<p style="text-align: center"><img src="https://gitlab.aicrowd.com/aicrowd/assets/-/raw/master/challenges/clock-decomposition/notebook-banner.jpg?inline=false" alt="Drawing" style="height: 400px;"/></p>

# What is the notebook about?

The challenge is to use the features extracted from the Clock Drawing Test to build an automated and algorithm to predict whether each participant is one of three phases:

1)    Pre-Alzheimer’s (Early Warning)
2)    Post-Alzheimer’s (Detection)
3)    Normal (Not an Alzheimer’s patient)

In machine learning terms: this is a 3-class classification task.

# How to use this notebook? 📝

<p style="text-align: center"><img src="https://gitlab.aicrowd.com/aicrowd/assets/-/raw/master/notebook/aicrowd_notebook_submission_flow.png?inline=false" alt="notebook overview" style="width: 650px;"/></p>

- **Update the config parameters**. You can define the common variables here

Variable | Description
--- | ---
`AICROWD_DATASET_PATH` | Path to the file containing test data (The data will be available at `/ds_shared_drive/` on aridhia workspace). This should be an absolute path.
`AICROWD_PREDICTIONS_PATH` | Path to write the output to.
`AICROWD_ASSETS_DIR` | In case your notebook needs additional files (like model weights, etc.,), you can add them to a directory and specify the path to the directory here (please specify relative path). The contents of this directory will be sent to AIcrowd for evaluation.
`AICROWD_API_KEY` | In order to submit your code to AIcrowd, you need to provide your account's API key. This key is available at https://www.aicrowd.com/participants/me

- **Installing packages**. Please use the [Install packages 🗃](#install-packages-) section to install the packages
- **Training your models**. All the code within the [Training phase ⚙️](#training-phase-) section will be skipped during evaluation. **Please make sure to save your model weights in the assets directory and load them in the predictions phase section** 

# Setup AIcrowd Utilities 🛠

We use this to bundle the files for submission and create a submission on AIcrowd. Do not edit this block.

In [3]:
!pip install -q -U aicrowd-cli

In [1]:
%load_ext aicrowd.magic

In [16]:
!pip install sweetviz
!pip install -U jupyter

In [2]:
import sweetviz as sv

In [3]:
import os

# Please use the absolute for the location of the pip install Shapelydataset.
# Or you can use relative path with `os.getcwd() + "test_data/validation.csv"`
AICROWD_DATASET_PATH = os.getenv("DATASET_PATH", "/ds_shared_drive/validation.csv")
AICROWD_PREDICTIONS_PATH = os.getenv("PREDICTIONS_PATH", "predictions.csv")
AICROWD_ASSETS_DIR = "assets"


In [85]:
#!pip install ipywidgets
#!jupyter nbextension enable --py widgetsnbextension
#!conda install -y jupyterlab_widgets
#!pip install aquirdturtle_collapsible_headings

# Install packages 🗃

Please add all pacakage installations in this section

In [86]:
!pip install numpy pandas
!pip install -U imbalanced-learn
!pip install xgboost
!pip install lightgbm
!pip install catboost
!pip install tensorflow
!pip install shap
!pip install torch torchvision torchaudio

# Define preprocessing code 💻

The code that is common between the training and the prediction sections should be defined here. During evaluation, we completely skip the training section. Please make sure to add any common logic between the training and prediction sections here.

### Import common packages

Please import packages that are common for training and prediction phases here.

In [424]:
import numpy as np
import pandas as pd
import joblib
import matplotlib.pyplot as plt
from collections import Counter
import torch
from tqdm.notebook import tqdm
%matplotlib inline

### Credits
Adapted from https://discourse.aicrowd.com/t/target-distribution-in-the-test-set-lb-0-616-with-a-simple-magic-trick/5613 and http://glemaitre.github.io/imbalanced-learn/index.html

In [5]:
target_col = "diagnosis"
key_col = "row_id"
cat_cols = ['intersection_pos_rel_centre']
seed = 2021

target_values = ["normal", "post_alzheimer", "pre_alzheimer"]


In [394]:
scale = 4
translator2d = {1: [4,1], 2 : [5,2], 3: [6,3], 4:[5,4] ,5:[4,5] ,6:[3,6] ,7:[2,5] ,8:[1,4] ,9:[0,3],10:[1,2], 11:[2,1] , 12:[3,0]}
ccw_translate = {i: 12 - i  for i in range(1,13,1)}
ccw_translate[12] = 12
translator2d_ccw = {ccw_translate[k]:v for k,v in translator2d.items()}

In [395]:
import torchvision
import torch, torch.nn as nn
import torchvision.models as models
from torch.autograd import Variable
import torch.nn.functional as F
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
np.random.seed(0)
torch.manual_seed(0)

<torch._C.Generator at 0x7fdcf9c01150>

In [421]:
z_dim = 1 # image_repr_features.shape[1]
n_classes = 3
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(z_dim, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, n_classes)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.softmax(x)
model = Net()

# Training phase ⚙️

You can define your training code here. This sections will be skipped during evaluation.

In [10]:
train = pd.read_csv('/ds_shared_drive/train.csv')

In [11]:
# valid = pd.read_csv('/ds_shared_drive/validation.csv')
# valid_truth = pd.read_csv('/ds_shared_drive/validation_ground_truth.csv')
# valid_all = valid.merge(valid_truth,how='left')
# train = pd.concat([train, valid_all],axis = 0)

In [12]:
train = train[train[target_col].isin(target_values)].copy().reset_index(drop=True)

# Remove Constant Columns
train = train.loc[:, (train != train.iloc[0]).any()]
features = train.columns[1:-1].to_list()

numeric_features = [c for c in features if c not in cat_cols]

In [13]:
for c in numeric_features:
    train[c] = train[c].astype(float)

print(train[target_col].value_counts())
print(train.shape)

normal            31208
post_alzheimer     1149
pre_alzheimer       420
Name: diagnosis, dtype: int64
(32777, 120)


In [42]:

df_pos = train[train[target_col].isin(target_values[1:])]
nb_pos = df_pos.shape[0]
nb_neg = nb_pos*2
df_neg = train[train[target_col] == "normal"].sample(n=nb_neg, random_state=seed)
# df_neg = df_normal 
df_samples = pd.concat([df_pos, df_neg]).sample(frac=1).reset_index(drop=True)
# df_samples = train
df_samples.shape



(4707, 120)

In [43]:
df_samples.shape

(4707, 120)

In [326]:
print(cat_cols)
for c in cat_cols:
    df_samples[c].fillna("NA", inplace=True)
    
df_dummies = pd.get_dummies(df_samples[cat_cols], columns=cat_cols, dummy_na=True).add_prefix('CAT_')
dummy_cols = df_dummies.columns.to_list()
print(dummy_cols)

df_samples = pd.concat([df_samples, df_dummies], axis=1)
df_samples['cnt_NaN'] = df_samples[numeric_features].isna().sum(axis=1)
# df_samples.fillna(-1, inplace=True)
model_features = df_samples.columns.to_list()
model_features = [c for c in model_features if c not in [key_col, target_col] + cat_cols]
print(len(model_features))
X_train = df_samples[model_features]
y_train_all = df_samples[target_col].map(dict(zip(target_values, list(range(len(target_values))))))

['intersection_pos_rel_centre']
['CAT_intersection_pos_rel_centre_BL', 'CAT_intersection_pos_rel_centre_BR', 'CAT_intersection_pos_rel_centre_NA', 'CAT_intersection_pos_rel_centre_TL', 'CAT_intersection_pos_rel_centre_TR', 'CAT_intersection_pos_rel_centre_nan']
130


In [45]:
df_samples[target_col].value_counts()

normal            3138
post_alzheimer    1149
pre_alzheimer      420
Name: diagnosis, dtype: int64

In [352]:
image_repr_features = None
for n,row in tqdm(X_train.iterrows()):
    image_repr = np.zeros((1,7,7))
    centre_repr = np.zeros((1,7,7))
    for i in range(1,13):
        col = f'missing_digit_{i}'
        present = row[col]

        if present:
            ccw_flag = row["sequence_flag_ccw"] == 1
            translator = translator2d
            if ccw_flag:
                translator = translator2d_ccw
            pos = translator[i]
            image_repr[0,pos[1],pos[0]] = 1
    
    image_repr = np.kron(image_repr, np.ones((scale,scale)))
#     rot_angle_z = image_repr * row["final_rotation_angle"]/360
#     centre_dot = row["centre_dot_detect"]
#     if centre_dot == 1:
#         centre_repr[0,3,3] = 1
#     centre_repr = np.kron(centre_repr, np.ones((scale,scale))) 
#     image_repr = np.vstack([image_repr,rot_angle_z,centre_repr])
    image_repr = np.expand_dims(image_repr, axis = 0)
    if n > 0:
        image_repr_features = np.vstack([image_repr_features,image_repr])
    else:
        image_repr_features = image_repr

image_repr_features_no_nan = image_repr_features # np.nan_to_num(image_repr_features)

0it [00:00, ?it/s]

In [422]:
model = Net()

In [379]:
train_x, val_x, train_y, val_y = train_test_split(image_repr_features_no_nan, y_train_all, test_size = 0.1)
(train_x.shape, train_y.shape), (val_x.shape, val_y.shape)
train_x = torch.from_numpy(train_x).float()
val_x = torch.from_numpy(val_x).float()
train_y = torch.from_numpy(train_y.values).long()
val_y = torch.from_numpy(val_y.values).long()
def train(epoch):
    tr_loss = 0
    # getting the training set
    x_train, y_train = Variable(train_x), Variable(train_y)
    # getting the validation set
    x_val, y_val = Variable(val_x), Variable(val_y)
    # clearing the Gradients of the model parameters
    optimizer.zero_grad()
    
    # prediction for training and validation set
    output_train = model(x_train)
    output_val = model(x_val)

    # computing the training and validation loss
    loss_train = criterion(output_train, y_train)
    loss_val = criterion(output_val, y_val)
    train_losses.append(loss_train)
    val_losses.append(loss_val)

    # computing the updated weights of all the model parameters
    loss_train.backward()
    optimizer.step()
    tr_loss = loss_train.item()
    if epoch%2 == 0:
        # printing the validation loss
        print('Epoch : ',epoch+1, '\t', 'loss :', loss_val)


# defining the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.005)
# defining the loss function
criterion = nn.CrossEntropyLoss()
# defining the number of epochs
n_epochs = 15
# empty list to store training losses
train_losses = []
# empty list to store validation losses
val_losses = []
# training the model
for epoch in range(n_epochs):
    train(epoch)
# prediction for training set
with torch.no_grad():
    output = model(train_x)
    
softmax = output.cpu()
prob = list(softmax.numpy())
predictions = np.argmax(prob, axis=1)
print(predictions.sum())
# f1 score on training set
f1_score(train_y.numpy(), predictions, average='weighted')

In [912]:
# load your data

## Train your model

In [918]:
train_x, val_x, train_y, val_y = train_test_split(image_repr_features_no_nan, y_train_all, test_size = 0.1)
(train_x.shape, train_y.shape), (val_x.shape, val_y.shape)
train_x = torch.from_numpy(train_x).float()
val_x = torch.from_numpy(val_x).float()
train_y = torch.from_numpy(train_y.values).long()
val_y = torch.from_numpy(val_y.values).long()
def train(epoch):
    tr_loss = 0
    # getting the training set
    x_train, y_train = Variable(train_x), Variable(train_y)
    # getting the validation set
    x_val, y_val = Variable(val_x), Variable(val_y)
    # clearing the Gradients of the model parameters
    optimizer.zero_grad()
    
    # prediction for training and validation set
    output_train = model(x_train)
    output_val = model(x_val)

    # computing the training and validation loss
    loss_train = criterion(output_train, y_train)
    loss_val = criterion(output_val, y_val)
    train_losses.append(loss_train)
    val_losses.append(loss_val)

    # computing the updated weights of all the model parameters
    loss_train.backward()
    optimizer.step()
    tr_loss = loss_train.item()
    if epoch%2 == 0:
        # printing the validation loss
        print('Epoch : ',epoch+1, '\t', 'loss :', loss_val)

model = Net()
# defining the optimizer
optimizer = optim.Adam(model.parameters(), lr=0.005)
# defining the loss function
criterion = nn.CrossEntropyLoss()
# defining the number of epochs
n_epochs = 15
# empty list to store training losses
train_losses = []
# empty list to store validation losses
val_losses = []
# training the model
for epoch in range(n_epochs):
    train(epoch)
# prediction for training set
with torch.no_grad():
    output = model(train_x)
    
softmax = output.cpu()
prob = list(softmax.numpy())
predictions = np.argmax(prob, axis=1)
print(predictions.sum())
# f1 score on training set
f1_score(train_y.numpy(), predictions, average='weighted')

Pipeline(steps=[('adasyn', ADASYN(random_state=0)),
                ('lgbmclassifier', LGBMClassifier())])

## Save your trained model

In [396]:
filename = f'{AICROWD_ASSETS_DIR}/model_checkpoint'

check_point = {'params': model.state_dict(),
              'optimizer': optimizer.state_dict()}

torch.save(check_point, filename)


# Prediction phase 🔎

Please make sure to save the weights from the training section in your assets directory and load them in this section

In [397]:
file = f'{AICROWD_ASSETS_DIR}/model_checkpoint'
check_point = torch.load(file)
model.load_state_dict(check_point['params'])

<All keys matched successfully>

## Load test data

In [404]:
test_data = pd.read_csv(AICROWD_DATASET_PATH)
test_data.head()

Unnamed: 0,row_id,number_of_digits,missing_digit_1,missing_digit_2,missing_digit_3,missing_digit_4,missing_digit_5,missing_digit_6,missing_digit_7,missing_digit_8,...,top_area_perc,bottom_area_perc,left_area_perc,right_area_perc,hor_count,vert_count,eleven_ten_error,other_error,time_diff,centre_dot_detect
0,LA9JQ1JZMJ9D2MBZV,11.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.500272,0.499368,0.553194,0.446447,0,0,0,1,,
1,PSSRCWAPTAG72A1NT,6.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,...,0.572472,0.427196,0.496352,0.503273,0,1,0,1,,
2,GCTODIZJB42VCBZRZ,11.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.494076,0.505583,0.503047,0.496615,1,0,0,0,0.0,0.0
3,7YMVQGV1CDB1WZFNE,3.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,...,0.555033,0.444633,0.580023,0.419575,0,1,0,1,,
4,PHEQC6DV3LTFJYIJU,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,0.603666,0.395976,0.49499,0.504604,0,0,0,1,150.0,0.0


In [406]:
image_repr_features_test = None
for n,row in tqdm(test_data.iterrows()):
    image_repr = np.zeros((1,7,7))
    centre_repr = np.zeros((1,7,7))
    for i in range(1,13):
        col = f'missing_digit_{i}'
        present = row[col]

        if present:
            ccw_flag = row["sequence_flag_ccw"] == 1
            translator = translator2d
            if ccw_flag:
                translator = translator2d_ccw
            pos = translator[i]
            image_repr[0,pos[1],pos[0]] = 1
    
    image_repr = np.kron(image_repr, np.ones((scale,scale)))
#     rot_angle_z = image_repr * row["final_rotation_angle"]/360
#     centre_dot = row["centre_dot_detect"]
#     if centre_dot == 1:
#         centre_repr[0,3,3] = 1
#     centre_repr = np.kron(centre_repr, np.ones((scale,scale))) 
#     image_repr = np.vstack([image_repr,rot_angle_z,centre_repr])
    image_repr = np.expand_dims(image_repr, axis = 0)
    if n > 0:
        image_repr_features_test = np.vstack([image_repr_features_test,image_repr])
    else:
        image_repr_features_test = image_repr

image_repr_features_test_no_nan = image_repr_features_test # np.nan_to_num(image_repr_features)

0it [00:00, ?it/s]

In [407]:
# prediction for training set
test_x = torch.from_numpy(image_repr_features_test_no_nan).float()
with torch.no_grad():
    output = model(test_x)
    
softmax = output.cpu()
prob = list(softmax.numpy())
predictions = np.argmax(prob, axis=1)
print(predictions.sum())

33


  return F.softmax(x)


## Generate predictions

In [418]:
predictions = {
    "row_id": test_data["row_id"].values,
    "normal_diagnosis_probability": [x[0] for x in prob],
    "post_alzheimer_diagnosis_probability": [x[1] for x in prob],
    "pre_alzheimer_diagnosis_probability": [x[2] for x in prob],
}

predictions_df = pd.DataFrame.from_dict(predictions)

In [419]:
predictions_df.head()

Unnamed: 0,row_id,normal_diagnosis_probability,post_alzheimer_diagnosis_probability,pre_alzheimer_diagnosis_probability
0,LA9JQ1JZMJ9D2MBZV,0.166939,0.833061,5.098431e-13
1,PSSRCWAPTAG72A1NT,0.999258,0.000743,6.307389e-11
2,GCTODIZJB42VCBZRZ,0.999851,0.000149,1.460324e-11
3,7YMVQGV1CDB1WZFNE,0.107968,0.892032,4.458191e-25
4,PHEQC6DV3LTFJYIJU,0.920741,0.079259,1.108694e-18


## Save predictions 📨

In [420]:
predictions_df.to_csv(AICROWD_PREDICTIONS_PATH, index=False)

# Submit to AIcrowd 🚀

**NOTE: PLEASE SAVE THE NOTEBOOK BEFORE SUBMITTING IT (Ctrl + S)**

In [423]:
!DATASET_PATH=$AICROWD_DATASET_PATH \
aicrowd notebook submit \
    --assets-dir $AICROWD_ASSETS_DIR \
    --challenge addi-alzheimers-detection-challenge

[32mAPI Key valid[0m
[32mSaved API Key successfully![0m
Using notebook: /home/desktop0/ClockFeatures.ipynb for submission...
Removing existing files from submission directory...
Scrubbing API keys from the notebook...
Collecting notebook...
Validating the submission...
Executing install.ipynb...
[NbConvertApp] Converting notebook /home/desktop0/submission/install.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
[NbConvertApp] Writing 15301 bytes to /home/desktop0/submission/install.nbconvert.ipynb
Executing predict.ipynb...
[NbConvertApp] Converting notebook /home/desktop0/submission/predict.ipynb to notebook
[NbConvertApp] Executing notebook with kernel: python
2021-05-22 13:37:50.903967: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-05-22 13:37:50.904015: I tensorflow/stream_executor/cuda/cudart_stu