Childhood Autistic Spectrum Disorder Screening using Machine Learning¶

The early diagnosis of neurodevelopment disorders can improve treatment and significantly decrease the associated healthcare costs. In this project, we used supervised learning to diagnose Autistic Spectrum Disorder (ASD) based on behavioural features and individual characteristics. More specifically, we built and deployed a neural network using the Keras API.

This project used a dataset provided by the UCI Machine Learning Repository that contains screening data for 292 patients. The dataset can be found at the following URL: https://archive.ics.uci.edu/ml/datasets/Autistic+Spectrum+Disorder+Screening+Data+for+Children++

1. Importing the Dataset¶

The data was obtained from the UCI Machine Learning Repository; it will be downloaded as a compressed zip file and then extracted manually. The data will then be read in from a text file using Pandas.

In [2]:
# import the dataset
file = 'autism-data.txt'

# read the csv
data = pd.read_table(file, sep = ',', index_col = None)
In [3]:
# print the shape of the DataFrame, so we can see how many examples we have
print 'Shape of DataFrame: {}'.format(data.shape)
print data.loc[0]
Shape of DataFrame: (292, 21)
A1_Score                            1
A2_Score                            1
A3_Score                            0
A4_Score                            0
A5_Score                            1
A6_Score                            1
A7_Score                            0
A8_Score                            1
A9_Score                            0
A10_Score                           0
age                                 6
gender                              m
ethnicity                      Others
jundice                            no
family_history_of_PDD              no
contry_of_res                  Jordan
used_app_before                    no
result                              5
age_desc                 '4-11 years'
relation                       Parent
class                              NO
Name: 0, dtype: object
In [4]:
# print out multiple patients at the same time
data.loc[:10]
Out[4]:
A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score A9_Score A10_Score ... gender ethnicity jundice family_history_of_PDD contry_of_res used_app_before result age_desc relation class
0 1 1 0 0 1 1 0 1 0 0 ... m Others no no Jordan no 5 '4-11 years' Parent NO
1 1 1 0 0 1 1 0 1 0 0 ... m 'Middle Eastern ' no no Jordan no 5 '4-11 years' Parent NO
2 1 1 0 0 0 1 1 1 0 0 ... m ? no no Jordan yes 5 '4-11 years' ? NO
3 0 1 0 0 1 1 0 0 0 1 ... f ? yes no Jordan no 4 '4-11 years' ? NO
4 1 1 1 1 1 1 1 1 1 1 ... m Others yes no 'United States' no 10 '4-11 years' Parent YES
5 0 0 1 0 1 1 0 1 0 1 ... m ? no yes Egypt no 5 '4-11 years' ? NO
6 1 0 1 1 1 1 0 1 0 1 ... m White-European no no 'United Kingdom' no 7 '4-11 years' Parent YES
7 1 1 1 1 1 1 1 1 0 0 ... f 'Middle Eastern ' no no Bahrain no 8 '4-11 years' Parent YES
8 1 1 1 1 1 1 1 0 0 0 ... f 'Middle Eastern ' no no Bahrain no 7 '4-11 years' Parent YES
9 0 0 1 1 1 0 1 1 0 0 ... f ? no yes Austria no 5 '4-11 years' ? NO
10 1 0 0 0 1 1 1 1 1 1 ... m White-European yes no 'United Kingdom' no 7 '4-11 years' Self YES

11 rows × 21 columns

In [5]:
# print out a description of the dataframe
data.describe()
Out[5]:
A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score A9_Score A10_Score result
count 292.000000 292.000000 292.000000 292.000000 292.000000 292.000000 292.000000 292.000000 292.000000 292.000000 292.000000
mean 0.633562 0.534247 0.743151 0.551370 0.743151 0.712329 0.606164 0.496575 0.493151 0.726027 6.239726
std 0.482658 0.499682 0.437646 0.498208 0.437646 0.453454 0.489438 0.500847 0.500811 0.446761 2.284882
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000
50% 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 0.000000 1.000000 6.000000
75% 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 8.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10.000000

2. Data Preprocessing¶

This dataset requires multiple preprocessing steps. There are columns in the DataFrame that aren't wanted for the training of the neural network. Also, the data being reported is using strings that need to be converted to categorical labels. Finally, the dataset needs to be split into X and Y datasets, where X includes attributes used for prediction and Y has the class labels.

In [6]:
# drop unwanted columns
data = data.drop(['result', 'age_desc'], axis=1)
In [7]:
data.loc[:10]
Out[7]:
A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score A9_Score A10_Score age gender ethnicity jundice family_history_of_PDD contry_of_res used_app_before relation class
0 1 1 0 0 1 1 0 1 0 0 6 m Others no no Jordan no Parent NO
1 1 1 0 0 1 1 0 1 0 0 6 m 'Middle Eastern ' no no Jordan no Parent NO
2 1 1 0 0 0 1 1 1 0 0 6 m ? no no Jordan yes ? NO
3 0 1 0 0 1 1 0 0 0 1 5 f ? yes no Jordan no ? NO
4 1 1 1 1 1 1 1 1 1 1 5 m Others yes no 'United States' no Parent YES
5 0 0 1 0 1 1 0 1 0 1 4 m ? no yes Egypt no ? NO
6 1 0 1 1 1 1 0 1 0 1 5 m White-European no no 'United Kingdom' no Parent YES
7 1 1 1 1 1 1 1 1 0 0 5 f 'Middle Eastern ' no no Bahrain no Parent YES
8 1 1 1 1 1 1 1 0 0 0 11 f 'Middle Eastern ' no no Bahrain no Parent YES
9 0 0 1 1 1 0 1 1 0 0 11 f ? no yes Austria no ? NO
10 1 0 0 0 1 1 1 1 1 1 10 m White-European yes no 'United Kingdom' no Self YES
In [8]:
# create X and Y datasets for training
x = data.drop(['class'], 1)
y = data['class']
In [9]:
x.loc[:10]
Out[9]:
A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score A9_Score A10_Score age gender ethnicity jundice family_history_of_PDD contry_of_res used_app_before relation
0 1 1 0 0 1 1 0 1 0 0 6 m Others no no Jordan no Parent
1 1 1 0 0 1 1 0 1 0 0 6 m 'Middle Eastern ' no no Jordan no Parent
2 1 1 0 0 0 1 1 1 0 0 6 m ? no no Jordan yes ?
3 0 1 0 0 1 1 0 0 0 1 5 f ? yes no Jordan no ?
4 1 1 1 1 1 1 1 1 1 1 5 m Others yes no 'United States' no Parent
5 0 0 1 0 1 1 0 1 0 1 4 m ? no yes Egypt no ?
6 1 0 1 1 1 1 0 1 0 1 5 m White-European no no 'United Kingdom' no Parent
7 1 1 1 1 1 1 1 1 0 0 5 f 'Middle Eastern ' no no Bahrain no Parent
8 1 1 1 1 1 1 1 0 0 0 11 f 'Middle Eastern ' no no Bahrain no Parent
9 0 0 1 1 1 0 1 1 0 0 11 f ? no yes Austria no ?
10 1 0 0 0 1 1 1 1 1 1 10 m White-European yes no 'United Kingdom' no Self
In [10]:
# convert the data to categorical values - one-hot-encoded vectors
X = pd.get_dummies(x)
In [11]:
# print the new categorical column labels
X.columns.values
Out[11]:
array(['A1_Score', 'A2_Score', 'A3_Score', 'A4_Score', 'A5_Score',
       'A6_Score', 'A7_Score', 'A8_Score', 'A9_Score', 'A10_Score',
       'age_10', 'age_11', 'age_4', 'age_5', 'age_6', 'age_7', 'age_8',
       'age_9', 'age_?', 'gender_f', 'gender_m',
       "ethnicity_'Middle Eastern '", "ethnicity_'South Asian'",
       'ethnicity_?', 'ethnicity_Asian', 'ethnicity_Black',
       'ethnicity_Hispanic', 'ethnicity_Latino', 'ethnicity_Others',
       'ethnicity_Pasifika', 'ethnicity_Turkish',
       'ethnicity_White-European', 'jundice_no', 'jundice_yes',
       'family_history_of_PDD_no', 'family_history_of_PDD_yes',
       "contry_of_res_'Costa Rica'", "contry_of_res_'Isle of Man'",
       "contry_of_res_'New Zealand'", "contry_of_res_'Saudi Arabia'",
       "contry_of_res_'South Africa'", "contry_of_res_'South Korea'",
       "contry_of_res_'U.S. Outlying Islands'",
       "contry_of_res_'United Arab Emirates'",
       "contry_of_res_'United Kingdom'", "contry_of_res_'United States'",
       'contry_of_res_Afghanistan', 'contry_of_res_Argentina',
       'contry_of_res_Armenia', 'contry_of_res_Australia',
       'contry_of_res_Austria', 'contry_of_res_Bahrain',
       'contry_of_res_Bangladesh', 'contry_of_res_Bhutan',
       'contry_of_res_Brazil', 'contry_of_res_Bulgaria',
       'contry_of_res_Canada', 'contry_of_res_China',
       'contry_of_res_Egypt', 'contry_of_res_Europe',
       'contry_of_res_Georgia', 'contry_of_res_Germany',
       'contry_of_res_Ghana', 'contry_of_res_India', 'contry_of_res_Iraq',
       'contry_of_res_Ireland', 'contry_of_res_Italy',
       'contry_of_res_Japan', 'contry_of_res_Jordan',
       'contry_of_res_Kuwait', 'contry_of_res_Latvia',
       'contry_of_res_Lebanon', 'contry_of_res_Libya',
       'contry_of_res_Malaysia', 'contry_of_res_Malta',
       'contry_of_res_Mexico', 'contry_of_res_Nepal',
       'contry_of_res_Netherlands', 'contry_of_res_Nigeria',
       'contry_of_res_Oman', 'contry_of_res_Pakistan',
       'contry_of_res_Philippines', 'contry_of_res_Qatar',
       'contry_of_res_Romania', 'contry_of_res_Russia',
       'contry_of_res_Sweden', 'contry_of_res_Syria',
       'contry_of_res_Turkey', 'used_app_before_no',
       'used_app_before_yes', "relation_'Health care professional'",
       'relation_?', 'relation_Parent', 'relation_Relative',
       'relation_Self', 'relation_self'], dtype=object)
In [12]:
# print an example patient from the categorical data
X.loc[1]
Out[12]:
A1_Score                               1
A2_Score                               1
A3_Score                               0
A4_Score                               0
A5_Score                               1
A6_Score                               1
A7_Score                               0
A8_Score                               1
A9_Score                               0
A10_Score                              0
age_10                                 0
age_11                                 0
age_4                                  0
age_5                                  0
age_6                                  1
age_7                                  0
age_8                                  0
age_9                                  0
age_?                                  0
gender_f                               0
gender_m                               1
ethnicity_'Middle Eastern '            1
ethnicity_'South Asian'                0
ethnicity_?                            0
ethnicity_Asian                        0
ethnicity_Black                        0
ethnicity_Hispanic                     0
ethnicity_Latino                       0
ethnicity_Others                       0
ethnicity_Pasifika                     0
                                      ..
contry_of_res_Italy                    0
contry_of_res_Japan                    0
contry_of_res_Jordan                   1
contry_of_res_Kuwait                   0
contry_of_res_Latvia                   0
contry_of_res_Lebanon                  0
contry_of_res_Libya                    0
contry_of_res_Malaysia                 0
contry_of_res_Malta                    0
contry_of_res_Mexico                   0
contry_of_res_Nepal                    0
contry_of_res_Netherlands              0
contry_of_res_Nigeria                  0
contry_of_res_Oman                     0
contry_of_res_Pakistan                 0
contry_of_res_Philippines              0
contry_of_res_Qatar                    0
contry_of_res_Romania                  0
contry_of_res_Russia                   0
contry_of_res_Sweden                   0
contry_of_res_Syria                    0
contry_of_res_Turkey                   0
used_app_before_no                     1
used_app_before_yes                    0
relation_'Health care professional'    0
relation_?                             0
relation_Parent                        1
relation_Relative                      0
relation_Self                          0
relation_self                          0
Name: 1, Length: 96, dtype: int64
In [13]:
# convert the class data to categorical values - one-hot-encoded vectors
Y = pd.get_dummies(y)
In [14]:
Y.iloc[:10]
Out[14]:
NO YES
0 1 0
1 1 0
2 1 0
3 1 0
4 0 1
5 1 0
6 0 1
7 0 1
8 0 1
9 1 0

3. Split the Dataset into Training and Testing Datasets¶

Before training the neural network, the dataset needs to be split into training and testing datasets. This can be done using the train_test_split() function provided by scikit-learn

In [15]:
from sklearn import model_selection
# split the X and Y data into training and testing datasets
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size = 0.2)
In [16]:
print X_train.shape
print X_test.shape
print Y_train.shape
print Y_test.shape
(233, 96)
(59, 96)
(233, 2)
(59, 2)

4. Building the Network - Keras¶

Keras is used to build and train the network. This model will be relatively simple and will only use dense (also known as fully connected) layers. The network will have one hidden layer, use an Adam optimizer, and a categorical crossentropy loss.

In [17]:
# build a neural network using Keras
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam

# define a function to build the keras model
def create_model():
    # create model
    model = Sequential()
    model.add(Dense(8, input_dim=96, kernel_initializer='normal', activation='relu'))
    model.add(Dense(4, kernel_initializer='normal', activation='relu'))
    model.add(Dense(2, activation='sigmoid'))
    
    # compile model
    adam = Adam(lr=0.001)
    model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
    return model

model = create_model()

print(model.summary())
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 8)                 776       
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 36        
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 10        
=================================================================
Total params: 822
Trainable params: 822
Non-trainable params: 0
_________________________________________________________________
None

5. Training the Network¶

Train the Keras model by calling model.fit().

In [18]:
# fit the model to the training data
model.fit(X_train, Y_train, epochs=50, batch_size=10, verbose = 1)
Epoch 1/50
233/233 [==============================] - 0s 288us/step - loss: 0.6927 - acc: 0.5794
Epoch 2/50
233/233 [==============================] - 0s 245us/step - loss: 0.6910 - acc: 0.7210
Epoch 3/50
233/233 [==============================] - 0s 258us/step - loss: 0.6868 - acc: 0.7639
Epoch 4/50
233/233 [==============================] - 0s 236us/step - loss: 0.6779 - acc: 0.7082
Epoch 5/50
233/233 [==============================] - 0s 236us/step - loss: 0.6619 - acc: 0.8541
Epoch 6/50
233/233 [==============================] - 0s 305us/step - loss: 0.6340 - acc: 0.8283
Epoch 7/50
233/233 [==============================] - 0s 227us/step - loss: 0.5963 - acc: 0.8541
Epoch 8/50
233/233 [==============================] - 0s 305us/step - loss: 0.5446 - acc: 0.9399
Epoch 9/50
233/233 [==============================] - 0s 240us/step - loss: 0.4884 - acc: 0.8884
Epoch 10/50
233/233 [==============================] - 0s 227us/step - loss: 0.4220 - acc: 0.9227
Epoch 11/50
233/233 [==============================] - 0s 322us/step - loss: 0.3603 - acc: 0.9313
Epoch 12/50
233/233 [==============================] - 0s 245us/step - loss: 0.2935 - acc: 0.9614
Epoch 13/50
233/233 [==============================] - 0s 296us/step - loss: 0.2528 - acc: 0.9657
Epoch 14/50
233/233 [==============================] - 0s 330us/step - loss: 0.2087 - acc: 0.9657
Epoch 15/50
233/233 [==============================] - 0s 305us/step - loss: 0.1788 - acc: 0.9871
Epoch 16/50
233/233 [==============================] - 0s 313us/step - loss: 0.1605 - acc: 0.9700
Epoch 17/50
233/233 [==============================] - 0s 309us/step - loss: 0.1389 - acc: 0.9828
Epoch 18/50
233/233 [==============================] - 0s 335us/step - loss: 0.1258 - acc: 0.9785
Epoch 19/50
233/233 [==============================] - 0s 343us/step - loss: 0.1108 - acc: 0.9871
Epoch 20/50
233/233 [==============================] - 0s 399us/step - loss: 0.1004 - acc: 0.9871
Epoch 21/50
233/233 [==============================] - 0s 416us/step - loss: 0.0910 - acc: 0.9871
Epoch 22/50
233/233 [==============================] - 0s 343us/step - loss: 0.0820 - acc: 0.9871
Epoch 23/50
233/233 [==============================] - 0s 361us/step - loss: 0.0752 - acc: 0.9914
Epoch 24/50
233/233 [==============================] - 0s 356us/step - loss: 0.0714 - acc: 0.9957
Epoch 25/50
233/233 [==============================] - 0s 309us/step - loss: 0.0634 - acc: 0.9957
Epoch 26/50
233/233 [==============================] - 0s 339us/step - loss: 0.0585 - acc: 0.9957
Epoch 27/50
233/233 [==============================] - 0s 335us/step - loss: 0.0571 - acc: 1.0000
Epoch 28/50
233/233 [==============================] - 0s 429us/step - loss: 0.0526 - acc: 0.9957
Epoch 29/50
233/233 [==============================] - 0s 335us/step - loss: 0.0474 - acc: 1.0000
Epoch 30/50
233/233 [==============================] - 0s 322us/step - loss: 0.0463 - acc: 0.9957
Epoch 31/50
233/233 [==============================] - 0s 296us/step - loss: 0.0431 - acc: 1.0000
Epoch 32/50
233/233 [==============================] - 0s 348us/step - loss: 0.0381 - acc: 1.0000
Epoch 33/50
233/233 [==============================] - 0s 322us/step - loss: 0.0357 - acc: 1.0000
Epoch 34/50
233/233 [==============================] - 0s 292us/step - loss: 0.0331 - acc: 1.0000
Epoch 35/50
233/233 [==============================] - 0s 305us/step - loss: 0.0316 - acc: 1.0000
Epoch 36/50
233/233 [==============================] - 0s 335us/step - loss: 0.0294 - acc: 1.0000
Epoch 37/50
233/233 [==============================] - 0s 322us/step - loss: 0.0282 - acc: 1.0000
Epoch 38/50
233/233 [==============================] - 0s 236us/step - loss: 0.0281 - acc: 1.0000
Epoch 39/50
233/233 [==============================] - 0s 339us/step - loss: 0.0253 - acc: 1.0000
Epoch 40/50
233/233 [==============================] - 0s 223us/step - loss: 0.0252 - acc: 1.0000
Epoch 41/50
233/233 [==============================] - 0s 326us/step - loss: 0.0226 - acc: 1.0000
Epoch 42/50
233/233 [==============================] - 0s 326us/step - loss: 0.0213 - acc: 1.0000
Epoch 43/50
233/233 [==============================] - 0s 219us/step - loss: 0.0203 - acc: 1.0000
Epoch 44/50
233/233 [==============================] - 0s 215us/step - loss: 0.0193 - acc: 1.0000
Epoch 45/50
233/233 [==============================] - 0s 318us/step - loss: 0.0190 - acc: 1.0000
Epoch 46/50
233/233 [==============================] - 0s 232us/step - loss: 0.0176 - acc: 1.0000
Epoch 47/50
233/233 [==============================] - 0s 215us/step - loss: 0.0163 - acc: 1.0000
Epoch 48/50
233/233 [==============================] - 0s 202us/step - loss: 0.0161 - acc: 1.0000
Epoch 49/50
233/233 [==============================] - 0s 240us/step - loss: 0.0154 - acc: 1.0000
Epoch 50/50
233/233 [==============================] - 0s 223us/step - loss: 0.0150 - acc: 1.0000
Out[18]:
<keras.callbacks.History at 0x12000f28>

6. Testing and Performance Metrics¶

Now that the model has been trained, we need to test its performance on the testing dataset. The model has never seen this information before; as a result, the testing dataset allows us to determine whether or not the model will be able to generalize to information that wasn't used during its training phase.

In [19]:
# generate classification report using predictions for categorical model
from sklearn.metrics import classification_report, accuracy_score

predictions = model.predict_classes(X_test)
predictions
Out[19]:
array([1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0], dtype=int64)
In [20]:
print('Results for Categorical Model')
print(accuracy_score(Y_test[['YES']], predictions))
print(classification_report(Y_test[['YES']], predictions))
Results for Categorical Model
0.9661016949152542
             precision    recall  f1-score   support

          0       0.97      0.97      0.97        36
          1       0.96      0.96      0.96        23

avg / total       0.97      0.97      0.97        59

In [ ]: