Classifying DNA Sequences¶

This project explores the world of bioinformatics by using Markov models, K-nearest neighbor (KNN) algorithms, support vector machines, and other common classifiers to classify short E. Coli DNA sequences. This project will use a dataset from the UCI Machine Learning Repository that has 106 DNA sequences, with 57 sequential nucleotides (“base-pairs”) each.

Topics covered:

  • Import data from the UCI repository
  • Convert text inputs to numerical data
  • Build and train classification algorithms
  • Compare and contrast classification algorithms

Step 1: Importing the Dataset¶

The following code cells will import necessary libraries and import the dataset from the UCI repository as a Pandas DataFrame.

In [1]:
# To make sure all of the correct libraries are installed, import each module. 

import sys
import numpy
import sklearn
import pandas

In [2]:
# Import, change module names
import numpy as np
import pandas as pd

# import the uci Molecular Biology (Promoter Gene Sequences) Data Set
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.data'
names = ['Class', 'id', 'Sequence']
data = pd.read_csv(url, names = names)
In [3]:
print(data.iloc[0])
Class                                                       +
id                                                        S10
Sequence    \t\ttactagcaatacgcttgcgttcggtggttaagtatgtataat...
Name: 0, dtype: object

Step 2: Preprocessing the Dataset¶

The data is not in a usable form; as a result, we will need to process it before using it to train our algorithms.

In [4]:
# Building our Dataset by creating a custom Pandas DataFrame
# Each column in a DataFrame is called a Series. Lets start by making a series for each column.

classes = data.loc[:, 'Class']
print(classes[:5])
0    +
1    +
2    +
3    +
4    +
Name: Class, dtype: object
In [5]:
# generate list of DNA sequences
sequences = list(data.loc[:, 'Sequence'])
dataset = {}

# loop through sequences and split into individual nucleotides
for i, seq in enumerate(sequences):
    
    # split into nucleotides, remove tab characters
    nucleotides = list(seq)
    nucleotides = [x for x in nucleotides if x != '\t']
    
    # append class assignment
    nucleotides.append(classes[i])
    
    # add to dataset
    dataset[i] = nucleotides
    
print(dataset[0])
['t', 'a', 'c', 't', 'a', 'g', 'c', 'a', 'a', 't', 'a', 'c', 'g', 'c', 't', 't', 'g', 'c', 'g', 't', 't', 'c', 'g', 'g', 't', 'g', 'g', 't', 't', 'a', 'a', 'g', 't', 'a', 't', 'g', 't', 'a', 't', 'a', 'a', 't', 'g', 'c', 'g', 'c', 'g', 'g', 'g', 'c', 't', 't', 'g', 't', 'c', 'g', 't', '+']
In [6]:
# turn dataset into pandas DataFrame
dframe = pd.DataFrame(dataset)
print(dframe)
   0   1   2   3   4   5   6   7   8   9   ... 96  97  98  99  100 101 102  \
0    t   t   g   a   t   a   c   t   c   t ...   c   c   t   a   g   c   g   
1    a   g   t   a   c   g   a   t   g   t ...   c   g   a   g   a   c   t   
2    c   c   a   t   g   g   g   t   a   t ...   g   c   t   a   g   t   a   
3    t   t   c   t   a   g   g   c   c   t ...   a   t   g   g   a   c   t   
4    a   a   t   g   t   g   g   t   t   a ...   g   a   a   g   g   a   t   
5    g   t   a   t   a   c   g   a   t   a ...   t   g   c   g   c   a   c   
6    c   c   g   g   a   a   g   c   a   a ...   a   g   c   t   a   t   t   
7    a   c   a   a   t   a   t   a   a   t ...   g   a   g   g   t   g   c   
8    a   t   g   t   t   g   g   a   t   t ...   a   c   a   t   g   g   a   
9    t   g   a   g   a   g   g   a   a   t ...   c   t   a   a   t   c   a   
10   a   a   a   t   a   a   a   a   t   c ...   c   t   c   c   c   c   c   
11   c   c   c   g   c   g   g   c   a   c ...   c   t   g   t   a   t   a   
12   g   a   t   t   t   g   g   a   c   t ...   t   c   a   c   g   c   a   
13   c   g   a   a   a   a   a   c   t   c ...   t   t   g   c   c   t   g   
14   t   t   g   t   t   t   t   t   g   t ...   a   t   t   a   c   a   a   
15   t   t   t   c   t   g   t   t   c   t ...   g   g   c   a   t   a   t   
16   g   g   g   g   g   g   t   g   g   g ...   a   t   a   g   c   a   t   
17   c   t   c   a   a   a   a   a   a   t ...   g   t   a   a   g   c   a   
18   g   c   a   a   c   a   a   t   c   c ...   a   g   t   a   a   g   a   
19   t   a   t   g   g   a   g   a   a   a ...   g   a   c   g   c   g   c   
20   t   c   t   t   a   g   c   c   g   g ...   c   t   a   a   a   g   c   
21   c   g   a   g   a   a   c   t   g   g ...   a   t   g   g   a   t   g   
22   g   c   g   t   a   g   a   g   a   c ...   t   t   a   g   c   c   a   
23   g   t   c   g   a   g   t   t   c   c ...   g   t   c   a   t   t   c   
24   t   g   t   t   g   t   c   a   g   g ...   t   c   c   a   t   t   a   
25   g   a   t   t   c   t   t   t   t   g ...   c   c   g   g   g   g   g   
26   g   t   a   g   t   g   c   g   c   a ...   a   a   c   a   c   a   a   
27   t   t   t   c   g   c   c   a   c   a ...   g   t   t   t   a   g   t   
28   t   g   t   g   a   c   t   g   g   t ...   c   g   t   g   t   g   t   
29   a   g   t   g   a   g   g   c   t   a ...   c   c   t   a   a   g   c   
30   a   t   t   a   a   t   a   a   t   a ...   t   g   g   g   a   g   a   
31   g   g   t   g   a   a   t   t   c   c ...   c   g   a   g   a   t   a   
32   t   t   t   t   c   t   g   a   t   t ...   g   t   c   c   t   t   t   
33   a   c   t   a   c   a   a   c   g   c ...   a   g   t   t   g   t   c   
34   t   g   g   g   a   a   c   a   t   c ...   c   t   c   a   c   t   t   
35   g   t   t   a   c   a   g   g   g   c ...   a   t   t   g   t   t   c   
36   t   t   t   t   t   g   c   t   t   t ...   a   t   g   a   t   t   g   
37   a   a   a   g   a   a   a   a   a   a ...   c   t   g   c   t   g   t   
38   t   c   t   t   g   a   t   t   a   t ...   t   g   t   g   c   c   g   
39   a   a   c   t   a   a   a   a   a   a ...   t   c   a   t   t   t   g   
40   a   a   a   a   a   c   g   a   t   a ...   g   g   t   c   t   g   a   
41   t   t   t   g   t   t   t   t   c   t ...   c   c   t   t   g   a   t   
42   g   c   g   a   g   a   c   t   g   g ...   a   a   a   c   t   a   g   
43   c   t   c   a   c   g   a   g   c   c ...   t   a   c   t   a   a   g   
44   g   a   t   t   g   a   g   c   a   g ...   a   t   t   g   g   g   a   
45   c   a   a   a   c   g   c   t   a   c ...   a   g   g   c   a   g   c   
46   g   c   a   c   c   t   c   t   t   c ...   a   t   t   a   c   a   g   
47   g   g   c   t   t   c   c   c   g   a ...   t   t   g   t   g   g   t   
48   g   c   c   a   c   c   a   a   a   c ...   g   a   a   g   t   g   t   
49   c   a   a   a   c   g   t   a   a   c ...   c   a   a   g   g   a   c   
50   t   t   c   c   g   t   c   c   a   a ...   t   t   c   a   c   a   a   
51   t   c   c   a   t   t   a   a   t   c ...   t   c   a   g   c   c   a   
52   g   g   c   a   g   t   t   g   g   t ...   t   g   t   t   c   t   c   
53   t   c   g   a   g   a   g   a   g   g ...   c   c   t   a   t   a   a   
54   c   c   g   c   t   g   a   a   t   a ...   t   t   a   t   a   t   t   
55   g   a   c   t   a   g   a   c   t   c ...   t   t   t   g   c   a   t   
56   t   a   g   c   g   t   t   a   t   a ...   g   t   t   a   g   t   g   
57   +   +   +   +   +   +   +   +   +   + ...   -   -   -   -   -   -   -   

   103 104 105  
0    c   c   t  
1    g   t   a  
2    c   c   a  
3    g   g   c  
4    a   t   a  
5    c   c   t  
6    t   c   t  
7    a   t   a  
8    c   c   a  
9    g   a   t  
10   a   a   a  
11   t   t   a  
12   g   g   a  
13   a   g   t  
14   g   c   a  
15   a   c   a  
16   t   t   g  
17   g   c   g  
18   c   t   a  
19   c   a   g  
20   t   a   g  
21   g   a   c  
22   a   c   t  
23   g   g   c  
24   t   g   t  
25   g   g   a  
26   c   t   a  
27   t   c   t  
28   t   t   g  
29   c   t   g  
30   c   g   c  
31   g   a   a  
32   t   g   c  
33   t   g   t  
34   a   g   c  
35   c   g   a  
36   t   t   t  
37   g   t   t  
38   g   t   a  
39   a   t   g  
40   t   t   c  
41   t   t   c  
42   g   g   a  
43   t   c   a  
44   c   t   t  
45   a   g   c  
46   c   a   a  
47   c   a   a  
48   a   a   t  
49   a   g   c  
50   g   g   a  
51   g   a   a  
52   c   g   g  
53   t   g   a  
54   t   a   a  
55   c   a   c  
56   c   c   t  
57   -   -   -  

[58 rows x 106 columns]
In [7]:
# transpose the DataFrame
df = dframe.transpose()
print(df.iloc[:5])
  0  1  2  3  4  5  6  7  8  9  ... 48 49 50 51 52 53 54 55 56 57
0  t  a  c  t  a  g  c  a  a  t ...  g  c  t  t  g  t  c  g  t  +
1  t  g  c  t  a  t  c  c  t  g ...  c  a  t  c  g  c  c  a  a  +
2  g  t  a  c  t  a  g  a  g  a ...  c  a  c  c  c  g  g  c  g  +
3  a  a  t  t  g  t  g  a  t  g ...  a  a  c  a  a  a  c  t  c  +
4  t  c  g  a  t  a  a  t  t  a ...  c  c  g  t  g  g  t  a  g  +

[5 rows x 58 columns]
In [8]:
# for clarity, lets rename the last dataframe column to class
df.rename(columns = {57: 'Class'}, inplace = True) 
print(df.iloc[:5])
   0  1  2  3  4  5  6  7  8  9  ...  48 49 50 51 52 53 54 55 56 Class
0  t  a  c  t  a  g  c  a  a  t  ...   g  c  t  t  g  t  c  g  t     +
1  t  g  c  t  a  t  c  c  t  g  ...   c  a  t  c  g  c  c  a  a     +
2  g  t  a  c  t  a  g  a  g  a  ...   c  a  c  c  c  g  g  c  g     +
3  a  a  t  t  g  t  g  a  t  g  ...   a  a  c  a  a  a  c  t  c     +
4  t  c  g  a  t  a  a  t  t  a  ...   c  c  g  t  g  g  t  a  g     +

[5 rows x 58 columns]
In [9]:
# Let's start to familiarize ourselves with the dataset so we can pick the most suitable 
# algorithms for this data

df.describe()
Out[9]:
0 1 2 3 4 5 6 7 8 9 ... 48 49 50 51 52 53 54 55 56 Class
count 106 106 106 106 106 106 106 106 106 106 ... 106 106 106 106 106 106 106 106 106 106
unique 4 4 4 4 4 4 4 4 4 4 ... 4 4 4 4 4 4 4 4 4 2
top t a a c a a a a a a ... c c c t t c c t t -
freq 38 34 30 30 36 42 38 34 33 36 ... 36 42 31 33 35 32 29 29 34 53

4 rows × 58 columns

In [10]:
# desribe does not tell us enough information since the attributes are text. Lets record value counts for each sequence
series = []
for name in df.columns:
    series.append(df[name].value_counts())
    
info = pd.DataFrame(series)
details = info.transpose()
print(details)
      0     1     2     3     4     5     6     7     8     9  ...      48  \
+   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...     NaN   
-   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN  ...     NaN   
a  26.0  34.0  30.0  22.0  36.0  42.0  38.0  34.0  33.0  36.0  ...    23.0   
c  27.0  22.0  21.0  30.0  19.0  18.0  21.0  20.0  22.0  22.0  ...    36.0   
g  15.0  24.0  28.0  28.0  29.0  22.0  17.0  20.0  19.0  20.0  ...    26.0   
t  38.0  26.0  27.0  26.0  22.0  24.0  30.0  32.0  32.0  28.0  ...    21.0   

     49    50    51    52    53    54    55    56  Class  
+   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   53.0  
-   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   53.0  
a  24.0  28.0  27.0  25.0  22.0  26.0  24.0  27.0    NaN  
c  42.0  31.0  32.0  21.0  32.0  29.0  29.0  17.0    NaN  
g  18.0  24.0  14.0  25.0  22.0  28.0  24.0  28.0    NaN  
t  22.0  23.0  33.0  35.0  30.0  23.0  29.0  34.0    NaN  

[6 rows x 58 columns]
In [11]:
# We can't run machine learning algorithms on the data in 'String' formats. As a result, we need to switch
# it to numerical data. This can easily be accomplished using the pd.get_dummies() function
numerical_df = pd.get_dummies(df)
numerical_df.iloc[:5]
Out[11]:
0_a 0_c 0_g 0_t 1_a 1_c 1_g 1_t 2_a 2_c ... 55_a 55_c 55_g 55_t 56_a 56_c 56_g 56_t Class_+ Class_-
0 0 0 0 1 1 0 0 0 0 1 ... 0 0 1 0 0 0 0 1 1 0
1 0 0 0 1 0 0 1 0 0 1 ... 1 0 0 0 1 0 0 0 1 0
2 0 0 1 0 0 0 0 1 1 0 ... 0 1 0 0 0 0 1 0 1 0
3 1 0 0 0 1 0 0 0 0 0 ... 0 0 0 1 0 1 0 0 1 0
4 0 0 0 1 0 1 0 0 0 0 ... 1 0 0 0 0 0 1 0 1 0

5 rows × 230 columns

In [12]:
# We don't need both class columns.  We'll drop one then rename the other to simply 'Class'.
df = numerical_df.drop(columns=['Class_-'])

df.rename(columns = {'Class_+': 'Class'}, inplace = True)
print(df.iloc[:5])
   0_a  0_c  0_g  0_t  1_a  1_c  1_g  1_t  2_a  2_c  ...    54_t  55_a  55_c  \
0    0    0    0    1    1    0    0    0    0    1  ...       0     0     0   
1    0    0    0    1    0    0    1    0    0    1  ...       0     1     0   
2    0    0    1    0    0    0    0    1    1    0  ...       0     0     1   
3    1    0    0    0    1    0    0    0    0    0  ...       0     0     0   
4    0    0    0    1    0    1    0    0    0    0  ...       1     1     0   

   55_g  55_t  56_a  56_c  56_g  56_t  Class  
0     1     0     0     0     0     1      1  
1     0     0     1     0     0     0      1  
2     0     0     0     0     1     0      1  
3     0     1     0     1     0     0      1  
4     0     0     0     0     1     0      1  

[5 rows x 229 columns]
In [13]:
# Use the model_selection module to separate training and testing datasets
from sklearn import model_selection

# Create X and Y datasets for training
X = np.array(df.drop(['Class'], 1))
y = np.array(df['Class'])

# define seed for reproducibility
seed = 1

# split data into training and testing datasets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=seed)

Step 3: Training and Testing the Classification Algorithms¶

Now that we have preprocessed the data and built our training and testing datasets, we can start to deploy different classification algorithms. It's relatively easy to test multiple models; as a result, we will compare and contrast the performance of ten different algorithms.

In [14]:
# Import each algorithm we plan on using
# from sklearn. Import some performance metrics, such as accuracy_score and classification_report.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score

# define scoring method
scoring = 'accuracy'

# Define models to train
names = ["Nearest Neighbors", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "SVM Linear", "SVM RBF", "SVM Sigmoid"]

classifiers = [
    KNeighborsClassifier(n_neighbors = 3),
    GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1),
    AdaBoostClassifier(),
    GaussianNB(),
    SVC(kernel = 'linear'), 
    SVC(kernel = 'rbf'),
    SVC(kernel = 'sigmoid')
]

models = zip(names, classifiers)

# evaluate each model in turn
results = []
names = []

for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state = seed)
    cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
Nearest Neighbors: 0.823214 (0.113908)
Gaussian Process: 0.873214 (0.056158)
Decision Tree: 0.750000 (0.185405)
Random Forest: 0.580357 (0.106021)
C:\Programdata\anaconda2\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py:564: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)
Neural Net: 0.887500 (0.087500)
AdaBoost: 0.912500 (0.112500)
Naive Bayes: 0.837500 (0.137500)
SVM Linear: 0.850000 (0.108972)
SVM RBF: 0.737500 (0.117925)
SVM Sigmoid: 0.569643 (0.159209)
In [15]:
# Make predictions on the validation dataset.

for name, model in models:
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(name)
    print(accuracy_score(y_test, predictions))
    print(classification_report(y_test, predictions))
    
# Accuracy - ratio of correctly predicted observation to the total observations. 
# Precision - (false positives) ratio of correctly predicted positive observations to the total predicted positive observations
# Recall (Sensitivity) - (false negatives) ratio of correctly predicted positive observations to the all observations in actual class - yes.
# F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false 
Nearest Neighbors
0.7777777777777778
             precision    recall  f1-score   support

          0       1.00      0.65      0.79        17
          1       0.62      1.00      0.77        10

avg / total       0.86      0.78      0.78        27

Gaussian Process
0.8888888888888888
             precision    recall  f1-score   support

          0       1.00      0.82      0.90        17
          1       0.77      1.00      0.87        10

avg / total       0.91      0.89      0.89        27

Decision Tree
0.7777777777777778
             precision    recall  f1-score   support

          0       1.00      0.65      0.79        17
          1       0.62      1.00      0.77        10

avg / total       0.86      0.78      0.78        27

Random Forest
0.5925925925925926
             precision    recall  f1-score   support

          0       0.88      0.41      0.56        17
          1       0.47      0.90      0.62        10

avg / total       0.73      0.59      0.58        27

Neural Net
0.9259259259259259
             precision    recall  f1-score   support

          0       1.00      0.88      0.94        17
          1       0.83      1.00      0.91        10

avg / total       0.94      0.93      0.93        27

AdaBoost
0.8518518518518519
             precision    recall  f1-score   support

          0       1.00      0.76      0.87        17
          1       0.71      1.00      0.83        10

avg / total       0.89      0.85      0.85        27

Naive Bayes
0.9259259259259259
             precision    recall  f1-score   support

          0       1.00      0.88      0.94        17
          1       0.83      1.00      0.91        10

avg / total       0.94      0.93      0.93        27

SVM Linear
0.9629629629629629
             precision    recall  f1-score   support

          0       1.00      0.94      0.97        17
          1       0.91      1.00      0.95        10

avg / total       0.97      0.96      0.96        27

SVM RBF
0.7777777777777778
             precision    recall  f1-score   support

          0       1.00      0.65      0.79        17
          1       0.62      1.00      0.77        10

avg / total       0.86      0.78      0.78        27

SVM Sigmoid
0.4444444444444444
             precision    recall  f1-score   support

          0       1.00      0.12      0.21        17
          1       0.40      1.00      0.57        10

avg / total       0.78      0.44      0.34        27

In [ ]: