This project explores the world of bioinformatics by using Markov models, K-nearest neighbor (KNN) algorithms, support vector machines, and other common classifiers to classify short E. Coli DNA sequences. This project will use a dataset from the UCI Machine Learning Repository that has 106 DNA sequences, with 57 sequential nucleotides (“base-pairs”) each.
Topics covered:
The following code cells will import necessary libraries and import the dataset from the UCI repository as a Pandas DataFrame.
# To make sure all of the correct libraries are installed, import each module.
import sys
import numpy
import sklearn
import pandas
# Import, change module names
import numpy as np
import pandas as pd
# import the uci Molecular Biology (Promoter Gene Sequences) Data Set
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/molecular-biology/promoter-gene-sequences/promoters.data'
names = ['Class', 'id', 'Sequence']
data = pd.read_csv(url, names = names)
print(data.iloc[0])
The data is not in a usable form; as a result, we will need to process it before using it to train our algorithms.
# Building our Dataset by creating a custom Pandas DataFrame
# Each column in a DataFrame is called a Series. Lets start by making a series for each column.
classes = data.loc[:, 'Class']
print(classes[:5])
# generate list of DNA sequences
sequences = list(data.loc[:, 'Sequence'])
dataset = {}
# loop through sequences and split into individual nucleotides
for i, seq in enumerate(sequences):
# split into nucleotides, remove tab characters
nucleotides = list(seq)
nucleotides = [x for x in nucleotides if x != '\t']
# append class assignment
nucleotides.append(classes[i])
# add to dataset
dataset[i] = nucleotides
print(dataset[0])
# turn dataset into pandas DataFrame
dframe = pd.DataFrame(dataset)
print(dframe)
# transpose the DataFrame
df = dframe.transpose()
print(df.iloc[:5])
# for clarity, lets rename the last dataframe column to class
df.rename(columns = {57: 'Class'}, inplace = True)
print(df.iloc[:5])
# Let's start to familiarize ourselves with the dataset so we can pick the most suitable
# algorithms for this data
df.describe()
# desribe does not tell us enough information since the attributes are text. Lets record value counts for each sequence
series = []
for name in df.columns:
series.append(df[name].value_counts())
info = pd.DataFrame(series)
details = info.transpose()
print(details)
# We can't run machine learning algorithms on the data in 'String' formats. As a result, we need to switch
# it to numerical data. This can easily be accomplished using the pd.get_dummies() function
numerical_df = pd.get_dummies(df)
numerical_df.iloc[:5]
# We don't need both class columns. We'll drop one then rename the other to simply 'Class'.
df = numerical_df.drop(columns=['Class_-'])
df.rename(columns = {'Class_+': 'Class'}, inplace = True)
print(df.iloc[:5])
# Use the model_selection module to separate training and testing datasets
from sklearn import model_selection
# Create X and Y datasets for training
X = np.array(df.drop(['Class'], 1))
y = np.array(df['Class'])
# define seed for reproducibility
seed = 1
# split data into training and testing datasets
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=seed)
Now that we have preprocessed the data and built our training and testing datasets, we can start to deploy different classification algorithms. It's relatively easy to test multiple models; as a result, we will compare and contrast the performance of ten different algorithms.
# Import each algorithm we plan on using
# from sklearn. Import some performance metrics, such as accuracy_score and classification_report.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
# define scoring method
scoring = 'accuracy'
# Define models to train
names = ["Nearest Neighbors", "Gaussian Process",
"Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
"Naive Bayes", "SVM Linear", "SVM RBF", "SVM Sigmoid"]
classifiers = [
KNeighborsClassifier(n_neighbors = 3),
GaussianProcessClassifier(1.0 * RBF(1.0)),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1),
AdaBoostClassifier(),
GaussianNB(),
SVC(kernel = 'linear'),
SVC(kernel = 'rbf'),
SVC(kernel = 'sigmoid')
]
models = zip(names, classifiers)
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state = seed)
cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# Make predictions on the validation dataset.
for name, model in models:
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(name)
print(accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
# Accuracy - ratio of correctly predicted observation to the total observations.
# Precision - (false positives) ratio of correctly predicted positive observations to the total predicted positive observations
# Recall (Sensitivity) - (false negatives) ratio of correctly predicted positive observations to the all observations in actual class - yes.
# F1 score - F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false