Prediction of DNA-Protein Interaction using CNN and LSTM.

Last Updated on January 10, 2024 by Editorial Team

Author(s): Rakesh M K

Originally published on Towards AI.

Photo by National Cancer Institute on Unsplash

Introduction

2. About Data

3. Data Preprocessing

4. Model Building

5. Performance Analysis

6. Conclusion

7. References

Introduction

To begin with, let’s keep in mind that there exists some binding between DNA and protein. DNA-binding proteins are proteins that have DNA-binding domains and thus have a specific or general affinity for single- or double-stranded DNA [1]. Keeping genomics apart, the aim of my work is to develop a deep neural network that can predict whether a DNA sequence can bind to a protein or not. So, we will be heading to a binary classification problem, and I intend to show how a deep neural network composed of CNN and LSTM can effectively solve it.

About Data

The DNA sequence and labels are downloaded from https://github.com/abidlabs/deep-learning-genomics-primer/blob/master/sequences.txt and https://github.com/abidlabs/deep-learning-genomics-primer/blob/master/labels.txt respectively which are publicly accessible. Let’s import and have a look at the data.

import pandas as pd

df1 = pd.read_csv('sequences.txt', sep=' ', header=None, names=[ 'Sequence'])
df2 = pd.read_csv('labels.txt', sep=' ', header=None, names=['Label'])
df3 = pd.concat([df1,df2],axis=1)
df3.head(8)

DNA consists of four bases, which are adenine [A], cytosine [C], guanine [G], or thymine [T]. DNA sequence is a laboratory process of determining the sequence of these four bases in a DNA molecule [1]. There are 2000 records, and each are of length 200. Also, the data is reasonably balanced as you see below.

Data Preprocessing

The text sequence needs to be converted into numbers before feeding to the model. One-hot encoding is done on the sequence since neural networks often works well with the same. The encoding code is done as below.

def one_hot_encode(sequence):
 nucleotide_to_index = {'A': 0, 'T': 1, 'G': 2, 'C': 3}
 one_hot_encoded = []

 for nucleotide in sequence:
 one_hot_vector = [0] * 4
 one_hot_vector[nucleotide_to_index[nucleotide]] = 1
 one_hot_encoded.append(one_hot_vector)

 return np.array(one_hot_encoded)

df3['one-hot']

0 [[0, 0, 0, 1], [0, 0, 0, 1], [0, 0, 1, 0], [1,...
1 [[0, 0, 1, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0,...
2 [[0, 0, 1, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0,...
3 [[0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0,...
4 [[0, 0, 1, 0], [0, 0, 1, 0], [0, 0, 0, 1], [0,...
 ... 
1995 [[0, 0, 1, 0], [0, 1, 0, 0], [0, 0, 0, 1], [0,...
1996 [[0, 0, 1, 0], [0, 1, 0, 0], [0, 1, 0, 0], [0,...
1997 [[1, 0, 0, 0], [0, 0, 0, 1], [0, 1, 0, 0], [0,...
1998 [[0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1,...
1999 [[1, 0, 0, 0], [1, 0, 0, 0], [0, 1, 0, 0], [0,...
Name: one-hot, Length: 2000, dtype: object

Model Building

A combination of Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) layers are mainly used for the model. Conv-1D layers are mainly used in problems that involve sequential data. Conv1D layers are efficient in capturing patterns regardless of the position in the sequence. Since these layers have fewer parameters, they are computationally efficient and also known for their ability to capture local patterns in sequential data.

LSTM is a Recurrent Neural Network that is designed to overcome the issue of the vanishing gradient problem. The three gates in the cell input gate, output gate and forget gate incorporate to retain and propagate information in the sequence.

The data is split to train and test as below.

from sklearn.model_selection import train_test_split

'''split the data'''
trainSeq,testSeq,trainLabels,testLabels = train_test_split(df3["one-hot"].to_numpy(),
 df3['Label'].to_numpy(),
 test_size=0.2,
 random_state=101)
'''convert train and test data to numpy array'''
seqTrain1 = np.array(seqTrain)
seqTest1 = np.array(seqTest)

'''reshape the array since the neural layers expect data in 3D'''
X_train = np.reshape(seqTrain1, (seqTrain1.shape[0], seqTrain1.shape[1], 1)) # (1600, 200, 1)
X_test = np.reshape(seqTest1, (seqTest1.shape[0], seqTest1.shape[1], 1)) # (1600, 200, 1)

'''check shape of train and test data'''
X_train.shape, X_test.shape
((1600, 200, 1), (400, 200, 1))

The model is configured in TensorFlow framework using keras as the below code shows.

import tensorflow as tf
from tensorflow.keras import layers
tf.random.set_seed(101)


inputs= layers.Input(shape=(200,1 ))

x= layers.Conv1D(filters=7,kernel_size=5)(inputs)
x= layers.LSTM(units=64,return_sequences =True,)(x)
x= layers.MaxPooling1D(2,2)(x)
x= layers.Flatten()(x)
x=tf.keras.layers.Masking(mask_value=0)(x)
x=layers.Dense(64,activation = 'relu')(x)
x=layers.Dropout(0.1)(x)


outputs=layers.Dense(1,activation="sigmoid")(x)

modelLSTM=tf.keras.Model(inputs,outputs,name="modelLSTM")

tf.keras.layers.Masking(mask_value=0) masks all timesteps where the value is equal to 0. During training, the masked timesteps will be ignored, and the network will not consider them, which eventually makes the computation efficient.

The model is trained for 20 epochs. A validation accuracy of .99 is quite impressive.

Accuracy and loss curve during training of model are shown below.

Performance Analysis

Let’s evaluate the model with test data and have a look into the classification report as well as the confusion matrix.

'''evaluating the model on test data'''
modelLSTM.evaluate(X_test,testLabels)

[0.055308908224105835, 0.9900000095367432]

Classification report:

 precision recall f1-score support

 0.0 0.98 1.00 0.99 195
 1.0 1.00 0.98 0.99 205

 accuracy 0.99 400
 macro avg 0.99 0.99 0.99 400
weighted avg 0.99 0.99 0.99 400

Confusion matrix:

From the above results, the performance of the model is much more impressive other than negligible false negatives.

Conclusion

Statistical models such as SVM, KNN, and Random Forest were tried before moving into a neural network, and the maximum accuracy obtained was 0.83 with SVM. The deep model has outperformed all these models with stunning accuracy and an f1 score of 0.99.

References

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Prediction of DNA-Protein Interaction using CNN and LSTM.

Author(s): Rakesh M K

Table of contents

Introduction

About Data

Data Preprocessing

Model Building

Performance Analysis

Conclusion

References

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

I Used ChatGPT to Count My Calories

Resource-Efficient Fine-Tuning of DeepSeek-R1

TAI #138: OpenAI’s o3-Mini and Deep Research: A New Era of Reasoning Powered Agents?

Text Preprocessing for NLP: A Step-by-Step Guide to Clean Raw Text Data

DeepSeek AI — The Future is Here

The World’s Leading AI and Technology Publication.

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Prediction of DNA-Protein Interaction using CNN and LSTM.

Author(s): Rakesh M K

Table of contents

Introduction

About Data

Data Preprocessing

Model Building

Performance Analysis

Conclusion

References

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement