Gene/Protein Sequences, RNN

Simulating protein sequences using Recurrent Neural Networks

Generating a random protein sequence is easy; all you need is the list of amino acid residues and a random number generator to select one amino acid at a time and stitch them together. Or, you can use a built-in function to sample (with replacement) a desired number of residues from the list of amino acids, all at once. Generating a protein sequence will become a real challenge when you have a set of protein sequences (let’s say a bunch of protein kinase C sequences from several organisms) and would like to simulate protein sequences that are very similar, yet not exactly the same.

One very exciting way to do so is through the Artificial Neural Networks or more specifically, Recurrent Neural Networks (RNNs). RNNs are machine learning models that are successfully used for text-related tasks (Natural Language Processing; NLP). Also, RNNs are very good at sequential information. An RNN can, for example, be trained to classify comments made by internet users below a blog post into negative or positive views. Or, it can be used to predict the next word in a sentence. Or, in a more complex application, can be trained to generate a caption for a photo. And they are unreasonably effective in doing any NLP tasks. See this amazing post to learn exactly how unreasonably effective they are!

So, I wrote a small & crude code in Python using keras with Tensoflow backend to generate new protein sequences based on a provided set of real protein sequences in FASTA format. The code is written and tested with/on Python 3.5.2, Tensorflow 1.0.1, and keras 2.0.6.  I tried to extensively comment the codes wherever seemed too confusing or complicated. This RNN model is a character-level Long Short-Term Memory (LSTM) network. As a character-level model (as opposed to word-level models), it treats each amino acid residue in the protein sequences as an input. The LSTM is a particular type of recurrent network along with some other types such as Gated Recurrent Unit (GRU). For an intuitive explanation of LSTM and GRU types of RNN, read this post.

It should be noted that, as I said above, the code is crude and not optimized by any means. No hyper-parameter optimization was performed whatsoever and the code is not devised/optimized to work on any particular type of proteins or to generate a specific type of proteins. The fasta file provided is just a random protein cluster downloaded from NCBI for training purposes. Hyperparameter optimization/tuning is an essential and case-specific part of implementing any neural network model that should be done tediously, considering the size and type of the data and the questions we are asking of the data.

Finally, here is the code. Enjoy!

Uncategorized

Programming Languages & Machine Learning Platforms

R and Python are the most coveted programming languages currently being used for big data analysis by both research scientists and data scientists. They are free, open-source, user-friendly and come with a plethora of extremely helpful libraries to perform almost any task. They are also both supported by tons of documentation, tutorials, example codes, and a strong developer community. To read more on their differences visit this page.

When it comes to Machine Learning and more specifically deep learning, however, Python seems to be the clear winner due to larger and considerably more reputable selection of specialized packages: Tensorflow, Theano, Caffe, Keras, Lasagne, and scikit-learn, to name a few. R is no slouch either by any means, but it is definitely lagging behind Python in this department.

I personally use Python 3 with scikit-learn, Tensorflow and/or Keras on a linux machine equipped with a Nvidia GEFORCE® GTX 1080 Ti GPU. There are two main ways to set up a Machine Learning platform with Python & co. on a linux machine:

  • From scratch, by installing a vanilla python and adding the required packages; aka, the hard way!
  • Getting a preconfigured and ready-to-use python ecosystem, such as Anaconda.

Not only Anaconda comes pre-installed with most of the popular data science packages, it is also much easier to find and install extra packages that are missing from the build, such as Tensorflow. To conclude, I recommend installing Anaconda + Tensorflow + Keras on a linux machine and wish you all an amazing adventure!