This is an implementation of DNA gene family classification. To run the code:
python train_and_eval.pyPlease install the required packages in requirements.txt:
pip install -r requirements.txtThe data preprocessing is implemented in data_process.py.
Please see DNADataset class for details.
The DNA sequences are:
- Sliced into subsequences of maximum length of 512 nucleotides.
- Optionally, before slicing and padding, the sequences can be augmented by their reverse complement.
- The sequences are tokenized into 5-mers, and then converted to one-hot encoding.
A Convolutional Neural Network (CNN) is implemented in model.py.
The input data is split into 5 folds of 20% test sequences and 80% training. These splits are iterated over for 5 times, allowing each fold to be used as a test set once. For any iteration, the training set is further split into 80% training and 20% validation.
The final test score is computed over the predictions of all the sequences when they were used in a test set. We report the accuracy and f1 score of the model on the whole dataset.