This project aims to add a quantitative approach to the testing and diagnosis of COVID-19 using gene expression data. Other approaches include biological analyses such as Nucleic Acid Tests.
This project is to solve a binary classification problem of whether a patient is infected with COVID-19 using data from Gene Expression Omnibus (GEO) database.
To reproduce the results, please read below.
Below is the structure of this repository.
├── data/
├── preprocessed/
└── labels.csv
├── figures/
├── report/
├── results/
├── src/
├── .gitignore
├── LICENSE
├── README.md
└── requirements.yml
data/
This directory contains both raw and preprocessed feature matrix and target label files.
Raw file containing all labels at in labels.csv at root of data/.
Preprocessed X_train, X_test, y_train, y_test for each of the 10 random states are stored in data/preprocessed.
File containing the original feature matrix is too large to store as a single file so is not present here.
Please read below in Section Data for the way to download the raw feature matrix.
figures/
This directory contains all figures generated in this project.
report/
This directory contains the pdf version of the final report of this project.
results/
This directory stores all saved models, tuned parameters, train/test scores, and corresponding training and testing sets from cross validation.
src/
This directory contains source code for both the midterm and final presentations, and the final report.
The latter one contains the most recent and important results.
The feature matrix file is publicly available on GEO website with dataset ID GSE212041.
To download the feature matrix, follow the below steps.
- Go to this website.
- Scroll down to the bottom and you will see Supplementary file.
- Download the one with file name GSE212041_Neutrophil_RNAseq_TPM_Matrix.txt.gz.
- Unzip it and you will find a .txt file with name TPM_S1G.txt.
- This file contains 60640 rows and 782 columns and is the original feature matrix.
- This feature matrix is transposed in this project since 60640 is used as number of columns and 782 is used as number of rows.
- There are only 781 samples instead of 782. There is a 'Symbol' column (removed for this project) in original data that contains the Gene Symbol of the 60640 genes.
The developing environment used for this project is Anaconda with Python 3.10.12.
Below are the major packages and their versions used.
There is a requirements.yml file in the root directory for ease of use.
- numpy 1.26.2
- scipy 1.11.4
- scikit-learn 1.2.2
- pandas 2.1.3
- matplotlib 3.6.0
- xgboost 2.0.2
- scanpy 1.9.5
Zefeng Xu (zefeng_xu@brown.edu)
This project is licensed under the MIT License.