A Quantitative Test for SARS-Cov-2 Infection

This is the repository for Zefeng Xu's DATA 1030 Final Project at DSI, Brown University.

This project aims to add a quantitative approach to the testing and diagnosis of COVID-19 using gene expression data. Other approaches include biological analyses such as Nucleic Acid Tests.
This project is to solve a binary classification problem of whether a patient is infected with COVID-19 using data from Gene Expression Omnibus (GEO) database.

To reproduce the results, please read below.

Below is the structure of this repository.

├── data/                     
    ├── preprocessed/             
    └── labels.csv                
├── figures/                  
├── report/                   
├── results/                  
├── src/                      
├── .gitignore                
├── LICENSE                   
├── README.md                 
└── requirements.yml

data/
This directory contains both raw and preprocessed feature matrix and target label files.
Raw file containing all labels at in labels.csv at root of data/.
Preprocessed X_train, X_test, y_train, y_test for each of the 10 random states are stored in data/preprocessed.
File containing the original feature matrix is too large to store as a single file so is not present here.
Please read below in Section Data for the way to download the raw feature matrix.

figures/
This directory contains all figures generated in this project.

report/
This directory contains the pdf version of the final report of this project.

results/
This directory stores all saved models, tuned parameters, train/test scores, and corresponding training and testing sets from cross validation.

src/
This directory contains source code for both the midterm and final presentations, and the final report.
The latter one contains the most recent and important results.

Data

The feature matrix file is publicly available on GEO website with dataset ID GSE212041.
To download the feature matrix, follow the below steps.

Go to this website.
Scroll down to the bottom and you will see Supplementary file.
Download the one with file name GSE212041_Neutrophil_RNAseq_TPM_Matrix.txt.gz.
Unzip it and you will find a .txt file with name TPM_S1G.txt.
This file contains 60640 rows and 782 columns and is the original feature matrix.

This feature matrix is transposed in this project since 60640 is used as number of columns and 782 is used as number of rows.
There are only 781 samples instead of 782. There is a 'Symbol' column (removed for this project) in original data that contains the Gene Symbol of the 60640 genes.

Dependency

The developing environment used for this project is Anaconda with Python 3.10.12.
Below are the major packages and their versions used.
There is a requirements.yml file in the root directory for ease of use.

numpy 1.26.2
scipy 1.11.4
scikit-learn 1.2.2
pandas 2.1.3
matplotlib 3.6.0
xgboost 2.0.2
scanpy 1.9.5

Author

Zefeng Xu (zefeng_xu@brown.edu)

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Quantitative Test for SARS-Cov-2 Infection

This is the repository for Zefeng Xu's DATA 1030 Final Project at DSI, Brown University.

Data

Dependency

Author

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
figures		figures
report		report
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.yml		requirements.yml

Folders and files

Latest commit

History

Repository files navigation

A Quantitative Test for SARS-Cov-2 Infection

This is the repository for Zefeng Xu's DATA 1030 Final Project at DSI, Brown University.

Data

Dependency

Author

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages