Skip to content

GunnerForever/1030_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Quantitative Test for SARS-Cov-2 Infection

This is the repository for Zefeng Xu's DATA 1030 Final Project at DSI, Brown University.

This project aims to add a quantitative approach to the testing and diagnosis of COVID-19 using gene expression data. Other approaches include biological analyses such as Nucleic Acid Tests.
This project is to solve a binary classification problem of whether a patient is infected with COVID-19 using data from Gene Expression Omnibus (GEO) database.

To reproduce the results, please read below.

Below is the structure of this repository.

├── data/                     
    ├── preprocessed/             
    └── labels.csv                
├── figures/                  
├── report/                   
├── results/                  
├── src/                      
├── .gitignore                
├── LICENSE                   
├── README.md                 
└── requirements.yml          

data/
This directory contains both raw and preprocessed feature matrix and target label files.
Raw file containing all labels at in labels.csv at root of data/.
Preprocessed X_train, X_test, y_train, y_test for each of the 10 random states are stored in data/preprocessed.
File containing the original feature matrix is too large to store as a single file so is not present here.
Please read below in Section Data for the way to download the raw feature matrix.

figures/
This directory contains all figures generated in this project.

report/
This directory contains the pdf version of the final report of this project.

results/
This directory stores all saved models, tuned parameters, train/test scores, and corresponding training and testing sets from cross validation.

src/
This directory contains source code for both the midterm and final presentations, and the final report.
The latter one contains the most recent and important results.

Data

The feature matrix file is publicly available on GEO website with dataset ID GSE212041.
To download the feature matrix, follow the below steps.

  1. Go to this website.
  2. Scroll down to the bottom and you will see Supplementary file.
  3. Download the one with file name GSE212041_Neutrophil_RNAseq_TPM_Matrix.txt.gz.
  4. Unzip it and you will find a .txt file with name TPM_S1G.txt.
  5. This file contains 60640 rows and 782 columns and is the original feature matrix.
  • This feature matrix is transposed in this project since 60640 is used as number of columns and 782 is used as number of rows.
  • There are only 781 samples instead of 782. There is a 'Symbol' column (removed for this project) in original data that contains the Gene Symbol of the 60640 genes.

Dependency

The developing environment used for this project is Anaconda with Python 3.10.12.
Below are the major packages and their versions used.
There is a requirements.yml file in the root directory for ease of use.

  • numpy 1.26.2
  • scipy 1.11.4
  • scikit-learn 1.2.2
  • pandas 2.1.3
  • matplotlib 3.6.0
  • xgboost 2.0.2
  • scanpy 1.9.5

Author

Zefeng Xu (zefeng_xu@brown.edu)

License

This project is licensed under the MIT License.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors