Skip to content

Diabetic-Research-Group/data-preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

data-preprocessing

Preprocessing codes for diabetes dataset

Dataset links

Common

  1. Raw dataset files

  2. Merged dataset

  3. Diabetes classified dataset - v3

  4. Selected raw attributes v3 - 375 attributes

  5. Feature Engineered dataset - 273 attributes

  6. Datatype corrected and reduced dataset - 74 attributes

Diabetes prediction

  1. Diabetes prediction leaked removed - selection stage 1 - 240 attributes

  2. Diabetes prediction leaked removed - selection from reduced dataset - To Be Uploaded

Severity classification

  1. Example dataset goes here

Treatement recommendation

  1. Example dataset goes here

How to use the huggingface datasets above

Say the dataset URL is https://huggingface.co/datasets/user_name/dataset_name

Use the user_name/dataset_name

import pandas as pd
from datasets import load_dataset

dataset = load_dataset("user_name/dataset_name", split="train")  # default split is train, if the dataset has other splits, use them as necessary

# Use dataset as is for a Arrow dataset or conver to pandas if needed
df = dataset.to_pandas()

Preprocessing scripts

  1. merge_nhanes_files.py : Merges multiple NHANES files into a single dataset.
  2. parquet_to_csv.py : Converts Parquet files to CSV format for easier data handling.

Environment Variables

The project uses environment variables to manage input and output directories as well as options for generating CSV files. You can set these variables in a .env file based on the provided .env.example.

Prerequisites

Setup

  1. Clone the repository to your local machine.

  2. Navigate to the project directory.

  3. Install the required dependencies using poetry:

    poetry install
  4. Create a .env file in the project root directory and set the necessary environment variables as shown in the .env.example file.

  5. Run the preprocessing scripts as needed.

Usage Example

To merge NHANES files, run the following command from the project root directory:

poetry run python code/merge_nhanes_files.py

Archived Links

V1

  1. Initial Merged NHANES Dataset
  2. Diabetes classified training dataset
  3. Selected attributes - 386 attributes
  4. Selected raw attributes - 375 attributes

About

Preprocessing codes for diabetes dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors