Preprocessing codes for diabetes dataset
-
Diabetes prediction leaked removed - selection stage 1 - 240 attributes
-
Diabetes prediction leaked removed - selection from reduced dataset - To Be Uploaded
Say the dataset URL is https://huggingface.co/datasets/user_name/dataset_name
Use the user_name/dataset_name
import pandas as pd
from datasets import load_dataset
dataset = load_dataset("user_name/dataset_name", split="train") # default split is train, if the dataset has other splits, use them as necessary
# Use dataset as is for a Arrow dataset or conver to pandas if needed
df = dataset.to_pandas()merge_nhanes_files.py: Merges multiple NHANES files into a single dataset.parquet_to_csv.py: Converts Parquet files to CSV format for easier data handling.
The project uses environment variables to manage input and output directories as well as options for generating CSV files. You can set these variables in a .env file based on the provided .env.example.
- Python 3.13 or higher
- Ensure you have poetry installed for dependency management.
-
Clone the repository to your local machine.
-
Navigate to the project directory.
-
Install the required dependencies using poetry:
poetry install
-
Create a
.envfile in the project root directory and set the necessary environment variables as shown in the.env.examplefile. -
Run the preprocessing scripts as needed.
To merge NHANES files, run the following command from the project root directory:
poetry run python code/merge_nhanes_files.py