This project fine-tunes an open-source AI model to specialize in Data Science tasks using a portion of the SWE-Smith dataset.
SWE-Smith-DataScience provides a modular pipeline to fine-tune large language models on specialized data science problems extracted from the SWE-Smith dataset and evaluate them via the SWE-agent framework.
- Model Fine-Tuning: Leverage
finetune.pyto train on data subsets. - Automated Evaluation: Use the
SWE-agentmodule to benchmark performance. - Reproducible Workflows: Modular scripts for data processing and training.
- Python 3.8 or higher
- Git
- (Optional) Modal CLI for scalable remote training
- Required Python packages:
pip install torch transformers datasets modal
git clone https://github.com/HugoGoHe/SWE-Smith-DataScience.git
cd SWE-Smith-DataScienceSWE-Smith-DataScience/
├── SWE-agent/ # Evaluation agent and benchmarking scripts
├── finetune.py # Fine-tuning script for the model
├── requirements.txt # Python dependencies (optional)
├── LICENSE # MIT License
└── README.md # Project documentation
