-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Hi, thanks for the great work on EpiCoder!
I am trying to reproduce the pipeline described in the paper. While cluster/main.py provides the method to extract the core set from The Stack v2 using kCenterGreedy, running this locally leads to variations in data quality depending on the initial pool and environment.
Request To ensure fair reproduction and better study the impact of the feature tree evolution, could you please release:
The exact list of IDs or indices (from The Stack v2) for the 150k Python files used as the seed data.
Or, if possible, upload the core_set_150k.jsonl to Hugging Face (similar to the final EpiCoder-func-380k).
This would help the community strictly align the "Start Point" of the pipeline with the paper's setting.
Additional Question Regarding the evaluation: Did you experiment with fine-tuning the base model directly on these 150k raw code files (without synthesis)? It would be interesting to see a baseline of "Raw Data SFT" vs. "EpiCoder Synthesis SFT" to quantify the gain from the feature evolution process.
Thanks!