Release of the exact indices for the 150k core set used in the paper

Hi, thanks for the great work on EpiCoder!

I am trying to reproduce the pipeline described in the paper. While cluster/main.py provides the method to extract the core set from The Stack v2 using kCenterGreedy, running this locally leads to variations in data quality depending on the initial pool and environment.

Request To ensure fair reproduction and better study the impact of the feature tree evolution, could you please release:

The exact list of IDs or indices (from The Stack v2) for the 150k Python files used as the seed data.

Or, if possible, upload the core_set_150k.jsonl to Hugging Face (similar to the final EpiCoder-func-380k).

This would help the community strictly align the "Start Point" of the pipeline with the paper's setting.

Additional Question Regarding the evaluation: Did you experiment with fine-tuning the base model directly on these 150k raw code files (without synthesis)? It would be interesting to see a baseline of "Raw Data SFT" vs. "EpiCoder Synthesis SFT" to quantify the gain from the feature evolution process.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release of the exact indices for the 150k core set used in the paper #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Release of the exact indices for the 150k core set used in the paper #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions