Skip to content

Release of the exact indices for the 150k core set used in the paper #5

@wuduher

Description

@wuduher

Hi, thanks for the great work on EpiCoder!

I am trying to reproduce the pipeline described in the paper. While cluster/main.py provides the method to extract the core set from The Stack v2 using kCenterGreedy, running this locally leads to variations in data quality depending on the initial pool and environment.

Request To ensure fair reproduction and better study the impact of the feature tree evolution, could you please release:

The exact list of IDs or indices (from The Stack v2) for the 150k Python files used as the seed data.

Or, if possible, upload the core_set_150k.jsonl to Hugging Face (similar to the final EpiCoder-func-380k).

This would help the community strictly align the "Start Point" of the pipeline with the paper's setting.

Additional Question Regarding the evaluation: Did you experiment with fine-tuning the base model directly on these 150k raw code files (without synthesis)? It would be interesting to see a baseline of "Raw Data SFT" vs. "EpiCoder Synthesis SFT" to quantify the gain from the feature evolution process.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions