Skip to content

Results on GPT-4 are lower than the reuslts presented in the paper? #2

@hustcxx

Description

@hustcxx

Great jobs.
I have some questions for the authors.

  1. I run the code on the GPT-4 with the same parameter settings, but the results (macro-F1) for using GPT-4 as the program generator (N=1, gold), but the results on FEVEROUS are lower than the results using text-davinci-003 presented in the github .
    FEVEROUS with GPT4: 91.05
    FEVEROUS with text-davinci-003: 92.32 (presented in the github)
    This result is very confusing.
  2. I would like to know if the results reported in the paper as well as github are in the full dataset or the partially sampled dataset?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions