Results on GPT-4 are lower than the reuslts presented in the paper?

Great jobs.
I have some questions for the authors.
1. I run the code on the GPT-4 with the same parameter settings,  but the results (macro-F1) for using GPT-4 as the program generator (N=1, gold), but the results on FEVEROUS are lower than the results using text-davinci-003 presented in the  github .
FEVEROUS with GPT4: 91.05
FEVEROUS with text-davinci-003: 92.32 （presented in the github）
This result is very confusing.
2. I would like to know if the results reported in the paper as well as github are in the full dataset or the partially sampled dataset?







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Results on GPT-4 are lower than the reuslts presented in the paper? #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Results on GPT-4 are lower than the reuslts presented in the paper? #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions