You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Great jobs.
I have some questions for the authors.
I run the code on the GPT-4 with the same parameter settings, but the results (macro-F1) for using GPT-4 as the program generator (N=1, gold), but the results on FEVEROUS are lower than the results using text-davinci-003 presented in the github .
FEVEROUS with GPT4: 91.05
FEVEROUS with text-davinci-003: 92.32 (presented in the github)
This result is very confusing.
I would like to know if the results reported in the paper as well as github are in the full dataset or the partially sampled dataset?