Skip to content

I can't reproduce the same results in the paper #3

@DSincerity

Description

@DSincerity

Hi,
Like the #2 , I also tried to reproduce the FED paper with the FED data (http://shikib.com/fed_data.json).
But, I couldn't obtain the same result as the paper.

  1. Average scores of Annotators.
    By applying the data processing method in the paper, I could only reproduce the similar results for dialog-level evaluation, not turn-level evaluation. How I can reproduce the results for turn-level ?
  • Avg. score in the paper
    image

  • Avg. score in FED data
    image

  1. Correlation between Follow-up Utterance(FU) scores and Avg. scores of annotators
    I also calculated correlation between FU scores and Avg. scores of human evaluation. I obtained FU scores with the DialogGPT(large) model, following the guidance on README file (i.e., preprocessing inputs and using the FED module).
    However, the results of correlation were totally different from the paper. I wonder if the FU scores in the paper were calculated in the same way in this repository. How I can reproduce the same results of correlation?
  • Correlation in the paper
    image

  • Reproduced correlation
    image

  • (Dialog level) FU scores and annotator's evaluation that I've obtained.
    Calcuated_results.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions