Skip to content

Recreating evaluation segment of the original paper #4

@waron97

Description

@waron97

Dear Dr. Reuben,

I am part of the same reproduction team as @whalekeykeeper. It's a pleasure to make your acquaintance and thank you for your support of our project.

As a team member, I am specifically attempting to recreate your evaluation script from the original paper. You write about two models being necessary for this part: the main model for generating captions, and a different one, trained on different data, which will be used to inform a second listener for identifying the target image of a caption - amid a group of distractors.

In the paper, you write as follows:

We train our production and evaluation models on separate sets consisting of regions in the Visual Genome dataset (Krishna et al., 2017) and full images in MSCOCO (Chen et al., 2015).

You have graciously included pretrained parameters for the encoder-decoder, more than one in fact. However, I cannot understand if the two sets if parameters coco-[encoder/decoder], vg-[encoder/decoder] correspond to these two models. If they do not, would you say it's still fine to use one pair for generating captions and the other pair for $L_{eval}$?

Since both TS1 and TS2 (as defined in the paper in section 4.1) are constructed from Visual Genome, it feels intuitively right to use vg-[encoder/decoder] for $L_{eval}$ and coco-[encoder/decoder] for caption generation. Unfortunately we do not have the compute at this time to train new models from scratch, so being able to use the provided pretrained parameters would be a huge help.

I thank you again for your time and support.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions