-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Dear Dr. Reuben,
I am part of the same reproduction team as @whalekeykeeper. It's a pleasure to make your acquaintance and thank you for your support of our project.
As a team member, I am specifically attempting to recreate your evaluation script from the original paper. You write about two models being necessary for this part: the main model for generating captions, and a different one, trained on different data, which will be used to inform a second listener for identifying the target image of a caption - amid a group of distractors.
In the paper, you write as follows:
We train our production and evaluation models on separate sets consisting of regions in the Visual Genome dataset (Krishna et al., 2017) and full images in MSCOCO (Chen et al., 2015).
You have graciously included pretrained parameters for the encoder-decoder, more than one in fact. However, I cannot understand if the two sets if parameters coco-[encoder/decoder], vg-[encoder/decoder] correspond to these two models. If they do not, would you say it's still fine to use one pair for generating captions and the other pair for
Since both TS1 and TS2 (as defined in the paper in section 4.1) are constructed from Visual Genome, it feels intuitively right to use vg-[encoder/decoder] for coco-[encoder/decoder] for caption generation. Unfortunately we do not have the compute at this time to train new models from scratch, so being able to use the provided pretrained parameters would be a huge help.
I thank you again for your time and support.