Skip to content
This repository was archived by the owner on Sep 18, 2025. It is now read-only.

Postprocessor Improvement#75

Open
lh5844 wants to merge 14 commits intofacebookresearch:mainfrom
lh5844:postprocess
Open

Postprocessor Improvement#75
lh5844 wants to merge 14 commits intofacebookresearch:mainfrom
lh5844:postprocess

Conversation

@lh5844
Copy link
Contributor

@lh5844 lh5844 commented Sep 5, 2023

This addresses the issue of the sentence piece model not correcting when two words should be together. For example, if there were two text segments with the first one ending in "with" and the second one beginning with "out", the model would identify it as two different words. However, we want the two to be together as "without", and this would involve correcting the prediction list, delays list, and elapsed list for latency accuracy.


To run the spm_detokenizer_agent.py, use this command in the SimulEval directory:

simuleval  \ 
    --user-dir examples  \
    --agent-class examples.quick_start.spm_detokenizer_agent.DummyPipeline \
    --source examples/quick_start/spm_source.txt \
    --target examples/quick_start/spm_target.txt  \
    --output tmp_output \
     --segment-k 3  \
    --sentencepiece-model examples/quick_start/tokenizer.model \
    --detokenize-only 

This is the expected output for

  • instances.log
{"index": 0, "prediction": "Let's do it without hesitation.", "delays": [3, 6, 6, 9, 9], "elapsed": [0, 0, 0, 0, 0],
"prediction_length": 5, "reference": "Let's do it without hesitation.\n", "source": "\u2581Let ' s \u2581do 
\u2581it \u2581with out \u2581hesitation .", "source_length": 9} 
  • metrics.tsv
LAAL	AL	AP	DAL
3.3	3.3	0.733	3.96
  • scores.tsv
BLEU	LAAL	AL	AP	DAL	ATD
100.0	3.3	3.3	0.733	3.96	5.0

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 5, 2023
@lh5844 lh5844 requested review from annasun28 and xutaima September 5, 2023 15:10
Copy link
Contributor

@annasun28 annasun28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this makes sense, thanks Lucy! Could you add the test command you used and the outputs produced to your commit summary?

second_half = prediction_list[0]
complete_word = first_half + second_half
self.prediction_list.pop()
self.delays.pop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also pop the last from self.elapsed? It's similar to self.delays except that it also includes the actual inference time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! I have made those changes now

@lh5844 lh5844 removed the request for review from xutaima September 6, 2023 18:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants