-
Notifications
You must be signed in to change notification settings - Fork 33
Export sampler for Seq2Seq models #207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
For cuda backend we want to keep most if not all calculation on device. Assuming most of the ASR applications are doing a greedy argmax sampling, we are exporting and lowering sampling into a method of ExecuTorch model. This way the runner can choose to run it.
mergennachin
approved these changes
Jan 27, 2026
larryliu0820
added a commit
to pytorch/executorch
that referenced
this pull request
Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose. Benchmark result:
larryliu0820
added a commit
to pytorch/executorch
that referenced
this pull request
Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose. Benchmark result:
larryliu0820
added a commit
to pytorch/executorch
that referenced
this pull request
Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose. Benchmark result:
larryliu0820
added a commit
to pytorch/executorch
that referenced
this pull request
Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose. Benchmark result:
larryliu0820
added a commit
to pytorch/executorch
that referenced
this pull request
Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose. Benchmark result:
larryliu0820
added a commit
to pytorch/executorch
that referenced
this pull request
Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose. Benchmark result:
larryliu0820
added a commit
to pytorch/executorch
that referenced
this pull request
Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose. Benchmark result on RTX 5080: ``` ====================================================================== BENCHMARK SUMMARY ====================================================================== Total runs: 30 Generated tokens per run: 104 THROUGHPUT (tokens/sec): Min: 793.89 t/s Max: 845.53 t/s Mean: 820.35 t/s Stdev: 11.86 t/s MODEL LOAD TIME (ms): Min: 620 ms Max: 2170 ms Mean: 700 ms Stdev: 279 ms ENCODE TIME (ms, inference_start to prompt_eval_end): Min: 36 ms Max: 38 ms Mean: 37 ms Stdev: 1 ms DECODE TIME (ms, prompt_eval_end to inference_end): Min: 123 ms Max: 131 ms Mean: 127 ms Stdev: 2 ms ====================================================================== ```
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For cuda backend we want to keep most if not all calculation on device. Assuming most of the ASR applications are doing a greedy argmax sampling, we are exporting and lowering sampling into a method of ExecuTorch model. This way the runner can choose to run it.