Skip to content

Conversation

@larryliu0820
Copy link
Collaborator

For cuda backend we want to keep most if not all calculation on device. Assuming most of the ASR applications are doing a greedy argmax sampling, we are exporting and lowering sampling into a method of ExecuTorch model. This way the runner can choose to run it.

For cuda backend we want to keep most if not all calculation on device. Assuming most of the ASR applications are doing a greedy argmax sampling, we are exporting and lowering sampling into a method of ExecuTorch model. This way the runner can choose to run it.
@larryliu0820 larryliu0820 merged commit f8aa919 into main Jan 27, 2026
69 of 83 checks passed
@larryliu0820 larryliu0820 deleted the export_sampler branch January 27, 2026 00:25
larryliu0820 added a commit to pytorch/executorch that referenced this pull request Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose.

Benchmark result:
larryliu0820 added a commit to pytorch/executorch that referenced this pull request Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose.

Benchmark result:
larryliu0820 added a commit to pytorch/executorch that referenced this pull request Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose.

Benchmark result:
larryliu0820 added a commit to pytorch/executorch that referenced this pull request Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose.

Benchmark result:
larryliu0820 added a commit to pytorch/executorch that referenced this pull request Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose.

Benchmark result:
larryliu0820 added a commit to pytorch/executorch that referenced this pull request Jan 27, 2026
With this PR: huggingface/optimum-executorch#207 we are adding a new method "sampler" to ASR models, alongside with "encoder" and "text_decoder". The flow becomes: if temperature is 0 and sampler method is available, run that method. Otherwise still go with the old path. This change should largely improve the performance on CUDA since we don't have to copy logits from device to CPU for sampling purpose.

Benchmark result:
larryliu0820 added a commit to pytorch/executorch that referenced this pull request Jan 27, 2026
With this PR: huggingface/optimum-executorch#207
we are adding a new method "sampler" to ASR models, alongside with
"encoder" and "text_decoder". The flow becomes: if temperature is 0 and
sampler method is available, run that method. Otherwise still go with
the old path. This change should largely improve the performance on CUDA
since we don't have to copy logits from device to CPU for sampling
purpose.

Benchmark result on RTX 5080:

```

======================================================================
BENCHMARK SUMMARY
======================================================================
Total runs: 30
Generated tokens per run: 104

THROUGHPUT (tokens/sec):
  Min:    793.89 t/s
  Max:    845.53 t/s
  Mean:   820.35 t/s
  Stdev:  11.86 t/s

MODEL LOAD TIME (ms):
  Min:    620 ms
  Max:    2170 ms
  Mean:   700 ms
  Stdev:  279 ms

ENCODE TIME (ms, inference_start to prompt_eval_end):
  Min:    36 ms
  Max:    38 ms
  Mean:   37 ms
  Stdev:  1 ms

DECODE TIME (ms, prompt_eval_end to inference_end):
  Min:    123 ms
  Max:    131 ms
  Mean:   127 ms
  Stdev:  2 ms

======================================================================
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants