Add torch scaled dot product attention (FlashAttention) #1798

dorian-K · 2025-12-12T13:00:21Z

No description provided.

dorian-K · 2025-12-19T15:14:58Z

Todo:

tracers that use energy and att_weights are broken now, and there isn't a straightforward way to fix because even if we automatically fall back to the returnn impl once we detect a tracer, that implementation is in the backend class and not in rf.dot_attention so current implementations don't find those variables. As far as I can tell this is only used in attention weight analyses and a test in test_rf_attention.py, so not huge impact but still annoying. internal flag to fall back to the vanilla implementation, and duplicate some of the tests in test_rf_attention to test both the new an the vanilla implementation, and att weights / energy will only be tested for vanilla impl
write some tests to verify that torch scaled_dot_product_attention produces the same result as the returnn fallback impl
convert existing building blocks in attention to use efficient is_causal=True parameter
maybe rename scaled_dot_product_attention to just dot_attention
there have been significant changes to torch scaled_dot_product_attention since version 2.0.0 till now, verify that all are compatible to the current impl

albertz · 2025-12-19T15:26:25Z

tracers that use energy and att_weights are broken now, and there isn't a straightforward way to fix because even if we automatically fall back to the returnn impl once we detect a tracer, that implementation is in the backend class and not in rf.dot_attention so current implementations don't find those variables. As far as I can tell this is only used in attention weight analyses and a test in test_rf_attention.py, so not huge impact but still annoying

I think it is to be expected that using such tracers can never be reliable and stable. We don't guarantee that, and we don't need to guarantee that. So this is not really an issue.

We should still keep the existing tests. For that, there should be a flag (maybe only internal flag) to disable this and fall back to the current vanilla implementation.

dorian-K and others added 4 commits December 19, 2025 13:29

WIP scaled dot product attention

b814a5f

wip

b69bbfd

wip

1194018

wip

9a2cae2

dorian-K force-pushed the doriank-sdpa branch from 1893a2d to 9a2cae2 Compare December 19, 2025 12:31

fix formatting

e5f636c

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

Copilot AI mentioned this pull request Dec 19, 2025

Add torch scaled dot product attention (FlashAttention) #1799

Closed

dorian-K added 5 commits December 19, 2025 14:22

more

a0627ec

more

90316e9

fix tests

b91ea9d

more

54beb8f

more

f9adbf3

fix

eada973

fix pycharm

62f825e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add torch scaled dot product attention (FlashAttention) #1798

Add torch scaled dot product attention (FlashAttention) #1798

Uh oh!

dorian-K commented Dec 12, 2025

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

dorian-K commented Dec 19, 2025 •

edited by albertz

Loading

Uh oh!

albertz commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add torch scaled dot product attention (FlashAttention) #1798

Are you sure you want to change the base?

Add torch scaled dot product attention (FlashAttention) #1798

Uh oh!

Conversation

dorian-K commented Dec 12, 2025

Uh oh!

This comment was marked as resolved.

This comment was marked as outdated.

dorian-K commented Dec 19, 2025 • edited by albertz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertz commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dorian-K commented Dec 19, 2025 •

edited by albertz

Loading