Skip to content

Conversation

@vmoens
Copy link
Collaborator

@vmoens vmoens commented Feb 3, 2026

Summary

  • Optimizes data collection pipeline by using lazy stacks in collectors when a replay buffer is present
  • Enables single-write operations directly to storage via torch.stack with out= parameter
  • Reduces memory operations from 2 writes to 1 write when using collectors with replay buffers

Before

  1. Collector: torch.stack(tensordicts, out=_final_rollout) → Write 1
  2. Storage: storage[cursor] = data → Write 2

After

  1. Collector: LazyStackedTensorDict.lazy_stack(tensordicts) → No write (lazy)
  2. Storage: torch.stack(lazy.unbind(), out=storage[cursor]) → Single write

Changes

  • Storage (storages.py): Modified TensorStorage.set() to detect LazyStackedTensorDict input and use torch.stack(..., out=storage[cursor]) instead of assignment
  • Collector (_single.py): Modified Collector.rollout() to use lazy stack when replay buffer is present with extend_buffer=True
  • Tests: Added test_extend_lazystack_direct_write and test_collector_with_rb_uses_lazy_stack
  • Benchmarks: Added test_single_with_rb and test_single_with_rb_pixels to compare performance

Test plan

  • Run test_extend_lazystack_direct_write to verify storage optimization works
  • Run test_collector_with_rb_uses_lazy_stack to verify collector integration
  • Run benchmarks to measure performance improvement

Made with Cursor

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 3, 2026
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 3, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/3438

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Unrelated Failure

As of commit cddbc62 with merge base eb7a1e4 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

⚠️ PR Title Label Error

PR title must start with a label prefix in brackets (e.g., [BugFix]).

Current title: Lazy stack optimization for collector-to-buffer writes

Supported Prefixes (case-sensitive)

Your PR title must start with exactly one of these prefixes:

Prefix Label Applied Example
[BugFix] BugFix [BugFix] Fix memory leak in collector
[Feature] Feature [Feature] Add new optimizer
[Doc] or [Docs] Documentation [Doc] Update installation guide
[Refactor] Refactoring [Refactor] Clean up module imports
[CI] CI [CI] Fix workflow permissions
[Test] or [Tests] Tests [Tests] Add unit tests for buffer
[Environment] or [Environments] Environments [Environments] Add Gymnasium support
[Data] Data [Data] Fix replay buffer sampling
[Performance] or [Perf] Performance [Performance] Optimize tensor ops
[BC-Breaking] bc breaking [BC-Breaking] Remove deprecated API
[Deprecation] Deprecation [Deprecation] Mark old function

Note: Common variations like singular/plural are supported (e.g., [Doc] or [Docs]).

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 148. Improved: $\large\color{#35bf28}8$. Worsened: $\large\color{#d91a1a}11$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_tensor_to_bytestream_speed[pickle] 82.5609μs 81.6652μs 12.2451 KOps/s 12.4090 KOps/s $\color{#d91a1a}-1.32\%$
test_tensor_to_bytestream_speed[torch.save] 0.1408ms 0.1401ms 7.1369 KOps/s 7.1825 KOps/s $\color{#d91a1a}-0.63\%$
test_tensor_to_bytestream_speed[untyped_storage] 0.1342s 0.1334s 7.4977 Ops/s 7.8426 Ops/s $\color{#d91a1a}-4.40\%$
test_tensor_to_bytestream_speed[numpy] 2.8000μs 2.7731μs 360.6117 KOps/s 371.2428 KOps/s $\color{#d91a1a}-2.86\%$
test_tensor_to_bytestream_speed[safetensors] 38.7528μs 38.4387μs 26.0154 KOps/s 27.7441 KOps/s $\textbf{\color{#d91a1a}-6.23\%}$
test_simple 0.9237s 0.8318s 1.2021 Ops/s 1.2047 Ops/s $\color{#d91a1a}-0.21\%$
test_transformed 1.5646s 1.4810s 0.6752 Ops/s 0.6840 Ops/s $\color{#d91a1a}-1.29\%$
test_serial 2.4738s 2.3834s 0.4196 Ops/s 0.4285 Ops/s $\color{#d91a1a}-2.08\%$
test_parallel 2.0306s 1.9559s 0.5113 Ops/s 0.5170 Ops/s $\color{#d91a1a}-1.10\%$
test_step_mdp_speed[True-True-True-True-True] 0.3331ms 45.5816μs 21.9387 KOps/s 22.0546 KOps/s $\color{#d91a1a}-0.53\%$
test_step_mdp_speed[True-True-True-True-False] 58.1320μs 25.2313μs 39.6334 KOps/s 39.4378 KOps/s $\color{#35bf28}+0.50\%$
test_step_mdp_speed[True-True-True-False-True] 83.2830μs 25.0565μs 39.9098 KOps/s 39.0135 KOps/s $\color{#35bf28}+2.30\%$
test_step_mdp_speed[True-True-True-False-False] 42.2320μs 13.8860μs 72.0149 KOps/s 72.0465 KOps/s $\color{#d91a1a}-0.04\%$
test_step_mdp_speed[True-True-False-True-True] 92.5530μs 47.5380μs 21.0358 KOps/s 21.0664 KOps/s $\color{#d91a1a}-0.15\%$
test_step_mdp_speed[True-True-False-True-False] 56.9920μs 28.1629μs 35.5077 KOps/s 35.6896 KOps/s $\color{#d91a1a}-0.51\%$
test_step_mdp_speed[True-True-False-False-True] 60.4020μs 27.8838μs 35.8631 KOps/s 35.6908 KOps/s $\color{#35bf28}+0.48\%$
test_step_mdp_speed[True-True-False-False-False] 54.4120μs 16.9506μs 58.9949 KOps/s 59.1610 KOps/s $\color{#d91a1a}-0.28\%$
test_step_mdp_speed[True-False-True-True-True] 0.1265ms 50.8386μs 19.6701 KOps/s 19.2570 KOps/s $\color{#35bf28}+2.15\%$
test_step_mdp_speed[True-False-True-True-False] 69.5120μs 31.0308μs 32.2261 KOps/s 31.7400 KOps/s $\color{#35bf28}+1.53\%$
test_step_mdp_speed[True-False-True-False-True] 57.1820μs 27.6497μs 36.1668 KOps/s 35.1565 KOps/s $\color{#35bf28}+2.87\%$
test_step_mdp_speed[True-False-True-False-False] 43.7920μs 16.7693μs 59.6327 KOps/s 58.9501 KOps/s $\color{#35bf28}+1.16\%$
test_step_mdp_speed[True-False-False-True-True] 88.3730μs 53.4010μs 18.7263 KOps/s 19.0123 KOps/s $\color{#d91a1a}-1.50\%$
test_step_mdp_speed[True-False-False-True-False] 73.1320μs 33.6426μs 29.7242 KOps/s 29.7735 KOps/s $\color{#d91a1a}-0.17\%$
test_step_mdp_speed[True-False-False-False-True] 61.0620μs 30.5216μs 32.7637 KOps/s 32.9287 KOps/s $\color{#d91a1a}-0.50\%$
test_step_mdp_speed[True-False-False-False-False] 59.2420μs 19.5488μs 51.1539 KOps/s 50.6580 KOps/s $\color{#35bf28}+0.98\%$
test_step_mdp_speed[False-True-True-True-True] 99.9430μs 51.2706μs 19.5044 KOps/s 20.0951 KOps/s $\color{#d91a1a}-2.94\%$
test_step_mdp_speed[False-True-True-True-False] 69.6020μs 30.3911μs 32.9044 KOps/s 32.3169 KOps/s $\color{#35bf28}+1.82\%$
test_step_mdp_speed[False-True-True-False-True] 65.7520μs 31.6289μs 31.6167 KOps/s 31.2419 KOps/s $\color{#35bf28}+1.20\%$
test_step_mdp_speed[False-True-True-False-False] 49.8620μs 18.6783μs 53.5381 KOps/s 54.4276 KOps/s $\color{#d91a1a}-1.63\%$
test_step_mdp_speed[False-True-False-True-True] 2.6119ms 54.3592μs 18.3962 KOps/s 18.6500 KOps/s $\color{#d91a1a}-1.36\%$
test_step_mdp_speed[False-True-False-True-False] 66.4620μs 33.6886μs 29.6836 KOps/s 29.4274 KOps/s $\color{#35bf28}+0.87\%$
test_step_mdp_speed[False-True-False-False-True] 70.7630μs 34.5816μs 28.9171 KOps/s 29.2924 KOps/s $\color{#d91a1a}-1.28\%$
test_step_mdp_speed[False-True-False-False-False] 49.6410μs 21.1577μs 47.2641 KOps/s 47.4871 KOps/s $\color{#d91a1a}-0.47\%$
test_step_mdp_speed[False-False-True-True-True] 0.1019ms 55.5093μs 18.0150 KOps/s 17.8621 KOps/s $\color{#35bf28}+0.86\%$
test_step_mdp_speed[False-False-True-True-False] 77.4520μs 36.9224μs 27.0838 KOps/s 27.3222 KOps/s $\color{#d91a1a}-0.87\%$
test_step_mdp_speed[False-False-True-False-True] 73.8130μs 34.7936μs 28.7409 KOps/s 28.9086 KOps/s $\color{#d91a1a}-0.58\%$
test_step_mdp_speed[False-False-True-False-False] 57.6320μs 21.1730μs 47.2299 KOps/s 47.5436 KOps/s $\color{#d91a1a}-0.66\%$
test_step_mdp_speed[False-False-False-True-True] 0.1059ms 57.6433μs 17.3481 KOps/s 17.0978 KOps/s $\color{#35bf28}+1.46\%$
test_step_mdp_speed[False-False-False-True-False] 74.9530μs 39.0653μs 25.5981 KOps/s 25.2592 KOps/s $\color{#35bf28}+1.34\%$
test_step_mdp_speed[False-False-False-False-True] 78.3130μs 37.1615μs 26.9096 KOps/s 27.0238 KOps/s $\color{#d91a1a}-0.42\%$
test_step_mdp_speed[False-False-False-False-False] 64.7320μs 23.7667μs 42.0758 KOps/s 41.5881 KOps/s $\color{#35bf28}+1.17\%$
test_non_tensor_env_rollout_speed[1000-single-True] 0.8796s 0.7772s 1.2867 Ops/s 1.2711 Ops/s $\color{#35bf28}+1.22\%$
test_non_tensor_env_rollout_speed[1000-single-False] 0.7311s 0.6382s 1.5668 Ops/s 1.5383 Ops/s $\color{#35bf28}+1.86\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 1.7979s 1.7164s 0.5826 Ops/s 0.5817 Ops/s $\color{#35bf28}+0.16\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 1.5504s 1.4800s 0.6757 Ops/s 0.6745 Ops/s $\color{#35bf28}+0.17\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 2.0422s 1.9726s 0.5069 Ops/s 0.5079 Ops/s $\color{#d91a1a}-0.20\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 1.8130s 1.7396s 0.5748 Ops/s 0.5734 Ops/s $\color{#35bf28}+0.26\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 4.7818s 4.6887s 0.2133 Ops/s 0.2124 Ops/s $\color{#35bf28}+0.41\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 4.6534s 4.5146s 0.2215 Ops/s 0.2233 Ops/s $\color{#d91a1a}-0.78\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 2.0375s 1.9781s 0.5055 Ops/s 0.5060 Ops/s $\color{#d91a1a}-0.09\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 1.8762s 1.7351s 0.5763 Ops/s 0.5890 Ops/s $\color{#d91a1a}-2.15\%$
test_values[generalized_advantage_estimate-True-True] 22.7895ms 21.8448ms 45.7774 Ops/s 47.6312 Ops/s $\color{#d91a1a}-3.89\%$
test_values[vec_generalized_advantage_estimate-True-True] 0.1390s 3.7191ms 268.8820 Ops/s 262.2745 Ops/s $\color{#35bf28}+2.52\%$
test_values[td0_return_estimate-False-False] 0.1107ms 87.6554μs 11.4083 KOps/s 11.6354 KOps/s $\color{#d91a1a}-1.95\%$
test_values[td1_return_estimate-False-False] 52.8873ms 51.7450ms 19.3255 Ops/s 20.1711 Ops/s $\color{#d91a1a}-4.19\%$
test_values[vec_td1_return_estimate-False-False] 1.3778ms 1.1125ms 898.8859 Ops/s 910.6630 Ops/s $\color{#d91a1a}-1.29\%$
test_values[td_lambda_return_estimate-True-False] 85.9720ms 84.2263ms 11.8728 Ops/s 12.3443 Ops/s $\color{#d91a1a}-3.82\%$
test_values[vec_td_lambda_return_estimate-True-False] 1.3511ms 1.1066ms 903.6336 Ops/s 913.2117 Ops/s $\color{#d91a1a}-1.05\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 22.7706ms 21.9534ms 45.5510 Ops/s 48.1032 Ops/s $\textbf{\color{#d91a1a}-5.31\%}$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1.0608ms 0.7841ms 1.2754 KOps/s 1.3065 KOps/s $\color{#d91a1a}-2.38\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.7443ms 0.7002ms 1.4282 KOps/s 1.4583 KOps/s $\color{#d91a1a}-2.07\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 1.6190ms 1.5263ms 655.2005 Ops/s 665.3428 Ops/s $\color{#d91a1a}-1.52\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 0.8304ms 0.7190ms 1.3908 KOps/s 1.4139 KOps/s $\color{#d91a1a}-1.63\%$
test_dqn_speed[False-None] 1.7035ms 1.5697ms 637.0633 Ops/s 633.9639 Ops/s $\color{#35bf28}+0.49\%$
test_dqn_speed[False-backward] 2.3261ms 2.2199ms 450.4721 Ops/s 455.9845 Ops/s $\color{#d91a1a}-1.21\%$
test_dqn_speed[True-None] 0.7117ms 0.5844ms 1.7110 KOps/s 1.8217 KOps/s $\textbf{\color{#d91a1a}-6.08\%}$
test_dqn_speed[True-backward] 1.2981ms 1.2216ms 818.6209 Ops/s 845.9448 Ops/s $\color{#d91a1a}-3.23\%$
test_dqn_speed[reduce-overhead-None] 0.6559ms 0.5780ms 1.7300 KOps/s 1.6798 KOps/s $\color{#35bf28}+2.99\%$
test_ddpg_speed[False-None] 3.5782ms 2.9772ms 335.8840 Ops/s 340.4654 Ops/s $\color{#d91a1a}-1.35\%$
test_ddpg_speed[False-backward] 4.6363ms 4.4225ms 226.1141 Ops/s 229.6927 Ops/s $\color{#d91a1a}-1.56\%$
test_ddpg_speed[True-None] 1.4449ms 1.3227ms 756.0499 Ops/s 774.5700 Ops/s $\color{#d91a1a}-2.39\%$
test_ddpg_speed[True-backward] 2.6812ms 2.5770ms 388.0521 Ops/s 400.4700 Ops/s $\color{#d91a1a}-3.10\%$
test_ddpg_speed[reduce-overhead-None] 1.6230ms 1.3495ms 741.0409 Ops/s 755.2130 Ops/s $\color{#d91a1a}-1.88\%$
test_sac_speed[False-None] 9.1066ms 8.6344ms 115.8162 Ops/s 118.2568 Ops/s $\color{#d91a1a}-2.06\%$
test_sac_speed[False-backward] 12.4077ms 11.9023ms 84.0172 Ops/s 84.5687 Ops/s $\color{#d91a1a}-0.65\%$
test_sac_speed[True-None] 2.1625ms 1.8358ms 544.7129 Ops/s 556.1260 Ops/s $\color{#d91a1a}-2.05\%$
test_sac_speed[True-backward] 3.7751ms 3.6478ms 274.1407 Ops/s 279.0405 Ops/s $\color{#d91a1a}-1.76\%$
test_sac_speed[reduce-overhead-None] 18.7499ms 10.4789ms 95.4296 Ops/s 93.8490 Ops/s $\color{#35bf28}+1.68\%$
test_redq_deprec_speed[False-None] 10.4665ms 9.5536ms 104.6724 Ops/s 106.0408 Ops/s $\color{#d91a1a}-1.29\%$
test_redq_deprec_speed[False-backward] 13.4714ms 13.0305ms 76.7430 Ops/s 79.3033 Ops/s $\color{#d91a1a}-3.23\%$
test_redq_deprec_speed[True-None] 2.8188ms 2.5646ms 389.9239 Ops/s 401.5192 Ops/s $\color{#d91a1a}-2.89\%$
test_redq_deprec_speed[True-backward] 4.8273ms 4.3912ms 227.7275 Ops/s 233.2272 Ops/s $\color{#d91a1a}-2.36\%$
test_redq_deprec_speed[reduce-overhead-None] 15.2756ms 9.5027ms 105.2332 Ops/s 88.6407 Ops/s $\textbf{\color{#35bf28}+18.72\%}$
test_td3_speed[False-None] 8.6424ms 8.4083ms 118.9298 Ops/s 119.8573 Ops/s $\color{#d91a1a}-0.77\%$
test_td3_speed[False-backward] 11.5241ms 11.0030ms 90.8844 Ops/s 90.4454 Ops/s $\color{#35bf28}+0.49\%$
test_td3_speed[True-None] 1.6923ms 1.6496ms 606.1989 Ops/s 624.7728 Ops/s $\color{#d91a1a}-2.97\%$
test_td3_speed[True-backward] 3.3637ms 3.2734ms 305.4932 Ops/s 320.6952 Ops/s $\color{#d91a1a}-4.74\%$
test_td3_speed[reduce-overhead-None] 65.2657ms 23.2247ms 43.0575 Ops/s 42.2562 Ops/s $\color{#35bf28}+1.90\%$
test_cql_speed[False-None] 18.3718ms 17.7461ms 56.3505 Ops/s 56.9199 Ops/s $\color{#d91a1a}-1.00\%$
test_cql_speed[False-backward] 24.0958ms 23.6379ms 42.3050 Ops/s 43.7598 Ops/s $\color{#d91a1a}-3.32\%$
test_cql_speed[True-None] 3.6122ms 3.3120ms 301.9365 Ops/s 310.7384 Ops/s $\color{#d91a1a}-2.83\%$
test_cql_speed[True-backward] 5.9664ms 5.5915ms 178.8435 Ops/s 188.5120 Ops/s $\textbf{\color{#d91a1a}-5.13\%}$
test_cql_speed[reduce-overhead-None] 18.4533ms 11.6673ms 85.7095 Ops/s 85.8677 Ops/s $\color{#d91a1a}-0.18\%$
test_a2c_speed[False-None] 4.3361ms 3.3440ms 299.0387 Ops/s 304.3224 Ops/s $\color{#d91a1a}-1.74\%$
test_a2c_speed[False-backward] 6.7295ms 6.5143ms 153.5077 Ops/s 161.2003 Ops/s $\color{#d91a1a}-4.77\%$
test_a2c_speed[True-None] 1.3966ms 1.3337ms 749.7931 Ops/s 754.7561 Ops/s $\color{#d91a1a}-0.66\%$
test_a2c_speed[True-backward] 3.1769ms 3.1245ms 320.0477 Ops/s 326.5576 Ops/s $\color{#d91a1a}-1.99\%$
test_a2c_speed[reduce-overhead-None] 1.0464ms 0.9674ms 1.0337 KOps/s 1.0352 KOps/s $\color{#d91a1a}-0.14\%$
test_ppo_speed[False-None] 4.0513ms 3.9196ms 255.1273 Ops/s 256.8064 Ops/s $\color{#d91a1a}-0.65\%$
test_ppo_speed[False-backward] 7.7365ms 7.3162ms 136.6832 Ops/s 138.3801 Ops/s $\color{#d91a1a}-1.23\%$
test_ppo_speed[True-None] 1.5489ms 1.4287ms 699.9293 Ops/s 714.9976 Ops/s $\color{#d91a1a}-2.11\%$
test_ppo_speed[True-backward] 3.4186ms 3.2824ms 304.6526 Ops/s 307.2243 Ops/s $\color{#d91a1a}-0.84\%$
test_ppo_speed[reduce-overhead-None] 1.1298ms 1.0419ms 959.7812 Ops/s 953.2886 Ops/s $\color{#35bf28}+0.68\%$
test_reinforce_speed[False-None] 2.4223ms 2.3157ms 431.8266 Ops/s 431.1167 Ops/s $\color{#35bf28}+0.16\%$
test_reinforce_speed[False-backward] 3.9281ms 3.4967ms 285.9821 Ops/s 291.0403 Ops/s $\color{#d91a1a}-1.74\%$
test_reinforce_speed[True-None] 1.4461ms 1.2675ms 788.9494 Ops/s 806.1003 Ops/s $\color{#d91a1a}-2.13\%$
test_reinforce_speed[True-backward] 3.2751ms 3.1542ms 317.0393 Ops/s 325.9236 Ops/s $\color{#d91a1a}-2.73\%$
test_reinforce_speed[reduce-overhead-None] 16.3020ms 9.0839ms 110.0854 Ops/s 98.0720 Ops/s $\textbf{\color{#35bf28}+12.25\%}$
test_iql_speed[False-None] 10.2912ms 9.6928ms 103.1690 Ops/s 103.2568 Ops/s $\color{#d91a1a}-0.09\%$
test_iql_speed[False-backward] 14.5589ms 13.7718ms 72.6120 Ops/s 72.0262 Ops/s $\color{#35bf28}+0.81\%$
test_iql_speed[True-None] 2.3327ms 2.1866ms 457.3387 Ops/s 464.8715 Ops/s $\color{#d91a1a}-1.62\%$
test_iql_speed[True-backward] 4.9754ms 4.8807ms 204.8868 Ops/s 207.5553 Ops/s $\color{#d91a1a}-1.29\%$
test_iql_speed[reduce-overhead-None] 16.9964ms 10.0065ms 99.9354 Ops/s 97.8589 Ops/s $\color{#35bf28}+2.12\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.6030ms 6.0613ms 164.9821 Ops/s 162.3710 Ops/s $\color{#35bf28}+1.61\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 1.0190ms 0.3189ms 3.1358 KOps/s 2.6525 KOps/s $\textbf{\color{#35bf28}+18.22\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.6154ms 0.2951ms 3.3883 KOps/s 3.3316 KOps/s $\color{#35bf28}+1.70\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.1533ms 5.8160ms 171.9394 Ops/s 166.9571 Ops/s $\color{#35bf28}+2.98\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 1.1444ms 0.3369ms 2.9685 KOps/s 2.8707 KOps/s $\color{#35bf28}+3.41\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6471ms 0.2999ms 3.3343 KOps/s 3.3154 KOps/s $\color{#35bf28}+0.57\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.7317ms 1.3815ms 723.8446 Ops/s 687.2524 Ops/s $\textbf{\color{#35bf28}+5.32\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.5111ms 1.3090ms 763.9331 Ops/s 724.6747 Ops/s $\textbf{\color{#35bf28}+5.42\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.0502ms 5.9231ms 168.8312 Ops/s 162.1238 Ops/s $\color{#35bf28}+4.14\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.8187ms 0.4892ms 2.0442 KOps/s 2.1508 KOps/s $\color{#d91a1a}-4.95\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.8681ms 0.4678ms 2.1378 KOps/s 2.3010 KOps/s $\textbf{\color{#d91a1a}-7.09\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.0620ms 5.8291ms 171.5543 Ops/s 165.9214 Ops/s $\color{#35bf28}+3.39\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 1.1316ms 0.2898ms 3.4508 KOps/s 2.6249 KOps/s $\textbf{\color{#35bf28}+31.46\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.4811ms 0.2727ms 3.6670 KOps/s 2.7693 KOps/s $\textbf{\color{#35bf28}+32.41\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.1624ms 5.7756ms 173.1428 Ops/s 166.5402 Ops/s $\color{#35bf28}+3.96\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 2.1346ms 0.3776ms 2.6481 KOps/s 2.7006 KOps/s $\color{#d91a1a}-1.94\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.5112ms 0.3596ms 2.7808 KOps/s 2.8572 KOps/s $\color{#d91a1a}-2.68\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.1003ms 5.9746ms 167.3758 Ops/s 163.5469 Ops/s $\color{#35bf28}+2.34\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 2.1695ms 0.5455ms 1.8331 KOps/s 1.9647 KOps/s $\textbf{\color{#d91a1a}-6.70\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7562ms 0.5295ms 1.8887 KOps/s 2.1633 KOps/s $\textbf{\color{#d91a1a}-12.69\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 6.6225ms 5.1635ms 193.6688 Ops/s 48.4662 Ops/s $\textbf{\color{#35bf28}+299.60\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 8.9703ms 2.2595ms 442.5797 Ops/s 529.6709 Ops/s $\textbf{\color{#d91a1a}-16.44\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 10.1627ms 1.3319ms 750.8159 Ops/s 1.0432 KOps/s $\textbf{\color{#d91a1a}-28.03\%}$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.5848s 16.7784ms 59.6004 Ops/s 192.4404 Ops/s $\textbf{\color{#d91a1a}-69.03\%}$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 4.0202ms 1.8689ms 535.0668 Ops/s 531.2042 Ops/s $\color{#35bf28}+0.73\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 9.0428ms 1.2643ms 790.9388 Ops/s 782.7137 Ops/s $\color{#35bf28}+1.05\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 6.8963ms 5.3996ms 185.1995 Ops/s 186.3062 Ops/s $\color{#d91a1a}-0.59\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 11.6834ms 2.2207ms 450.3108 Ops/s 498.6093 Ops/s $\textbf{\color{#d91a1a}-9.69\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 3.9924ms 1.1726ms 852.8300 Ops/s 849.4931 Ops/s $\color{#35bf28}+0.39\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 39.5790ms 36.8974ms 27.1022 Ops/s 26.8297 Ops/s $\color{#35bf28}+1.02\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 20.0624ms 18.5696ms 53.8515 Ops/s 53.0097 Ops/s $\color{#35bf28}+1.59\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 41.6254ms 38.1984ms 26.1791 Ops/s 25.9251 Ops/s $\color{#35bf28}+0.98\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 20.6609ms 18.9919ms 52.6541 Ops/s 52.5124 Ops/s $\color{#35bf28}+0.27\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 40.9988ms 39.6181ms 25.2410 Ops/s 25.0289 Ops/s $\color{#35bf28}+0.85\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 21.9697ms 20.2432ms 49.3994 Ops/s 49.0797 Ops/s $\color{#35bf28}+0.65\%$

@github-actions
Copy link
Contributor

github-actions bot commented Feb 3, 2026

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 173. Improved: $\large\color{#35bf28}23$. Worsened: $\large\color{#d91a1a}6$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_tensor_to_bytestream_speed[pickle] 83.1249μs 81.8303μs 12.2204 KOps/s 12.3131 KOps/s $\color{#d91a1a}-0.75\%$
test_tensor_to_bytestream_speed[torch.save] 0.1424ms 0.1416ms 7.0597 KOps/s 7.1729 KOps/s $\color{#d91a1a}-1.58\%$
test_tensor_to_bytestream_speed[untyped_storage] 0.1260s 0.1255s 7.9676 Ops/s 8.5860 Ops/s $\textbf{\color{#d91a1a}-7.20\%}$
test_tensor_to_bytestream_speed[numpy] 2.7608μs 2.7527μs 363.2762 KOps/s 376.5348 KOps/s $\color{#d91a1a}-3.52\%$
test_tensor_to_bytestream_speed[safetensors] 40.0430μs 39.4921μs 25.3215 KOps/s 25.8995 KOps/s $\color{#d91a1a}-2.23\%$
test_simple 0.5665s 0.5649s 1.7702 Ops/s 1.7107 Ops/s $\color{#35bf28}+3.48\%$
test_transformed 1.2816s 1.1852s 0.8438 Ops/s 0.8447 Ops/s $\color{#d91a1a}-0.11\%$
test_serial 1.7071s 1.7023s 0.5874 Ops/s 0.5797 Ops/s $\color{#35bf28}+1.34\%$
test_parallel 1.2518s 1.1446s 0.8737 Ops/s 0.8787 Ops/s $\color{#d91a1a}-0.57\%$
test_step_mdp_speed[True-True-True-True-True] 0.3135ms 45.5695μs 21.9445 KOps/s 22.3578 KOps/s $\color{#d91a1a}-1.85\%$
test_step_mdp_speed[True-True-True-True-False] 50.1800μs 25.6568μs 38.9760 KOps/s 38.6807 KOps/s $\color{#35bf28}+0.76\%$
test_step_mdp_speed[True-True-True-False-True] 52.4010μs 25.5128μs 39.1960 KOps/s 39.2066 KOps/s $\color{#d91a1a}-0.03\%$
test_step_mdp_speed[True-True-True-False-False] 40.0610μs 13.8772μs 72.0606 KOps/s 71.0243 KOps/s $\color{#35bf28}+1.46\%$
test_step_mdp_speed[True-True-False-True-True] 82.9310μs 48.7441μs 20.5153 KOps/s 20.8427 KOps/s $\color{#d91a1a}-1.57\%$
test_step_mdp_speed[True-True-False-True-False] 52.0010μs 28.3848μs 35.2301 KOps/s 36.0186 KOps/s $\color{#d91a1a}-2.19\%$
test_step_mdp_speed[True-True-False-False-True] 54.5010μs 28.4400μs 35.1618 KOps/s 35.3258 KOps/s $\color{#d91a1a}-0.46\%$
test_step_mdp_speed[True-True-False-False-False] 47.6910μs 16.6906μs 59.9138 KOps/s 58.9862 KOps/s $\color{#35bf28}+1.57\%$
test_step_mdp_speed[True-False-True-True-True] 81.9110μs 51.1710μs 19.5423 KOps/s 19.5781 KOps/s $\color{#d91a1a}-0.18\%$
test_step_mdp_speed[True-False-True-True-False] 65.1620μs 30.5569μs 32.7258 KOps/s 32.4217 KOps/s $\color{#35bf28}+0.94\%$
test_step_mdp_speed[True-False-True-False-True] 58.0210μs 28.1203μs 35.5615 KOps/s 36.2835 KOps/s $\color{#d91a1a}-1.99\%$
test_step_mdp_speed[True-False-True-False-False] 41.0910μs 16.5066μs 60.5817 KOps/s 59.4238 KOps/s $\color{#35bf28}+1.95\%$
test_step_mdp_speed[True-False-False-True-True] 83.6110μs 53.2307μs 18.7862 KOps/s 18.4301 KOps/s $\color{#35bf28}+1.93\%$
test_step_mdp_speed[True-False-False-True-False] 61.8310μs 33.2951μs 30.0344 KOps/s 29.6171 KOps/s $\color{#35bf28}+1.41\%$
test_step_mdp_speed[True-False-False-False-True] 56.7810μs 30.9146μs 32.3472 KOps/s 31.9944 KOps/s $\color{#35bf28}+1.10\%$
test_step_mdp_speed[True-False-False-False-False] 44.1710μs 19.3205μs 51.7584 KOps/s 50.2879 KOps/s $\color{#35bf28}+2.92\%$
test_step_mdp_speed[False-True-True-True-True] 93.2720μs 50.7789μs 19.6932 KOps/s 19.4527 KOps/s $\color{#35bf28}+1.24\%$
test_step_mdp_speed[False-True-True-True-False] 55.4010μs 30.9580μs 32.3018 KOps/s 31.8329 KOps/s $\color{#35bf28}+1.47\%$
test_step_mdp_speed[False-True-True-False-True] 2.3809ms 32.9563μs 30.3432 KOps/s 30.4945 KOps/s $\color{#d91a1a}-0.50\%$
test_step_mdp_speed[False-True-True-False-False] 46.9610μs 18.6236μs 53.6953 KOps/s 52.6439 KOps/s $\color{#35bf28}+2.00\%$
test_step_mdp_speed[False-True-False-True-True] 88.0020μs 53.9715μs 18.5283 KOps/s 18.2946 KOps/s $\color{#35bf28}+1.28\%$
test_step_mdp_speed[False-True-False-True-False] 68.4310μs 33.7879μs 29.5964 KOps/s 28.9921 KOps/s $\color{#35bf28}+2.08\%$
test_step_mdp_speed[False-True-False-False-True] 74.2620μs 35.2407μs 28.3763 KOps/s 28.6374 KOps/s $\color{#d91a1a}-0.91\%$
test_step_mdp_speed[False-True-False-False-False] 78.1510μs 20.7062μs 48.2947 KOps/s 46.5609 KOps/s $\color{#35bf28}+3.72\%$
test_step_mdp_speed[False-False-True-True-True] 98.2820μs 56.6191μs 17.6619 KOps/s 17.5347 KOps/s $\color{#35bf28}+0.73\%$
test_step_mdp_speed[False-False-True-True-False] 58.1710μs 36.3056μs 27.5439 KOps/s 26.6291 KOps/s $\color{#35bf28}+3.44\%$
test_step_mdp_speed[False-False-True-False-True] 65.6820μs 34.8109μs 28.7266 KOps/s 28.7120 KOps/s $\color{#35bf28}+0.05\%$
test_step_mdp_speed[False-False-True-False-False] 48.2710μs 21.3093μs 46.9279 KOps/s 46.4138 KOps/s $\color{#35bf28}+1.11\%$
test_step_mdp_speed[False-False-False-True-True] 87.7420μs 58.7828μs 17.0118 KOps/s 16.6080 KOps/s $\color{#35bf28}+2.43\%$
test_step_mdp_speed[False-False-False-True-False] 63.1310μs 38.7992μs 25.7738 KOps/s 25.4178 KOps/s $\color{#35bf28}+1.40\%$
test_step_mdp_speed[False-False-False-False-True] 69.1110μs 37.3482μs 26.7750 KOps/s 27.5911 KOps/s $\color{#d91a1a}-2.96\%$
test_step_mdp_speed[False-False-False-False-False] 73.1610μs 23.7754μs 42.0603 KOps/s 42.0990 KOps/s $\color{#d91a1a}-0.09\%$
test_non_tensor_env_rollout_speed[1000-single-True] 0.7618s 0.7580s 1.3193 Ops/s 1.2575 Ops/s $\color{#35bf28}+4.92\%$
test_non_tensor_env_rollout_speed[1000-single-False] 0.7406s 0.6418s 1.5581 Ops/s 1.5399 Ops/s $\color{#35bf28}+1.18\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-True] 1.7752s 1.6959s 0.5897 Ops/s 0.5853 Ops/s $\color{#35bf28}+0.74\%$
test_non_tensor_env_rollout_speed[1000-serial-no-buffers-False] 1.5447s 1.4693s 0.6806 Ops/s 0.6730 Ops/s $\color{#35bf28}+1.12\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-True] 2.0303s 1.9466s 0.5137 Ops/s 0.5090 Ops/s $\color{#35bf28}+0.92\%$
test_non_tensor_env_rollout_speed[1000-serial-buffers-False] 1.8082s 1.7258s 0.5794 Ops/s 0.5742 Ops/s $\color{#35bf28}+0.91\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-True] 4.8335s 4.7478s 0.2106 Ops/s 0.2135 Ops/s $\color{#d91a1a}-1.32\%$
test_non_tensor_env_rollout_speed[1000-parallel-no-buffers-False] 4.6224s 4.5258s 0.2210 Ops/s 0.2194 Ops/s $\color{#35bf28}+0.69\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-True] 2.1283s 1.9709s 0.5074 Ops/s 0.5060 Ops/s $\color{#35bf28}+0.26\%$
test_non_tensor_env_rollout_speed[1000-parallel-buffers-False] 1.7510s 1.6717s 0.5982 Ops/s 0.5893 Ops/s $\color{#35bf28}+1.52\%$
test_values[generalized_advantage_estimate-True-True] 11.8817ms 10.8466ms 92.1951 Ops/s 94.3337 Ops/s $\color{#d91a1a}-2.27\%$
test_values[vec_generalized_advantage_estimate-True-True] 12.7238ms 11.1318ms 89.8331 Ops/s 56.2763 Ops/s $\textbf{\color{#35bf28}+59.63\%}$
test_values[td0_return_estimate-False-False] 0.2421ms 0.1406ms 7.1119 KOps/s 7.9330 KOps/s $\textbf{\color{#d91a1a}-10.35\%}$
test_values[td1_return_estimate-False-False] 30.3634ms 29.4540ms 33.9513 Ops/s 33.8013 Ops/s $\color{#35bf28}+0.44\%$
test_values[vec_td1_return_estimate-False-False] 11.3937ms 11.1592ms 89.6120 Ops/s 56.3942 Ops/s $\textbf{\color{#35bf28}+58.90\%}$
test_values[td_lambda_return_estimate-True-False] 45.5190ms 43.7966ms 22.8328 Ops/s 22.9114 Ops/s $\color{#d91a1a}-0.34\%$
test_values[vec_td_lambda_return_estimate-True-False] 12.2307ms 11.1662ms 89.5557 Ops/s 56.0432 Ops/s $\textbf{\color{#35bf28}+59.80\%}$
test_gae_speed[generalized_advantage_estimate-False-1-512] 9.6741ms 9.5319ms 104.9108 Ops/s 106.1474 Ops/s $\color{#d91a1a}-1.16\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 1.7121ms 1.5001ms 666.6059 Ops/s 659.2101 Ops/s $\color{#35bf28}+1.12\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.5178ms 0.4252ms 2.3520 KOps/s 2.3061 KOps/s $\color{#35bf28}+1.99\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 19.4366ms 18.7902ms 53.2192 Ops/s 28.7191 Ops/s $\textbf{\color{#35bf28}+85.31\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 2.1422ms 1.7266ms 579.1851 Ops/s 584.0525 Ops/s $\color{#d91a1a}-0.83\%$
test_dqn_speed[False-None] 1.5183ms 1.4141ms 707.1437 Ops/s 695.5854 Ops/s $\color{#35bf28}+1.66\%$
test_dqn_speed[False-backward] 1.9760ms 1.9362ms 516.4777 Ops/s 510.3646 Ops/s $\color{#35bf28}+1.20\%$
test_dqn_speed[True-None] 0.7830ms 0.5414ms 1.8471 KOps/s 1.8293 KOps/s $\color{#35bf28}+0.98\%$
test_dqn_speed[True-backward] 1.0288ms 0.9982ms 1.0019 KOps/s 850.0212 Ops/s $\textbf{\color{#35bf28}+17.86\%}$
test_dqn_speed[reduce-overhead-None] 0.9418ms 0.5271ms 1.8971 KOps/s 1.8300 KOps/s $\color{#35bf28}+3.67\%$
test_ddpg_speed[False-None] 4.3221ms 2.9643ms 337.3507 Ops/s 343.7276 Ops/s $\color{#d91a1a}-1.86\%$
test_ddpg_speed[False-backward] 4.5009ms 4.1862ms 238.8783 Ops/s 240.1774 Ops/s $\color{#d91a1a}-0.54\%$
test_ddpg_speed[True-None] 1.6830ms 1.4119ms 708.2638 Ops/s 697.3215 Ops/s $\color{#35bf28}+1.57\%$
test_ddpg_speed[True-backward] 2.5150ms 2.3931ms 417.8633 Ops/s 376.8703 Ops/s $\textbf{\color{#35bf28}+10.88\%}$
test_ddpg_speed[reduce-overhead-None] 1.4554ms 1.3892ms 719.8625 Ops/s 698.1169 Ops/s $\color{#35bf28}+3.11\%$
test_sac_speed[False-None] 8.8154ms 8.1801ms 122.2479 Ops/s 122.1722 Ops/s $\color{#35bf28}+0.06\%$
test_sac_speed[False-backward] 12.1178ms 11.4394ms 87.4171 Ops/s 86.1405 Ops/s $\color{#35bf28}+1.48\%$
test_sac_speed[True-None] 2.5504ms 2.1354ms 468.2932 Ops/s 460.5500 Ops/s $\color{#35bf28}+1.68\%$
test_sac_speed[True-backward] 4.5836ms 4.1246ms 242.4461 Ops/s 244.4614 Ops/s $\color{#d91a1a}-0.82\%$
test_sac_speed[reduce-overhead-None] 2.4701ms 2.1311ms 469.2432 Ops/s 446.4885 Ops/s $\textbf{\color{#35bf28}+5.10\%}$
test_redq_speed[False-None] 10.9619ms 10.4670ms 95.5381 Ops/s 93.3354 Ops/s $\color{#35bf28}+2.36\%$
test_redq_speed[False-backward] 18.5157ms 17.8791ms 55.9311 Ops/s 56.0587 Ops/s $\color{#d91a1a}-0.23\%$
test_redq_speed[True-None] 4.7313ms 4.4026ms 227.1373 Ops/s 229.7392 Ops/s $\color{#d91a1a}-1.13\%$
test_redq_speed[True-backward] 10.0385ms 9.7333ms 102.7405 Ops/s 94.5727 Ops/s $\textbf{\color{#35bf28}+8.64\%}$
test_redq_speed[reduce-overhead-None] 4.5754ms 4.3225ms 231.3452 Ops/s 220.1939 Ops/s $\textbf{\color{#35bf28}+5.06\%}$
test_redq_deprec_speed[False-None] 11.6039ms 11.0649ms 90.3762 Ops/s 89.4540 Ops/s $\color{#35bf28}+1.03\%$
test_redq_deprec_speed[False-backward] 16.2611ms 15.7557ms 63.4691 Ops/s 62.3081 Ops/s $\color{#35bf28}+1.86\%$
test_redq_deprec_speed[True-None] 4.4014ms 3.6688ms 272.5655 Ops/s 267.7504 Ops/s $\color{#35bf28}+1.80\%$
test_redq_deprec_speed[True-backward] 7.8198ms 7.5692ms 132.1147 Ops/s 132.7005 Ops/s $\color{#d91a1a}-0.44\%$
test_redq_deprec_speed[reduce-overhead-None] 3.9961ms 3.6100ms 277.0081 Ops/s 279.3934 Ops/s $\color{#d91a1a}-0.85\%$
test_td3_speed[False-None] 8.3187ms 8.1529ms 122.6561 Ops/s 122.4674 Ops/s $\color{#35bf28}+0.15\%$
test_td3_speed[False-backward] 11.7888ms 11.0677ms 90.3533 Ops/s 90.4446 Ops/s $\color{#d91a1a}-0.10\%$
test_td3_speed[True-None] 1.8418ms 1.8180ms 550.0530 Ops/s 548.3735 Ops/s $\color{#35bf28}+0.31\%$
test_td3_speed[True-backward] 4.0380ms 3.6618ms 273.0885 Ops/s 252.2537 Ops/s $\textbf{\color{#35bf28}+8.26\%}$
test_td3_speed[reduce-overhead-None] 1.8489ms 1.7900ms 558.6661 Ops/s 554.0323 Ops/s $\color{#35bf28}+0.84\%$
test_cql_speed[False-None] 29.4037ms 26.4670ms 37.7829 Ops/s 38.0585 Ops/s $\color{#d91a1a}-0.72\%$
test_cql_speed[False-backward] 40.4521ms 36.1431ms 27.6678 Ops/s 27.8404 Ops/s $\color{#d91a1a}-0.62\%$
test_cql_speed[True-None] 12.6864ms 12.4144ms 80.5513 Ops/s 76.2404 Ops/s $\textbf{\color{#35bf28}+5.65\%}$
test_cql_speed[True-backward] 18.8409ms 18.3689ms 54.4398 Ops/s 54.6072 Ops/s $\color{#d91a1a}-0.31\%$
test_cql_speed[reduce-overhead-None] 12.7698ms 12.4139ms 80.5551 Ops/s 79.2421 Ops/s $\color{#35bf28}+1.66\%$
test_a2c_speed[False-None] 5.5748ms 5.3549ms 186.7445 Ops/s 182.5535 Ops/s $\color{#35bf28}+2.30\%$
test_a2c_speed[False-backward] 12.0977ms 11.7408ms 85.1730 Ops/s 84.1387 Ops/s $\color{#35bf28}+1.23\%$
test_a2c_speed[True-None] 4.1303ms 3.7458ms 266.9683 Ops/s 260.7267 Ops/s $\color{#35bf28}+2.39\%$
test_a2c_speed[True-backward] 8.9998ms 8.5783ms 116.5727 Ops/s 116.0092 Ops/s $\color{#35bf28}+0.49\%$
test_a2c_speed[reduce-overhead-None] 4.0865ms 3.6976ms 270.4430 Ops/s 267.0233 Ops/s $\color{#35bf28}+1.28\%$
test_ppo_speed[False-None] 6.3077ms 5.9525ms 167.9980 Ops/s 165.1361 Ops/s $\color{#35bf28}+1.73\%$
test_ppo_speed[False-backward] 13.2117ms 12.6313ms 79.1681 Ops/s 78.5366 Ops/s $\color{#35bf28}+0.80\%$
test_ppo_speed[True-None] 3.8158ms 3.6237ms 275.9637 Ops/s 272.2137 Ops/s $\color{#35bf28}+1.38\%$
test_ppo_speed[True-backward] 8.7543ms 8.3835ms 119.2813 Ops/s 117.1816 Ops/s $\color{#35bf28}+1.79\%$
test_ppo_speed[reduce-overhead-None] 3.8027ms 3.6233ms 275.9907 Ops/s 274.0391 Ops/s $\color{#35bf28}+0.71\%$
test_reinforce_speed[False-None] 4.9745ms 4.6246ms 216.2353 Ops/s 214.7035 Ops/s $\color{#35bf28}+0.71\%$
test_reinforce_speed[False-backward] 7.7310ms 7.4260ms 134.6620 Ops/s 135.1591 Ops/s $\color{#d91a1a}-0.37\%$
test_reinforce_speed[True-None] 3.0943ms 2.8731ms 348.0514 Ops/s 334.5105 Ops/s $\color{#35bf28}+4.05\%$
test_reinforce_speed[True-backward] 8.2443ms 7.7281ms 129.3976 Ops/s 116.8220 Ops/s $\textbf{\color{#35bf28}+10.76\%}$
test_reinforce_speed[reduce-overhead-None] 3.0146ms 2.8582ms 349.8695 Ops/s 339.9249 Ops/s $\color{#35bf28}+2.93\%$
test_iql_speed[False-None] 24.9990ms 20.3817ms 49.0637 Ops/s 49.9612 Ops/s $\color{#d91a1a}-1.80\%$
test_iql_speed[False-backward] 35.7492ms 30.8243ms 32.4419 Ops/s 32.7725 Ops/s $\color{#d91a1a}-1.01\%$
test_iql_speed[True-None] 8.7555ms 8.5182ms 117.3952 Ops/s 111.1575 Ops/s $\textbf{\color{#35bf28}+5.61\%}$
test_iql_speed[True-backward] 17.0398ms 16.7205ms 59.8068 Ops/s 59.6576 Ops/s $\color{#35bf28}+0.25\%$
test_iql_speed[reduce-overhead-None] 8.8931ms 8.6050ms 116.2119 Ops/s 113.4854 Ops/s $\color{#35bf28}+2.40\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.2626ms 6.1170ms 163.4800 Ops/s 160.0243 Ops/s $\color{#35bf28}+2.16\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 2.1514ms 0.3076ms 3.2506 KOps/s 3.4421 KOps/s $\textbf{\color{#d91a1a}-5.56\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.5301ms 0.2953ms 3.3861 KOps/s 2.9051 KOps/s $\textbf{\color{#35bf28}+16.55\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.1947ms 5.9531ms 167.9806 Ops/s 166.8636 Ops/s $\color{#35bf28}+0.67\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 2.3421ms 0.3398ms 2.9425 KOps/s 2.9778 KOps/s $\color{#d91a1a}-1.19\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.5108ms 0.3105ms 3.2202 KOps/s 2.9646 KOps/s $\textbf{\color{#35bf28}+8.62\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.5089ms 1.2922ms 773.8705 Ops/s 685.7287 Ops/s $\textbf{\color{#35bf28}+12.85\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.5222ms 1.2287ms 813.8654 Ops/s 724.0638 Ops/s $\textbf{\color{#35bf28}+12.40\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 10.2244ms 6.3318ms 157.9318 Ops/s 162.2306 Ops/s $\color{#d91a1a}-2.65\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.1664ms 0.5058ms 1.9770 KOps/s 1.9154 KOps/s $\color{#35bf28}+3.22\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7039ms 0.4762ms 2.0999 KOps/s 1.9792 KOps/s $\textbf{\color{#35bf28}+6.10\%}$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 6.0671ms 5.9860ms 167.0576 Ops/s 166.8224 Ops/s $\color{#35bf28}+0.14\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 2.3386ms 0.3382ms 2.9571 KOps/s 3.5428 KOps/s $\textbf{\color{#d91a1a}-16.53\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.4736ms 0.2654ms 3.7673 KOps/s 3.7376 KOps/s $\color{#35bf28}+0.79\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 6.1289ms 5.9370ms 168.4364 Ops/s 168.7031 Ops/s $\color{#d91a1a}-0.16\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 2.0518ms 0.2973ms 3.3636 KOps/s 2.8816 KOps/s $\textbf{\color{#35bf28}+16.73\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6059ms 0.3094ms 3.2321 KOps/s 3.0186 KOps/s $\textbf{\color{#35bf28}+7.07\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 6.5709ms 6.1498ms 162.6069 Ops/s 163.0967 Ops/s $\color{#d91a1a}-0.30\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.1564ms 0.4664ms 2.1443 KOps/s 1.9899 KOps/s $\textbf{\color{#35bf28}+7.76\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7145ms 0.4799ms 2.0836 KOps/s 2.0099 KOps/s $\color{#35bf28}+3.67\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.6709s 18.4775ms 54.1200 Ops/s 55.6767 Ops/s $\color{#d91a1a}-2.80\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 11.3631ms 2.0106ms 497.3655 Ops/s 521.5782 Ops/s $\color{#d91a1a}-4.64\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 7.1116ms 1.2219ms 818.4024 Ops/s 788.1009 Ops/s $\color{#35bf28}+3.84\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 6.4934ms 5.1285ms 194.9883 Ops/s 192.4866 Ops/s $\color{#35bf28}+1.30\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 4.0254ms 1.8000ms 555.5658 Ops/s 570.3471 Ops/s $\color{#d91a1a}-2.59\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 2.9208ms 1.1678ms 856.3190 Ops/s 1.1156 KOps/s $\textbf{\color{#d91a1a}-23.24\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.5233s 15.7409ms 63.5289 Ops/s 56.2444 Ops/s $\textbf{\color{#35bf28}+12.95\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 4.3081ms 2.0418ms 489.7536 Ops/s 471.9836 Ops/s $\color{#35bf28}+3.76\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 12.1121ms 1.5354ms 651.3160 Ops/s 951.4083 Ops/s $\textbf{\color{#d91a1a}-31.54\%}$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-True] 38.7000ms 36.4806ms 27.4118 Ops/s 27.6035 Ops/s $\color{#d91a1a}-0.69\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-10000-10000-100-False] 20.1726ms 18.5138ms 54.0139 Ops/s 54.4383 Ops/s $\color{#d91a1a}-0.78\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-True] 41.4421ms 37.9187ms 26.3722 Ops/s 26.6670 Ops/s $\color{#d91a1a}-1.11\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-100000-10000-100-False] 20.9041ms 19.1495ms 52.2207 Ops/s 52.9276 Ops/s $\color{#d91a1a}-1.34\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-True] 41.7349ms 39.5742ms 25.2690 Ops/s 25.4669 Ops/s $\color{#d91a1a}-0.78\%$
test_rb_extend_sample[ReplayBuffer-LazyTensorStorage-RandomSampler-1000000-10000-100-False] 22.4944ms 20.6912ms 48.3298 Ops/s 47.5251 Ops/s $\color{#35bf28}+1.69\%$
test_storage_write_lazystack[50-img_shape0-small] 0.9244ms 0.2268ms 4.4090 KOps/s 4.3519 KOps/s $\color{#35bf28}+1.31\%$
test_storage_write_lazystack[100-img_shape1-atari] 1.5560ms 1.3887ms 720.1190 Ops/s 710.1624 Ops/s $\color{#35bf28}+1.40\%$
test_storage_write_lazystack[100-img_shape2-large_img] 2.6261ms 2.3293ms 429.3095 Ops/s 417.5578 Ops/s $\color{#35bf28}+2.81\%$
test_storage_write_lazystack[200-img_shape3-large_batch] 3.0676ms 2.9143ms 343.1334 Ops/s 341.1287 Ops/s $\color{#35bf28}+0.59\%$
test_storage_write_contiguous[50-img_shape0-small] 0.2150ms 0.1400ms 7.1444 KOps/s 7.1779 KOps/s $\color{#d91a1a}-0.47\%$
test_storage_write_contiguous[100-img_shape1-atari] 0.3590ms 0.2006ms 4.9842 KOps/s 5.1440 KOps/s $\color{#d91a1a}-3.11\%$
test_storage_write_contiguous[100-img_shape2-large_img] 1.9206ms 1.7679ms 565.6324 Ops/s 588.9215 Ops/s $\color{#d91a1a}-3.95\%$
test_storage_write_contiguous[200-img_shape3-large_batch] 1.5159ms 1.3199ms 757.6285 Ops/s 782.2691 Ops/s $\color{#d91a1a}-3.15\%$
test_collector_stack_then_write[50-img_shape0-small] 1.2966ms 1.1401ms 877.1381 Ops/s 877.4109 Ops/s $\color{#d91a1a}-0.03\%$
test_collector_stack_then_write[100-img_shape1-atari] 3.7205ms 3.6168ms 276.4860 Ops/s 275.5326 Ops/s $\color{#35bf28}+0.35\%$
test_collector_stack_then_write[100-img_shape2-large_img] 10.8101ms 5.8143ms 171.9890 Ops/s 173.1712 Ops/s $\color{#d91a1a}-0.68\%$
test_collector_stack_then_write[200-img_shape3-large_batch] 7.5840ms 7.0208ms 142.4333 Ops/s 142.3719 Ops/s $\color{#35bf28}+0.04\%$
test_collector_lazystack_then_write[50-img_shape0-small] 0.4733ms 0.2753ms 3.6327 KOps/s 3.6063 KOps/s $\color{#35bf28}+0.73\%$
test_collector_lazystack_then_write[100-img_shape1-atari] 1.6816ms 1.4989ms 667.1357 Ops/s 657.2712 Ops/s $\color{#35bf28}+1.50\%$
test_collector_lazystack_then_write[100-img_shape2-large_img] 2.5618ms 2.4193ms 413.3453 Ops/s 398.2725 Ops/s $\color{#35bf28}+3.78\%$
test_collector_lazystack_then_write[200-img_shape3-large_batch] 3.3883ms 3.1380ms 318.6699 Ops/s 318.1328 Ops/s $\color{#35bf28}+0.17\%$
test_collector_without_rb[100-img_shape0-atari] 35.5111ms 34.9737ms 28.5929 Ops/s 28.7912 Ops/s $\color{#d91a1a}-0.69\%$
test_collector_without_rb[200-img_shape1-large_batch] 69.5882ms 68.6075ms 14.5757 Ops/s 14.4894 Ops/s $\color{#35bf28}+0.60\%$
test_collector_with_rb[100-img_shape0-atari] 45.8965ms 44.1824ms 22.6335 Ops/s 20.7459 Ops/s $\textbf{\color{#35bf28}+9.10\%}$
test_collector_with_rb[200-img_shape1-large_batch] 99.3002ms 98.7993ms 10.1215 Ops/s 10.3131 Ops/s $\color{#d91a1a}-1.86\%$

@vmoens vmoens changed the title Lazy stack optimization for collector-to-buffer writes [Performance] Lazy stack optimization for collector-to-buffer writes Feb 3, 2026
@github-actions github-actions bot added the Performance Performance issue or suggestion for improvement label Feb 3, 2026
vmoens and others added 4 commits February 3, 2026 18:15
Optimize data collection pipeline by using lazy stacks in collectors when
a replay buffer is present, enabling single-write operations directly to
storage instead of two separate write operations.

Before:
1. Collector: torch.stack(tensordicts, out=_final_rollout) -> Write 1
2. Storage: storage[cursor] = data -> Write 2

After:
1. Collector: LazyStackedTensorDict.lazy_stack(tensordicts) -> No write
2. Storage: torch.stack(lazy.unbind(), out=storage[cursor]) -> Single write

Changes:
- TensorStorage.set() now detects LazyStackedTensorDict and uses
  torch.stack(..., out=) to write directly to storage
- Collector.rollout() uses lazy_stack when replay buffer is present
- Added tests for storage and collector integration
- Added benchmarks to measure the improvement

Co-authored-by: Cursor <cursoragent@cursor.com>
The torch.stack(..., out=) approach for TensorDict doesn't work correctly.
Reverted to using the normal assignment path self._storage[cursor] = data
which handles lazy stacks through TensorDict's __setitem__.

Also simplified the test to verify data integrity more reliably.

Co-authored-by: Cursor <cursoragent@cursor.com>
Instead of torch.stack(..., out=), iterate through the lazy stack's
tensordicts and use update_() to write each directly to the corresponding
storage location. This avoids creating an intermediate contiguous copy.

The optimization only applies when stack_dim == 0 (the batch dimension),
which is the common case for collector outputs.

Co-authored-by: Cursor <cursoragent@cursor.com>
For slice indices, storage[slice] returns a view, so we can use
_stack_onto_ to copy directly from the lazy stack's tensordicts.

For non-contiguous tensor indices, we continue to iterate and
update each element individually since storage[tensor] returns a copy.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vmoens vmoens force-pushed the feat/lazy-stack-collector-optimization branch from bacb6c5 to b54982c Compare February 3, 2026 18:15
Extend the lazy stack optimization in TensorStorage.set() to handle
any stack_dim, not just stack_dim=0. This is important for parallel
environments where the storage is 2D [max_size, n_steps] and the
lazy stack has stack_dim=1 (time dimension).

Changes:
- Use _stack_onto_ for slices with any stack_dim
- For tensor indices with stack_dim>0, check if contiguous and convert to slice
- Add tests for 2D storage with lazy stack (stack_dim=1)
- Add collector integration test with parallel envs

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Benchmarks rl/benchmark changes CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Collectors Performance Performance issue or suggestion for improvement ReplayBuffers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants