Fix advantage calculation off-by-one error (#363) #445

jacobarrio · 2025-12-30T01:24:27Z

Fixes #363

Changed buffer shapes from (segments, horizon) to (segments, horizon + 1). Now we store 65 states which gives us 64 full transitions instead of just 63.

Also changed the rollout loop from >= to > at line 300 so it actually collects that last transition.

Before this fix, advantages[:, -1] was always 0 and the last sample in every segment had broken gradients (pg_loss ≈ 0, entropy made policy random, value function got conservative updates).

Now all 64 samples get proper advantage calculations.

Closes #363

fix advantage off-by-one: store horizon+1 states for horizon transitions

430467c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix advantage calculation off-by-one error (#363) #445

Fix advantage calculation off-by-one error (#363) #445

jacobarrio commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix advantage calculation off-by-one error (#363) #445

Are you sure you want to change the base?

Fix advantage calculation off-by-one error (#363) #445

Conversation

jacobarrio commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant