Skip to content

Conversation

@jacobarrio
Copy link

Fixes #363

Changed buffer shapes from (segments, horizon) to (segments, horizon + 1). Now we store 65 states which gives us 64 full transitions instead of just 63.

Also changed the rollout loop from >= to > at line 300 so it actually collects that last transition.

Before this fix, advantages[:, -1] was always 0 and the last sample in every segment had broken gradients (pg_loss ≈ 0, entropy made policy random, value function got conservative updates).

Now all 64 samples get proper advantage calculations.

Closes #363

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Training signal could be improved

1 participant