-
Notifications
You must be signed in to change notification settings - Fork 477
RLM: Fix trajectory collision #786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This reverts commit a1052f8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
|
hmmm would it be possible to have this handled by overriding add_trajectory_step from MultiTurnEnv? maybe we still want to log those steps, but they shouldn't be the in the main sequence in state['trajectory'] if they won't be used for training IMO this feels like some of the RLM logic is creeping a bit too low into the stack (base Environment shouldn't know what an RLM is or have to think about it), and preparing the context for the API call should probably be handled by get_prompt_messages. If the RLM environment promises that get_prompt_messages only ever contains "increasing" sequences of messages on the subset of turns where a step will be added to state['trajectory'], then the old/current approach should work without changes I think? In general though, I'm a bit skeptical about trying to shoehorn RLMs into the interleaved strategy. Ultimately we want a single "best of both worlds" strategy which is always TITO, always "just works" with get_prompt_messages, and aggressively interleaves until a message sequence forces a branch |
Description
RLMEnvby filtering to main‑trajectory steps, preventing sub‑LLM steps from corrupting main‑LLM tokenization (the root cause of /tokenize empty‑messages errors intraining)
Environment._call_model_api, so main and sub‑LLMs share the same request/validation pathtests/test_rlm_env.pyType of Change
Testing
uv run pytestlocally.Checklist
Note
Environment._call_model_api, used by both main and sub-LLM paths (chat, tokens endpoint), with overlong-prompt handling.RLMEnv.get_model_responsefiltersstate["trajectory"]to the maintrajectory_idbeforeget_prompt_ids; sub-LLM calls keep a minimalsub_stateto compute incrementalprompt_idsacross tool turns./chat/completions/tokenswithreturn_token_ids, and avoids directchat.completions.createwhen interleaved. Adds tests for tokens endpoint usage, message normalization, and incremental prompt IDs.example_idandtask; removes eval dataset/num_eval_examples. README notes eval via re-instantiating with a differentseed.Written by Cursor Bugbot for commit fd52bee. This will update automatically on new commits. Configure here.