RLM: Fix trajectory collision #786

snimu · 2026-01-25T19:10:09Z

Description

Fixes interleaved prompt‑ID computation in RLMEnv by filtering to main‑trajectory steps, preventing sub‑LLM steps from corrupting main‑LLM tokenization (the root cause of /tokenize empty‑messages errors in
training)
Factors model calls into Environment._call_model_api, so main and sub‑LLMs share the same request/validation path
Unifies sub‑LLM interleaving with the main model (incremental prompt IDs) and adds coverage in tests/test_rlm_env.py
RLM‑secrets dataset updates: add task/example_id, remove eval dataset/num_eval_examples, update README

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Test improvement

Testing

All existing tests pass when running uv run pytest locally.
New tests have been added to cover the changes

Checklist

My code follows the style guidelines of this project as outlined in AGENTS.md
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Note

Shared model-call path: Factors request/validation into Environment._call_model_api, used by both main and sub-LLM paths (chat, tokens endpoint), with overlong-prompt handling.
Interleaved tokenization fix: RLMEnv.get_model_response filters state["trajectory"] to the main trajectory_id before get_prompt_ids; sub-LLM calls keep a minimal sub_state to compute incremental prompt_ids across tool turns.
Sub-LLM call path: Normalizes message content, supports /chat/completions/tokens with return_token_ids, and avoids direct chat.completions.create when interleaved. Adds tests for tokens endpoint usage, message normalization, and incremental prompt IDs.
RLM-secrets updates: Dataset rows now include example_id and task; removes eval dataset/num_eval_examples. README notes eval via re-instantiating with a different seed.

^{Written by Cursor Bugbot for commit fd52bee. This will update automatically on new commits. Configure here.}

This reverts commit a1052f8.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

verifiers/envs/experimental/rlm_env.py

willccbb · 2026-01-25T22:50:39Z

hmmm would it be possible to have this handled by overriding add_trajectory_step from MultiTurnEnv? maybe we still want to log those steps, but they shouldn't be the in the main sequence in state['trajectory'] if they won't be used for training IMO

this feels like some of the RLM logic is creeping a bit too low into the stack (base Environment shouldn't know what an RLM is or have to think about it), and preparing the context for the API call should probably be handled by get_prompt_messages. If the RLM environment promises that get_prompt_messages only ever contains "increasing" sequences of messages on the subset of turns where a step will be added to state['trajectory'], then the old/current approach should work without changes I think?

In general though, I'm a bit skeptical about trying to shoehorn RLMs into the interleaved strategy. Ultimately we want a single "best of both worlds" strategy which is always TITO, always "just works" with get_prompt_messages, and aggressively interleaves until a message sequence forces a branch

snimu added 7 commits January 25, 2026 12:35

Refactor model call path and align sub-LLM tests

3338a57

Unify sub-LLM interleaving and expand tests

8fc1aac

Add task and example_id to rlm_secrets dataset

21ae949

Remove eval dataset params from rlm_secrets

0671c21

Fix interleaved prompt IDs for main RLM

5bd429a

Drain bash REPL output between commands

a1052f8

Revert "Drain bash REPL output between commands"

fcca9fd

This reverts commit a1052f8.

cursor bot reviewed Jan 25, 2026

View reviewed changes

verifiers/envs/experimental/rlm_env.py Show resolved Hide resolved

Persist cached suffix ids from prompt_state

fd52bee

snimu changed the title ~~Sebastian/rlm sub llm call path 2026 01 25~~ RLM: Fix trajectory collision Jan 25, 2026

snimu requested a review from willccbb January 25, 2026 20:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RLM: Fix trajectory collision #786

RLM: Fix trajectory collision #786

snimu commented Jan 25, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

willccbb commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RLM: Fix trajectory collision #786

Are you sure you want to change the base?

RLM: Fix trajectory collision #786

Conversation

snimu commented Jan 25, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Testing

Checklist

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

willccbb commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

snimu commented Jan 25, 2026 •

edited by cursor bot

Loading