AssertionError in _log_rollout_data when training qwen3-vl-8B with true_on_policy_mode

Hello! First of all, thank you for the great work on this framework.

I encountered an assertion error while running the true_on_policy_vlm example. Details are below.

### Environment & Steps to Reproduce

Codebase Version: slime @ 0934a0e
Command Executed:

`SLIME_SCRIPT_MODEL_NAME=Qwen3-VL-8B-Instruct SLIME_SCRIPT_NUM_GPUS=8 python examples/true_on_policy_vlm/run_simple.py`

### Error Description

When running the command above, the program throws an AssertionError which indicates a mismatch between the values of log_probs and rollout_log_probs.

```
Traceback (most recent call last):
  File "/root/slime/train.py", line 106, in <module>
    train(args)
  File "/root/slime/train.py", line 79, in train
    ray.get(actor_model.async_train(rollout_id, rollout_data_ref))
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2972, in get
    values, debugger_breakpoint = worker.get_objects(
                                  ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1031, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ^[[36mray::FSDPTrainRayActor.train()^[[39m (pid=858541, ip=10.102.98.154, actor_id=61e27bdd7141b62f9b9a4da302000000, repr=<slime.backends.fsdp_utils.actor.FSDPTrainRayActor object at 0x7fd9e7159430>)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/fsdp_utils/actor.py", line 486, in train
    self._train_core(rollout_id=rollout_id, rollout_data=rollout_data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/fsdp_utils/actor.py", line 550, in _train_core
    self._log_rollout_data(rollout_id, rollout_data, packed_batches)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/slime/slime/backends/fsdp_utils/actor.py", line 525, in _log_rollout_data
    assert log_dict["rollout/log_probs"] == log_dict["rollout/rollout_log_probs"], (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: CI check failed: true_on_policy_mode is enabled, but log_probs (-0.2051895260810852) != rollout_log_probs (-0.20450712740421295)
```

### Additional Context

Notably, under the same configuration, the `Qwen3-VL-2B-Instruct` and `Qwen3-VL-4B-Instruct` models run successfully without this error. The issue appears to be specific to the Qwen3-VL-8B-Instruct model.

Are there any known solutions or directions for troubleshooting? Please let me know if you need more logs or environment details.

Thank you for your attention and help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AssertionError in _log_rollout_data when training qwen3-vl-8B with true_on_policy_mode #1109

Environment & Steps to Reproduce

Error Description

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

AssertionError in _log_rollout_data when training qwen3-vl-8B with true_on_policy_mode #1109

Description

Environment & Steps to Reproduce

Error Description

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions