Training hangs at actor-infer step with Qwen3-8B on an 8-GPU node

The steps taken to inspect the process using pystack are as follows:

```
file "/root/miniconda3/env/roll/python3.11/site-packages/vllm/excutor/uniproc_excutor.py",  line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
File "/root/miniconda3/env/roll/python3.11/site-packages/vllm/utils/__init__.py", line 2985, in run_method
return func(*args, **kwargs)
File "/home/ROLL/roll/third_party/vllm/worker_helper.py", in line 80, in setup_collective_group
collective.allreduce(torch.zeros(1).to(current_platform.device_type), group_name=group_name)
File "/home/ROLL/roll/third_party/utils/collective/collective.py", in line 84, in allreduce
dist.all_reduce(tensor, op=op, group=_group_mgr.get_group_by_name(group_name))
File "/root/miniconda3/env/roll/python3.11/site-packages/torch/distributed/c10d_logger.py",  line 81, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/env/roll/python3.11/site-packages/torch/distributed/distributed_c10d.py",  line 2180, in all_reuduce
work = group.allreduce([tensor], opts)
```

Here is the training config used:

```YAML
defaults:
  - ../config/deepspeed_zero@_here_
  - ../config/deepspeed_zero2@_here_
  - ../config/deepspeed_zero3@_here_
  - ../config/deepspeed_zero3_cpuoffload@_here_

hydra:
  run:
    dir: .
  output_subdir: null

pg_variant: topr # topr, vanilla, tis, cispo, kimi15, ppo
exp_name: Qwen3-8B-RLVR-${pg_variant}
seed: 42
logging_dir: ./output/logs
output_dir: ./output
system_envs:
  USE_MODELSCOPE: '1'

checkpoint_config:
  type: file_system
  output_dir: /data/cpfs_0/rl_examples/models/${exp_name}

num_gpus_per_node: 8

max_steps: 500
save_steps: 100
logging_steps: 1
eval_steps: 10
resume_from_checkpoint: false


rollout_batch_size: 128  # prompt
prompt_length: 2048
response_length: 8192

num_return_sequences_in_group: 8
ppo_epochs: 1
adv_estimator: "reinforce"

# clip
value_clip: 0.5
reward_clip: 10
advantage_clip: 2.0
dual_clip_loss: true

# normalize
norm_mean_type: batch
norm_std_type: batch

# data mask
max_len_mask: true
difficulty_mask: true
difficulty_low_threshold: 0.1
difficulty_high_threshold: 0.95
error_max_len_clip: false

# data weight
difficulty_loss_weight: false
length_loss_weight: false

# reward
add_token_level_kl: false

# advantage
whiten_advantages: true


pretrain: Qwen/Qwen3-8B
reward_pretrain: Qwen/Qwen3-8B

validation:
  data_args:
    template: qwen3
    file_name:
      - data/math_benchmarks.jsonl
  generating_args:
    top_p: 0.6
    top_k: 50
    num_beams: 1
    temperature: 0.6
    num_return_sequences: 1
  eval_steps: 10

actor_train:
  worker_cls: roll.pipeline.rlvr.actor_pg_worker.ActorPGWorker
  pg_variant: topr # topr, vanilla, tis, cispo, kimi15, ppo
  model_args:
    flash_attn: fa2
    disable_gradient_checkpointing: false
    dtype: bf16
    model_type: ~
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 64
    warmup_steps: 20
    num_train_epochs: 50
  data_args:
    template: qwen2_5
    file_name:
      - data/math_deepmath_deal.jsonl
    domain_interleave_probs:
      math_rule: 1
    dataset_dir: data
    messages: messages
    interleave_probs: "1.0"
    preprocessing_num_workers: 16
  strategy_args:
    strategy_name: deepspeed_train
    strategy_config:${deepspeed_zero3}
  device_mapping: list(range(0,4))
  infer_batch_size: 4

actor_infer:
  model_args:
    flash_attn: fa2
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: ${response_length}
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: ${num_return_sequences_in_group}
  data_args:
    template: qwen3
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.6
      block_size: 16
      max_model_len: 8000
  device_mapping: list(range(4,6))
  infer_batch_size: 1

reference:
  model_args:
    flash_attn: fa2
    disable_gradient_checkpointing: true
    dtype: bf16
    model_type: ~
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: hf_infer
    strategy_config: ~
  device_mapping: list(range(6,8))
  infer_batch_size: 8

rewards:
  math_rule:
    worker_cls: roll.pipeline.rlvr.rewards.math_rule_reward_worker.MathRuleRewardWorker
    model_args:
      model_name_or_path: ${reward_pretrain}
    data_args:
      template: qwen2_5
    tag_included: [deepmath_103k, 'MATH-500', 'OlympiadBench', 'minervamath', 'aime2025', 'gsm8k', 'aime', 'amc23', 'math_rule']
    world_size: 8
    infer_batch_size: 1

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training hangs at actor-infer step with Qwen3-8B on an 8-GPU node #329

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training hangs at actor-infer step with Qwen3-8B on an 8-GPU node #329

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions