Skip to content

Training hangs at actor-infer step with Qwen3-8B on an 8-GPU node #329

@UsernameFull

Description

@UsernameFull

The steps taken to inspect the process using pystack are as follows:

file "/root/miniconda3/env/roll/python3.11/site-packages/vllm/excutor/uniproc_excutor.py",  line 56, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
File "/root/miniconda3/env/roll/python3.11/site-packages/vllm/utils/__init__.py", line 2985, in run_method
return func(*args, **kwargs)
File "/home/ROLL/roll/third_party/vllm/worker_helper.py", in line 80, in setup_collective_group
collective.allreduce(torch.zeros(1).to(current_platform.device_type), group_name=group_name)
File "/home/ROLL/roll/third_party/utils/collective/collective.py", in line 84, in allreduce
dist.all_reduce(tensor, op=op, group=_group_mgr.get_group_by_name(group_name))
File "/root/miniconda3/env/roll/python3.11/site-packages/torch/distributed/c10d_logger.py",  line 81, in wrapper
return func(*args, **kwargs)
File "/root/miniconda3/env/roll/python3.11/site-packages/torch/distributed/distributed_c10d.py",  line 2180, in all_reuduce
work = group.allreduce([tensor], opts)

Here is the training config used:

defaults:
  - ../config/deepspeed_zero@_here_
  - ../config/deepspeed_zero2@_here_
  - ../config/deepspeed_zero3@_here_
  - ../config/deepspeed_zero3_cpuoffload@_here_

hydra:
  run:
    dir: .
  output_subdir: null

pg_variant: topr # topr, vanilla, tis, cispo, kimi15, ppo
exp_name: Qwen3-8B-RLVR-${pg_variant}
seed: 42
logging_dir: ./output/logs
output_dir: ./output
system_envs:
  USE_MODELSCOPE: '1'

checkpoint_config:
  type: file_system
  output_dir: /data/cpfs_0/rl_examples/models/${exp_name}

num_gpus_per_node: 8

max_steps: 500
save_steps: 100
logging_steps: 1
eval_steps: 10
resume_from_checkpoint: false


rollout_batch_size: 128  # prompt
prompt_length: 2048
response_length: 8192

num_return_sequences_in_group: 8
ppo_epochs: 1
adv_estimator: "reinforce"

# clip
value_clip: 0.5
reward_clip: 10
advantage_clip: 2.0
dual_clip_loss: true

# normalize
norm_mean_type: batch
norm_std_type: batch

# data mask
max_len_mask: true
difficulty_mask: true
difficulty_low_threshold: 0.1
difficulty_high_threshold: 0.95
error_max_len_clip: false

# data weight
difficulty_loss_weight: false
length_loss_weight: false

# reward
add_token_level_kl: false

# advantage
whiten_advantages: true


pretrain: Qwen/Qwen3-8B
reward_pretrain: Qwen/Qwen3-8B

validation:
  data_args:
    template: qwen3
    file_name:
      - data/math_benchmarks.jsonl
  generating_args:
    top_p: 0.6
    top_k: 50
    num_beams: 1
    temperature: 0.6
    num_return_sequences: 1
  eval_steps: 10

actor_train:
  worker_cls: roll.pipeline.rlvr.actor_pg_worker.ActorPGWorker
  pg_variant: topr # topr, vanilla, tis, cispo, kimi15, ppo
  model_args:
    flash_attn: fa2
    disable_gradient_checkpointing: false
    dtype: bf16
    model_type: ~
  training_args:
    learning_rate: 1.0e-6
    weight_decay: 0
    per_device_train_batch_size: 1
    gradient_accumulation_steps: 64
    warmup_steps: 20
    num_train_epochs: 50
  data_args:
    template: qwen2_5
    file_name:
      - data/math_deepmath_deal.jsonl
    domain_interleave_probs:
      math_rule: 1
    dataset_dir: data
    messages: messages
    interleave_probs: "1.0"
    preprocessing_num_workers: 16
  strategy_args:
    strategy_name: deepspeed_train
    strategy_config:${deepspeed_zero3}
  device_mapping: list(range(0,4))
  infer_batch_size: 4

actor_infer:
  model_args:
    flash_attn: fa2
    disable_gradient_checkpointing: true
    dtype: bf16
  generating_args:
    max_new_tokens: ${response_length}
    top_p: 0.99
    top_k: 100
    num_beams: 1
    temperature: 0.99
    num_return_sequences: ${num_return_sequences_in_group}
  data_args:
    template: qwen3
  strategy_args:
    strategy_name: vllm
    strategy_config:
      gpu_memory_utilization: 0.6
      block_size: 16
      max_model_len: 8000
  device_mapping: list(range(4,6))
  infer_batch_size: 1

reference:
  model_args:
    flash_attn: fa2
    disable_gradient_checkpointing: true
    dtype: bf16
    model_type: ~
  data_args:
    template: qwen2_5
  strategy_args:
    strategy_name: hf_infer
    strategy_config: ~
  device_mapping: list(range(6,8))
  infer_batch_size: 8

rewards:
  math_rule:
    worker_cls: roll.pipeline.rlvr.rewards.math_rule_reward_worker.MathRuleRewardWorker
    model_args:
      model_name_or_path: ${reward_pretrain}
    data_args:
      template: qwen2_5
    tag_included: [deepmath_103k, 'MATH-500', 'OlympiadBench', 'minervamath', 'aime2025', 'gsm8k', 'aime', 'amc23', 'math_rule']
    world_size: 8
    infer_batch_size: 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions