-
Notifications
You must be signed in to change notification settings - Fork 207
Description
尝试单卡跑通流程,遇到报错不知如何解决,配置基于example_grpo修改。另外,单卡推荐使用什么strategy,期待并感谢作者的答复。
跑错信息:
(ActorWorker(reference-0) pid=136685) [2026-01-11 21:25:34] [context_managers.py (41)] [INFO] [reference-0 0 / 1][PID 136685] reference/compute_log_probs_start_offload, memory allocated (GB): 0.0, memory reserved (GB): 0.0, memory max reserved (GB): 0.0, rss (GB): 0.9964942932128906 memory device used (GB): 4.25604248046875 (ActorWorker(reference-0) pid=136685) /root/miniconda3/envs/roll/lib/python3.10/site-packages/torch/cuda/memory.py:491: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. (ActorWorker(reference-0) pid=136685) warnings.warn( (ActorWorker(reference-0) pid=136685) /root/miniconda3/envs/roll/lib/python3.10/site-packages/torch/cuda/memory.py:517: FutureWarning: torch.cuda.reset_max_memory_cached now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats. (ActorWorker(reference-0) pid=136685) warnings.warn( (ActorWorker(reference-0) pid=136685) [2026-01-11 21:25:35] [context_managers.py (41)] [INFO] [reference-0 0 / 1][PID 136685] reference/compute_log_probs_start_onload, memory allocated (GB): 0.0, memory reserved (GB): 0.0, memory max reserved (GB): 0.0, rss (GB): 0.9965057373046875 memory device used (GB): 17.13885498046875 (ActorWorker(reference-0) pid=136685) [2026-01-11 21:25:35] [context_managers.py (41)] [INFO] [reference-0 0 / 1][PID 136685] reference/compute_log_probs_end_onload, memory allocated (GB): 0.0028085708618164062, memory reserved (GB): 0.00390625, memory max reserved (GB): 0.00390625, rss (GB): 1.0738525390625 memory device used (GB): 17.52490234375 (ActorWorker(reference-0) pid=136685) [2026-01-11 21:25:35] [context_managers.py (41)] [INFO] [reference-0 0 / 1][PID 136685] reference/compute_log_probs_end_offload, memory allocated (GB): 0.0028085708618164062, memory reserved (GB): 0.00390625, memory max reserved (GB): 0.00390625, rss (GB): 1.0738525390625 memory device used (GB): 4.64208984375 Traceback (most recent call last): File "/root/ROLL/examples/start_rlvr_pipeline.py", line 36, in <module> main() File "/root/ROLL/examples/start_rlvr_pipeline.py", line 32, in main pipeline.run() File "/root/miniconda3/envs/roll/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) File "/root/ROLL/roll/pipeline/rlvr/rlvr_pipeline.py", line 568, in run ref_log_probs.rename(old_keys="log_probs", new_keys="ref_log_probs") File "/root/ROLL/roll/distributed/scheduler/protocol.py", line 542, in rename self.batch.rename_key_(tuple(old_keys), tuple(new_keys)) AttributeError: 'NoneType' object has no attribute 'rename_key_' (MathRuleRewardWorker(reward-math_rule-1) pid=137269) [2026-01-11 21:25:33] [worker.py (154)] [WARNING] [reward-math_rule-1 1 / 8][PID 137269] worker has not strategy [repeated 8x across cluster] (MathRuleRewardWorker(reward-math_rule-4) pid=137272) [2026-01-11 21:25:34] [platform.py (89)] [WARNING] [reward-math_rule-4 4 / 8][PID 137272] Current platform cpu does not have 'empty_cache' attribute. [repeated 8x across cluster] (ActorWorker(reference-0) pid=136685) Elapsed time: 0.8141 seconds [repeated 4x across cluster] (ActorWorker(reference-0) pid=136685) (EngineCore_DP0 pid=140470) INFO 01-11 21:25:35 [block_pool.py:292] Successfully reset prefix cache [repeated 2x across cluster] (ActorWorker(reference-0) pid=136685) (EngineCore_DP0 pid=140470) INFO 01-11 21:25:35 [gpu_worker.py:116] Sleep mode freed 12.88 GiB memory, 4.02 GiB memory is still in use.
配置信息:
`defaults:
- ../config/deepspeed_zero@here
- ../config/deepspeed_zero2@here
- ../config/deepspeed_zero3@here
- ../config/deepspeed_zero3_cpuoffload@here
hydra:
run:
dir: .
output_subdir: null
exp_name: "test"
seed: 42
logging_dir: ./output/logs
output_dir: ./output
system_envs:
USE_MODELSCOPE: '1'
checkpoint_config:
type: file_system
output_dir: /data/cpfs_0/rl_examples/models/${exp_name}
track_with: tensorboard
tracker_kwargs:
log_dir: /data/oss_bucket_0/rl_examples/llm/tensorboard/roll_exp/rlvr
num_gpus_per_node: 1
max_steps: 100
save_steps: 100
logging_steps: 1
eval_steps: 10
resume_from_checkpoint: false
--------------------------
important tips:
NOTE: Configurations prefixed with "example_" are for documentation purposes only;
no guarantee on training performance. For actual usage,
please refer to configurations without the "example_" prefix.
grpo related
rollout_batch_size: 4 # prompt
adv_estimator: "grpo"
num_return_sequences_in_group: 4
prompt_length: 2048
response_length: 2048
ppo_epochs: 1
use_kl_loss: true
kl_loss_coef: 0.001
loss_agg_mode: "seq-mean-token-mean"
ppo related
advantage
whiten_advantages: true
advantage_clip: 2.0
dual_clip_loss: true
clip
reward_clip: 10
normalize
reward_norm: null
reward_shift: false
reward_scale: false
reward
add_token_level_kl: false
--------------------------
Below are detailed configurations for each role.
pretrain: Qwen/Qwen2.5-0.5B-Instruct
reward_pretrain: Qwen/Qwen2.5-0.5B-Instruct
validation:
data_args:
template: qwen2_5
file_name:
- data/math_benchmarks.jsonl
generating_args:
max_new_tokens: ${response_length}
top_p: 0.6
top_k: 50
num_beams: 1
temperature: 0.6
num_return_sequences: 1
actor_train:
model_args:
# attn_implementation: fa2
disable_gradient_checkpointing: false
dtype: bf16
model_type: ~
training_args:
learning_rate: 1.0e-6
weight_decay: 0
per_device_train_batch_size: 1
gradient_accumulation_steps: 32
warmup_steps: 20
num_train_epochs: 50
data_args:
template: qwen2_5
file_name:
- data/math_deepmath_deal.jsonl
domain_interleave_probs:
math_rule: 1
dataset_dir: data
messages: messages
interleave_probs: "1.0"
preprocessing_num_workers: 16
strategy_args:
strategy_name: deepspeed_train
strategy_config: ${deepspeed_zero3}
# strategy_config:
# tensor_model_parallel_size: 1
# pipeline_model_parallel_size: 1
# expert_model_parallel_size: 1
# use_distributed_optimizer: true
# recompute_granularity: full
device_mapping: list(range(0,1))
infer_batch_size: 1
actor_infer:
model_args:
disable_gradient_checkpointing: true
dtype: bf16
generating_args:
max_new_tokens: ${response_length}
top_p: 0.99
top_k: 100
num_beams: 1
temperature: 0.99
num_return_sequences: ${num_return_sequences_in_group}
data_args:
template: qwen2_5
strategy_args:
strategy_name: vllm
strategy_config:
gpu_memory_utilization: 0.5
block_size: 16
max_model_len: 2048
device_mapping: list(range(0,1))
infer_batch_size: 1
reference:
model_args:
disable_gradient_checkpointing: true
dtype: bf16
model_type: ~
data_args:
template: qwen2_5
strategy_args:
strategy_name: vllm
strategy_config:
gpu_memory_utilization: 0.5
block_size: 16
max_model_len: 2048
# strategy_config:
# tensor_model_parallel_size: 1
# pipeline_model_parallel_size: 1
# expert_model_parallel_size: 1
device_mapping: list(range(0,1))
infer_batch_size: 2
rewards:
math_rule:
worker_cls: roll.pipeline.rlvr.rewards.math_rule_reward_worker.MathRuleRewardWorker
model_args:
model_name_or_path: ${reward_pretrain}
data_args:
template: qwen2_5
tag_included: [deepmath_103k, aime]
world_size: 8
infer_batch_size: 1`