添加训推修复功能（add feature train-infer-mismatch），更完整，更全面 by millioniron · Pull Request #288 · alibaba/ROLL

millioniron · 2025-12-08T15:04:16Z

我进行了合并修订。具体来讲，对于新版本中有关infer-log-prob的获取部分我遵循了官方的版本。但是在具体的训推修复部分，是使用我自己的修订。

✨ What's Changed

What does this PR do?

✨ What's Changed

1. 核心组件重构

新增 InferCorrectionHandler （roll/utils/infer_correction.py）类：专注处理IS校正+样本拒绝，替代原loss_func中混杂逻辑

handler = InferCorrectionHandler(pipeline_config)
weighted_loss, final_mask, metrics = handler(
    old_log_probs, infer_log_probs, response_mask, pg_loss
)

2. 三级拒绝策略体系

策略类型	触发条件	保护目标	关键参数
Token级拒绝	IS比率超出合理范围	防止单点梯度爆炸	`infer_token_mask_threshold_{min,max}`
序列级拒绝	序列整体IS比率异常	保证序列级一致性	`enable_seq_reject`, `infer_seq_mask_threshold_{min,max}`
灾难性拒绝 ✨	IS比率 < 1e-3 (指数级概率差异)	防止训练完全崩溃	`infer_catastrophic_threshold`

3. 智能重要性采样

模式动态切换：
```
infer_is_mode: Literal["token", "sequence", "geometric", "none"]
```
- token：传统token级IS（默认）
- sequence：序列总log-ratio（稳定长序列训练）
- geometric：几何平均比率（平衡极端值）
- none：关闭IS（基准测试用）

自适应裁剪：

is_weight = raw_is_weight.clamp(
    min=infer_is_threshold_min, 
    max=infer_is_threshold_max

4. 工业级诊断系统

StatsCollector 集中管理指标，分三类：
- 基础分布：token_ratio_mean/std/min/max
- 拒绝分析：token_reject_frac, seq_reject_frac, catastrophic_seq_frac
- 训练健康度：inferkl (原始KL), inferkl_reject (拒绝后KL)

延迟计算优化：避免频繁GPU-CPU同步

self.stats.add_tensor_stat("token_ratio", ratio, mask)  # 注册但不立即计算
self.stats.compute_tensor_stats()  # 批量计算

new file: examples/qwen2.5-infer_correction/agentic_webshop_infer_correction.yaml new file: examples/qwen2.5-infer_correction/rlvr_infer_correction_config.yaml new file: examples/qwen2.5-infer_correction/run_agentic_pipeline_webshop.sh new file: examples/qwen2.5-infer_correction/run_rlvr_pipeline.sh modified: roll/configs/base_config.py modified: roll/configs/generating_args.py modified: roll/distributed/scheduler/generate_scheduler.py modified: roll/pipeline/agentic/env_manager/step_env_manager.py modified: roll/pipeline/agentic/env_manager/traj_env_manager.py modified: roll/pipeline/base_worker.py modified: roll/pipeline/rlvr/actor_pg_worker.py modified: roll/pipeline/rlvr/actor_worker.py

millioniron · 2025-12-08T15:12:51Z

done

millioniron added 7 commits December 3, 2025 14:08

重新修订了整个的排版，抽象出了一个类，使得可以更加自由定义

77c1e5d

重新修订了整个的排版，抽象出了一个类，使得可以更加自由定义

e98c4ce

Merge branch 'main' of https://github.com/millioniron/ROLL

7613dbd

modified: roll/pipeline/agentic/env_manager/step_env_manager.py

c3d1121

去掉原本官方的train-infer实现

515bf39

millioniron mentioned this pull request Dec 8, 2025

添加训推修复功能（add feature train-infer-mismatch） #273

Closed

modified: roll/pipeline/agentic/env_manager/step_env_manager.py

243a961

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

添加训推修复功能（add feature train-infer-mismatch），更完整，更全面#288

添加训推修复功能（add feature train-infer-mismatch），更完整，更全面#288
millioniron wants to merge 8 commits intoalibaba:mainfrom
millioniron:main

millioniron commented Dec 8, 2025

Uh oh!

millioniron commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

millioniron commented Dec 8, 2025

What does this PR do?

✨ What's Changed

1. 核心组件重构

2. 三级拒绝策略体系

3. 智能重要性采样

4. 工业级诊断系统

Uh oh!

millioniron commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant