I ran your basic MLLM RLVR configuration examples/qwen2.5-vl-7B-rlvr/rlvr_megatron.yaml, but the training performance was almost stand, swanlab url:
https://swanlab.cn/@canghongjian/web_public/runs/n5yko5po0dgktwbl9zs0d/chart
while the test performance also has no change after training:
|
RefCOCO_test |
CountBenchQA |
RefCOCO_g_test |
RefCOCO_plus_test |
MathVista_MINI |
MathVerse_MINI |
CountQA_test |
| Qwen2_5_VL_7B_Instruct |
0.8992 |
0.8789 |
0.8624 |
0.8113 |
0.6850 |
0.4508 |
0.2075 |
| vlm_roll_rlvr_ckpt_20 |
0.8976 |
0.8727 |
0.8610 |
0.8089 |
0.6950 |
0.4114 |
0.2055 |
| vlm_roll_rlvr_ckpt_100 |
0.8970 |
0.8645 |
0.8606 |
0.8147 |
0.6940 |
0.4363 |
0.2016 |