Hi, I'm using GRPO to train on my own task. I noticed that when using Qwen2-VL-2B, the model tends to generate overly verbose and irrelevant content in the early training stages, making it difficult to learn the desired format .......
I'm wondering if you encountered a similar issue during your training process. If so, how did you address it?
Thanks in advance!