Qwen2-VL-2B generates verbose and off-topic outputs during early GRPO training

Hi, I'm using GRPO to train on my own task. I noticed that when using Qwen2-VL-2B, the model tends to generate overly verbose and irrelevant content in the early training stages, making it difficult to learn the desired format <think>...</think><answer>...</answer>.

I'm wondering if you encountered a similar issue during your training process. If so, how did you address it?

Thanks in advance!