**Reproduction Environment** - GPU: H20*4 - Configuration: Codebase default settings - Dataset: First 15k SAT samples (as per default config) **My Results** Figure 1. My training curve  Figure 2. My test curve  - My reproduction (base+GRPO): 58.4 (step 1000) - Qwen2-VL instruct model: 61.6 **Results from [report](https://turningpointai.notion.site/the-multimodal-aha-moment-on-2b-model)** Figure 3. Test curve from report  **Key Questions** 1. Performance gap (58.4 vs ~59.5) between my reproduction and reported results. 2. Inconsistent qwen2-VL instruct model performance (61.6 locally vs ~56 in report). 3. Abnormal trend in reproduced SFT curve and GRPO curve (Figure 2) . 4. Why does default config only use first 15k SAT samples instead of full dataset? 5. According to your experience, what are the possible reasons for the abnormal reproduction results?