I really like your great work. May I ask if you have tried different per_device_train_batch_size and num_generations? In my current experiment, I attempted to use a larger batch size, but it appears to be having difficulty converging. Does using a smaller num_generations lead to a drop in performance and result in an 'Aha Moment'?