If I want to use an LLM to obtain a reward score for a response, how should I specify this in the `custom-rm-path`? May I refer to the implementation of `examples/on_policy_distillation`? Is the LLM-based reward score used as a teacher model?