I really appreciate your excellent work on LLM Compass. It provides very useful insights for model evaluation.
I have two questions about the implementation:
-
For Figure 5, could you explain how the real metrics were tested and calculated?
-
Does LLM Compass support simulating other models besides GPT-3? The current implementation seems specifically designed for GPT-3, and adapting it to other models appears to require significant engineering effort.
Thank you for your time and consideration.