Ideas for eval-friendly or research usage

SoftPrompt-IR aims to reduce ambiguity *before sampling*.

Questions to explore:
- How could this be tested systematically?
- What would a fair baseline look like?
- Which metrics even make sense here?

Open discussion, no formal eval required.