SoftPrompt-IR aims to reduce ambiguity *before sampling*. Questions to explore: - How could this be tested systematically? - What would a fair baseline look like? - Which metrics even make sense here? Open discussion, no formal eval required.