For example, test_image_input passes locally but is failing on #424. As far as I can tell, this doesn't indicate a bug.
A few options:
- We could run the tests repeatedly and pass if a sufficient fraction pass.
- We could trace a live LLM when writing the test and replay the trace when running it.