LLM integration tests should be made more reliable

For example, `test_image_input` passes locally but is failing on #424. As far as I can tell, this doesn't indicate a bug.

A few options:
- We could run the tests repeatedly and pass if a sufficient fraction pass.
- We could trace a live LLM when writing the test and replay the trace when running it.