Add eval support: EvalHarness ABC, LightEvalHarness, LLMClient, LLMJudge#388
Add eval support: EvalHarness ABC, LightEvalHarness, LLMClient, LLMJudge#388
Conversation
Generic RPC abstraction for LLM endpoints with OpenAI-compatible implementation. LLMJudge rubric uses this client for LLM-as-a-judge reward computation with configurable prompt templates and score parsing.
Concrete EvalHarness subclass with deferred lighteval imports (DaytonaProvider pattern) so the class is importable without lighteval installed. Supports transformers/vllm/tgi/openai/sglang backends. Adds lighteval optional dep group to pyproject.toml.
Greptile SummaryThis PR adds evaluation infrastructure to OpenEnv across three areas: an
Confidence Score: 4/5
Important Files Changed
Class DiagramclassDiagram
class EvalHarness {
<<abstract>>
+run(harness_version, library_versions, dataset, eval_parameters) Dict
+run_from_config(config: EvalConfig) EvalResult
+name: str
}
class LightEvalHarness {
+output_dir: str
+save_details: bool
+run(...) Dict
-_build_model_config(backend, model_name, model_args) Any
}
class EvalConfig {
+harness_name: str
+harness_version: str
+library_versions: Dict
+dataset: str
+eval_parameters: Dict
}
class EvalResult {
+config: EvalConfig
+scores: Dict
}
class LLMClient {
<<abstract>>
+endpoint: str
+port: int
+complete(prompt, **kwargs) str
+base_url: str
}
class OpenAIClient {
+model: str
+system_prompt: str
+temperature: float
+max_tokens: int
+complete(prompt, **kwargs) str
}
class Rubric {
<<abstract>>
+forward(action, observation) float
}
class LLMJudge {
+prompt_template: str
+default_score: float
+normalize: bool
+forward(action, observation) float
-_render_prompt(action, observation) str
-_parse_score(response) float
+state_dict() Dict
+load_state_dict(state) void
}
EvalHarness <|-- LightEvalHarness
EvalHarness ..> EvalConfig : uses
EvalHarness ..> EvalResult : produces
EvalResult o-- EvalConfig
LLMClient <|-- OpenAIClient
Rubric <|-- LLMJudge
LLMJudge --> LLMClient : uses
Last reviewed commit: cde69f7 |
src/openenv/core/evals/lighteval.py
Outdated
| save_details: bool = False, | ||
| ): | ||
| self.output_dir = output_dir | ||
| self.save_details = save_details |
There was a problem hiding this comment.
save_details is stored but never used
self.save_details is set in __init__ but never referenced in run() or _build_model_config(). It is not passed to PipelineParameters, EnvConfig, or any other LightEval API call. This means the parameter has no effect regardless of its value.
Either wire it into the pipeline configuration (likely as a kwarg to PipelineParameters or Pipeline) or remove it to avoid misleading callers.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/openenv/core/evals/lighteval.py
Line: 50:53
Comment:
**`save_details` is stored but never used**
`self.save_details` is set in `__init__` but never referenced in `run()` or `_build_model_config()`. It is not passed to `PipelineParameters`, `EnvConfig`, or any other LightEval API call. This means the parameter has no effect regardless of its value.
Either wire it into the pipeline configuration (likely as a kwarg to `PipelineParameters` or `Pipeline`) or remove it to avoid misleading callers.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in cb9ba18. save_and_push_results() is now gated on self.save_details, so the parameter controls whether results are persisted. Tests verify both paths (test_evaluate_called_but_save_skipped_by_default and test_save_called_when_save_details_enabled).
src/openenv/core/llm_client.py
Outdated
| temperature=kwargs.get("temperature", self.temperature), | ||
| max_tokens=kwargs.get("max_tokens", self.max_tokens), | ||
| ) | ||
| return response.choices[0].message.content |
There was a problem hiding this comment.
complete() can return None instead of str
response.choices[0].message.content can be None when the OpenAI API returns a tool-call-only response or when the model produces no text content. The return type annotation is str, but this line will return None in those cases, which would cause a TypeError downstream (e.g., in LLMJudge._parse_score() which calls self._score_pattern.search(response) on the result).
| return response.choices[0].message.content | |
| return response.choices[0].message.content or "" |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/openenv/core/llm_client.py
Line: 114:114
Comment:
**`complete()` can return `None` instead of `str`**
`response.choices[0].message.content` can be `None` when the OpenAI API returns a tool-call-only response or when the model produces no text content. The return type annotation is `str`, but this line will return `None` in those cases, which would cause a `TypeError` downstream (e.g., in `LLMJudge._parse_score()` which calls `self._score_pattern.search(response)` on the result).
```suggestion
return response.choices[0].message.content or ""
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in cb9ba18. complete() now returns response.choices[0].message.content or "", ensuring the -> str contract is satisfied even when the API returns None content.
src/openenv/core/evals/lighteval.py
Outdated
| model_config=model_config, | ||
| ) | ||
| pipeline.evaluate() | ||
| pipeline.save_and_push_results() |
There was a problem hiding this comment.
Unconditional save_and_push_results() may cause side effects
pipeline.save_and_push_results() is called unconditionally on every run(). Depending on the LightEval configuration, this could write files to disk or push results to a remote hub. Users who just want to retrieve scores without side effects (e.g., in tests or CI) would need to work around this.
Consider making this conditional, e.g. controlled by the save_details parameter that is already accepted but unused:
if self.save_details:
pipeline.save_and_push_results()Prompt To Fix With AI
This is a comment left during a code review.
Path: src/openenv/core/evals/lighteval.py
Line: 126:126
Comment:
**Unconditional `save_and_push_results()` may cause side effects**
`pipeline.save_and_push_results()` is called unconditionally on every `run()`. Depending on the LightEval configuration, this could write files to disk or push results to a remote hub. Users who just want to retrieve scores without side effects (e.g., in tests or CI) would need to work around this.
Consider making this conditional, e.g. controlled by the `save_details` parameter that is already accepted but unused:
```python
if self.save_details:
pipeline.save_and_push_results()
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in cb9ba18. save_and_push_results() is now conditional on self.save_details (defaults to False), so users who just want scores get no side effects.
There was a problem hiding this comment.
Pull request overview
Adds an evaluation foundation to openenv.core (typed config/result + harness abstraction) and introduces an LLM-based rubric workflow by adding an LLM client abstraction and an LLM-as-a-judge rubric.
Changes:
- Introduce eval primitives (
EvalHarness,EvalConfig,EvalResult) and aLightEvalHarnessintegration with deferredlightevalimports. - Add
LLMClientabstraction with anOpenAIClientimplementation for OpenAI-compatible endpoints. - Add
LLMJudgerubric plus comprehensive tests for evals, the LLM client, and the rubric.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
pyproject.toml |
Adds lighteval optional dependency group. |
src/openenv/core/__init__.py |
Re-exports LLMClient / OpenAIClient from openenv.core. |
src/openenv/core/llm_client.py |
Introduces LLMClient ABC and OpenAIClient implementation. |
src/openenv/core/evals/__init__.py |
Exposes eval types + LightEvalHarness from openenv.core.evals. |
src/openenv/core/evals/base.py |
Adds EvalHarness ABC and run_from_config() helper. |
src/openenv/core/evals/types.py |
Adds Pydantic models EvalConfig and EvalResult. |
src/openenv/core/evals/lighteval.py |
Implements LightEvalHarness with deferred imports + backend mapping. |
src/openenv/core/rubrics/__init__.py |
Re-exports LLMJudge. |
src/openenv/core/rubrics/llm_judge.py |
Adds LLM-as-a-judge rubric with prompt rendering, parsing, and state (de)serialization. |
tests/core/test_llm_client.py |
Unit tests for LLMClient ABC + OpenAIClient. |
tests/core/test_rubrics/test_llm_judge.py |
Unit tests for prompt rendering, parsing, hooks, containers, and state_dict. |
tests/core/test_evals/__init__.py |
Test package init for eval tests. |
tests/core/test_evals/test_eval_harness.py |
Tests for the EvalHarness ABC and run_from_config(). |
tests/core/test_evals/test_eval_types.py |
Tests for EvalConfig / EvalResult validation + equality. |
tests/core/test_evals/test_lighteval_harness.py |
Tests LightEvalHarness with mocked lighteval modules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/openenv/core/llm_client.py
Outdated
| temperature=kwargs.get("temperature", self.temperature), | ||
| max_tokens=kwargs.get("max_tokens", self.max_tokens), | ||
| ) | ||
| return response.choices[0].message.content |
There was a problem hiding this comment.
OpenAIClient.complete() assumes response.choices[0].message.content is always present. OpenAI-compatible APIs can return an empty choices list or content=None (e.g., tool calls / refusal), which would raise IndexError or violate the -> str contract. Consider validating the response and raising a clear error (or returning a fallback) when no text content is available.
| return response.choices[0].message.content | |
| # Validate that the response contains at least one choice with non-empty content. | |
| choices = getattr(response, "choices", None) | |
| if not choices: | |
| raise ValueError("LLM completion returned no choices; cannot extract text response.") | |
| first_choice = choices[0] | |
| message = getattr(first_choice, "message", None) | |
| content = getattr(message, "content", None) if message is not None else None | |
| if content is None: | |
| raise ValueError("LLM completion choice has no message content; cannot extract text response.") | |
| # Preserve existing behavior: assume content is text, but ensure we return a string. | |
| return content if isinstance(content, str) else str(content) |
There was a problem hiding this comment.
Fixed in cb9ba18. We went with return response.choices[0].message.content or "" rather than the more verbose validation — keeps the code simple and satisfies the -> str contract. LLMJudge._parse_score() already handles empty strings gracefully (returns default_score).
src/openenv/core/evals/lighteval.py
Outdated
| pipeline_params = PipelineParameters( | ||
| launcher_type=launcher_type, | ||
| env_config=EnvConfig(cache_dir=self.output_dir), | ||
| custom_tasks_directory=custom_tasks_directory, |
There was a problem hiding this comment.
output_dir is documented as an output directory, but in the current implementation it is only passed to EnvConfig(cache_dir=...). Either rename/update the docs to reflect that this is a cache dir, or wire it into LightEval's output/result directory configuration.
There was a problem hiding this comment.
Good catch on the naming. In LightEval, EnvConfig.cache_dir serves as both the cache and output directory for results. We kept output_dir as the public parameter name since it better describes user intent (where outputs go), while cache_dir is the LightEval-internal name. The docstring says "Directory for evaluation outputs" which is accurate.
src/openenv/core/evals/lighteval.py
Outdated
| def __init__( | ||
| self, | ||
| *, | ||
| output_dir: Optional[str] = None, | ||
| save_details: bool = False, | ||
| ): | ||
| self.output_dir = output_dir | ||
| self.save_details = save_details | ||
|
|
There was a problem hiding this comment.
save_details is accepted and stored on the harness, but it is not referenced anywhere in run(). This makes the public API misleading; either remove the arg for now or plumb it into the LightEval pipeline configuration / save behavior.
There was a problem hiding this comment.
Fixed in cb9ba18. save_details now gates pipeline.save_and_push_results() — when False (default), results are only returned in-memory without writing to disk or pushing to hub.
| match = self._score_pattern.search(response) | ||
| if match is None: | ||
| return self.default_score | ||
| try: | ||
| score = float(match.group(1)) |
There was a problem hiding this comment.
_parse_score() always reads match.group(1). If a caller supplies a score_pattern without a capturing group, parsing will always fail and silently fall back to default_score. Either document that the first capturing group is required, or support patterns without groups by using group(1) when available else group(0).
There was a problem hiding this comment.
Fixed in 6e11d8f (merge from main). _parse_score() now uses match.group(1) if match.lastindex else match.group(0), so patterns without a capture group fall back to the full match instead of silently returning default_score.
- Gate pipeline.save_and_push_results() on save_details flag so it only runs when explicitly requested (fixes unused parameter bug) - Return empty string from OpenAIClient.complete() when content is None, satisfying the -> str contract
…l-support # Conflicts: # src/openenv/core/llm_client.py # src/openenv/core/rubrics/llm_judge.py
LightEval authors are migrating to Inspect AI. Replace the concrete harness implementation with InspectAIHarness wrapping inspect_ai.eval(). The EvalHarness ABC, EvalConfig, and EvalResult remain unchanged.
Summary
EvalHarnessABC,EvalConfig, andEvalResultPydantic models insrc/openenv/core/evals/as the foundation for evaluation supportLightEvalHarness(EvalHarness)wrapping HuggingFace's LightEval Pipeline API with deferred imports (DaytonaProvider pattern) — importable without lighteval installedLLMClientABC andOpenAIClientimplementation insrc/openenv/core/llm_client.pyLLMJudgerubric insrc/openenv/core/rubrics/llm_judge.pylightevaloptional dependency group topyproject.tomlCloses #384
Test plan
tests/core/test_evals/) — harness ABC, types, and LightEvalHarness with mocked lightevaltests/core/test_llm_client.py,tests/core/test_rubrics/test_llm_judge.py)