Skip to content

Add eval support: EvalHarness ABC, LightEvalHarness, LLMClient, LLMJudge#388

Open
Darktex wants to merge 6 commits intomainfrom
feature/issue-384-eval-support
Open

Add eval support: EvalHarness ABC, LightEvalHarness, LLMClient, LLMJudge#388
Darktex wants to merge 6 commits intomainfrom
feature/issue-384-eval-support

Conversation

@Darktex
Copy link
Contributor

@Darktex Darktex commented Feb 17, 2026

Summary

  • Adds EvalHarness ABC, EvalConfig, and EvalResult Pydantic models in src/openenv/core/evals/ as the foundation for evaluation support
  • Adds LightEvalHarness(EvalHarness) wrapping HuggingFace's LightEval Pipeline API with deferred imports (DaytonaProvider pattern) — importable without lighteval installed
  • Adds LLMClient ABC and OpenAIClient implementation in src/openenv/core/llm_client.py
  • Adds LLMJudge rubric in src/openenv/core/rubrics/llm_judge.py
  • Adds lighteval optional dependency group to pyproject.toml
  • 5 backends supported: transformers, vllm, tgi, openai, sglang

Closes #384

Test plan

  • 58 eval tests pass (tests/core/test_evals/) — harness ABC, types, and LightEvalHarness with mocked lighteval
  • LLMClient and LLMJudge tests pass (tests/core/test_llm_client.py, tests/core/test_rubrics/test_llm_judge.py)
  • Full test suite passes (876 passed, 94 skipped, pre-existing websocket port conflict excluded)
  • Lint and format clean

Generic RPC abstraction for LLM endpoints with OpenAI-compatible
implementation. LLMJudge rubric uses this client for LLM-as-a-judge
reward computation with configurable prompt templates and score parsing.
Concrete EvalHarness subclass with deferred lighteval imports (DaytonaProvider
pattern) so the class is importable without lighteval installed. Supports
transformers/vllm/tgi/openai/sglang backends. Adds lighteval optional dep
group to pyproject.toml.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 17, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 17, 2026

Greptile Summary

This PR adds evaluation infrastructure to OpenEnv across three areas: an EvalHarness ABC with EvalConfig/EvalResult Pydantic models, a LightEvalHarness wrapping HuggingFace's LightEval Pipeline API (5 backends: transformers, vllm, tgi, openai, sglang), and an LLMClient abstraction with OpenAIClient implementation powering the new LLMJudge rubric. The changes align with RFC 004's design for rubric-based reward computation.

  • LLMJudge integrates cleanly with the existing Rubric hierarchy (hooks, WeightedSum composition, state_dict serialization)
  • LightEvalHarness uses deferred imports (matching the DaytonaProvider pattern) so it's importable without lighteval installed
  • Bug: OpenAIClient.complete() can return None when message.content is None from the API, which will cause a TypeError in LLMJudge._parse_score() — see inline comment
  • Dead code: LightEvalHarness.save_details parameter is stored but never wired into the pipeline
  • Test coverage is thorough (58 eval tests, LLMClient tests, LLMJudge tests with mocked dependencies)

Confidence Score: 4/5

  • This PR is safe to merge after addressing the None-return bug in OpenAIClient.complete() and the unused save_details parameter.
  • The architecture is clean, follows project patterns (deferred imports, Pydantic models, ABC hierarchy), and has thorough test coverage. The two issues found are a potential None return type violation in OpenAIClient.complete() that could cause runtime errors downstream, and a dead save_details parameter. Neither is a blocker for merge, but the None return should be fixed before the LLMJudge is used in production.
  • src/openenv/core/llm_client.py (potential None return from complete()), src/openenv/core/evals/lighteval.py (unused save_details and unconditional save_and_push_results())

Important Files Changed

Filename Overview
pyproject.toml Adds lighteval optional dependency group with lighteval[accelerate]>=0.13.0. Straightforward addition following the existing daytona pattern.
src/openenv/core/init.py Re-exports LLMClient and OpenAIClient from core. Follows existing pattern for other client types. The openai package is already a core dependency.
src/openenv/core/evals/base.py Clean ABC with run() and run_from_config() convenience method. Simple and well-structured.
src/openenv/core/evals/types.py Pydantic models EvalConfig and EvalResult with extra="forbid" and validate_assignment=True. Follows project conventions for wire types.
src/openenv/core/evals/lighteval.py LightEval wrapper with deferred imports. Two issues: save_details parameter is accepted but never used (dead code), and save_and_push_results() is called unconditionally which may cause unwanted side effects.
src/openenv/core/llm_client.py LLM client ABC and OpenAI implementation. Bug: complete() can return None when message.content is None from the API, violating the str return type and causing downstream TypeError in LLMJudge._parse_score().
src/openenv/core/rubrics/llm_judge.py LLM-as-judge rubric with async forward(), prompt templating, regex score parsing, and state_dict/load_state_dict. Clean implementation that integrates with existing Rubric hierarchy and hook system.

Class Diagram

classDiagram
    class EvalHarness {
        <<abstract>>
        +run(harness_version, library_versions, dataset, eval_parameters) Dict
        +run_from_config(config: EvalConfig) EvalResult
        +name: str
    }

    class LightEvalHarness {
        +output_dir: str
        +save_details: bool
        +run(...) Dict
        -_build_model_config(backend, model_name, model_args) Any
    }

    class EvalConfig {
        +harness_name: str
        +harness_version: str
        +library_versions: Dict
        +dataset: str
        +eval_parameters: Dict
    }

    class EvalResult {
        +config: EvalConfig
        +scores: Dict
    }

    class LLMClient {
        <<abstract>>
        +endpoint: str
        +port: int
        +complete(prompt, **kwargs) str
        +base_url: str
    }

    class OpenAIClient {
        +model: str
        +system_prompt: str
        +temperature: float
        +max_tokens: int
        +complete(prompt, **kwargs) str
    }

    class Rubric {
        <<abstract>>
        +forward(action, observation) float
    }

    class LLMJudge {
        +prompt_template: str
        +default_score: float
        +normalize: bool
        +forward(action, observation) float
        -_render_prompt(action, observation) str
        -_parse_score(response) float
        +state_dict() Dict
        +load_state_dict(state) void
    }

    EvalHarness <|-- LightEvalHarness
    EvalHarness ..> EvalConfig : uses
    EvalHarness ..> EvalResult : produces
    EvalResult o-- EvalConfig
    LLMClient <|-- OpenAIClient
    Rubric <|-- LLMJudge
    LLMJudge --> LLMClient : uses
Loading

Last reviewed commit: cde69f7

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

Comment on lines 50 to 53
save_details: bool = False,
):
self.output_dir = output_dir
self.save_details = save_details
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_details is stored but never used

self.save_details is set in __init__ but never referenced in run() or _build_model_config(). It is not passed to PipelineParameters, EnvConfig, or any other LightEval API call. This means the parameter has no effect regardless of its value.

Either wire it into the pipeline configuration (likely as a kwarg to PipelineParameters or Pipeline) or remove it to avoid misleading callers.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/openenv/core/evals/lighteval.py
Line: 50:53

Comment:
**`save_details` is stored but never used**

`self.save_details` is set in `__init__` but never referenced in `run()` or `_build_model_config()`. It is not passed to `PipelineParameters`, `EnvConfig`, or any other LightEval API call. This means the parameter has no effect regardless of its value.

Either wire it into the pipeline configuration (likely as a kwarg to `PipelineParameters` or `Pipeline`) or remove it to avoid misleading callers.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cb9ba18. save_and_push_results() is now gated on self.save_details, so the parameter controls whether results are persisted. Tests verify both paths (test_evaluate_called_but_save_skipped_by_default and test_save_called_when_save_details_enabled).

temperature=kwargs.get("temperature", self.temperature),
max_tokens=kwargs.get("max_tokens", self.max_tokens),
)
return response.choices[0].message.content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete() can return None instead of str

response.choices[0].message.content can be None when the OpenAI API returns a tool-call-only response or when the model produces no text content. The return type annotation is str, but this line will return None in those cases, which would cause a TypeError downstream (e.g., in LLMJudge._parse_score() which calls self._score_pattern.search(response) on the result).

Suggested change
return response.choices[0].message.content
return response.choices[0].message.content or ""
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/openenv/core/llm_client.py
Line: 114:114

Comment:
**`complete()` can return `None` instead of `str`**

`response.choices[0].message.content` can be `None` when the OpenAI API returns a tool-call-only response or when the model produces no text content. The return type annotation is `str`, but this line will return `None` in those cases, which would cause a `TypeError` downstream (e.g., in `LLMJudge._parse_score()` which calls `self._score_pattern.search(response)` on the result).

```suggestion
        return response.choices[0].message.content or ""
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cb9ba18. complete() now returns response.choices[0].message.content or "", ensuring the -> str contract is satisfied even when the API returns None content.

model_config=model_config,
)
pipeline.evaluate()
pipeline.save_and_push_results()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditional save_and_push_results() may cause side effects

pipeline.save_and_push_results() is called unconditionally on every run(). Depending on the LightEval configuration, this could write files to disk or push results to a remote hub. Users who just want to retrieve scores without side effects (e.g., in tests or CI) would need to work around this.

Consider making this conditional, e.g. controlled by the save_details parameter that is already accepted but unused:

if self.save_details:
    pipeline.save_and_push_results()
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/openenv/core/evals/lighteval.py
Line: 126:126

Comment:
**Unconditional `save_and_push_results()` may cause side effects**

`pipeline.save_and_push_results()` is called unconditionally on every `run()`. Depending on the LightEval configuration, this could write files to disk or push results to a remote hub. Users who just want to retrieve scores without side effects (e.g., in tests or CI) would need to work around this.

Consider making this conditional, e.g. controlled by the `save_details` parameter that is already accepted but unused:

```python
if self.save_details:
    pipeline.save_and_push_results()
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cb9ba18. save_and_push_results() is now conditional on self.save_details (defaults to False), so users who just want scores get no side effects.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an evaluation foundation to openenv.core (typed config/result + harness abstraction) and introduces an LLM-based rubric workflow by adding an LLM client abstraction and an LLM-as-a-judge rubric.

Changes:

  • Introduce eval primitives (EvalHarness, EvalConfig, EvalResult) and a LightEvalHarness integration with deferred lighteval imports.
  • Add LLMClient abstraction with an OpenAIClient implementation for OpenAI-compatible endpoints.
  • Add LLMJudge rubric plus comprehensive tests for evals, the LLM client, and the rubric.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pyproject.toml Adds lighteval optional dependency group.
src/openenv/core/__init__.py Re-exports LLMClient / OpenAIClient from openenv.core.
src/openenv/core/llm_client.py Introduces LLMClient ABC and OpenAIClient implementation.
src/openenv/core/evals/__init__.py Exposes eval types + LightEvalHarness from openenv.core.evals.
src/openenv/core/evals/base.py Adds EvalHarness ABC and run_from_config() helper.
src/openenv/core/evals/types.py Adds Pydantic models EvalConfig and EvalResult.
src/openenv/core/evals/lighteval.py Implements LightEvalHarness with deferred imports + backend mapping.
src/openenv/core/rubrics/__init__.py Re-exports LLMJudge.
src/openenv/core/rubrics/llm_judge.py Adds LLM-as-a-judge rubric with prompt rendering, parsing, and state (de)serialization.
tests/core/test_llm_client.py Unit tests for LLMClient ABC + OpenAIClient.
tests/core/test_rubrics/test_llm_judge.py Unit tests for prompt rendering, parsing, hooks, containers, and state_dict.
tests/core/test_evals/__init__.py Test package init for eval tests.
tests/core/test_evals/test_eval_harness.py Tests for the EvalHarness ABC and run_from_config().
tests/core/test_evals/test_eval_types.py Tests for EvalConfig / EvalResult validation + equality.
tests/core/test_evals/test_lighteval_harness.py Tests LightEvalHarness with mocked lighteval modules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

temperature=kwargs.get("temperature", self.temperature),
max_tokens=kwargs.get("max_tokens", self.max_tokens),
)
return response.choices[0].message.content
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAIClient.complete() assumes response.choices[0].message.content is always present. OpenAI-compatible APIs can return an empty choices list or content=None (e.g., tool calls / refusal), which would raise IndexError or violate the -> str contract. Consider validating the response and raising a clear error (or returning a fallback) when no text content is available.

Suggested change
return response.choices[0].message.content
# Validate that the response contains at least one choice with non-empty content.
choices = getattr(response, "choices", None)
if not choices:
raise ValueError("LLM completion returned no choices; cannot extract text response.")
first_choice = choices[0]
message = getattr(first_choice, "message", None)
content = getattr(message, "content", None) if message is not None else None
if content is None:
raise ValueError("LLM completion choice has no message content; cannot extract text response.")
# Preserve existing behavior: assume content is text, but ensure we return a string.
return content if isinstance(content, str) else str(content)

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cb9ba18. We went with return response.choices[0].message.content or "" rather than the more verbose validation — keeps the code simple and satisfies the -> str contract. LLMJudge._parse_score() already handles empty strings gracefully (returns default_score).

Comment on lines 111 to 114
pipeline_params = PipelineParameters(
launcher_type=launcher_type,
env_config=EnvConfig(cache_dir=self.output_dir),
custom_tasks_directory=custom_tasks_directory,
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_dir is documented as an output directory, but in the current implementation it is only passed to EnvConfig(cache_dir=...). Either rename/update the docs to reflect that this is a cache dir, or wire it into LightEval's output/result directory configuration.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the naming. In LightEval, EnvConfig.cache_dir serves as both the cache and output directory for results. We kept output_dir as the public parameter name since it better describes user intent (where outputs go), while cache_dir is the LightEval-internal name. The docstring says "Directory for evaluation outputs" which is accurate.

Comment on lines 46 to 54
def __init__(
self,
*,
output_dir: Optional[str] = None,
save_details: bool = False,
):
self.output_dir = output_dir
self.save_details = save_details

Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_details is accepted and stored on the harness, but it is not referenced anywhere in run(). This makes the public API misleading; either remove the arg for now or plumb it into the LightEval pipeline configuration / save behavior.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in cb9ba18. save_details now gates pipeline.save_and_push_results() — when False (default), results are only returned in-memory without writing to disk or pushing to hub.

Comment on lines 87 to 91
match = self._score_pattern.search(response)
if match is None:
return self.default_score
try:
score = float(match.group(1))
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_score() always reads match.group(1). If a caller supplies a score_pattern without a capturing group, parsing will always fail and silently fall back to default_score. Either document that the first capturing group is required, or support patterns without groups by using group(1) when available else group(0).

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6e11d8f (merge from main). _parse_score() now uses match.group(1) if match.lastindex else match.group(0), so patterns without a capture group fall back to the full match instead of silently returning default_score.

- Gate pipeline.save_and_push_results() on save_details flag so it only
  runs when explicitly requested (fixes unused parameter bug)
- Return empty string from OpenAIClient.complete() when content is None,
  satisfying the -> str contract
…l-support

# Conflicts:
#	src/openenv/core/llm_client.py
#	src/openenv/core/rubrics/llm_judge.py
LightEval authors are migrating to Inspect AI. Replace the concrete
harness implementation with InspectAIHarness wrapping inspect_ai.eval().
The EvalHarness ABC, EvalConfig, and EvalResult remain unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for Evals

2 participants