Add eval support: EvalHarness ABC, LightEvalHarness, LLMClient, LLMJudge by Darktex · Pull Request #388 · meta-pytorch/OpenEnv

Darktex · 2026-02-17T08:11:32Z

Summary

Adds EvalHarness ABC, EvalConfig, and EvalResult Pydantic models in src/openenv/core/evals/ as the foundation for evaluation support
Adds LightEvalHarness(EvalHarness) wrapping HuggingFace's LightEval Pipeline API with deferred imports (DaytonaProvider pattern) — importable without lighteval installed
Adds LLMClient ABC and OpenAIClient implementation in src/openenv/core/llm_client.py
Adds LLMJudge rubric in src/openenv/core/rubrics/llm_judge.py
Adds lighteval optional dependency group to pyproject.toml
5 backends supported: transformers, vllm, tgi, openai, sglang

Closes #384

Test plan

58 eval tests pass (tests/core/test_evals/) — harness ABC, types, and LightEvalHarness with mocked lighteval
LLMClient and LLMJudge tests pass (tests/core/test_llm_client.py, tests/core/test_rubrics/test_llm_judge.py)
Full test suite passes (876 passed, 94 skipped, pre-existing websocket port conflict excluded)
Lint and format clean

Generic RPC abstraction for LLM endpoints with OpenAI-compatible implementation. LLMJudge rubric uses this client for LLM-as-a-judge reward computation with configurable prompt templates and score parsing.

Concrete EvalHarness subclass with deferred lighteval imports (DaytonaProvider pattern) so the class is importable without lighteval installed. Supports transformers/vllm/tgi/openai/sglang backends. Adds lighteval optional dep group to pyproject.toml.

greptile-apps · 2026-02-17T08:14:54Z

Greptile Summary

This PR adds evaluation infrastructure to OpenEnv across three areas: an EvalHarness ABC with EvalConfig/EvalResult Pydantic models, a LightEvalHarness wrapping HuggingFace's LightEval Pipeline API (5 backends: transformers, vllm, tgi, openai, sglang), and an LLMClient abstraction with OpenAIClient implementation powering the new LLMJudge rubric. The changes align with RFC 004's design for rubric-based reward computation.

LLMJudge integrates cleanly with the existing Rubric hierarchy (hooks, WeightedSum composition, state_dict serialization)
LightEvalHarness uses deferred imports (matching the DaytonaProvider pattern) so it's importable without lighteval installed
Bug: OpenAIClient.complete() can return None when message.content is None from the API, which will cause a TypeError in LLMJudge._parse_score() — see inline comment
Dead code: LightEvalHarness.save_details parameter is stored but never wired into the pipeline
Test coverage is thorough (58 eval tests, LLMClient tests, LLMJudge tests with mocked dependencies)

Confidence Score: 4/5

This PR is safe to merge after addressing the None-return bug in OpenAIClient.complete() and the unused save_details parameter.
The architecture is clean, follows project patterns (deferred imports, Pydantic models, ABC hierarchy), and has thorough test coverage. The two issues found are a potential None return type violation in OpenAIClient.complete() that could cause runtime errors downstream, and a dead save_details parameter. Neither is a blocker for merge, but the None return should be fixed before the LLMJudge is used in production.
src/openenv/core/llm_client.py (potential None return from complete()), src/openenv/core/evals/lighteval.py (unused save_details and unconditional save_and_push_results())

Important Files Changed

Filename	Overview
pyproject.toml	Adds `lighteval` optional dependency group with `lighteval[accelerate]>=0.13.0`. Straightforward addition following the existing `daytona` pattern.
src/openenv/core/init.py	Re-exports `LLMClient` and `OpenAIClient` from core. Follows existing pattern for other client types. The `openai` package is already a core dependency.
src/openenv/core/evals/base.py	Clean ABC with `run()` and `run_from_config()` convenience method. Simple and well-structured.
src/openenv/core/evals/types.py	Pydantic models `EvalConfig` and `EvalResult` with `extra="forbid"` and `validate_assignment=True`. Follows project conventions for wire types.
src/openenv/core/evals/lighteval.py	LightEval wrapper with deferred imports. Two issues: `save_details` parameter is accepted but never used (dead code), and `save_and_push_results()` is called unconditionally which may cause unwanted side effects.
src/openenv/core/llm_client.py	LLM client ABC and OpenAI implementation. Bug: `complete()` can return `None` when `message.content` is `None` from the API, violating the `str` return type and causing downstream `TypeError` in `LLMJudge._parse_score()`.
src/openenv/core/rubrics/llm_judge.py	LLM-as-judge rubric with async `forward()`, prompt templating, regex score parsing, and `state_dict`/`load_state_dict`. Clean implementation that integrates with existing Rubric hierarchy and hook system.

Class Diagram

classDiagram
    class EvalHarness {
        <<abstract>>
        +run(harness_version, library_versions, dataset, eval_parameters) Dict
        +run_from_config(config: EvalConfig) EvalResult
        +name: str
    }

    class LightEvalHarness {
        +output_dir: str
        +save_details: bool
        +run(...) Dict
        -_build_model_config(backend, model_name, model_args) Any
    }

    class EvalConfig {
        +harness_name: str
        +harness_version: str
        +library_versions: Dict
        +dataset: str
        +eval_parameters: Dict
    }

    class EvalResult {
        +config: EvalConfig
        +scores: Dict
    }

    class LLMClient {
        <<abstract>>
        +endpoint: str
        +port: int
        +complete(prompt, **kwargs) str
        +base_url: str
    }

    class OpenAIClient {
        +model: str
        +system_prompt: str
        +temperature: float
        +max_tokens: int
        +complete(prompt, **kwargs) str
    }

    class Rubric {
        <<abstract>>
        +forward(action, observation) float
    }

    class LLMJudge {
        +prompt_template: str
        +default_score: float
        +normalize: bool
        +forward(action, observation) float
        -_render_prompt(action, observation) str
        -_parse_score(response) float
        +state_dict() Dict
        +load_state_dict(state) void
    }

    EvalHarness <|-- LightEvalHarness
    EvalHarness ..> EvalConfig : uses
    EvalHarness ..> EvalResult : produces
    EvalResult o-- EvalConfig
    LLMClient <|-- OpenAIClient
    Rubric <|-- LLMJudge
    LLMJudge --> LLMClient : uses

_{Last reviewed commit: cde69f7}

greptile-apps

_{15 files reviewed, 3 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-17T08:14:58Z

src/openenv/core/evals/lighteval.py

+        save_details: bool = False,
+    ):
+        self.output_dir = output_dir
+        self.save_details = save_details


save_details is stored but never used

self.save_details is set in __init__ but never referenced in run() or _build_model_config(). It is not passed to PipelineParameters, EnvConfig, or any other LightEval API call. This means the parameter has no effect regardless of its value.

Either wire it into the pipeline configuration (likely as a kwarg to PipelineParameters or Pipeline) or remove it to avoid misleading callers.

Prompt To Fix With AI

This is a comment left during a code review. Path: src/openenv/core/evals/lighteval.py Line: 50:53 Comment: **`save_details` is stored but never used** `self.save_details` is set in `__init__` but never referenced in `run()` or `_build_model_config()`. It is not passed to `PipelineParameters`, `EnvConfig`, or any other LightEval API call. This means the parameter has no effect regardless of its value. Either wire it into the pipeline configuration (likely as a kwarg to `PipelineParameters` or `Pipeline`) or remove it to avoid misleading callers. How can I resolve this? If you propose a fix, please make it concise.

Fixed in cb9ba18. save_and_push_results() is now gated on self.save_details, so the parameter controls whether results are persisted. Tests verify both paths (test_evaluate_called_but_save_skipped_by_default and test_save_called_when_save_details_enabled).

greptile-apps · 2026-02-17T08:14:59Z

src/openenv/core/llm_client.py

+            temperature=kwargs.get("temperature", self.temperature),
+            max_tokens=kwargs.get("max_tokens", self.max_tokens),
+        )
+        return response.choices[0].message.content


complete() can return None instead of str

response.choices[0].message.content can be None when the OpenAI API returns a tool-call-only response or when the model produces no text content. The return type annotation is str, but this line will return None in those cases, which would cause a TypeError downstream (e.g., in LLMJudge._parse_score() which calls self._score_pattern.search(response) on the result).

Suggested change

return response.choices[0].message.content

return response.choices[0].message.content or ""

Prompt To Fix With AI

This is a comment left during a code review. Path: src/openenv/core/llm_client.py Line: 114:114 Comment: **`complete()` can return `None` instead of `str`** `response.choices[0].message.content` can be `None` when the OpenAI API returns a tool-call-only response or when the model produces no text content. The return type annotation is `str`, but this line will return `None` in those cases, which would cause a `TypeError` downstream (e.g., in `LLMJudge._parse_score()` which calls `self._score_pattern.search(response)` on the result). ```suggestion return response.choices[0].message.content or "" ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed in cb9ba18. complete() now returns response.choices[0].message.content or "", ensuring the -> str contract is satisfied even when the API returns None content.

greptile-apps · 2026-02-17T08:15:00Z

src/openenv/core/evals/lighteval.py

+            model_config=model_config,
+        )
+        pipeline.evaluate()
+        pipeline.save_and_push_results()


Unconditional save_and_push_results() may cause side effects

pipeline.save_and_push_results() is called unconditionally on every run(). Depending on the LightEval configuration, this could write files to disk or push results to a remote hub. Users who just want to retrieve scores without side effects (e.g., in tests or CI) would need to work around this.

Consider making this conditional, e.g. controlled by the save_details parameter that is already accepted but unused:

if self.save_details: pipeline.save_and_push_results()

Prompt To Fix With AI

This is a comment left during a code review. Path: src/openenv/core/evals/lighteval.py Line: 126:126 Comment: **Unconditional `save_and_push_results()` may cause side effects** `pipeline.save_and_push_results()` is called unconditionally on every `run()`. Depending on the LightEval configuration, this could write files to disk or push results to a remote hub. Users who just want to retrieve scores without side effects (e.g., in tests or CI) would need to work around this. Consider making this conditional, e.g. controlled by the `save_details` parameter that is already accepted but unused: ```python if self.save_details: pipeline.save_and_push_results() ``` How can I resolve this? If you propose a fix, please make it concise.

Fixed in cb9ba18. save_and_push_results() is now conditional on self.save_details (defaults to False), so users who just want scores get no side effects.

Copilot

Pull request overview

Adds an evaluation foundation to openenv.core (typed config/result + harness abstraction) and introduces an LLM-based rubric workflow by adding an LLM client abstraction and an LLM-as-a-judge rubric.

Changes:

Introduce eval primitives (EvalHarness, EvalConfig, EvalResult) and a LightEvalHarness integration with deferred lighteval imports.
Add LLMClient abstraction with an OpenAIClient implementation for OpenAI-compatible endpoints.
Add LLMJudge rubric plus comprehensive tests for evals, the LLM client, and the rubric.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`pyproject.toml`	Adds `lighteval` optional dependency group.
`src/openenv/core/__init__.py`	Re-exports `LLMClient` / `OpenAIClient` from `openenv.core`.
`src/openenv/core/llm_client.py`	Introduces `LLMClient` ABC and `OpenAIClient` implementation.
`src/openenv/core/evals/__init__.py`	Exposes eval types + `LightEvalHarness` from `openenv.core.evals`.
`src/openenv/core/evals/base.py`	Adds `EvalHarness` ABC and `run_from_config()` helper.
`src/openenv/core/evals/types.py`	Adds Pydantic models `EvalConfig` and `EvalResult`.
`src/openenv/core/evals/lighteval.py`	Implements `LightEvalHarness` with deferred imports + backend mapping.
`src/openenv/core/rubrics/__init__.py`	Re-exports `LLMJudge`.
`src/openenv/core/rubrics/llm_judge.py`	Adds LLM-as-a-judge rubric with prompt rendering, parsing, and state (de)serialization.
`tests/core/test_llm_client.py`	Unit tests for `LLMClient` ABC + `OpenAIClient`.
`tests/core/test_rubrics/test_llm_judge.py`	Unit tests for prompt rendering, parsing, hooks, containers, and state_dict.
`tests/core/test_evals/__init__.py`	Test package init for eval tests.
`tests/core/test_evals/test_eval_harness.py`	Tests for the `EvalHarness` ABC and `run_from_config()`.
`tests/core/test_evals/test_eval_types.py`	Tests for `EvalConfig` / `EvalResult` validation + equality.
`tests/core/test_evals/test_lighteval_harness.py`	Tests `LightEvalHarness` with mocked `lighteval` modules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-17T08:18:40Z

src/openenv/core/llm_client.py

+            temperature=kwargs.get("temperature", self.temperature),
+            max_tokens=kwargs.get("max_tokens", self.max_tokens),
+        )
+        return response.choices[0].message.content


OpenAIClient.complete() assumes response.choices[0].message.content is always present. OpenAI-compatible APIs can return an empty choices list or content=None (e.g., tool calls / refusal), which would raise IndexError or violate the -> str contract. Consider validating the response and raising a clear error (or returning a fallback) when no text content is available.

Suggested change

return response.choices[0].message.content

# Validate that the response contains at least one choice with non-empty content.

choices = getattr(response, "choices", None)

if not choices:

raise ValueError("LLM completion returned no choices; cannot extract text response.")

first_choice = choices[0]

message = getattr(first_choice, "message", None)

content = getattr(message, "content", None) if message is not None else None

if content is None:

raise ValueError("LLM completion choice has no message content; cannot extract text response.")

# Preserve existing behavior: assume content is text, but ensure we return a string.

return content if isinstance(content, str) else str(content)

Fixed in cb9ba18. We went with return response.choices[0].message.content or "" rather than the more verbose validation — keeps the code simple and satisfies the -> str contract. LLMJudge._parse_score() already handles empty strings gracefully (returns default_score).

Copilot · 2026-02-17T08:18:40Z

src/openenv/core/evals/lighteval.py

+        pipeline_params = PipelineParameters(
+            launcher_type=launcher_type,
+            env_config=EnvConfig(cache_dir=self.output_dir),
+            custom_tasks_directory=custom_tasks_directory,


output_dir is documented as an output directory, but in the current implementation it is only passed to EnvConfig(cache_dir=...). Either rename/update the docs to reflect that this is a cache dir, or wire it into LightEval's output/result directory configuration.

Good catch on the naming. In LightEval, EnvConfig.cache_dir serves as both the cache and output directory for results. We kept output_dir as the public parameter name since it better describes user intent (where outputs go), while cache_dir is the LightEval-internal name. The docstring says "Directory for evaluation outputs" which is accurate.

Copilot · 2026-02-17T08:18:41Z

src/openenv/core/evals/lighteval.py

+    def __init__(
+        self,
+        *,
+        output_dir: Optional[str] = None,
+        save_details: bool = False,
+    ):
+        self.output_dir = output_dir
+        self.save_details = save_details
+


save_details is accepted and stored on the harness, but it is not referenced anywhere in run(). This makes the public API misleading; either remove the arg for now or plumb it into the LightEval pipeline configuration / save behavior.

Fixed in cb9ba18. save_details now gates pipeline.save_and_push_results() — when False (default), results are only returned in-memory without writing to disk or pushing to hub.

Copilot · 2026-02-17T08:18:41Z

src/openenv/core/rubrics/llm_judge.py

+        match = self._score_pattern.search(response)
+        if match is None:
+            return self.default_score
+        try:
+            score = float(match.group(1))


_parse_score() always reads match.group(1). If a caller supplies a score_pattern without a capturing group, parsing will always fail and silently fall back to default_score. Either document that the first capturing group is required, or support patterns without groups by using group(1) when available else group(0).

Fixed in 6e11d8f (merge from main). _parse_score() now uses match.group(1) if match.lastindex else match.group(0), so patterns without a capture group fall back to the full match instead of silently returning default_score.

- Gate pipeline.save_and_push_results() on save_details flag so it only runs when explicitly requested (fixes unused parameter bug) - Return empty string from OpenAIClient.complete() when content is None, satisfying the -> str contract

…l-support # Conflicts: # src/openenv/core/llm_client.py # src/openenv/core/rubrics/llm_judge.py

LightEval authors are migrating to Inspect AI. Replace the concrete harness implementation with InspectAIHarness wrapping inspect_ai.eval(). The EvalHarness ABC, EvalConfig, and EvalResult remain unchanged.

…l-support

Darktex added 2 commits February 16, 2026 18:05

Add LLMClient abstraction and LLMJudge rubric (RFC 004)

667a12f

Generic RPC abstraction for LLM endpoints with OpenAI-compatible implementation. LLMJudge rubric uses this client for LLM-as-a-judge reward computation with configurable prompt templates and score parsing.

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 17, 2026

Darktex requested review from burtenshaw and Copilot February 17, 2026 08:12

Copilot started reviewing on behalf of Darktex February 17, 2026 08:12 View session

greptile-apps bot reviewed Feb 17, 2026

View reviewed changes

Copilot AI reviewed Feb 17, 2026

View reviewed changes

Darktex added 4 commits February 17, 2026 13:47

Merge remote-tracking branch 'origin/main' into feature/issue-384-eva…

6e11d8f

…l-support # Conflicts: # src/openenv/core/llm_client.py # src/openenv/core/rubrics/llm_judge.py

Replace LightEvalHarness with InspectAIHarness (Inspect AI migration)

00e6747

LightEval authors are migrating to Inspect AI. Replace the concrete harness implementation with InspectAIHarness wrapping inspect_ai.eval(). The EvalHarness ABC, EvalConfig, and EvalResult remain unchanged.

Merge remote-tracking branch 'origin/main' into feature/issue-384-eva…

ffbe59f

…l-support

burtenshaw mentioned this pull request Feb 26, 2026

WebGym + OpenEnv microsoft/webgym#1

Open

	return response.choices[0].message.content
	return response.choices[0].message.content or ""

-        return response.choices[0].message.content
+        # Validate that the response contains at least one choice with non-empty content.
+        choices = getattr(response, "choices", None)
+        if not choices:
+            raise ValueError("LLM completion returned no choices; cannot extract text response.")
+        first_choice = choices[0]
+        message = getattr(first_choice, "message", None)
+        content = getattr(message, "content", None) if message is not None else None
+        if content is None:
+            raise ValueError("LLM completion choice has no message content; cannot extract text response.")
+        # Preserve existing behavior: assume content is text, but ensure we return a string.
+        return content if isinstance(content, str) else str(content)

Conversation

Darktex commented Feb 17, 2026

Summary

Test plan

Uh oh!

greptile-apps bot commented Feb 17, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Class Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Darktex Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Darktex Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Darktex Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Darktex Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Darktex Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Darktex Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Darktex Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants