Skip to content

Add per-input trialCount support to Eval()#1341

Open
dflynn15 wants to merge 1 commit intobraintrustdata:mainfrom
dflynn15:feature/js-per-input-trial-count
Open

Add per-input trialCount support to Eval()#1341
dflynn15 wants to merge 1 commit intobraintrustdata:mainfrom
dflynn15:feature/js-per-input-trial-count

Conversation

@dflynn15
Copy link

@dflynn15 dflynn15 commented Feb 5, 2026

Why?

Braintrust's Eval() function supports a trialCount parameter that runs each input multiple times to measure variance in non-deterministic LLM outputs. However, this setting applies globally to all inputs, which creates some (but minimal) friction in some evaluation workflows. For example:

  1. Targeted Debugging is Expensive: When investigating a single flaky test case, you want to run it 10-20 times to understand the variance pattern. With global trialCount, this means running your entire suite 10-20 times, multiplying costs and wait time unnecessarily.

  2. Mixed Determinism is Common: Real evaluation suites contain a mix of deterministic scenarios (math problems, factual lookups) and non-deterministic ones (creative writing, open-ended reasoning). Forcing the same trial count on both wastes resources.

  3. Cost Scales Linearly: Every additional trial means another LLM API call. A global trialCount: 5 on a 100-item dataset means 500 API calls, even if only 10 items actually need variance analysis.

In order to address this, we've created a custom solution that I want to propose as a contribution. Specifically it, allows each data item to specify its own trialCount, overriding the global default. This gives users fine-grained control over where to invest their evaluation budget.


What?

Eval("My Project", {
  data: [
    { input: "stable query", expected: "..." },                    // Uses global (3)
    { input: "flaky query", expected: "...", trialCount: 10 },     // Override to 10
    { input: "deterministic", expected: "...", trialCount: 1 },    // Override to 1
  ],
  task: myTask,
  scores: [Factuality],
  trialCount: 3, // Global default
});

There is a corollary Python PR up to match it here: #1342

Allow data items to specify their own trialCount, overriding the global
evaluator setting. This enables targeted debugging of flaky test cases
and mixed determinism scenarios without multiplying the entire suite.

- Add optional `trialCount` field to `EvalCase` type
- Per-item trialCount takes precedence over global trialCount
- Items without trialCount use the global value (or 1 if unset)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant