evals Feature

Stop shipping prompt regressions blind.

Attach a goldenset of cases to any AI Gateway endpoint. Every config change runs them again automatically; the diff shows you which cases regressed, which got fixed, and the per-case scores. Evals are first-class data in the gateway, not a side project.

Coming soonComing soon Eval docs ↗
WHY THIS MATTERS

Prompt edits regress quality silently.

Tweaking a system prompt to fix one customer's case usually breaks two others' — but you only find out when the next ticket comes in. Switching providers because "the new one is cheaper" works great until edge cases that worked on the old model start failing. Bumping a model version inherits all the upstream's behaviour changes.

Endpoint evals catch regressions before the change ships. A case is just a request body and an expected outcome. Attach a few dozen to your endpoint, save a config change, and the eval runs automatically. The next page shows you "five cases regressed, two cases got fixed, the rest unchanged" — with diffs.

  • Regression detection on every prompt or config change
  • Same eval re-runs after a model upgrade or provider switch
  • Side-by-side diff vs the previous run, per case
  • Score lives next to the endpoint, not in a separate Notion doc
case-shape.json json
{
  "name": "customer-vat-extraction",
  "request": {
    "messages": [{
      "role": "user",
      "content": "Invoice from ACME GmbH, VAT-ID DE123456789, total €1240.50"
    }]
  },
  "expect": {
    "json_path": {
      "$.vat_id":   "DE123456789",
      "$.total":    1240.50,
      "$.currency": "EUR"
    }
  },
  "weight": 1.0
}
run-summary.txt text
— Endpoint: /classify-intent —
Run #14   ·   2026-05-09 13:42   ·   42 cases   ·   3m 12s

  Pass:    37   (88%)   ↑ from 34 (81%)   +3 fixed
  Fail:     5   (12%)   ↓ from 8  (19%)   -3 fixed
  Diff:           +3 fixed,  0 new regressions

— Newly passing —
  ✓ refund-request-with-typo            score 1.00 (was 0.40)
  ✓ booking-change-with-tone-shift      score 0.95 (was 0.60)
  ✓ cancellation-non-english            score 1.00 (was 0.20)

— Still failing —
  ✗ multilingual-mixed-CN-EN            score 0.40
  ✗ ambiguous-tone-positive-with-neg    score 0.55
  ...

— Triggered by —
  endpoint config v23 → v24
  diff: prompt template changed (3 lines)
HOW IT WORKS

Three components — set, run, diff.

  1. Eval set — a JSON file or in-UI table of cases. Each case has a request body, an "expect" assertion (exact match, JSON path match, regex match, or "LLM-as-judge"), and an optional weight.
  2. Run — every endpoint save creates a new endpoint version, which triggers the attached eval set in the background. Each case fires a real request through the new config; the response is scored against the assertion.
  3. Diff — a run is paired with the previous successful run on the same eval set. The diff shows every case where the score changed, separated into "newly passing" and "newly failing".

Runs are independent of CI — you don't need a separate test pipeline. The eval IS the regression test. Attach an eval set to a customer-critical endpoint and you've got a continuous safety net for every prompt edit.

HOW CASES SCORE

Five assertion types, mix-and-match.

Different cases need different success criteria — pick the cheapest one that's accurate enough.

Exact match

Response equals the expected string. Cheap, deterministic, useful for classification cases with a fixed label set.

JSON path match

Pull values from the response JSON via JSON Path; assert each. The default for any endpoint with an output schema.

Regex match

Pattern-matched against the response text. Good for "the response must mention this keyword" or "must follow this format".

Contains-all / contains-any

List of substrings — assert all present, or any present, or none present. Easy positive/negative test cases.

LLM-as-judge

Score with another LLM call against a rubric you write. Most expensive, but unavoidable for "is the tone right?" / "is this a polite reply?" cases.

Custom hook

POST the response to a webhook you control; the webhook returns a 0..1 score. Lets you bring your own grader for domain-specific evaluation.

Quality gates for prompts.

Endpoint versioning + endpoint evals together: every config change is captured AND scored. Roll back a regression in one click.

Coming soonComing soon All features →