Stop shipping prompt regressions blind.
Attach a goldenset of cases to any AI Gateway endpoint. Every config change runs them again automatically; the diff shows you which cases regressed, which got fixed, and the per-case scores. Evals are first-class data in the gateway, not a side project.
Prompt edits regress quality silently.
Tweaking a system prompt to fix one customer's case usually breaks two others' — but you only find out when the next ticket comes in. Switching providers because "the new one is cheaper" works great until edge cases that worked on the old model start failing. Bumping a model version inherits all the upstream's behaviour changes.
Endpoint evals catch regressions before the change ships. A case is just a request body and an expected outcome. Attach a few dozen to your endpoint, save a config change, and the eval runs automatically. The next page shows you "five cases regressed, two cases got fixed, the rest unchanged" — with diffs.
- Regression detection on every prompt or config change
- Same eval re-runs after a model upgrade or provider switch
- Side-by-side diff vs the previous run, per case
- Score lives next to the endpoint, not in a separate Notion doc
{
"name": "customer-vat-extraction",
"request": {
"messages": [{
"role": "user",
"content": "Invoice from ACME GmbH, VAT-ID DE123456789, total €1240.50"
}]
},
"expect": {
"json_path": {
"$.vat_id": "DE123456789",
"$.total": 1240.50,
"$.currency": "EUR"
}
},
"weight": 1.0
} — Endpoint: /classify-intent — Run #14 · 2026-05-09 13:42 · 42 cases · 3m 12s Pass: 37 (88%) ↑ from 34 (81%) +3 fixed Fail: 5 (12%) ↓ from 8 (19%) -3 fixed Diff: +3 fixed, 0 new regressions — Newly passing — ✓ refund-request-with-typo score 1.00 (was 0.40) ✓ booking-change-with-tone-shift score 0.95 (was 0.60) ✓ cancellation-non-english score 1.00 (was 0.20) — Still failing — ✗ multilingual-mixed-CN-EN score 0.40 ✗ ambiguous-tone-positive-with-neg score 0.55 ... — Triggered by — endpoint config v23 → v24 diff: prompt template changed (3 lines)
Three components — set, run, diff.
- Eval set — a JSON file or in-UI table of cases. Each case has a request body, an "expect" assertion (exact match, JSON path match, regex match, or "LLM-as-judge"), and an optional weight.
- Run — every endpoint save creates a new endpoint version, which triggers the attached eval set in the background. Each case fires a real request through the new config; the response is scored against the assertion.
- Diff — a run is paired with the previous successful run on the same eval set. The diff shows every case where the score changed, separated into "newly passing" and "newly failing".
Runs are independent of CI — you don't need a separate test pipeline. The eval IS the regression test. Attach an eval set to a customer-critical endpoint and you've got a continuous safety net for every prompt edit.
Five assertion types, mix-and-match.
Different cases need different success criteria — pick the cheapest one that's accurate enough.
Exact match
Response equals the expected string. Cheap, deterministic, useful for classification cases with a fixed label set.
JSON path match
Pull values from the response JSON via JSON Path; assert each. The default for any endpoint with an output schema.
Regex match
Pattern-matched against the response text. Good for "the response must mention this keyword" or "must follow this format".
Contains-all / contains-any
List of substrings — assert all present, or any present, or none present. Easy positive/negative test cases.
LLM-as-judge
Score with another LLM call against a rubric you write. Most expensive, but unavoidable for "is the tone right?" / "is this a polite reply?" cases.
Custom hook
POST the response to a webhook you control; the webhook returns a 0..1 score. Lets you bring your own grader for domain-specific evaluation.
Quality gates for prompts.
Endpoint versioning + endpoint evals together: every config change is captured AND scored. Roll back a regression in one click.