Smart routing, declarative rules.
Per-endpoint YAML rules decide which provider and which model gets the call — based on input size, schema presence, monthly spend, time of day, or any signal you wire up. Clients keep using one model name; the gateway picks the right backend.
Smart cost control without rewriting every call site.
The economics of LLMs are bimodal: cheap models handle 80% of work fine, expensive models are necessary for the rest. The naive answer — "always use the expensive one" — bleeds money. The slightly-less-naive answer — "let each app pick" — pushes routing logic into every client.
Routing rules invert it: the gateway holds the policy. Clients call model: "smart" or model: "cheap"; the rules behind the alias decide what actually runs. Change the policy, every client immediately follows — without a single deploy on the app side.
- Long inputs route to a 200K-context model; short ones to a fast one
- JSON-schema enforcement on; route to a model that respects it
- Monthly budget burning fast; downgrade after 80% spend
- Off-hours; cheaper model since latency matters less
# Per-endpoint rules — first match wins, fall through to default. default: provider: anthropic model: claude-3-5-sonnet rules: # Long inputs → bigger context window - if: "input_tokens > 100000" use: provider: google model: gemini-2.5-pro # Schema mode — must round-trip clean JSON - if: "has_schema" use: provider: openai model: gpt-4o # Burn-rate guard — switch to cheaper model past 80% of monthly budget - if: "monthly_spend_pct > 0.8" use: provider: groq model: llama-3.3-70b # Off-hours batch processing - if: "hour_local in [22, 23, 0, 1, 2, 3, 4, 5]" use: provider: together model: deepseek-r1
What rules can match on.
Every routing decision lands in the audit log alongside the matched rule, so 'why did it pick that?' is always answerable after the fact.
input_tokens
Token count of the prompt before dispatch. Useful for context-window-aware routing.
has_schema
Boolean — true when the request includes a JSON schema. Route to schema-strict models.
has_tools
Boolean — true when tool definitions are passed. Route to tool-capable models.
monthly_spend_pct
Fraction of the endpoint's monthly USD budget already consumed. Burn-rate-aware downgrades.
time_of_day / hour_local / weekday
Time signals for cost or latency policies that vary across the day.
streaming
Boolean — true when the client requested SSE. Route streaming-capable models only.
Custom JSONPath signals
Pull any value from the request body — $.metadata.priority, $.user_tier — and match on it.
Failover chain
Independent of rules: every endpoint can list secondary credentials that kick in on 5xx / rate-limit. Logged with the routing decision.
Move the policy, not the apps.
Edit the rule on the gateway, every client sees it on the next request. Audit log records which rule matched on every call.