route Feature

Smart routing, declarative rules.

Per-endpoint YAML rules decide which provider and which model gets the call — based on input size, schema presence, monthly spend, time of day, or any signal you wire up. Clients keep using one model name; the gateway picks the right backend.

WHY YOU'D WANT THIS

Smart cost control without rewriting every call site.

The economics of LLMs are bimodal: cheap models handle 80% of work fine, expensive models are necessary for the rest. The naive answer — "always use the expensive one" — bleeds money. The slightly-less-naive answer — "let each app pick" — pushes routing logic into every client.

Routing rules invert it: the gateway holds the policy. Clients call model: "smart" or model: "cheap"; the rules behind the alias decide what actually runs. Change the policy, every client immediately follows — without a single deploy on the app side.

  • Long inputs route to a 200K-context model; short ones to a fast one
  • JSON-schema enforcement on; route to a model that respects it
  • Monthly budget burning fast; downgrade after 80% spend
  • Off-hours; cheaper model since latency matters less
endpoint-routing.yaml yaml
# Per-endpoint rules — first match wins, fall through to default.
default:
  provider: anthropic
  model:    claude-3-5-sonnet

rules:
  # Long inputs → bigger context window
  - if: "input_tokens > 100000"
    use:
      provider: google
      model:    gemini-2.5-pro

  # Schema mode — must round-trip clean JSON
  - if: "has_schema"
    use:
      provider: openai
      model:    gpt-4o

  # Burn-rate guard — switch to cheaper model past 80% of monthly budget
  - if: "monthly_spend_pct > 0.8"
    use:
      provider: groq
      model:    llama-3.3-70b

  # Off-hours batch processing
  - if: "hour_local in [22, 23, 0, 1, 2, 3, 4, 5]"
    use:
      provider: together
      model:    deepseek-r1
AVAILABLE SIGNALS

What rules can match on.

Every routing decision lands in the audit log alongside the matched rule, so 'why did it pick that?' is always answerable after the fact.

input_tokens

Token count of the prompt before dispatch. Useful for context-window-aware routing.

has_schema

Boolean — true when the request includes a JSON schema. Route to schema-strict models.

has_tools

Boolean — true when tool definitions are passed. Route to tool-capable models.

monthly_spend_pct

Fraction of the endpoint's monthly USD budget already consumed. Burn-rate-aware downgrades.

time_of_day / hour_local / weekday

Time signals for cost or latency policies that vary across the day.

streaming

Boolean — true when the client requested SSE. Route streaming-capable models only.

Custom JSONPath signals

Pull any value from the request body — $.metadata.priority, $.user_tier — and match on it.

Failover chain

Independent of rules: every endpoint can list secondary credentials that kick in on 5xx / rate-limit. Logged with the routing decision.

Move the policy, not the apps.

Edit the rule on the gateway, every client sees it on the next request. Audit log records which rule matched on every call.

Coming soonComing soon All features →