Skip to content

Model Routing and API Behavior

See also: Configuration Reference, Provider API Compatibility, Data Relationships, Identity and Access, Request Lifecycle and Failure Modes, Pricing Catalog and Accounting, Observability and Request Logs, ADR: Model Aliases and Provider-Only Route Config, ADR: Capability-Aware Route Gating with Strict Fail-Fast Validation, ADR: Route-Level Provider API Compatibility Profiles

This page explains how the public /v1/* surface resolves a request into one concrete route.

Source of Truth

Public Endpoints

The live public endpoints are:

  • GET /v1/models
  • POST /v1/chat/completions
  • POST /v1/responses
  • POST /v1/embeddings

All are authenticated.

Requested Versus Resolved Model Identity

The gateway keeps two model identities in play:

  • requested model
    • what the caller asked for
  • resolved model
    • the canonical execution target after alias resolution

That distinction is persisted into request logs.

Model Forms

Configured gateway models are either:

  • provider-backed
  • alias-backed

A model cannot define both routes and alias_of.

tag: Selectors

The request model field can be:

  • a concrete gateway model key
  • a tag selector such as tag:fast

Tag selectors use AND semantics.

  • every requested tag must exist on the chosen model
  • selection only considers models already allowed for the authenticated API key
  • candidates are ordered by model rank, then model key

Routes, Priority, and Weight

Provider-backed models resolve to one or more routes.

Each route can define:

  • provider
  • upstream_model
  • priority
  • weight
  • enabled
  • capabilities

Current planner behavior:

  • lower priority is attempted first
  • weight only matters within the same priority bucket
  • disabled routes and routes with non-positive weight are excluded

Current runtime nuance:

  • weighted routing is not multi-route fallback
  • the planner produces an ordered route list
  • the handler executes only the first eligible route

Weight affects selection inside a single priority bucket. It does not mean the gateway sends one request to several providers, retries the next route, or falls back after an upstream error. Configurable retry and fallback remains separate follow-up work in issue #118.

Capability-Aware Gating

Routes are filtered before provider execution based on request requirements and route capability.

Current capability dimensions:

  • chat_completions
  • responses
  • stream
  • embeddings
  • tools
  • vision
  • json_schema
  • developer_role

Capability metadata exists to fail early at the gateway edge. It is not a copy of provider marketing language.

Effective capability is the intersection of route metadata and provider runtime support.

  • route capability defaults are permissive
  • provider implementations can still reject unsupported API families
  • partial provider routes should explicitly disable unsupported API families

For example, current Vertex routes support the chat path but not the Responses path. A Vertex chat route should keep responses: false so /v1/responses fails during capability filtering instead of later inside the provider adapter.

Compatibility Profiles

Routes can also define provider API compatibility metadata.

Capabilities and compatibility have different jobs:

  • capabilities gates whether the route can execute a request at all
  • compatibility rewrites the outbound provider request shape after a route is selected

OpenAI-compatible route profiles currently cover deterministic Chat Completions transforms such as store removal, token field renaming, developer role rewriting, reasoning_effort handling, and stream usage requests. Responses uses a separate typed request/provider path; Chat Completions transforms must not be used as Responses shims.

See provider-api-compatibility.md for the compatibility matrix and field-level contract.

Worked Request Path

One plain path looks like this:

  • request model:
    • tag:fast
  • allowed model set:
    • gpt-4o-mini, claude-3-5-haiku
  • selection result:
    • gpt-4o-mini
  • alias result:
    • openai-gpt-4o-mini
  • planned route order:
    • openai-primary, then openai-backup
  • capability filter:
    • openai-primary stays eligible
  • execution:
    • the handler uses openai-primary
  • request-log fields:
    • model_key = gpt-4o-mini
    • resolved_model_key = openai-gpt-4o-mini
    • provider_key = openai-primary

Use request-lifecycle-and-failure-modes.md for the later logging, pricing, and budget effects.

/v1/models

GET /v1/models returns the gateway models visible to the authenticated API key.

Important notes:

  • it reflects gateway model identity, not raw provider catalogs
  • it shows grant-visible identities
  • it does not promise executable routes

That last point matters. A model can be visible and still fail if route viability or capability checks remove every route.

/v1/chat/completions

Current behavior highlights:

  • request IDs are propagated through x-request-id
  • budget checks run before provider execution
  • successful requests write usage when usage can be normalized
  • request logs store both requested and resolved model identity

/v1/responses

POST /v1/responses follows the same authentication, model resolution, route planning, budget guard, logging, and ledger flow as Chat Completions.

Important differences:

  • route capability filtering requires responses
  • provider execution calls the provider's Responses methods, not Chat Completions methods
  • streaming preserves Responses response.* event names and payloads instead of rewriting them into Chat Completions chunks
  • usage is normalized from input_tokens, output_tokens, and total_tokens

/v1/embeddings

POST /v1/embeddings follows the same high-level path:

  • authenticate
  • resolve the requested model
  • capability-filter the route set
  • execute the first eligible route
  • write usage when usage can be normalized

Current limitation:

  • Vertex embeddings remain out of scope in this slice and should be excluded by capability gating

Route Viability Versus Capability Mismatch

SymptomMeaning
invalid_requestthe model resolved, but capability filtering removed every route
no_routes_availablethe model exists, but no usable route survived provider and route-viability checks

That distinction is one of the fastest ways to debug a visible-but-unusable model.

Current V1 Behavior

The live runtime is intentionally narrow in this slice:

  • single-route execution only
  • no retry loop
  • no live fallback loop
  • strict capability filtering before provider execution

The open retry/fallback policy work must amend this section when it lands; see issue #118.

What This Page Does Not Own