How Routing Works

Statistical methods, exploration, and the trust invariant.


Outcome-Driven, Not Benchmark-Driven

Most routers optimize for metrics they can measure themselves: cost, latency, benchmark scores. But these don't tell you if the output actually worked.

Kalibr is different. You report outcomes (success or failure) and Kalibr learns from them. This creates a feedback loop that other routers don't have:

  • Model returns valid JSON but wrong data? You report failure. Kalibr learns.
  • Provider has a silent regression? Success rate drops. Traffic shifts away.
  • New model performs better for your task? Traffic shifts toward it.

This is why Kalibr can detect problems that other routers miss - semantic failures that still return HTTP 200.


Statistical Foundation

Kalibr uses Thompson Sampling for routing decisions and Wilson score intervals for confidence estimation. These are well-established algorithms for balancing exploration vs exploitation.

Why this approach?

  • Naturally balances exploration vs exploitation
  • Adapts to changing conditions automatically
  • Conservative with small sample sizes

You don't need to understand the math. The short version: Kalibr tries paths proportionally to how likely they are to be best, based on evidence so far.


Confidence and Sample Size

Kalibr is conservative with small samples. A path with 5 successes out of 5 attempts isn't trusted more than a path with 80 out of 100.

This matters because:

  • New paths get fair exploration before being discarded
  • Lucky streaks don't cause permanent routing decisions
  • Unlucky streaks don't permanently kill good paths

As sample size grows, confidence grows - but Kalibr never stops exploring entirely.


Exploration vs Exploitation

Cold start: When a goal is new, Kalibr explores randomly until it has enough data to make informed decisions.

Steady state: After sufficient data, Kalibr mostly exploits the best-performing path while continuing to test alternatives. This lets it detect when conditions change.

You can adjust the exploration rate:

router = Router(
    goal="extract_company",
    paths=["gpt-4o", "claude-sonnet-4-20250514"],
    exploration_rate=0.05  # Lower = more exploitation
)
const router = new Router({
  goal: 'extract_company',
  paths: ['gpt-4o', 'claude-sonnet-4-20250514'],
  explorationRate: 0.05,  // Lower = more exploitation
});

Lower exploration = more consistent, slower to adapt
Higher exploration = more variance, faster to detect changes

For high-stakes production tasks, use lower exploration. For experimental features, use higher.


Trend Detection

Kalibr compares recent performance against historical baseline to detect drift.

A path can be:

  • Improving - Recent success rate significantly above baseline
  • Stable - Consistent with baseline
  • Degrading - Recent success rate significantly below baseline

This catches silent model regressions. When a provider pushes a bad update, Kalibr notices and routes away - often before you'd notice manually.


The Trust Invariant

Kalibr optimizes for success first, cost second. Always.

A path with higher success rate will never lose to a path with lower success rate, even if the lower-performing path is significantly cheaper.

Cost and latency only matter when comparing paths with similar success rates. This ensures you never sacrifice quality for cost savings.


Bypass When Needed

Sometimes you need to override routing:

# Force a specific model
response = router.completion(
    messages=[...],
    force_model="gpt-4o"
)
// Force a specific model
const response = await router.completion(messages, {
  forceModel: 'gpt-4o',
});

The call is still traced, but routing is bypassed. Use this for:

  • Debugging specific model behavior
  • Reproducing customer issues
  • Load testing a specific provider

Don't use it as your default - you lose the learning benefits.


Next