How Routing Works
Statistical methods, exploration, and the trust invariant.
Outcome-Driven, Not Benchmark-Driven
Most routers optimize for metrics they can measure themselves: cost, latency, benchmark scores. But these don't tell you if the output actually worked.
Kalibr is different. You report outcomes (success or failure) and Kalibr learns from them. This creates a feedback loop that other routers don't have:
- Model returns valid JSON but wrong data? You report failure. Kalibr learns.
- Provider has a silent regression? Success rate drops. Traffic shifts away.
- New model performs better for your task? Traffic shifts toward it.
This is why Kalibr can detect problems that other routers miss - semantic failures that still return HTTP 200.
Statistical Foundation
Kalibr uses Thompson Sampling for routing decisions and Wilson score intervals for confidence estimation. These are well-established algorithms for balancing exploration vs exploitation.
Why this approach?
- Naturally balances exploration vs exploitation
- Adapts to changing conditions automatically
- Conservative with small sample sizes
You don't need to understand the math. The short version: Kalibr tries paths proportionally to how likely they are to be best, based on evidence so far.
Confidence and Sample Size
Kalibr is conservative with small samples. A path with 5 successes out of 5 attempts isn't trusted more than a path with 80 out of 100.
This matters because:
- New paths get fair exploration before being discarded
- Lucky streaks don't cause permanent routing decisions
- Unlucky streaks don't permanently kill good paths
As sample size grows, confidence grows - but Kalibr never stops exploring entirely.
Exploration vs Exploitation
Cold start: When a goal is new, Kalibr explores randomly until it has enough data to make informed decisions.
Steady state: After sufficient data, Kalibr mostly exploits the best-performing path while continuing to test alternatives. This lets it detect when conditions change.
You can adjust the exploration rate:
router = Router(
goal="extract_company",
paths=["gpt-4o", "claude-sonnet-4-20250514"],
exploration_rate=0.05 # Lower = more exploitation
)
const router = new Router({
goal: 'extract_company',
paths: ['gpt-4o', 'claude-sonnet-4-20250514'],
explorationRate: 0.05, // Lower = more exploitation
});
Lower exploration = more consistent, slower to adapt
Higher exploration = more variance, faster to detect changes
For high-stakes production tasks, use lower exploration. For experimental features, use higher.
Trend Detection
Kalibr compares recent performance against historical baseline to detect drift.
A path can be:
- Improving - Recent success rate significantly above baseline
- Stable - Consistent with baseline
- Degrading - Recent success rate significantly below baseline
This catches silent model regressions. When a provider pushes a bad update, Kalibr notices and routes away - often before you'd notice manually.
The Trust Invariant
Kalibr optimizes for success first, cost second. Always.
A path with higher success rate will never lose to a path with lower success rate, even if the lower-performing path is significantly cheaper.
Cost and latency only matter when comparing paths with similar success rates. This ensures you never sacrifice quality for cost savings.
Bypass When Needed
Sometimes you need to override routing:
# Force a specific model
response = router.completion(
messages=[...],
force_model="gpt-4o"
)
// Force a specific model
const response = await router.completion(messages, {
forceModel: 'gpt-4o',
});
The call is still traced, but routing is bypassed. Use this for:
- Debugging specific model behavior
- Reproducing customer issues
- Load testing a specific provider
Don't use it as your default - you lose the learning benefits.
Next
- API Reference - Full Router API including get_policy()
- Production Guide - Graceful degradation, monitoring