How Kalibr Works
Kalibr detects failures, heals them automatically, and learns which model works best for each task over time.
Most agent frameworks pick one model and stick with it. When that model silently degrades, or when the output is structurally wrong but the HTTP response was 200, nothing catches it. Kalibr does.
Kalibr sits between your agent and the model. It evaluates whether every output actually succeeded, reroutes automatically when it doesn't, and learns from outcomes to pick the best model over time. No alerting. No manual rollback. No human required.
1. Kalibr detects failures
After each model call, Kalibr runs two evaluation gates before the result is considered complete:
Gate 1. Structural eval (synchronous, every call). A fast, deterministic check that runs inline with no LLM calls. What it checks depends on goal type:
code_generationPython AST parse passes, or TypeScript has function/class structureweb_scrapingfield completeness ≥ 0.8, at least 1 row returnedclassificationreturned label is in the allowed setsummarizationoutput is not a near-verbatim copy, not empty, not a refusallead_scoringscore is numeric and in [0, 100]outreach_generationsubject line and body both present, 50–2,000 charsresearchat least 200 characters, no error markers in output- All other goal types, output is non-empty and non-trivial
Gate 1 result feeds directly into report(success=bool). No configuration needed, Kalibr knows the success contract for each goal type.
Gate 2 - LLM quality judge (async, ~20% sample rate, research and outreach only). For goals where structural correctness isn't enough to measure quality, specifically research and outreach_generation Kalibr runs a background quality judge on approximately 20% of outputs that passed Gate 1. The judge uses a cheap model (DeepSeek, specifically deepseek-chat, never a premium model) and returns a float score from 0.0 to 1.0. Scores below 0.6 are treated as low quality. This score feeds into report(success=bool, score=float) and gives the router finer discrimination between models that both pass Gate 1 but produce different quality output.
Gate 2 is fire-and-forget. It never blocks the main execution path. Standard routing has no LLM call in the hot path. The exception is when repair_prompt=True is set on the Router, which adds one judge-model call to rewrite the prompt on quality failure.
The 0.6 quality threshold is a global default, not per-goal calibrated. If your goal type produces consistently short outputs that score low by default (e.g., classification tasks that return a single word), set score_when instead of relying on the default Gate 2 judge. score_when gives you full control over the scoring function.
2. Kalibr heals failures
When Gate 1 fails (structurally bad output, wrong format, empty response, provider error) Kalibr records the failure against that model for this goal. On the next call for the same goal, Kalibr routes to the next-best model based on current success rates. No configuration. No threshold to set. It just switches.
This reroute is what the dashboard calls a heal. Every heal is an execution that would have reached your users as a failure, intercepted and redirected automatically. The heal count on your Agents page is the count of those interventions.
Healing catches failures that HTTP status codes miss: a model that returns 200 with malformed JSON, a summarization model that returns a verbatim copy of the input, a code model that returns syntactically invalid Python. Gate 1 catches all of these. The provider never flagged them as errors.
Prompt repair
When healing=True and a call fails Gate 1, Kalibr injects a deterministic repair system prompt before retrying on the same model. The repair prompt describes the exact validation failure and instructs the model what format is required. This happens inline, with no extra LLM call. If the retry also fails, Kalibr swaps to the next path.
Optionally, set repair_prompt=True on the Router (without healing=True) to enable a Gate 2 path where the judge_model rewrites the user prompt itself when quality falls below threshold. This path makes one additional LLM call to rewrite the prompt before trying the next model.
When all paths are exhausted
When the heal loop has exhausted all paths, the router returns the best partial response received (with response.kalibr_heal_exhausted = True) rather than raising an exception. This prevents benchmarks and tolerant callers from counting partial results as hard failures. A RuntimeError is only raised if no response was received at all (e.g. a network failure before any bytes arrived).
Inspect attributes on the returned response:
kalibr_heal_exhausted— True when all paths failedkalibr_healed— True when healing modified at least one callkalibr_heal_count— number of repair attempts madekalibr_models_tried— list of models attempted in order
3. Kalibr learns from outcomes
Routing priors are stored server-side by the Kalibr intelligence service. They persist across process restarts, redeployments, and scaling events. Your Router starts accumulating learning from the first call and retains it indefinitely. You do not need to warm up the Router again after a restart.
Before your tenant has any run history for a goal, Kalibr selects a starting model from a global pool of outcome data, aggregated across all tenants, all task types, weighted by task similarity. This warm-start means your first run routes to a model with a known track record for that goal type, not a coin flip.
As your agent accumulates outcomes, tenant-specific data takes over. The global prior becomes a progressively smaller influence. Your routing reflects your actual workload.
Scoring signals
Kalibr accepts two types of outcome signals:
- Binary
report(success=True/False). Updates the model's success rate directly. Every structural eval produces this. - Continuous
report(success=True, score=0.85). The float score gives finer discrimination. A score of 0.85 counts as 0.85 successes and 0.15 failures in the routing model. Two models that both pass Gate 1 at 90% will look identical on binary scoring, but if one consistently scores 0.92 and the other 0.61, Kalibr routes to the better one. The LLM quality judge (Gate 2) produces this signal automatically for eligible goal types.
Statistical model
Kalibr uses Thompson Sampling over Beta(alpha, beta) distributions, one per (tenant, goal, model_path) triple.
alpha = success_count + 1 (Laplace prior) beta = failure_count + 1 (Laplace prior)
At routing time, one sample is drawn from each path's distribution. The highest sample wins. This naturally balances exploitation (favoring known-good paths) with exploration (occasionally trying underperforming paths when uncertainty is high).
Exploration floor: paths with fewer than 50 total outcomes receive a guaranteed minimum traffic share (approximately 15.8%, or 1/6.3). This prevents starvation during cold start and ensures every path accumulates enough data for reliable estimates before Thompson Sampling fully dominates.
The distributions are not reset on process restart. They are stored server-side and updated after every reported outcome.
Non-stationarity: Kalibr weights recent outcomes more heavily than older ones when computing the effective alpha/beta. The evaluator computes a 24-hour recent window against a 7-30 day baseline and tracks a trend signal (improving, stable, degrading). A model that was excellent six months ago but degraded last week will lose routing priority within days, not months.
Behavioral signal blending (Gate 3)
User behavioral signals (report_user_turn, report_action, report_session_end) blend with structural and quality signals at a 20% weight, requiring a minimum of 5 signals before blending activates. The blending is additive: the routing prior is adjusted by (0.2 * behavioral_signal_weight) + (0.8 * structural_quality_weight). This weight is fixed in the current version; per-goal or per-tenant weight tuning is not yet supported.
Gate 3 is off by default and enabled per tenant via the gate3_signal_blend feature flag. When disabled, the ClickHouse signal query is skipped entirely to keep routing latency at network-only.
Selection bias note: users who reprompt are not a random sample of all users. The behavioral signal is most useful as a directional indicator (this model produces outputs users accept) rather than a precise probability estimate.
Trend detection and drift
Kalibr compares recent performance against historical baseline to detect drift. A model that was working last week may not be working this week, silent provider regressions happen constantly.
A model's trend can be:
- Improving Recent success rate significantly above baseline
- Stable Consistent with baseline
- Degrading Recent success rate significantly below baseline
When a model is degrading, it loses routing priority. When it recovers, routing gradually returns to it. This works across all modalities, a degrading transcription model gets the same treatment as a degrading text LLM.
The Trust Invariant
Kalibr optimizes for success first, cost second. Always.
A path with higher success rate will never lose to a path with lower success rate, even if the lower-performing path is significantly cheaper.
Cost and latency only matter when comparing paths with similar success rates. This ensures you never sacrifice quality for cost savings.
Bypass When Needed
Sometimes you need to override routing:
# Force a specific model response = router.completion( messages=[...], force_model="gpt-4o" )
// Force a specific model
const response = await router.completion(messages, {
forceModel: 'gpt-4o',
});The call is still traced, but routing is bypassed. Use this for:
- Debugging specific model behavior
- Reproducing customer issues
- Load testing a specific provider
Don't use it as your default. You lose the learning benefits.
Cost savings
The trust invariant (see below) guarantees Kalibr never sacrifices reliability for cost. But when two models have similar success rates, Kalibr routes to the cheaper one. Over time this compounds: a model that costs $0.004/call replacing one that costs $0.018/call across thousands of runs is real money. The Cost Saved by Kalibr KPI on your dashboard measures exactly this, the delta between what you spent and what you would have spent routing everything through the most expensive model in your path list.
Auto Path Generation [FEATURE FLAG]
Not enabled by default. When the auto_path_generation flag is on, a background job runs hourly and extends the path registry automatically:
- Identifies numeric parameters (temperature, top_p) where the best-performing value is at the boundary of explored space
- Generates new paths with interpolated values (e.g., if temperature 0.3 is the best and it's the lowest tested, tries 0.15)
- Paths that underperform (>20 percentage points below best) are automatically disabled after 30+ samples
- Maximum 5 auto-generated paths per goal, maximum 3 new paths per goal per run
Contact us to enable this on your account.
Rate limits and concurrency
The Kalibr intelligence service is rate-limited by your API key tier. The routing decision call (/decide) is fast and designed for high concurrency. If you exceed your tier, requests return 429. Provider-level rate limits (OpenAI, Anthropic) are not managed by Kalibr -- your application should handle these using standard retry logic.
Failure modes and reliability
Intelligence service unavailable: If the /decide call to the Kalibr intelligence service fails or times out, the Router falls back to the first path in the paths list and logs a warning. Routing continues without learning until the service recovers. Outcome reports are queued locally and retried.
All paths fail: When the heal loop exhausts all paths, the Router returns the best partial response with response.kalibr_heal_exhausted = True. A RuntimeError is only raised if no bytes were received at all (network-level failure before any response).
Rate limits: Kalibr routing decisions (/decide) are separate from provider calls. If your Kalibr tier is exceeded, the /decide call returns 429 and the Router falls back to the first path. Provider rate limits (OpenAI, Anthropic) are not managed by Kalibr and should be handled with standard retry logic in your application.
Next
- API Reference. Full Router API including get_policy()
- Production Guide. Graceful degradation, monitoring