Reference

How Kalibr Works

Kalibr detects failures, heals them automatically, and learns which model works best for each task over time.

Most agent frameworks pick one model and stick with it. When that model silently degrades, or when the output is structurally wrong but the HTTP response was 200, nothing catches it. Kalibr does.

Kalibr sits between your agent and the model. It evaluates whether every output actually succeeded, reroutes automatically when it doesn't, and learns from outcomes to pick the best model over time. No alerting. No manual rollback. No human required.

1. Kalibr detects failures

After each model call, Kalibr runs two evaluation gates before the result is considered complete:

Gate 1. Structural eval (synchronous, every call). A fast, deterministic check that runs inline with no LLM calls. What it checks depends on goal type:

Gate 1 result feeds directly into report(success=bool). No configuration needed, Kalibr knows the success contract for each goal type.

Gate 2 - LLM quality judge (async, ~20% sample rate, research and outreach only). For goals where structural correctness isn't enough to measure quality, specifically research and outreach_generation Kalibr runs a background quality judge on approximately 20% of outputs that passed Gate 1. The judge uses a cheap model (DeepSeek, specifically deepseek-chat, never a premium model) and returns a float score from 0.0 to 1.0. Scores below 0.6 are treated as low quality. This score feeds into report(success=bool, score=float) and gives the router finer discrimination between models that both pass Gate 1 but produce different quality output.

Gate 2 is fire-and-forget. It never blocks the main execution path. Standard routing has no LLM call in the hot path. The exception is when repair_prompt=True is set on the Router, which adds one judge-model call to rewrite the prompt on quality failure.

The 0.6 quality threshold is a global default, not per-goal calibrated. If your goal type produces consistently short outputs that score low by default (e.g., classification tasks that return a single word), set score_when instead of relying on the default Gate 2 judge. score_when gives you full control over the scoring function.

2. Kalibr heals failures

When Gate 1 fails (structurally bad output, wrong format, empty response, provider error) Kalibr records the failure against that model for this goal. On the next call for the same goal, Kalibr routes to the next-best model based on current success rates. No configuration. No threshold to set. It just switches.

This reroute is what the dashboard calls a heal. Every heal is an execution that would have reached your users as a failure, intercepted and redirected automatically. The heal count on your Agents page is the count of those interventions.

Healing catches failures that HTTP status codes miss: a model that returns 200 with malformed JSON, a summarization model that returns a verbatim copy of the input, a code model that returns syntactically invalid Python. Gate 1 catches all of these. The provider never flagged them as errors.

Prompt repair

When healing=True and a call fails Gate 1, Kalibr injects a deterministic repair system prompt before retrying on the same model. The repair prompt describes the exact validation failure and instructs the model what format is required. This happens inline, with no extra LLM call. If the retry also fails, Kalibr swaps to the next path.

Optionally, set repair_prompt=True on the Router (without healing=True) to enable a Gate 2 path where the judge_model rewrites the user prompt itself when quality falls below threshold. This path makes one additional LLM call to rewrite the prompt before trying the next model.

When all paths are exhausted

When the heal loop has exhausted all paths, the router returns the best partial response received (with response.kalibr_heal_exhausted = True) rather than raising an exception. This prevents benchmarks and tolerant callers from counting partial results as hard failures. A RuntimeError is only raised if no response was received at all (e.g. a network failure before any bytes arrived).

Inspect attributes on the returned response:

3. Kalibr learns from outcomes

Routing priors are stored server-side by the Kalibr intelligence service. They persist across process restarts, redeployments, and scaling events. Your Router starts accumulating learning from the first call and retains it indefinitely. You do not need to warm up the Router again after a restart.

Before your tenant has any run history for a goal, Kalibr selects a starting model from a global pool of outcome data, aggregated across all tenants, all task types, weighted by task similarity. This warm-start means your first run routes to a model with a known track record for that goal type, not a coin flip.

As your agent accumulates outcomes, tenant-specific data takes over. The global prior becomes a progressively smaller influence. Your routing reflects your actual workload.

Scoring signals

Kalibr accepts two types of outcome signals:

Statistical model

Kalibr uses Thompson Sampling over Beta(alpha, beta) distributions, one per (tenant, goal, model_path) triple.

math
alpha = success_count + 1  (Laplace prior)
beta  = failure_count + 1  (Laplace prior)

At routing time, one sample is drawn from each path's distribution. The highest sample wins. This naturally balances exploitation (favoring known-good paths) with exploration (occasionally trying underperforming paths when uncertainty is high).

Exploration floor: paths with fewer than 50 total outcomes receive a guaranteed minimum traffic share (approximately 15.8%, or 1/6.3). This prevents starvation during cold start and ensures every path accumulates enough data for reliable estimates before Thompson Sampling fully dominates.

The distributions are not reset on process restart. They are stored server-side and updated after every reported outcome.

Non-stationarity: Kalibr weights recent outcomes more heavily than older ones when computing the effective alpha/beta. The evaluator computes a 24-hour recent window against a 7-30 day baseline and tracks a trend signal (improving, stable, degrading). A model that was excellent six months ago but degraded last week will lose routing priority within days, not months.

Behavioral signal blending (Gate 3)

User behavioral signals (report_user_turn, report_action, report_session_end) blend with structural and quality signals at a 20% weight, requiring a minimum of 5 signals before blending activates. The blending is additive: the routing prior is adjusted by (0.2 * behavioral_signal_weight) + (0.8 * structural_quality_weight). This weight is fixed in the current version; per-goal or per-tenant weight tuning is not yet supported.

Gate 3 is off by default and enabled per tenant via the gate3_signal_blend feature flag. When disabled, the ClickHouse signal query is skipped entirely to keep routing latency at network-only.

Selection bias note: users who reprompt are not a random sample of all users. The behavioral signal is most useful as a directional indicator (this model produces outputs users accept) rather than a precise probability estimate.

Trend detection and drift

Kalibr compares recent performance against historical baseline to detect drift. A model that was working last week may not be working this week, silent provider regressions happen constantly.

A model's trend can be:

When a model is degrading, it loses routing priority. When it recovers, routing gradually returns to it. This works across all modalities, a degrading transcription model gets the same treatment as a degrading text LLM.

The Trust Invariant

Kalibr optimizes for success first, cost second. Always.

A path with higher success rate will never lose to a path with lower success rate, even if the lower-performing path is significantly cheaper.

Cost and latency only matter when comparing paths with similar success rates. This ensures you never sacrifice quality for cost savings.

Bypass When Needed

Sometimes you need to override routing:

python
# Force a specific model
response = router.completion(
 messages=[...],
 force_model="gpt-4o"
)
typescript
// Force a specific model
const response = await router.completion(messages, {
 forceModel: 'gpt-4o',
});

The call is still traced, but routing is bypassed. Use this for:

Don't use it as your default. You lose the learning benefits.

Cost savings

The trust invariant (see below) guarantees Kalibr never sacrifices reliability for cost. But when two models have similar success rates, Kalibr routes to the cheaper one. Over time this compounds: a model that costs $0.004/call replacing one that costs $0.018/call across thousands of runs is real money. The Cost Saved by Kalibr KPI on your dashboard measures exactly this, the delta between what you spent and what you would have spent routing everything through the most expensive model in your path list.

Auto Path Generation [FEATURE FLAG]

Not enabled by default. When the auto_path_generation flag is on, a background job runs hourly and extends the path registry automatically:

Contact us to enable this on your account.

Rate limits and concurrency

The Kalibr intelligence service is rate-limited by your API key tier. The routing decision call (/decide) is fast and designed for high concurrency. If you exceed your tier, requests return 429. Provider-level rate limits (OpenAI, Anthropic) are not managed by Kalibr -- your application should handle these using standard retry logic.

Failure modes and reliability

Intelligence service unavailable: If the /decide call to the Kalibr intelligence service fails or times out, the Router falls back to the first path in the paths list and logs a warning. Routing continues without learning until the service recovers. Outcome reports are queued locally and retried.

All paths fail: When the heal loop exhausts all paths, the Router returns the best partial response with response.kalibr_heal_exhausted = True. A RuntimeError is only raised if no bytes were received at all (network-level failure before any response).

Rate limits: Kalibr routing decisions (/decide) are separate from provider calls. If your Kalibr tier is exceeded, the /decide call returns 429 and the Router falls back to the first path. Provider rate limits (OpenAI, Anthropic) are not managed by Kalibr and should be handled with standard retry logic in your application.

Next