Core Concepts
The Problem
You have an agent that books meetings. It uses gpt-4o. Sometimes it fails - wrong times, missed constraints, hallucinated availability.
You wonder: would Claude be better? What about with a different temperature? What if you added a calendar validation tool?
You could run manual experiments. Or you could let production tell you.
Goals
A goal is a task with a consistent success criterion.
Good goals:
book_meetingextract_companyclassify_ticketgenerate_sql
Bad goals:
handle_request(too vague)llm_call(no success criterion)
Each goal gets its own routing state. Kalibr learns independently for each.
When to create a new goal
- Success criteria change -
extract_companyvsextract_company_with_domain - Input types differ -
summarize_emailvssummarize_transcript
When to keep the same goal
- Only the input content varies (different emails, same extraction task)
- You're testing different prompts for the same task
Paths
A path is a complete execution configuration. Paths work across any modality: text LLMs, voice models, image generators, embedding models, and any model on HuggingFace.
Just models:
paths = ["gpt-4o", "claude-sonnet-4-20250514", "openai/whisper-large-v3"]
const paths = ["gpt-4o", "claude-sonnet-4-20250514", "openai/whisper-large-v3"];
Model + tool combinations:
paths = [
{"model": "gpt-4o", "tools": ["calendar_api"]},
{"model": "gpt-4o", "tools": ["google_calendar"]},
{"model": "claude-sonnet-4-20250514", "tools": ["calendar_api"]}
]
const paths = [
{ model: "gpt-4o", tools: ["calendar_api"] },
{ model: "gpt-4o", tools: ["google_calendar"] },
{ model: "claude-sonnet-4-20250514", tools: ["calendar_api"] }
];
Model + tool + parameter combinations:
paths = [
{"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.3}},
{"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.7}},
]
const paths = [
{ model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.3 } },
{ model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.7 } },
];
Kalibr tracks success rates for each unique path. If gpt-4o + calendar_api works better than gpt-4o + google_calendar, traffic shifts automatically.
Outcomes
An outcome is what you report after execution: success or failure, optionally with a continuous quality score.
# Binary outcome
router.report(success=True)
router.report(success=False, reason="invalid_time")
# Continuous quality score — feeds directly into routing
router.report(success=True, score=0.85)
# Score provides finer signal than binary alone.
# A path scoring 0.85 consistently will be preferred
# over one scoring 0.6, even if both technically "succeed."
await router.report({ success: true });
await router.report({ success: false, reason: "invalid_time" });
await router.report({ success: true, score: 0.85 });
Without outcomes, Kalibr can't learn. This is the feedback loop.
What Kalibr tracks per path:
- Success rate (binary pass/fail)
- Quality score distribution (continuous 0-1, when reported)
- Sample count
- Trend (improving / stable / degrading)
- Cost and latency (from traces)
What Kalibr ignores:
- Your prompts
- Response content
- Anything that could leak sensitive data
The Feedback Loop
Kalibr captures execution telemetry and serves it back as structured intelligence. The full loop:
1. Report outcomes — your agent reports success or failure after each task, optionally with a continuous quality score (0-1) and a structured failure_category (timeout, tool_error, hallucination_detected, etc.). The continuous score feeds directly into Thompson Sampling for finer-grained routing.
2. Kalibr learns — Thompson Sampling updates beliefs about which paths work best, using both binary outcomes and continuous quality scores. Trend detection identifies degradation. Rollback monitoring disables failing paths automatically.
3. Query insights — a coding agent calls get_insights() and receives structured diagnostics: which goals are healthy, which are failing, which failure modes dominate, which paths underperform, which parameters matter.
4. Update outcomes — when real-world signals arrive later (customer reopened ticket 48 hours after "resolution"), update_outcome() corrects the record. Every downstream component learns from the correction.
5. Auto-explore — when enabled, Kalibr automatically generates new path configurations by interpolating parameter values (e.g., trying temperature 0.2 if 0.3 was the best value tested). New paths are evaluated through existing exploration traffic.
The human's role: set goals, define success criteria, own billing, check in occasionally. Everything else is agent-to-agent.
Failure Categories
Instead of free-text failure reasons that can't be aggregated, Kalibr supports structured failure categories. These enable clean clustering: "60% of failures for this goal are timeouts" rather than parsing thousands of unique error strings.
from kalibr import FAILURE_CATEGORIES
# timeout, context_exceeded, tool_error, rate_limited,
# validation_failed, hallucination_detected, user_unsatisfied,
# empty_response, malformed_output, auth_error, provider_error, unknown
router.report(success=False, failure_category="timeout",
reason="Provider timed out after 30s")
Constraints
You can add constraints to routing decisions:
policy = get_policy(
goal="book_meeting",
constraints={
"max_cost_usd": 0.05,
"max_latency_ms": 2000,
"min_quality": 0.8
}
)
const policy = await getPolicy({
goal: "book_meeting",
constraints: {
maxCostUsd: 0.05,
maxLatencyMs: 2000,
minQuality: 0.8
}
});
Kalibr will only recommend paths that meet all constraints.
What Kalibr Doesn't Do
- Not a proxy - Calls go directly to providers. Kalibr just decides which one.
- Not a retry system - If a call fails, it fails. Kalibr learns and routes away next time.
- Not eval tooling - Kalibr doesn't judge output quality. You define success.
- Not an agent framework - You own your logic. Kalibr only picks the path.
Next
- How Routing Works - Statistical methods, exploration vs exploitation
- API Reference - Full Router API
- Production Guide - Error handling, monitoring, debugging