Core Concepts
The Problem
You have an agent that books meetings. It uses gpt-4o. Sometimes it fails - wrong times, missed constraints, hallucinated availability.
You wonder: would Claude be better? What about with a different temperature? What if you added a calendar validation tool?
You could run manual experiments. Or you could let production tell you.
Goals
A goal is a task with a consistent success criterion.
Good goals:
book_meetingextract_companyclassify_ticketgenerate_sql
Bad goals:
handle_request(too vague)llm_call(no success criterion)
Each goal gets its own routing state. Kalibr learns independently for each.
When to create a new goal
- Success criteria change -
extract_companyvsextract_company_with_domain - Input types differ -
summarize_emailvssummarize_transcript
When to keep the same goal
- Only the input content varies (different emails, same extraction task)
- You're testing different prompts for the same task
Paths
A path is a complete execution configuration:
Just models:
paths = ["gpt-4o", "claude-sonnet-4-20250514"]
const paths = ["gpt-4o", "claude-sonnet-4-20250514"];
Model + tool combinations:
paths = [
{"model": "gpt-4o", "tools": ["calendar_api"]},
{"model": "gpt-4o", "tools": ["google_calendar"]},
{"model": "claude-sonnet-4-20250514", "tools": ["calendar_api"]}
]
const paths = [
{ model: "gpt-4o", tools: ["calendar_api"] },
{ model: "gpt-4o", tools: ["google_calendar"] },
{ model: "claude-sonnet-4-20250514", tools: ["calendar_api"] }
];
Model + tool + parameter combinations:
paths = [
{"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.3}},
{"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.7}},
]
const paths = [
{ model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.3 } },
{ model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.7 } },
];
Kalibr tracks success rates for each unique path. If gpt-4o + calendar_api works better than gpt-4o + google_calendar, traffic shifts automatically.
Outcomes
An outcome is what you report after execution: success or failure.
router.report(success=True)
router.report(success=False, reason="invalid_time")
router.report(success=True, score=0.85)
await router.report({ success: true });
await router.report({ success: false, reason: "invalid_time" });
await router.report({ success: true, score: 0.85 });
Without outcomes, Kalibr can't learn. This is the feedback loop.
What Kalibr tracks per path:
- Success rate
- Sample count
- Trend (improving / stable / degrading)
- Cost and latency (from traces)
What Kalibr ignores:
- Your prompts
- Response content
- Anything that could leak sensitive data
Constraints
You can add constraints to routing decisions:
policy = get_policy(
goal="book_meeting",
constraints={
"max_cost_usd": 0.05,
"max_latency_ms": 2000,
"min_quality": 0.8
}
)
const policy = await getPolicy({
goal: "book_meeting",
constraints: {
maxCostUsd: 0.05,
maxLatencyMs: 2000,
minQuality: 0.8
}
});
Kalibr will only recommend paths that meet all constraints.
What Kalibr Doesn't Do
- Not a proxy - Calls go directly to providers. Kalibr just decides which one.
- Not a retry system - If a call fails, it fails. Kalibr learns and routes away next time.
- Not eval tooling - Kalibr doesn't judge output quality. You define success.
- Not an agent framework - You own your logic. Kalibr only picks the path.
Next
- How Routing Works - Statistical methods, exploration vs exploitation
- API Reference - Full Router API
- Production Guide - Error handling, monitoring, debugging