Core Concepts

The Problem

You have an agent that books meetings. It uses gpt-4o. Sometimes it fails - wrong times, missed constraints, hallucinated availability.

You wonder: would Claude be better? What about with a different temperature? What if you added a calendar validation tool?

You could run manual experiments. Or you could let production tell you.

Goals

A goal is a task with a consistent success criterion.

Good goals:

book_meeting
extract_company
classify_ticket
generate_sql

Bad goals:

handle_request (too vague)
llm_call (no success criterion)

Each goal gets its own routing state. Kalibr learns independently for each.

When to create a new goal

Success criteria change - extract_company vs extract_company_with_domain
Input types differ - summarize_email vs summarize_transcript

When to keep the same goal

Only the input content varies (different emails, same extraction task)
You're testing different prompts for the same task

Paths

A path is a complete execution configuration:

Just models:

paths = ["gpt-4o", "claude-sonnet-4-20250514"]

const paths = ["gpt-4o", "claude-sonnet-4-20250514"];

Model + tool combinations:

paths = [
    {"model": "gpt-4o", "tools": ["calendar_api"]},
    {"model": "gpt-4o", "tools": ["google_calendar"]},
    {"model": "claude-sonnet-4-20250514", "tools": ["calendar_api"]}
]

const paths = [
  { model: "gpt-4o", tools: ["calendar_api"] },
  { model: "gpt-4o", tools: ["google_calendar"] },
  { model: "claude-sonnet-4-20250514", tools: ["calendar_api"] }
];

Model + tool + parameter combinations:

paths = [
    {"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.3}},
    {"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.7}},
]

const paths = [
  { model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.3 } },
  { model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.7 } },
];

Kalibr tracks success rates for each unique path. If gpt-4o + calendar_api works better than gpt-4o + google_calendar, traffic shifts automatically.

Outcomes

An outcome is what you report after execution: success or failure.

router.report(success=True)
router.report(success=False, reason="invalid_time")
router.report(success=True, score=0.85)

await router.report({ success: true });
await router.report({ success: false, reason: "invalid_time" });
await router.report({ success: true, score: 0.85 });

Without outcomes, Kalibr can't learn. This is the feedback loop.

What Kalibr tracks per path:

Success rate
Sample count
Trend (improving / stable / degrading)
Cost and latency (from traces)

What Kalibr ignores:

Your prompts
Response content
Anything that could leak sensitive data

Constraints

You can add constraints to routing decisions:

policy = get_policy(
    goal="book_meeting",
    constraints={
        "max_cost_usd": 0.05,
        "max_latency_ms": 2000,
        "min_quality": 0.8
    }
)

const policy = await getPolicy({
  goal: "book_meeting",
  constraints: {
    maxCostUsd: 0.05,
    maxLatencyMs: 2000,
    minQuality: 0.8
  }
});

Kalibr will only recommend paths that meet all constraints.

What Kalibr Doesn't Do

Not a proxy - Calls go directly to providers. Kalibr just decides which one.
Not a retry system - If a call fails, it fails. Kalibr learns and routes away next time.
Not eval tooling - Kalibr doesn't judge output quality. You define success.
Not an agent framework - You own your logic. Kalibr only picks the path.

How Routing Works - Statistical methods, exploration vs exploitation
API Reference - Full Router API
Production Guide - Error handling, monitoring, debugging