Core Concepts


The Problem

You have an agent that books meetings. It uses gpt-4o. Sometimes it fails - wrong times, missed constraints, hallucinated availability.

You wonder: would Claude be better? What about with a different temperature? What if you added a calendar validation tool?

You could run manual experiments. Or you could let production tell you.


Goals

A goal is a task with a consistent success criterion.

Good goals:

  • book_meeting
  • extract_company
  • classify_ticket
  • generate_sql

Bad goals:

  • handle_request (too vague)
  • llm_call (no success criterion)

Each goal gets its own routing state. Kalibr learns independently for each.

When to create a new goal

  • Success criteria change - extract_company vs extract_company_with_domain
  • Input types differ - summarize_email vs summarize_transcript

When to keep the same goal

  • Only the input content varies (different emails, same extraction task)
  • You're testing different prompts for the same task

Paths

A path is a complete execution configuration:

Just models:

paths = ["gpt-4o", "claude-sonnet-4-20250514"]
const paths = ["gpt-4o", "claude-sonnet-4-20250514"];

Model + tool combinations:

paths = [
    {"model": "gpt-4o", "tools": ["calendar_api"]},
    {"model": "gpt-4o", "tools": ["google_calendar"]},
    {"model": "claude-sonnet-4-20250514", "tools": ["calendar_api"]}
]
const paths = [
  { model: "gpt-4o", tools: ["calendar_api"] },
  { model: "gpt-4o", tools: ["google_calendar"] },
  { model: "claude-sonnet-4-20250514", tools: ["calendar_api"] }
];

Model + tool + parameter combinations:

paths = [
    {"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.3}},
    {"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.7}},
]
const paths = [
  { model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.3 } },
  { model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.7 } },
];

Kalibr tracks success rates for each unique path. If gpt-4o + calendar_api works better than gpt-4o + google_calendar, traffic shifts automatically.


Outcomes

An outcome is what you report after execution: success or failure.

router.report(success=True)
router.report(success=False, reason="invalid_time")
router.report(success=True, score=0.85)
await router.report({ success: true });
await router.report({ success: false, reason: "invalid_time" });
await router.report({ success: true, score: 0.85 });

Without outcomes, Kalibr can't learn. This is the feedback loop.

What Kalibr tracks per path:

  • Success rate
  • Sample count
  • Trend (improving / stable / degrading)
  • Cost and latency (from traces)

What Kalibr ignores:

  • Your prompts
  • Response content
  • Anything that could leak sensitive data

Constraints

You can add constraints to routing decisions:

policy = get_policy(
    goal="book_meeting",
    constraints={
        "max_cost_usd": 0.05,
        "max_latency_ms": 2000,
        "min_quality": 0.8
    }
)
const policy = await getPolicy({
  goal: "book_meeting",
  constraints: {
    maxCostUsd: 0.05,
    maxLatencyMs: 2000,
    minQuality: 0.8
  }
});

Kalibr will only recommend paths that meet all constraints.


What Kalibr Doesn't Do

  • Not a proxy - Calls go directly to providers. Kalibr just decides which one.
  • Not a retry system - If a call fails, it fails. Kalibr learns and routes away next time.
  • Not eval tooling - Kalibr doesn't judge output quality. You define success.
  • Not an agent framework - You own your logic. Kalibr only picks the path.

Next