Core Concepts


The Problem

You have an agent that books meetings. It uses gpt-4o. Sometimes it fails - wrong times, missed constraints, hallucinated availability.

You wonder: would Claude be better? What about with a different temperature? What if you added a calendar validation tool?

You could run manual experiments. Or you could let production tell you.


Goals

A goal is a task with a consistent success criterion.

Good goals:

  • book_meeting
  • extract_company
  • classify_ticket
  • generate_sql

Bad goals:

  • handle_request (too vague)
  • llm_call (no success criterion)

Each goal gets its own routing state. Kalibr learns independently for each.

When to create a new goal

  • Success criteria change - extract_company vs extract_company_with_domain
  • Input types differ - summarize_email vs summarize_transcript

When to keep the same goal

  • Only the input content varies (different emails, same extraction task)
  • You're testing different prompts for the same task

Paths

A path is a complete execution configuration. Paths work across any modality: text LLMs, voice models, image generators, embedding models, and any model on HuggingFace.

Just models:

paths = ["gpt-4o", "claude-sonnet-4-20250514", "openai/whisper-large-v3"]
const paths = ["gpt-4o", "claude-sonnet-4-20250514", "openai/whisper-large-v3"];

Model + tool combinations:

paths = [
    {"model": "gpt-4o", "tools": ["calendar_api"]},
    {"model": "gpt-4o", "tools": ["google_calendar"]},
    {"model": "claude-sonnet-4-20250514", "tools": ["calendar_api"]}
]
const paths = [
  { model: "gpt-4o", tools: ["calendar_api"] },
  { model: "gpt-4o", tools: ["google_calendar"] },
  { model: "claude-sonnet-4-20250514", tools: ["calendar_api"] }
];

Model + tool + parameter combinations:

paths = [
    {"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.3}},
    {"model": "gpt-4o", "tools": ["calendar_api"], "params": {"temperature": 0.7}},
]
const paths = [
  { model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.3 } },
  { model: "gpt-4o", tools: ["calendar_api"], params: { temperature: 0.7 } },
];

Kalibr tracks success rates for each unique path. If gpt-4o + calendar_api works better than gpt-4o + google_calendar, traffic shifts automatically.


Outcomes

An outcome is what you report after execution: success or failure, optionally with a continuous quality score.

# Binary outcome
router.report(success=True)
router.report(success=False, reason="invalid_time")

# Continuous quality score — feeds directly into routing
router.report(success=True, score=0.85)

# Score provides finer signal than binary alone.
# A path scoring 0.85 consistently will be preferred
# over one scoring 0.6, even if both technically "succeed."
await router.report({ success: true });
await router.report({ success: false, reason: "invalid_time" });
await router.report({ success: true, score: 0.85 });

Without outcomes, Kalibr can't learn. This is the feedback loop.

What Kalibr tracks per path:

  • Success rate (binary pass/fail)
  • Quality score distribution (continuous 0-1, when reported)
  • Sample count
  • Trend (improving / stable / degrading)
  • Cost and latency (from traces)

What Kalibr ignores:

  • Your prompts
  • Response content
  • Anything that could leak sensitive data

The Feedback Loop

Kalibr captures execution telemetry and serves it back as structured intelligence. The full loop:

1. Report outcomes — your agent reports success or failure after each task, optionally with a continuous quality score (0-1) and a structured failure_category (timeout, tool_error, hallucination_detected, etc.). The continuous score feeds directly into Thompson Sampling for finer-grained routing.

2. Kalibr learns — Thompson Sampling updates beliefs about which paths work best, using both binary outcomes and continuous quality scores. Trend detection identifies degradation. Rollback monitoring disables failing paths automatically.

3. Query insights — a coding agent calls get_insights() and receives structured diagnostics: which goals are healthy, which are failing, which failure modes dominate, which paths underperform, which parameters matter.

4. Update outcomes — when real-world signals arrive later (customer reopened ticket 48 hours after "resolution"), update_outcome() corrects the record. Every downstream component learns from the correction.

5. Auto-explore — when enabled, Kalibr automatically generates new path configurations by interpolating parameter values (e.g., trying temperature 0.2 if 0.3 was the best value tested). New paths are evaluated through existing exploration traffic.

The human's role: set goals, define success criteria, own billing, check in occasionally. Everything else is agent-to-agent.

Failure Categories

Instead of free-text failure reasons that can't be aggregated, Kalibr supports structured failure categories. These enable clean clustering: "60% of failures for this goal are timeouts" rather than parsing thousands of unique error strings.

from kalibr import FAILURE_CATEGORIES

# timeout, context_exceeded, tool_error, rate_limited,
# validation_failed, hallucination_detected, user_unsatisfied,
# empty_response, malformed_output, auth_error, provider_error, unknown

router.report(success=False, failure_category="timeout",
              reason="Provider timed out after 30s")

Constraints

You can add constraints to routing decisions:

policy = get_policy(
    goal="book_meeting",
    constraints={
        "max_cost_usd": 0.05,
        "max_latency_ms": 2000,
        "min_quality": 0.8
    }
)
const policy = await getPolicy({
  goal: "book_meeting",
  constraints: {
    maxCostUsd: 0.05,
    maxLatencyMs: 2000,
    minQuality: 0.8
  }
});

Kalibr will only recommend paths that meet all constraints.


What Kalibr Doesn't Do

  • Not a proxy - Calls go directly to providers. Kalibr just decides which one.
  • Not a retry system - If a call fails, it fails. Kalibr learns and routes away next time.
  • Not eval tooling - Kalibr doesn't judge output quality. You define success.
  • Not an agent framework - You own your logic. Kalibr only picks the path.

Next