How Kalibr evaluates outputs

Every call goes through three evaluation layers. Each one operates at a different speed and asks a different question about the output. Together they determine what gets reported back to the routing engine and what gets retried.

Gate 1: Did the output meet the minimum bar?

This check runs on every call, inline, before the response is returned. It is synchronous and adds no latency. Gate 1 answers a simple question: is the output structurally correct for this task type?

code_generation: does the output parse as valid code?
outreach_generation: does it have a subject line and a body?
research: is it at least 200 characters with 3 or more distinct sentences?
summarization: is the output at least 50 characters with 3 or more complete sentences?
Everything else : is there any output at all, with no error text?

If the output passes, Kalibr records a success and returns the result. If it fails, Kalibr swaps to the next model in your path list and retries the call. Your code receives the fixed output. The failure is never surfaced to the user.

If you use success_when on the Router, your lambda runs in place of the built-in check. If you skip both, Kalibr uses heuristics and auto-reports.

python

from kalibr import Router

router = Router(
    goal="code_generation",
    paths=["gpt-4o", "claude-sonnet-4", "deepseek-coder"],
)

result = router.completion(messages=[...])
# Gate 1 runs automatically. If output fails structural check,
# Kalibr swaps to the next path and retries before returning.

Gate 2: Was the output actually good?

This check runs in the background, after the response is already returned. It never blocks your agent. Gate 2 asks a harder question: not just whether the output is structurally valid, but whether it is high quality.

A lightweight LLM judge scores the output from 0.0 to 1.0. Gate 2 only runs on research and outreach_generation tasks, on roughly 20% of calls. The score feeds back into routing as a continuous signal, separate from the pass/fail from Gate 1. A model that consistently scores 0.85 will be routed to more often than one that scores 0.6, even if both technically pass.

If you set score_when on the Router, your scoring function runs instead of the async judge.

iGate 2 uses your existing provider API keys. No extra services required. The judge model defaults to DeepSeek for cost, and is configurable.

python

router = Router(
    goal="research",
    paths=["claude-sonnet-4", "gpt-4o", "llama-3.3-70b"],
    score_when=lambda output: 0.9 if len(output) > 500 else 0.5,
    # Custom scoring overrides the default async judge for this router instance
)

Gate 3: What did the user do with it?

Gates 1 and 2 evaluate the output in isolation. Gate 3 evaluates what happened after. When a user reprompts, edits, copies, or discards an output, that behavior is the most direct signal of whether the model actually worked. Gate 3 captures it.

You wire it in by calling four lightweight functions at the right moments in your app:

report_pipeline(). Call this when a pipeline run finishes. It anchors the session so subsequent signals can be attributed back to the right model and goal.

report_user_turn(). Call this each time the user sends a follow-up message. Kalibr tracks whether the conversation is converging (user is getting what they want) or diverging (user keeps reprompting, getting longer and more frustrated). This is called momentum.

report_session_end(). Call this when the session closes. If the conversation was converging, Kalibr emits a weak positive signal. If it was diverging, a weak negative. If there is not enough data to tell, it emits nothing.

report_action(). The strongest signal. Call this when the user does something concrete with the output: copies it, edits it, uses it verbatim, or discards it. This tells Kalibr exactly how the output landed.

All Gate 3 signals blend into routing at 20% weight. At least 5 signals are required before they start influencing decisions, so early noise does not pollute the router.

iGates 1 and 2 always dominate. Gate 3 is an additive signal, not a replacement. It nudges routing over time based on real user behavior.

python

from kalibr.feedback import report_pipeline, report_user_turn, report_session_end, report_action

session_id = "session-abc-123"

# When pipeline starts
report_pipeline(session_id, goal="outreach_generation", prompt=user_prompt, output=result, model="claude-sonnet-4")

# On each user follow-up
report_user_turn(session_id, user_message="make it shorter, too formal")

# When session ends
report_session_end(session_id)  # fires weak signal based on conversation momentum

# If user does something with the output (highest quality signal)
report_action(session_id, "output_edited")  # or: output_used_verbatim, output_copied, output_discarded

iGate 3 signals are advisory. They blend at 20% weight and require a minimum of 5 signals before influencing routing. Clean structural and quality signals always dominate.

How they work together

The three gates run at different speeds, but they all feed the same place: the routing engine that decides which model to call next.

Gate 1 runs in milliseconds, inline on every call.
Gate 2 runs in seconds, async in the background.
Gate 3 runs over the lifetime of a session, from seconds to hours.

Together they give Kalibr three separate windows into whether the model is working: the output itself, the quality of that output, and what the user did with it. Each routing decision for a given goal type gets better as more of these signals accumulate.

flow

agent call
  → [Gate 1] structural check
      pass → execute, return result
      fail → swap model, retry
  → [Gate 2] quality judge (async, 20%)
      score → update path quality signal
  → [Gate 3] behavioral signal (session lifetime)
      user turns → momentum tracking
      session end → infer reward from trajectory
      user action → direct outcome signal
  → routing engine updates
  → next call routes better

Customizing evaluation

All three gates are configurable per Router instance. The defaults work out of the box for most use cases.

python

# Custom structural check
router = Router(
    goal="code_generation",
    paths=["gpt-4o", "deepseek-coder"],
    success_when=lambda output: "def " in output or "class " in output,
)

# Continuous quality scoring (overrides Gate 2 async judge)
router = Router(
    goal="research",
    paths=["claude-sonnet-4", "gpt-4o"],
    score_when=lambda output: min(1.0, len(output.split()) / 300),
)

# Disable Gate 2 async judge for a specific router
router = Router(
    goal="summarization",
    paths=["llama-3.3-70b", "gpt-4o-mini"],
    # Gate 2 only fires for research and outreach_generation by default
    # For other goal types, it does not run automatically
)

How Routing Works. Statistical methods, exploration vs exploitation
API Reference. Full Router API
Production Guide. Error handling, monitoring, debugging

How Kalibr evaluates outputs

Gate 1: Did the output meet the minimum bar?

Gate 2: Was the output actually good?

Gate 3: What did the user do with it?

How they work together

Customizing evaluation

Next