FAQ

Short answers to common questions.

"Why not just hardcode the best model?"

Benchmarks don't match production. GPT-4o might win on MMLU but fail on your specific task. Kalibr learns what works for your goals, not average performance across public benchmarks.

"How is this different from LangSmith/Langfuse?"

LangSmith and Langfuse are observability tools. They show you what happened. Kalibr acts on what happened-it changes which model handles the next request. Observability without action is just dashboards.

"What if my success criteria change?"

Create a new goal. Goals are namespaced. extract_company_v1 and extract_company_v2 have separate routing. Old outcomes don't contaminate new criteria.

"Does this add latency?"

One HTTP call to decide() before each completion. Typically 10-50ms. If the intelligence service is slow or down, Kalibr falls back to your first path immediately.

"What happens if Kalibr is down?"

Your agent keeps running. Router falls back to the first path in your list. Outcomes aren't recorded until service recovers, but your users don't see errors.

"How long until routing is stable?"

After ~20-50 outcomes per path, Kalibr has enough data to exploit confidently. Before that, expect more exploration.

"Can I use this with LangChain?"

Yes. Install pip install kalibr[langchain] and use router.as_langchain() to get a LangChain-compatible chat model.

from kalibr import Router

router = Router(goal="summarize", paths=["gpt-4o", "claude-sonnet-4-20250514"])
llm = router.as_langchain()

chain = prompt | llm | parser

LangChain integration is Python-only. For TypeScript, use the Router directly or auto-instrumentation:

import { Router } from '@kalibr/sdk';

const router = new Router({
  goal: 'summarize',
  paths: ['gpt-4o', 'claude-sonnet-4-20250514'],
});

// Use router.completion() directly
const response = await router.completion(messages);
await router.report(true);

"Does Kalibr see my prompts or responses?"

No. Kalibr sees: which model was called, token counts, cost, latency, and success/failure. Your actual prompts and responses go directly to the LLM provider.

"When should I use success_when vs manual report()?"

Use success_when for simple output validation:

Output length, contains "@", starts with "{"

router = Router(
    goal="extract",
    paths=["gpt-4o"],
    success_when=lambda output: len(output) > 0
)

Use manual report() for complex validation:

Parsing JSON and validating structure
API calls to check results
Multi-step workflows

result = json.loads(response.choices[0].message.content)
is_valid = validate_schema(result)
router.report(success=is_valid, reason=None if is_valid else "invalid_schema")

Use successWhen for simple output validation:

Output length, contains "@", starts with "{"

const router = new Router({
  goal: 'extract',
  paths: ['gpt-4o'],
  successWhen: (output) => output.length > 0,
});

Use manual report() for complex validation:

Parsing JSON and validating structure
API calls to check results
Multi-step workflows

const result = JSON.parse(response.choices[0].message.content);
const isValid = validateSchema(result);
await router.report(isValid, isValid ? undefined : 'invalid_schema');

"How do I handle multi-turn conversations?"

Use force_model to keep same model across turns:

response1 = router.completion(messages=[...])
model = response1.model
response2 = router.completion(messages=[...], force_model=model)
router.report(success=issue_resolved)

Use forceModel to keep same model across turns:

const response1 = await router.completion(messages);
const model = response1.model;
const response2 = await router.completion(messages, { forceModel: model });
await router.report(issueResolved);

"Can I use Router in async/concurrent code?"

Router is not thread-safe. Create separate Router instances per thread/task.

# Each thread gets its own router
def worker():
    router = Router(goal="extract", paths=[...])
    router.completion(...)
    router.report(success=True)

Router is not thread-safe. In serverless/edge functions, create a new Router per request. In long-running Node.js apps, create separate instances per async context.

// Each request handler gets its own router
app.post('/api/extract', async (req, res) => {
  const router = new Router({ goal: 'extract', paths: [...] });
  const response = await router.completion(req.body.messages);
  await router.report(true);
  res.json(response);
});

"What if I change my success criteria?"

Create a new goal with version suffix:

router = Router(goal="extract_company_v2", paths=[...])

const router = new Router({ goal: 'extract_company_v2', paths: [...] });

"How do I route between different temperatures?"

router = Router(
    goal="creative_writing",
    paths=[
        {"model": "gpt-4o", "params": {"temperature": 0.3}},
        {"model": "gpt-4o", "params": {"temperature": 0.9}}
    ]
)

const router = new Router({
  goal: 'creative_writing',
  paths: [
    { model: 'gpt-4o', params: { temperature: 0.3 } },
    { model: 'gpt-4o', params: { temperature: 0.9 } },
  ],
});

Is this just A/B testing?

It's similar in spirit but different in execution. Traditional A/B testing:

Runs for a fixed period
Requires manual analysis
Needs a deploy to change allocation
Tests one thing at a time

Kalibr:

Runs continuously
Adapts automatically based on outcomes
No deploys needed to shift traffic
Tests multiple paths simultaneously (model × tool × params)

Think of it as A/B testing that never ends and deploys itself.

How is this different from OpenRouter or Portkey?

They optimize for metrics they can measure: cost, latency, uptime. Kalibr optimizes for your definition of success.

The litmus test: If a model starts returning syntactically valid but semantically wrong answers for three days, will the router notice?

OpenRouter: No. It only sees HTTP 200s and latency.
Portkey: No. It routes by rules you configure, not outcomes.
Kalibr: Yes. You report failures, Kalibr learns, traffic shifts away.

Other routers: "This model is cheapest and fastest"
Kalibr: "This model actually works for your task"

How is this different from LangSmith?

LangSmith is observability - it shows you what happened. You look at dashboards, notice problems, then manually change your code.

Kalibr is autonomous optimization - it changes what happens next without human intervention.

LangSmith: "Here's a dashboard showing gpt-4o failed 20% of the time yesterday. You should probably do something about that."
Kalibr: "gpt-4o started failing more. Traffic automatically shifted to Claude. You didn't have to do anything."

Observability tells you there's a problem. Kalibr fixes it.

What if Kalibr goes down?

The SDK falls back to the first path in your list. Your application keeps working - it just loses the optimization benefits until Kalibr recovers.

We designed for this explicitly. Kalibr should never be a single point of failure for your agent.

What's the minimum traffic needed?

Kalibr needs enough outcomes to learn. Rough guidelines:

< 10 outcomes/day per goal: Too low, routing will be mostly random
10-50 outcomes/day: Learning happens, but slowly
50+ outcomes/day: Meaningful optimization

If you have very low traffic for a goal, consider using get_policy() with a longer time window, or just hardcode the path.

Does Kalibr see my prompts or responses?

No. Kalibr tracks:

Which path was used (model, tool, params)
Whether it succeeded or failed (what you report)
Cost and latency (from traces)

Kalibr does not track:

Prompt content
Response content
User data

Your prompts and completions go directly to providers. Kalibr only sees metadata.

FAQ