Production Guide
Running Kalibr in production.
When Kalibr helps most
- Multiple models: You have 2+ models that could work, and you don't know which is best
- Flaky tools: Model performance varies over time (degradations, updates)
- Long-running agents: Enough volume to generate outcome data
Kalibr needs outcomes to learn. If you have <10 calls/day for a goal, learning will be slow.
Exploration vs stability
What exploration means
By default, Kalibr explores 10% of the time. This means 1 in 10 calls goes to a non-optimal path to gather data.
How to control it
router = Router(
goal="extract_company",
paths=["gpt-4o", "claude-sonnet-4-20250514"],
exploration_rate=0.05 # 5% exploration
)
const router = new Router({
goal: 'extract_company',
paths: ['gpt-4o', 'claude-sonnet-4-20250514'],
explorationRate: 0.05, // 5% exploration
});
When to turn it down
- High-stakes tasks where failures are costly
- After you've collected enough data and want pure exploitation
Default values
- Exploration rate:
0.1(10%) - Min samples before exploit:
20outcomes - Staleness threshold:
7days (paths not tried in 7 days get re-explored)
Failure modes
No outcomes reported
- Symptom: Routing stays random
- Cause:
report()never called - Fix: Add
report(success=...)after every completion
Low traffic
- Symptom: Routing changes slowly or not at all
- Cause: Need ~20-50 outcomes before Kalibr exploits confidently
- Fix: Lower
min_samples_before_exploitif you're confident in your data
Cold start behavior
- First calls explore randomly
- No preference until outcomes are reported
- First path in list is fallback if intelligence service is unavailable
Cost & latency
The trust invariant
Success rate ALWAYS dominates. Cost/latency only break ties among paths within 5% of best success rate.
If GPT-4o has 95% success and GPT-4o-mini has 85% success, Kalibr routes to GPT-4o regardless of cost. Cost only matters when success rates are within 5% of each other.
Turning Kalibr off
What happens if you remove it
If you remove Kalibr imports, you must replace Router calls with direct SDK calls.
How to fall back safely
Use force_model/forceModel to bypass routing:
response = router.completion(
messages=[...],
force_model="gpt-4o" # Always use gpt-4o, ignore routing
)
const response = await router.completion(messages, {
forceModel: 'gpt-4o', // Always use gpt-4o, ignore routing
});
Or replace Router with direct SDK calls:
# From this:
response = router.completion(messages=[...])
# To this:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(model="gpt-4o", messages=[...])
// From this:
const response = await router.completion(messages);
// To this:
import OpenAI from 'openai';
const client = new OpenAI();
const response = await client.chat.completions.create({
model: 'gpt-4o',
messages,
});
Error Handling Patterns
Provider errors vs Intelligence service errors
Kalibr handles two types of errors differently:
Provider errors (OpenAI, Anthropic, Google)
- Router re-raises the exception to your code
- You must handle these with try/except or try/catch
- Kalibr auto-reports as failure before raising
try:
response = router.completion(messages=[...])
router.report(success=True)
except Exception as e:
# Already auto-reported as failure
if "RateLimitError" in str(type(e)):
time.sleep(60)
else:
log_error(e)
try {
const response = await router.completion(messages);
await router.report(true);
} catch (error) {
// Already auto-reported as failure
if (error.message.includes('RateLimitError')) {
await new Promise(r => setTimeout(r, 60000));
} else {
console.error(error);
}
}
Intelligence service errors
- Router falls back to first path automatically
- Your code keeps running
- Logged as warning
Implication: If the intelligence service is down, your agent uses the first path until it recovers.
Multi-turn Conversations
For chat agents with multiple turns:
router = Router(
goal="customer_support",
paths=["gpt-4o", "claude-sonnet-4-20250514"]
)
conversation = [{"role": "user", "content": "I need help"}]
# Turn 1 - router decides model
response1 = router.completion(messages=conversation)
selected_model = response1.model
conversation.append({
"role": "assistant",
"content": response1.choices[0].message.content
})
# Turn 2 - force same model
conversation.append({"role": "user", "content": "That didn't work"})
response2 = router.completion(
messages=conversation,
force_model=selected_model
)
# Report once at end
router.report(success=issue_resolved)
const router = new Router({
goal: 'customer_support',
paths: ['gpt-4o', 'claude-sonnet-4-20250514'],
});
const conversation: Message[] = [{ role: 'user', content: 'I need help' }];
// Turn 1 - router decides model
const response1 = await router.completion(conversation);
const selectedModel = response1.model;
conversation.push({
role: 'assistant',
content: response1.choices[0].message.content,
});
// Turn 2 - force same model
conversation.push({ role: 'user', content: "That didn't work" });
const response2 = await router.completion(conversation, {
forceModel: selectedModel,
});
// Report once at end
await router.report(issueResolved);
Key principles:
- Use
force_model/forceModelto keep the same model across turns - Report once at the end, not after each turn
- Build conversation history correctly
Thread Safety
Router is not thread-safe. Create one Router instance per thread or async context.
Wrong (race condition):
router = Router(goal="extract", paths=[...])
# Two threads using same router
thread1: router.completion(...) # Sets trace_id=ABC
thread2: router.completion(...) # Overwrites trace_id=XYZ
thread1: router.report(success=True) # Reports for XYZ (WRONG!)
Right:
# Thread 1
router1 = Router(goal="extract", paths=[...])
router1.completion(...)
router1.report(success=True)
# Thread 2
router2 = Router(goal="extract", paths=[...])
router2.completion(...)
router2.report(success=True)
Wrong (race condition):
const router = new Router({ goal: 'extract', paths: [...] });
// Two async contexts using same router
context1: await router.completion(...) // Sets trace_id=ABC
context2: await router.completion(...) // Overwrites trace_id=XYZ
context1: await router.report(true) // Reports for XYZ (WRONG!)
Right:
// Request handler 1
const router1 = new Router({ goal: 'extract', paths: [...] });
await router1.completion(...);
await router1.report(true);
// Request handler 2
const router2 = new Router({ goal: 'extract', paths: [...] });
await router2.completion(...);
await router2.report(true);
TypeScript note: In serverless/edge functions, create a new Router per request. In long-running Node.js apps, create separate instances per async context.
Troubleshooting Routing
If routing isn't improving, check these common issues:
1. No outcomes being reported
Check: Go to your dashboard. Are outcomes appearing for your goal?
Fix: Make sure you're calling router.report(success=...) after every completion.
2. Not enough data yet
Check: Do you have >20 outcomes per path for this goal?
Why: Kalibr needs ~20-50 outcomes per path per goal before routing becomes stable. Before that, expect more exploration.
3. Success criteria too noisy
Check: Are both models showing similar success rates (e.g., both at 60%)?
Why: If all paths perform similarly, routing will stay exploratory. This might mean your task is too hard for current models, or your success criteria needs refinement.
4. Low traffic
Check: Are you making at least 10-20 calls per day per goal?
Why: With low traffic, it takes longer to gather enough outcomes. Consider lowering exploration_rate if you need faster convergence.
Graceful Degradation
If Kalibr is unavailable (network error, service down), the SDK falls back to the first path in your list. Your application never crashes due to Kalibr being unreachable.
paths = ["gpt-4o", "claude-sonnet-4-20250514"]
# If Kalibr is down, gpt-4o is used automatically
const paths = ['gpt-4o', 'claude-sonnet-4-20250514'];
// If Kalibr is down, gpt-4o is used automatically
Best practice: Put your most reliable path first. This becomes your fallback.
Trend Monitoring
Check the dashboard for paths marked as "degrading". These are paths where recent performance is significantly worse than historical baseline.
Common causes:
- Provider model updates (silent changes to model behavior)
- Changes in your input distribution
- Upstream API issues or rate limiting
- Prompt changes that affect certain models differently
When you see a degrading path:
- Check provider status pages
- Review recent changes to your prompts or inputs
- Consider temporarily disabling the path if degradation is severe
When to Use force_model
The force_model parameter bypasses routing:
response = router.completion(
messages=[...],
force_model="gpt-4o"
)
const response = await router.completion(messages, {
forceModel: 'gpt-4o',
});
Use it for:
- Debugging specific model behavior
- Reproducing customer-reported issues
- Load testing a specific provider
- Temporary workarounds during incidents
Don't use it as your default - you lose the learning benefits and won't detect regressions.
Latency Overhead
Kalibr adds a routing decision before each completion. Typical overhead:
- Cold (first request): 50-100ms
- Warm (cached routing state): 10-30ms
For latency-critical paths, you can:
- Use
get_policy()to cache recommendations - Lower exploration rate to reduce variability
- Use
force_modelfor paths where latency is critical
Next
- FAQ - Common questions