Resilience Benchmark
When your best execution path degrades, Kalibr routes around it automatically. Hardcoded systems keep failing until a human intervenes.
The Question
What happens when an agent's execution path starts failing in production?
- Hardcoded systems continue sending traffic to the same path until an engineer notices, diagnoses the issue, and deploys a fix.
- Kalibr continuously observes outcomes and shifts traffic toward paths that are succeeding.
This benchmark compares those two behaviors under identical conditions.
What Makes This Different
This is execution path routing, not model routing.
Each path is a complete execution strategy combining model and tool:
| Path ID | Model | Tool | Description |
|---|---|---|---|
gpt4o-serper | gpt-4o | Serper | Primary path (hardcoded baseline) |
gpt4o-tavily | gpt-4o | Tavily | Backup tool |
gpt4o-mini-tavily | gpt-4o-mini | Tavily | Cost-optimized backup |
The paths differ by tool, not just model. This reflects how real agents are built.
The Agent
A realistic multi-step research agent:
- Plan → Generate search queries (LLM call)
- Search → Call external API (Serper or Tavily)
- Extract → Pull facts with source references (LLM call)
- Synthesize → Write answer with citations (LLM call)
- Validate → Verify citations reference valid sources
A task succeeds only if all steps complete and validation passes.
Experimental Conditions
Hardcoded Baseline
- Always uses
gpt4o-serper - No fallback logic
- No adaptive routing
- Represents typical production configuration
Kalibr
- Chooses execution paths based on observed outcomes
- Explores alternatives during normal operation
- Shifts traffic away from degraded paths automatically
- No explicit failure signals or special-case logic
Phases
| Phase | Tasks | Description |
|---|---|---|
| Learning | 15 | Normal operation. No failures injected. |
| Degraded | 25 | Serper fails 70% of requests. Tavily unaffected. |
| Recovery | 10 | Degradation continues. Measure steady state. |
Failure Injection
At task 16, a 70% failure rate is injected only on Serper. Tavily remains healthy.
This simulates real-world degradation:
- API rate limits
- Provider outages
- Service degradation
The failure is scoped to one path. Kalibr can route around it. Hardcoded cannot.
Results
Run 1
| Phase | Hardcoded | Kalibr | Delta |
|---|---|---|---|
| Learning | 100.0% | 100.0% | +0.0% |
| Degraded | 36.0% | 92.0% | +56.0% |
| Recovery | 30.0% | 100.0% | +70.0% |
| Overall | 54.0% | 96.0% | +42.0% |
Run 2
| Phase | Hardcoded | Kalibr | Delta |
|---|---|---|---|
| Learning | 100.0% | 100.0% | +0.0% |
| Degraded | 16.0% | 88.0% | +72.0% |
| Recovery | 20.0% | 100.0% | +80.0% |
| Overall | 42.0% | 94.0% | +52.0% |
Run 3
| Phase | Hardcoded | Kalibr | Delta |
|---|---|---|---|
| Learning | 93.3% | 100.0% | +6.7% |
| Degraded | 24.0% | 88.0% | +64.0% |
| Recovery | 30.0% | 100.0% | +70.0% |
| Overall | 46.0% | 94.0% | +48.0% |
Results are consistent across runs.
Path Distribution (Run 3)
| Path | Tasks | Success Rate |
|---|---|---|
gpt4o-serper | 5 | 40.0% |
gpt4o-tavily | 28 | 100.0% |
gpt4o-mini-tavily | 17 | 100.0% |
Kalibr learned that Serper was failing and shifted traffic to Tavily paths.
What This Demonstrates
During normal operation: Both systems perform identically. Kalibr adds no overhead when nothing is wrong.
During degradation:
Hardcoded system:
- Kept routing 100% of traffic to the broken path
- Success rate dropped to ~20-36%
- Would require human intervention to recover
Kalibr:
- Observed failures and shifted traffic automatically
- Maintained ~90% success rate
- No code change, no human intervention
This is not an optimization. It is a behavioral difference that hardcoded systems cannot exhibit.
What This Does Not Claim
This benchmark does not demonstrate:
- Superior reasoning or intelligence
- Universal optimality across all tasks
- Guaranteed reliability in all environments
Kalibr is a control system. It routes execution based on what is actually working.
Limitations
- Single task type (research agent)
- Three execution paths
- Synthetic failure injection (rate limiting simulation)
Results should not be extrapolated to all workloads. The purpose is to validate adaptive execution path routing under degradation.
Run It Yourself
pip install kalibr openai httpx
export KALIBR_API_KEY=your-key
export KALIBR_TENANT_ID=your-tenant
export OPENAI_API_KEY=your-key
export SERPER_API_KEY=your-key
export TAVILY_API_KEY=your-key
python resilience_benchmark.py
Options:
python resilience_benchmark.py --quick # ~25 tasks, ~3 min
python resilience_benchmark.py # ~50 tasks, ~5 min
python resilience_benchmark.py --full # ~100 tasks, ~10 min
Requirements: ~$0.30 in API usage (standard run), Python 3.10+
Summary
| Metric | Hardcoded | Kalibr |
|---|---|---|
| Success during degradation | ~20-36% | ~88-92% |
| Human intervention required | Yes | No |
| Code changes required | Yes | No |
When execution paths degrade, hardcoded systems fail until humans intervene. Kalibr adapts automatically.
Source Code
The complete benchmark is open source: github.com/kalibr-ai/kalibr-benchmark