Resilience Benchmark

When your best execution path degrades, Kalibr routes around it automatically. Hardcoded systems keep failing until a human intervenes.

The Question

What happens when an agent's execution path starts failing in production?

Hardcoded systems continue sending traffic to the same path until an engineer notices, diagnoses the issue, and deploys a fix.
Kalibr continuously observes outcomes and shifts traffic toward paths that are succeeding.

This benchmark compares those two behaviors under identical conditions.

What Makes This Different

This is execution path routing, not model routing.

Each path is a complete execution strategy combining model and tool:

Path ID	Model	Tool	Description
`gpt4o-serper`	gpt-4o	Serper	Primary path (hardcoded baseline)
`gpt4o-tavily`	gpt-4o	Tavily	Backup tool
`gpt4o-mini-tavily`	gpt-4o-mini	Tavily	Cost-optimized backup

The paths differ by tool, not just model. This reflects how real agents are built.

The Agent

A realistic multi-step research agent:

Plan → Generate search queries (LLM call)
Search → Call external API (Serper or Tavily)
Extract → Pull facts with source references (LLM call)
Synthesize → Write answer with citations (LLM call)
Validate → Verify citations reference valid sources

A task succeeds only if all steps complete and validation passes.

Experimental Conditions

Hardcoded Baseline

Always uses gpt4o-serper
No fallback logic
No adaptive routing
Represents typical production configuration

Kalibr

Chooses execution paths based on observed outcomes
Explores alternatives during normal operation
Shifts traffic away from degraded paths automatically
No explicit failure signals or special-case logic

Phases

Phase	Tasks	Description
Learning	15	Normal operation. No failures injected.
Degraded	25	Serper fails 70% of requests. Tavily unaffected.
Recovery	10	Degradation continues. Measure steady state.

Failure Injection

At task 16, a 70% failure rate is injected only on Serper. Tavily remains healthy.

This simulates real-world degradation:

API rate limits
Provider outages
Service degradation

The failure is scoped to one path. Kalibr can route around it. Hardcoded cannot.

Results

Run 1

Phase	Hardcoded	Kalibr	Delta
Learning	100.0%	100.0%	+0.0%
Degraded	36.0%	92.0%	+56.0%
Recovery	30.0%	100.0%	+70.0%
Overall	54.0%	96.0%	+42.0%

Run 2

Phase	Hardcoded	Kalibr	Delta
Learning	100.0%	100.0%	+0.0%
Degraded	16.0%	88.0%	+72.0%
Recovery	20.0%	100.0%	+80.0%
Overall	42.0%	94.0%	+52.0%

Run 3

Phase	Hardcoded	Kalibr	Delta
Learning	93.3%	100.0%	+6.7%
Degraded	24.0%	88.0%	+64.0%
Recovery	30.0%	100.0%	+70.0%
Overall	46.0%	94.0%	+48.0%

Results are consistent across runs.

Path Distribution (Run 3)

Path	Tasks	Success Rate
`gpt4o-serper`	5	40.0%
`gpt4o-tavily`	28	100.0%
`gpt4o-mini-tavily`	17	100.0%

Kalibr learned that Serper was failing and shifted traffic to Tavily paths.

What This Demonstrates

During normal operation: Both systems perform identically. Kalibr adds no overhead when nothing is wrong.

During degradation:

Hardcoded system:

Kept routing 100% of traffic to the broken path
Success rate dropped to ~20-36%
Would require human intervention to recover

Kalibr:

Observed failures and shifted traffic automatically
Maintained ~90% success rate
No code change, no human intervention

This is not an optimization. It is a behavioral difference that hardcoded systems cannot exhibit.

What This Does Not Claim

This benchmark does not demonstrate:

Superior reasoning or intelligence
Universal optimality across all tasks
Guaranteed reliability in all environments

Kalibr is a control system. It routes execution based on what is actually working.

Limitations

Single task type (research agent)
Three execution paths
Synthetic failure injection (rate limiting simulation)

Results should not be extrapolated to all workloads. The purpose is to validate adaptive execution path routing under degradation.

Run It Yourself

pip install kalibr openai httpx

export KALIBR_API_KEY=your-key
export KALIBR_TENANT_ID=your-tenant
export OPENAI_API_KEY=your-key
export SERPER_API_KEY=your-key
export TAVILY_API_KEY=your-key

python resilience_benchmark.py

Options:

python resilience_benchmark.py --quick  # ~25 tasks, ~3 min
python resilience_benchmark.py          # ~50 tasks, ~5 min
python resilience_benchmark.py --full   # ~100 tasks, ~10 min

Requirements: ~$0.30 in API usage (standard run), Python 3.10+

Summary

Metric	Hardcoded	Kalibr
Success during degradation	~20-36%	~88-92%
Human intervention required	Yes	No
Code changes required	Yes	No

When execution paths degrade, hardcoded systems fail until humans intervene. Kalibr adapts automatically.

Source Code

The complete benchmark is open source: github.com/kalibr-ai/kalibr-benchmark