Resilience Benchmark

When your best execution path degrades, Kalibr routes around it automatically. Hardcoded systems keep failing until a human intervenes.


The Question

What happens when an agent's execution path starts failing in production?

  • Hardcoded systems continue sending traffic to the same path until an engineer notices, diagnoses the issue, and deploys a fix.
  • Kalibr continuously observes outcomes and shifts traffic toward paths that are succeeding.

This benchmark compares those two behaviors under identical conditions.


What Makes This Different

This is execution path routing, not model routing.

Each path is a complete execution strategy combining model and tool:

Path IDModelToolDescription
gpt4o-serpergpt-4oSerperPrimary path (hardcoded baseline)
gpt4o-tavilygpt-4oTavilyBackup tool
gpt4o-mini-tavilygpt-4o-miniTavilyCost-optimized backup

The paths differ by tool, not just model. This reflects how real agents are built.


The Agent

A realistic multi-step research agent:

  1. Plan → Generate search queries (LLM call)
  2. Search → Call external API (Serper or Tavily)
  3. Extract → Pull facts with source references (LLM call)
  4. Synthesize → Write answer with citations (LLM call)
  5. Validate → Verify citations reference valid sources

A task succeeds only if all steps complete and validation passes.


Experimental Conditions

Hardcoded Baseline

  • Always uses gpt4o-serper
  • No fallback logic
  • No adaptive routing
  • Represents typical production configuration

Kalibr

  • Chooses execution paths based on observed outcomes
  • Explores alternatives during normal operation
  • Shifts traffic away from degraded paths automatically
  • No explicit failure signals or special-case logic

Phases

PhaseTasksDescription
Learning15Normal operation. No failures injected.
Degraded25Serper fails 70% of requests. Tavily unaffected.
Recovery10Degradation continues. Measure steady state.

Failure Injection

At task 16, a 70% failure rate is injected only on Serper. Tavily remains healthy.

This simulates real-world degradation:

  • API rate limits
  • Provider outages
  • Service degradation

The failure is scoped to one path. Kalibr can route around it. Hardcoded cannot.


Results

Run 1

PhaseHardcodedKalibrDelta
Learning100.0%100.0%+0.0%
Degraded36.0%92.0%+56.0%
Recovery30.0%100.0%+70.0%
Overall54.0%96.0%+42.0%

Run 2

PhaseHardcodedKalibrDelta
Learning100.0%100.0%+0.0%
Degraded16.0%88.0%+72.0%
Recovery20.0%100.0%+80.0%
Overall42.0%94.0%+52.0%

Run 3

PhaseHardcodedKalibrDelta
Learning93.3%100.0%+6.7%
Degraded24.0%88.0%+64.0%
Recovery30.0%100.0%+70.0%
Overall46.0%94.0%+48.0%

Results are consistent across runs.

Path Distribution (Run 3)

PathTasksSuccess Rate
gpt4o-serper540.0%
gpt4o-tavily28100.0%
gpt4o-mini-tavily17100.0%

Kalibr learned that Serper was failing and shifted traffic to Tavily paths.


What This Demonstrates

During normal operation: Both systems perform identically. Kalibr adds no overhead when nothing is wrong.

During degradation:

Hardcoded system:

  • Kept routing 100% of traffic to the broken path
  • Success rate dropped to ~20-36%
  • Would require human intervention to recover

Kalibr:

  • Observed failures and shifted traffic automatically
  • Maintained ~90% success rate
  • No code change, no human intervention

This is not an optimization. It is a behavioral difference that hardcoded systems cannot exhibit.


What This Does Not Claim

This benchmark does not demonstrate:

  • Superior reasoning or intelligence
  • Universal optimality across all tasks
  • Guaranteed reliability in all environments

Kalibr is a control system. It routes execution based on what is actually working.


Limitations

  • Single task type (research agent)
  • Three execution paths
  • Synthetic failure injection (rate limiting simulation)

Results should not be extrapolated to all workloads. The purpose is to validate adaptive execution path routing under degradation.


Run It Yourself

pip install kalibr openai httpx

export KALIBR_API_KEY=your-key
export KALIBR_TENANT_ID=your-tenant
export OPENAI_API_KEY=your-key
export SERPER_API_KEY=your-key
export TAVILY_API_KEY=your-key

python resilience_benchmark.py

Options:

python resilience_benchmark.py --quick  # ~25 tasks, ~3 min
python resilience_benchmark.py          # ~50 tasks, ~5 min
python resilience_benchmark.py --full   # ~100 tasks, ~10 min

Requirements: ~$0.30 in API usage (standard run), Python 3.10+


Summary

MetricHardcodedKalibr
Success during degradation~20-36%~88-92%
Human intervention requiredYesNo
Code changes requiredYesNo

When execution paths degrade, hardcoded systems fail until humans intervene. Kalibr adapts automatically.


Source Code

The complete benchmark is open source: github.com/kalibr-ai/kalibr-benchmark