Every time a user accepts or rejects an agent output, Kalibr can learn from it. Connect that signal and future runs route to better models automatically — no redeployment, no config changes.
Structural evals and provider outcomes tell Kalibr what succeeded technically. User signals tell Kalibr what actually worked for real users.
Three levels. Pick the one that matches how your app is built.
Use this when your app has a clear thumbs up / thumbs down interaction, or when you can detect acceptance from user behavior.
from kalibr.feedback import user_accepted, user_rejected, track_run # After a pipeline run — store context so feedback can reference it result = router.completion(messages=[...]) track_run(result) # When user approves the output: user_accepted() # When user pushes back or asks for a redo: user_rejected(reason="output too short")
Use this when users send follow-up messages. Kalibr classifies whether the message is an acceptance, rejection, or continuation — you do not have to detect it yourself.
from kalibr.feedback import report_pipeline, report_user_turn
# After the pipeline runs — anchor this session:
report_pipeline(
session_id="user-session-123",
goal="research_summary",
prompt=system_prompt,
output=response,
model="gpt-4o"
)
# When the user sends their next message:
report_user_turn(
session_id="user-session-123",
user_message=user_next_message # Kalibr classifies this automatically
)
report_user_turn uses a two-layer classifier: a heuristic check first, an LLM fallback only if confidence is below threshold. It runs in a background thread and never blocks your main loop.
Use this when you can observe what the user did with the output — whether they used it verbatim, edited it, or discarded it. This overrides classifier results and gives Kalibr the most precise feedback.
from kalibr.feedback import report_action # Output was used exactly as produced: report_action(session_id="user-session-123", action_type="output_used_verbatim") # Output was used but edited before use: report_action(session_id="user-session-123", action_type="output_edited") # Output was not used: report_action(session_id="user-session-123", action_type="output_discarded")
report_user_turn classifies the message locally first. It only calls an LLM if the heuristic confidence is below 0.85.raw_evidence field (optional) is capped at 500 characters if you choose to include it.