Summary
Evaluation tells you whether the agent is good enough for a task. Observability tells you why it passed or failed. Production systems need both, because scores without traces are hard to improve and traces without metrics are hard to prioritize.Why It Matters
Agent systems are probabilistic and multi-step. That makes them harder to judge than deterministic software.- A correct answer may depend on search, tools, or file state.
- The same task can fail for very different reasons.
- A prompt or model change can improve one capability while silently harming another.
- an
evaluation loopfor measuring capability - a
diagnostic loopfor explaining behavior
Mental Model
Think in three layers.offline evaluation: benchmark-style checks run on known tasks to compare prompts, models, tools, and policies.online evaluation: production signals such as success rate, latency, escalation rate, retries, or human overrides.observability: traces, tool logs, state transitions, and artifacts that show what the system actually did.
- Tool use often needs structured correctness checks such as function and parameter matching.
- General assistant tasks often need answer-level correctness plus task-level completion.
- Data generation or synthesis tasks may need comparative review, judge models, or human verification.
Architecture Diagram
Tool Landscape
The imported reference material highlights three useful evaluation shapes:- benchmark-style tool-use evaluation, where structured matching checks whether the agent selected the right function and arguments
- general-assistant evaluation, where tasks require multi-step reasoning and broader success judgments
- generation-quality evaluation, where relative comparison or human review is often more useful than one exact metric
- Keep full tool inputs and outputs.
- Preserve failure records rather than collapsing them into generic errors.
- Track step order, retries, and state changes.
- Keep traces readable by both humans and machines.
Tradeoffs
- Offline benchmarks are useful, but they can overfit the system to lab tasks that are cleaner than production reality.
- Online metrics reflect real usage, but they lag and are noisy without good segmentation.
- Judge-model evaluation scales well, but it still needs human calibration.
- Rich traces improve diagnosis, but they create storage, privacy, and review overhead.
- evaluate the capability you are actually changing
- keep traces for both failed and successful runs
- review failure modes before rewriting prompts
- do not ship “tool failed” as the only explanation developers can see
Citations
- Source input: Chapter 12 Agent Performance Evaluation
- Source input: Extra09 Agent build pitfalls and observability lessons
Reading Extensions
Update Log
- 2026-04-21: Initial repo-native draft based on imported reference material and lab rewrite rules.