MMNTM logo
Return to Index
Technical Deep Dive

How to Know If Your AI Agent Actually Works

Model benchmarks tell you nothing about agent performance. Trajectory analysis, the three evaluation pillars, and the metrics that actually matter.

MMNTM Research Team
6 min read
#AI Agents#Evaluation#Testing#Best Practices

What is Agent Evaluation?

Agent evaluation is the practice of measuring AI agent performance through trajectory analysis—not model benchmarks. Unlike traditional ML evaluation that measures accuracy on test sets, agent evals analyze the complete sequence of observations, thoughts, actions, and tool calls. The three pillars are: automated testing (task success, latency, cost), LLM-as-judge (helpfulness, reasoning quality), and human review (edge cases, domain expertise). Standard benchmarks show the gap: GPT-4 achieves only ~14% on WebArena autonomous web tasks.


How to Know If Your AI Agent Actually Works

Model Evaluation ≠ Agent Evaluation

A model that scores brilliantly on benchmarks can still build an agent that fails catastrophically in production.

Model evaluation measures potential—can this LLM generate coherent text? Agent evaluation measures performance in action—can this system complete multi-step tasks, choose the right tools, handle errors, and achieve goals?

Traditional metrics like perplexity and BLEU scores tell you nothing about whether your agent will complete workflows, recover from failures, or maintain context across sessions.

Trajectory Analysis: The Core Technique

The defining shift in agent evaluation is trajectory analysis—evaluating the complete sequence of observations, thoughts, actions, and tool calls an agent makes.

An agent might produce the correct final answer but take an unnecessarily long path, use inappropriate tools, or hallucinate intermediate results. Conversely, an agent might fail the task but demonstrate correct reasoning up to a specific failure point—telling you exactly where to fix the system.

Google's Vertex AI provides six trajectory-specific metrics:

  • Exact match: Agent trajectory mirrors the ideal solution
  • Tool sequence: Correct tools called in the right order
  • Parameter accuracy: Tools called with correct arguments
  • Efficiency: Goals achieved without unnecessary steps

Outcome-only evaluation is blind to reasoning quality. Trajectory evaluation reveals the fitness of your agent's decision-making.

The Three Evaluation Pillars

Production-grade evaluation combines three approaches:

1. Automated Testing

Programmatic assertions process thousands of interactions to establish baselines:

  • Task success rates
  • Latency and throughput under load
  • Cost per task (tokens + API calls)
  • Tool-calling accuracy
  • Hallucination rates

2. LLM-as-Judge

A separate LLM evaluates outputs based on rubrics for helpfulness, coherence, safety, and reasoning quality. Well-designed LLM judges achieve 74-82% agreement with human evaluators.

Critical caveat: Arize's research found evaluators show self-evaluation bias—OpenAI models favor OpenAI outputs (+9.4%), Anthropic shows similar patterns. Use different models for agents and judges.

3. Human Review

Humans assess what automated systems can't: subtle biases, contextual appropriateness, domain-specific edge cases, and subjective qualities like empathy.

The winning approach: automated evaluators provide broad coverage, LLM judges assess subjective dimensions, human review validates edge cases.

Model Metrics vs. System Metrics

A critical distinction that trips up many teams:

Metric TypeWhat It MeasuresExamplesTells You
Model BenchmarksRaw LLM capabilityMMLU, SWE-bench, HumanEval, perplexity, BLEU"Can this model reason?"
System MetricsAgent performance in productionTask completion rate, P95 latency, CPCT, rework rate"Does this agent deliver value?"

Why this matters: A model scoring 90% on SWE-bench may power an agent with 60% task completion. The gap is everything—tool integration, context management, error recovery, real-world edge cases. Report system metrics to executives; use model benchmarks only for initial model selection.

The Metrics That Matter

Modern frameworks assess performance across four dimensions—all system-level metrics, not model benchmarks:

Task Completion: Success rate, step-level accuracy, intent resolution

Tool Interaction: Correct selection, parameter extraction, efficiency

System Performance: P50/P95 latency, throughput, cost per successful task

Quality & Safety: Groundedness, hallucination rate, error recovery

Multi-Agent Evaluation

When systems evolve from single agents to orchestrated teams, evaluation complexity increases dramatically. You're measuring coordination, not just individual performance.

Key multi-agent metrics:

  • Handoff quality: Was context passed correctly between agents?
  • Coordination efficiency: Communication overhead vs. task value
  • Conflict resolution: How do agents handle disagreements?

Anthropic's multi-agent research system demonstrated the stakes: their coordinated approach outperformed single-agent Opus 4 by 90.2% on research tasks. The swarm patterns that enable this require evaluation at both agent and system levels.

The Benchmark Reality Check

Standardized benchmarks provide comparative baselines:

WebArena: Autonomous web agents on realistic websites. GPT-4 achieved only ~14% success initially; 2025 architectures reach 60%+.

AgentBench: Eight diverse environments from web shopping to database queries. Results are sobering: GPT-4 scored ~4.0 while strong open-source models scored under 1.0. General LLM capability doesn't translate automatically to agent performance.

OSWorld: Desktop automation. State-of-the-art reaches 40-50% on entry-level tasks—full computer automation remains distant.

Building Your Eval Suite

Start with these practices:

  1. Define success criteria before building frameworks—what does "working" mean for your use case?

  2. Evaluate trajectories, not just outcomes—understanding reasoning quality reveals where to improve

  3. Use synthetic data to scale test coverage. Databricks' approach generates diverse test cases from proprietary documents. The best prompts don't come from manual engineering—they emerge from systematic evolution, as explained in The Prompt DNA Hypothesis.

  4. Integrate into CI/CDtreat agent quality like code quality. Fail builds when scores drop below thresholds.

  5. Monitor production continuously—evaluation doesn't end at deployment. See Agent Observability for the operational side.

The Bottom Line

40% of agentic AI projects will be cancelled by 2027. The difference between success and failure often comes down to evaluation rigor.

Model benchmarks measure potential. Trajectory analysis measures reality. Build for the latter.

MMNTM Research TeamNov 25, 2025
AI Agent Evals: Why Benchmarks Lie About Performance