MMNTM logo
Return to Index
Technical Deep Dive

Building Agent Evals: From Zero to Production

Why 40% of agent projects fail: the 5-level maturity model for production evals. Move beyond SWE-bench scores to measure task completion, error recovery, and ROI.

MMNTM Research Team
14 min read
#AI Agents#Evaluation#Testing#Best Practices#Infrastructure

The Paradigm Shift

Measuring models tells you what's possible. Measuring agents tells you what works.

A model that dominates benchmarks can still build an agent that fails catastrophically in production. Claude Opus 4.5 scores 80.9% on SWE-bench Verified—but that number doesn't predict whether your coding agent will complete tickets, handle edge cases, or recover from API failures.

Agent evaluation measures performance in action: multi-step task completion, tool orchestration, error recovery, and goal achievement. The shift from evaluating isolated intelligence to evaluating interactive systems requires new frameworks, metrics, and infrastructure.

Projects Cancelled

40%

By 2027 (Gartner)

Hallucination Rate

0.2%

Harvey AI benchmark

Cost Reduction

94.2%

With model cascading


The Five-Level Maturity Model

Most organizations are stuck at Level 0 or 1. World-class teams treat evals as core infrastructure, progressing through five distinct stages:

Level 5: Self-Improving Evals — Production failures automatically generate regression tests. LLM-generated adversarial test cases. Automatic threshold tuning based on production metrics. Harvey AI achieved their 0.2% hallucination rate through continuous eval cycles.


Multi-Dimensional Metrics Framework

A single accuracy score is insufficient. Ship/no-ship decisions require a balanced scorecard:

Task Completion Rate (TCR)

The north star metric. Percentage of end-to-end tasks successfully completed according to predefined success criteria.

Measurement: For coding agents, does the linter pass? For booking agents, was the calendar invite sent? Use database state comparison for verification.

30-70%

Complex Benchmarks

WebArena, AgentBench

90%+

Production Target

Critical workflows

Hallucination Rate

Percentage of outputs containing unsupported or fabricated claims.

Advanced measurement:

  • Claim decomposition: Breaking outputs into atomic claims
  • Source verification: Checking each claim against provided context
  • Confidence scoring: Identifying uncertain statements

Hallucination Thresholds by Domain

FeatureDomainAcceptable RatePopularNotes
Legal / MedicalHigh-stakesUnder 1%Zero tolerance
Customer ServiceMedium-stakes2-5%May be tolerable
CreativeLow-stakesHigher OKContext dependent

Latency Metrics

Track percentiles, not averages. P95 and P99 reveal the experience of tail users.

Use CaseTarget Latency
Real-time chatUnder 2s
Background workflows10-30s acceptable
Batch processingMinutes to hours

Cost Per Task

Total expenditure (tokens + API calls + infrastructure) for one successful task completion.

Why it matters: Prevents "infinite loops" where agents spin in reasoning steps without solving problems. Cost regressions directly affect unit economics.

Safety Metrics

  • Jailbreak resistance: 2-10% attack success rate for production systems
  • PII leakage detection: Automated scanning + compliance validation
  • Harmful output rate: Multi-dimensional bias evaluation

Test Suite Architecture: The Pyramid

Don't rely solely on expensive end-to-end tests. Build a pyramid of verification:

Test Distribution (Recommended)

Unit Tests60
Integration Tests25
Regression Tests10
Adversarial Tests5

Unit Tests (Fast & Cheap)

Scope: Single-turn evaluation of specific capabilities

Example: "Summarize this contract clause" → validate summary accuracy against ground truth

When to use: Prompt engineering iterations, model comparison, regression testing for specific capabilities

Integration Tests (Multi-Turn Workflows)

Scope: End-to-end task execution across multiple steps

Technique: Trajectory testing—record full agent execution trace (tool calls, reasoning, outputs). Assert on intermediate states and final outcomes.

Regression Tests (Golden Datasets)

Purpose: Detect when changes break previously working functionality

Execution: Run on every code/prompt change. Block merges if regression rate exceeds threshold (typically 5%).

Adversarial Tests (Red Team)

Categories:

  1. Prompt injection attacks: goal hijacking, instruction override, data exfiltration
  2. Jailbreak attempts: safety guardrail bypass, multi-turn manipulation
  3. Edge case generation: input validation stress tests, context overflow, malformed data

Tools: Promptfoo for automated red team configuration, AgentDojo for prompt injection testing


Evaluation Methods: The Hybrid Approach

MethodBest ForProsCons
Rule-BasedJSON formatting, tool arguments, syntaxFast, deterministic, freeBrittle, misses nuance
LLM-as-JudgeTone, helpfulness, reasoning qualityFlexible, scales infinitelyExpensive, position bias
Human ReviewGold standard creation, edge casesHighest qualitySlow, expensive, unscalable
HybridProduction constraintsBest of both worldsComplex to set up

LLM-as-Judge Best Practices

Well-designed LLM judges achieve 74-82% agreement with human evaluators, but introduce biases:

Self-evaluation bias: Arize research found OpenAI models favor OpenAI outputs (+9.4%), Anthropic shows similar patterns. Use different models for agents and judges.

Mitigation strategies:

  • Use temperature=0 for evaluation models
  • Structured prompts with explicit rubrics
  • Multiple judges with aggregation
  • Calibration sets to validate alignment with humans

The Tiered Judge Strategy

Use a small, cheap model (Haiku 4.5 at $0.80/M input) as first-pass gatekeeper. Only send difficult or ambiguous cases to Opus 4.5 ($5.00/M input) for final ruling.


Golden Dataset Strategy

Data Sources

Dataset Source Comparison

FeatureSyntheticProductionPopularLLM-Generated
QualityHigh (known truth)High (realistic)Variable
CoverageNarrowBroadBroad
CostHigh (manual)MediumLow
Scalability
Real-world patterns

Recommended Composition

Golden Dataset Mix

70
20
10
RepresentativeEdge CasesAdversarial

For generating synthetic evaluation datasets from proprietary data at scale, platforms like Databricks Agent Bricks automate this process—using the enterprise's own data to generate realistic "questions and answers" for testing agents. This is particularly valuable when production data contains sensitive information that can't be used directly for evaluation. The complete evaluation infrastructure stack is detailed in Databricks Foundation.

Size Recommendations

Minimum Viable

100+

Basic validation

Robust Evaluation

500-1K

Production systems

Comprehensive

2,000+

Exhaustive testing


Organizational Playbooks

The PM-Eng-QA Triad

RoleResponsibilityAction
Product ManagerAccountableDefines success criteria. Writes golden examples.
AI EngineerResponsibleImplements eval infrastructure. Connects CI/CD.
QA / SREConsultedMonitors production drift. Alerts on divergence.

Critical Process: Weekly Eval Review

Key outcome: Every production failure becomes a regression test. This prevents the same bug from happening twice.

Agenda:

  1. Review pass rates and trends over past week
  2. Analyze new failures and categorize by root cause
  3. Discuss production incidents and eval coverage gaps
  4. Update test dataset based on findings
  5. Adjust thresholds or success criteria if needed

The Tool Landscape

LangSmith (The Observability Giant)

Best for: Teams deep in LangChain ecosystem needing full trace visibility

  • Dataset management and version control
  • "Queue" mode for human review
  • Native GitHub Actions integration to block PRs on regression
  • Production trace logging and debugging

Braintrust (The Enterprise Standard)

Best for: Large enterprises requiring governance and on-prem deployment

  • Evaluation of prompts independent of code
  • Git-like workflow for version control
  • Framework-agnostic (not tied to LangChain)
  • Pricing: $249/month for Pro, Custom for Enterprise

Promptfoo (The Hacker's Choice)

Best for: Security teams and engineers who love the CLI

  • Open-source, CLI-first tool
  • Adversarial testing focus with red team capabilities
  • Local execution (no data leaves your machine)
  • Simple YAML-based configuration

The Six-Month Roadmap

Foundation

100 core test cases, automated scoring, basic dashboard. 1-2 engineers full-time.

Milestone

Automation

CI/CD integration, pass/fail gates blocking deployments, Slack notifications.

Milestone

Production Parity

Golden datasets from production, adversarial testing, A/B testing infrastructure.

Milestone

Continuous Optimization

Auto-generated regression tests, drift detection, LLM adversarial generation.

Investment Scaling

Team SizeEval Investment
Small (under 10)10-20% time on evals
Medium (10-50)1-2 dedicated eval engineers
Large (50+)Dedicated eval team (3-5 people)

Common Pitfalls and Solutions

Problem: Optimizing for eval metrics leads to degraded real-world performance

Solutions:

  • Hold-out test sets never used for development
  • Production A/B testing as ultimate validation
  • Regular refresh of eval datasets to prevent memorization

Problem: Evals only cover happy path, miss edge cases and failure modes

Solutions:

  • Dedicated adversarial testing phase
  • Production sampling to capture real diversity
  • Automated edge case generation using LLMs

Problem: Evals use synthetic data that doesn't reflect real usage

Solutions:

  • Golden datasets sourced from production snapshots
  • Continuous monitoring to detect distribution drift
  • A/B testing to validate eval predictions

Problem: LLM-as-judge costs spiral with comprehensive evaluation

Solutions:

  • Hybrid scoring: LLM for nuance, rules for determinism
  • Model cascading: cheap models first, expensive for hard cases
  • Batching and caching of evaluation calls

Problem: LLM-as-judge gives different scores for same output

Solutions:

  • Use temperature=0 for evaluation models
  • Structured prompts with explicit rubrics
  • Multiple evaluation runs with aggregation

The Bottom Line

40% of agentic AI projects will be cancelled by 2027. The difference between success and failure often comes down to evaluation rigor.

Model benchmarks measure potential. Agent evals measure reality.

The shift from evaluating isolated intelligence to evaluating interactive systems requires:

  • Progressive maturity: Moving through 5 clear stages from manual testing to self-improving evals
  • Multi-dimensional measurement: Tracking task completion, hallucination, latency, cost, and safety simultaneously
  • Comprehensive test coverage: Unit, integration, regression, and adversarial tests working together
  • Hybrid evaluation methods: Combining LLM-as-judge, rule-based checks, and human validation
  • Production-driven datasets: Golden datasets that evolve with real user behavior

Target Detection Rate

95%+

Issues caught before production

To World-Class

6 months

Following this roadmap

MMNTM Research TeamDec 15, 2025
Building Agent Evals: From Zero to Production