Building Agent Evals: From Zero to Production

The Paradigm Shift

Measuring models tells you what's possible. Measuring agents tells you what works.

A model that dominates benchmarks can still build an agent that fails catastrophically in production. Claude Opus 4.5 scores 80.9% on SWE-bench Verified—but that number doesn't predict whether your coding agent will complete tickets, handle edge cases, or recover from API failures.

Agent evaluation measures performance in action: multi-step task completion, tool orchestration, error recovery, and goal achievement. The shift from evaluating isolated intelligence to evaluating interactive systems requires new frameworks, metrics, and infrastructure.

Projects Cancelled

40%

By 2027 (Gartner)

Hallucination Rate

0.2%

Harvey AI benchmark

Cost Reduction

94.2%

With model cascading

The Five-Level Maturity Model

Most organizations are stuck at Level 0 or 1. World-class teams treat evals as core infrastructure, progressing through five distinct stages:

Level 5: Self-Improving Evals — Production failures automatically generate regression tests. LLM-generated adversarial test cases. Automatic threshold tuning based on production metrics. Harvey AI achieved their 0.2% hallucination rate through continuous eval cycles.

Multi-Dimensional Metrics Framework

A single accuracy score is insufficient. Ship/no-ship decisions require a balanced scorecard:

Task Completion Rate (TCR)

The north star metric. Percentage of end-to-end tasks successfully completed according to predefined success criteria.

Measurement: For coding agents, does the linter pass? For booking agents, was the calendar invite sent? Use database state comparison for verification.

30-70%

Complex Benchmarks

WebArena, AgentBench

90%+

Production Target

Critical workflows

Hallucination Rate

Percentage of outputs containing unsupported or fabricated claims.

Advanced measurement:

Claim decomposition: Breaking outputs into atomic claims
Source verification: Checking each claim against provided context
Confidence scoring: Identifying uncertain statements

Hallucination Thresholds by Domain

Feature	Domain	Acceptable RatePopular	Notes
Legal / Medical	High-stakes	Under 1%	Zero tolerance
Customer Service	Medium-stakes	2-5%	May be tolerable
Creative	Low-stakes	Higher OK	Context dependent

Latency Metrics

Track percentiles, not averages. P95 and P99 reveal the experience of tail users.

Use Case	Target Latency
Real-time chat	Under 2s
Background workflows	10-30s acceptable
Batch processing	Minutes to hours

Cost Per Task

Total expenditure (tokens + API calls + infrastructure) for one successful task completion.

Why it matters: Prevents "infinite loops" where agents spin in reasoning steps without solving problems. Cost regressions directly affect unit economics.

Safety Metrics

Jailbreak resistance: 2-10% attack success rate for production systems
PII leakage detection: Automated scanning + compliance validation
Harmful output rate: Multi-dimensional bias evaluation

Test Suite Architecture: The Pyramid

Don't rely solely on expensive end-to-end tests. Build a pyramid of verification:

Test Distribution (Recommended)

Unit Tests60

Integration Tests25

Regression Tests10

Adversarial Tests5

Unit Tests (Fast & Cheap)

Scope: Single-turn evaluation of specific capabilities

Example: "Summarize this contract clause" → validate summary accuracy against ground truth

When to use: Prompt engineering iterations, model comparison, regression testing for specific capabilities

Integration Tests (Multi-Turn Workflows)

Scope: End-to-end task execution across multiple steps

Technique: Trajectory testing—record full agent execution trace (tool calls, reasoning, outputs). Assert on intermediate states and final outcomes.

Regression Tests (Golden Datasets)

Purpose: Detect when changes break previously working functionality

Execution: Run on every code/prompt change. Block merges if regression rate exceeds threshold (typically 5%).

Adversarial Tests (Red Team)

Categories:

Prompt injection attacks: goal hijacking, instruction override, data exfiltration
Jailbreak attempts: safety guardrail bypass, multi-turn manipulation
Edge case generation: input validation stress tests, context overflow, malformed data

Tools: Promptfoo for automated red team configuration, AgentDojo for prompt injection testing

Evaluation Methods: The Hybrid Approach

Method	Best For	Pros	Cons
Rule-Based	JSON formatting, tool arguments, syntax	Fast, deterministic, free	Brittle, misses nuance
LLM-as-Judge	Tone, helpfulness, reasoning quality	Flexible, scales infinitely	Expensive, position bias
Human Review	Gold standard creation, edge cases	Highest quality	Slow, expensive, unscalable
Hybrid	Production constraints	Best of both worlds	Complex to set up

LLM-as-Judge Best Practices

Well-designed LLM judges achieve 74-82% agreement with human evaluators, but introduce biases:

Self-evaluation bias: Arize research found OpenAI models favor OpenAI outputs (+9.4%), Anthropic shows similar patterns. Use different models for agents and judges.

Mitigation strategies:

Use temperature=0 for evaluation models
Structured prompts with explicit rubrics
Multiple judges with aggregation
Calibration sets to validate alignment with humans

The Tiered Judge Strategy

Use a small, cheap model (Haiku 4.5 at $0.80/M input) as first-pass gatekeeper. Only send difficult or ambiguous cases to Opus 4.5 ($5.00/M input) for final ruling.

Golden Dataset Strategy

Data Sources

Dataset Source Comparison

Feature	Synthetic	ProductionPopular	LLM-Generated
Quality	High (known truth)	High (realistic)	Variable
Coverage	Narrow	Broad	Broad
Cost	High (manual)	Medium	Low
Scalability
Real-world patterns

Recommended Composition

Golden Dataset Mix

RepresentativeEdge CasesAdversarial

For generating synthetic evaluation datasets from proprietary data at scale, platforms like Databricks Agent Bricks automate this process—using the enterprise's own data to generate realistic "questions and answers" for testing agents. This is particularly valuable when production data contains sensitive information that can't be used directly for evaluation. The complete evaluation infrastructure stack is detailed in Databricks Foundation.

Size Recommendations

Minimum Viable

100+

Basic validation

Robust Evaluation

500-1K

Production systems

Comprehensive

2,000+

Exhaustive testing

Organizational Playbooks

The PM-Eng-QA Triad

Role	Responsibility	Action
Product Manager	Accountable	Defines success criteria. Writes golden examples.
AI Engineer	Responsible	Implements eval infrastructure. Connects CI/CD.
QA / SRE	Consulted	Monitors production drift. Alerts on divergence.

Critical Process: Weekly Eval Review

Key outcome: Every production failure becomes a regression test. This prevents the same bug from happening twice.

Agenda:

Review pass rates and trends over past week
Analyze new failures and categorize by root cause
Discuss production incidents and eval coverage gaps
Update test dataset based on findings
Adjust thresholds or success criteria if needed

The Tool Landscape

LangSmith (The Observability Giant)

Best for: Teams deep in LangChain ecosystem needing full trace visibility

Dataset management and version control
"Queue" mode for human review
Native GitHub Actions integration to block PRs on regression
Production trace logging and debugging

Braintrust (The Enterprise Standard)

Best for: Large enterprises requiring governance and on-prem deployment

Evaluation of prompts independent of code
Git-like workflow for version control
Framework-agnostic (not tied to LangChain)
Pricing: $249/month for Pro, Custom for Enterprise

Promptfoo (The Hacker's Choice)

Best for: Security teams and engineers who love the CLI

Open-source, CLI-first tool
Adversarial testing focus with red team capabilities
Local execution (no data leaves your machine)
Simple YAML-based configuration

The Six-Month Roadmap

Month 1

Foundation

100 core test cases, automated scoring, basic dashboard. 1-2 engineers full-time.

Months 2-3Milestone

Automation

CI/CD integration, pass/fail gates blocking deployments, Slack notifications.

Months 4-6Milestone

Production Parity

Golden datasets from production, adversarial testing, A/B testing infrastructure.

Month 7+Milestone

Continuous Optimization

Auto-generated regression tests, drift detection, LLM adversarial generation.

Investment Scaling

Team Size	Eval Investment
Small (under 10)	10-20% time on evals
Medium (10-50)	1-2 dedicated eval engineers
Large (50+)	Dedicated eval team (3-5 people)

Common Pitfalls and Solutions

Problem: Optimizing for eval metrics leads to degraded real-world performance

Solutions:

Hold-out test sets never used for development
Production A/B testing as ultimate validation
Regular refresh of eval datasets to prevent memorization

Problem: Evals only cover happy path, miss edge cases and failure modes

Solutions:

Dedicated adversarial testing phase
Production sampling to capture real diversity
Automated edge case generation using LLMs

Problem: Evals use synthetic data that doesn't reflect real usage

Solutions:

Golden datasets sourced from production snapshots
Continuous monitoring to detect distribution drift
A/B testing to validate eval predictions

Problem: LLM-as-judge costs spiral with comprehensive evaluation

Solutions:

Hybrid scoring: LLM for nuance, rules for determinism
Model cascading: cheap models first, expensive for hard cases
Batching and caching of evaluation calls

Problem: LLM-as-judge gives different scores for same output

Solutions:

Use temperature=0 for evaluation models
Structured prompts with explicit rubrics
Multiple evaluation runs with aggregation

The Bottom Line

40% of agentic AI projects will be cancelled by 2027. The difference between success and failure often comes down to evaluation rigor.

Model benchmarks measure potential. Agent evals measure reality. And crucially, verification is domain-specific. Legal has citations. Code has tests. Healthcare has clinical standards. General knowledge work has... human judgment. The verticals that define what "good" looks like will capture the value—those that can't verify objectively will always need humans in the loop.

The shift from evaluating isolated intelligence to evaluating interactive systems requires:

Progressive maturity: Moving through 5 clear stages from manual testing to self-improving evals
Multi-dimensional measurement: Tracking task completion, hallucination, latency, cost, and safety simultaneously
Comprehensive test coverage: Unit, integration, regression, and adversarial tests working together
Hybrid evaluation methods: Combining LLM-as-judge, rule-based checks, and human validation
Production-driven datasets: Golden datasets that evolve with real user behavior

Target Detection Rate

95%+

Issues caught before production

To World-Class

6 months

Following this roadmap

Best Practices7 min

You're Monitoring Agents Like APIs. That's Why They Fail Silently.

Agents don't fail like software. They fail like employees—doing technically correct work that produces wrong outcomes. The observability stack that catches behavioral failures, not just operational ones.

Read

Best Practices8 min

Agent Economics: The Unit Economics of Autonomous Work

Stop measuring cost per token. The metric that matters is Cost Per Completed Task. Here is the framework for measuring, optimizing, and governing the economics of AI agents.

Read

The Paradigm Shift

The Five-Level Maturity Model

Multi-Dimensional Metrics Framework

Task Completion Rate (TCR)

Hallucination Rate

Hallucination Thresholds by Domain

Latency Metrics

Cost Per Task

Safety Metrics

Test Suite Architecture: The Pyramid

Test Distribution (Recommended)

Unit Tests (Fast & Cheap)

Integration Tests (Multi-Turn Workflows)

Regression Tests (Golden Datasets)

Adversarial Tests (Red Team)

Evaluation Methods: The Hybrid Approach

LLM-as-Judge Best Practices

The Tiered Judge Strategy

Golden Dataset Strategy

Data Sources

Dataset Source Comparison

Recommended Composition

Golden Dataset Mix

Size Recommendations

Organizational Playbooks

The PM-Eng-QA Triad

Critical Process: Weekly Eval Review

The Tool Landscape

LangSmith (The Observability Giant)

Braintrust (The Enterprise Standard)

Promptfoo (The Hacker's Choice)

The Six-Month Roadmap

Foundation

Automation

Production Parity

Continuous Optimization

Investment Scaling

Common Pitfalls and Solutions

The Bottom Line

You're Monitoring Agents Like APIs. That's Why They Fail Silently.

Agent Economics: The Unit Economics of Autonomous Work

Related

Ask a follow-up