MMNTM logo
Technical Deep Dive

The Contamination Problem: Why Production Agents Silently Degrade

Production AI has two contamination problems — memory contamination and role contamination. Both produce the same failure: the agent works but is subtly worse. The design principle that connects OpenClaw's memory architecture to gstack's cognitive gear-shifting.

Greg Salwitz
11 min read
#Agent Architecture#Contamination#gstack#OpenClaw#Production AI#Evals

When a production agent starts performing worse, the first instinct is to blame the model. The prompt got too long. The context window filled up. The model version changed. These explanations are sometimes correct. More often, the degradation is contamination — and contamination has two forms that look nothing alike but produce the same failure mode: the agent works, it's subtly worse, and you can't tell why.


Two Kinds of Contamination

Memory contamination is what happens when accumulated context degrades decision quality. OpenClaw's accumulation model stores everything — daily logs, vector indexes, 400-token chunks with 80-token overlap. The theory: total recall is worth the cost. The failure: as context grows, the agent crosses a threshold and silently forgets instructions. It doesn't crash. It continues with confidence. The signal drowns in noise, and the noise is invisible because it looks like context.

Role contamination is what happens when an agent shifts between cognitive modes in the same session. gstack identified this pattern precisely: an agent that helped design something is a worse auditor of that thing. The planner doesn't challenge assumptions aggressively enough because it's also thinking about implementation. The reviewer doesn't find bugs rigorously enough because it already has emotional investment in the code. This isn't a prompt engineering quirk — it's a structural property of context. The agent's prior work contaminates its current judgment.

From the gstack README: "Planning is not review. Review is not shipping. Founder taste is not engineering rigor. If you blur all of that together, you usually get a mediocre blend of all four." The same principle applies to memory: accumulation is not curation. Curation is not infrastructure optimization. Blur them and you get a mediocre blend.

Both contamination types produce the same diagnostic puzzle: the agent appears to function normally. Its outputs are syntactically correct. It follows instructions. The degradation shows up as a quality drop that looks like the model having an off day — except it's systematic and it compounds.


Decontamination Strategies

Every memory architecture and every agent workflow framework is, at root, a decontamination strategy. The strategy chosen reveals the theory of what kind of contamination matters most.

Contamination Types and Decontamination Strategies

FeatureContamination TypeStrategyMechanismWhen Strategy Fails
Memory (accumulation)Noise accumulates in contextOpenClaw: store everything, search to find signalHybrid search (70/30 vector/BM25), over-fetch 4×, pre-compaction flushContext cliff — agent silently forgets instructions
Memory (curation)Signal-to-noise ratio degradesHermes: keep only what matters3,575-char hard cap, frozen at session startConsolidation wall — agent stops learning when memory fills
Memory (infrastructure)Cache structure mismanagedClaude Code: platform manages memoryPrompt caching, static → project → conversation layeringCache miss — economic failure, cost spikes
Role (specialization)Cognitive modes bleed togethergstack: separate agents by role with explicit refusals13 skills across 5 domains, each with persona + mandate + what it refuses to doContamination when skills are composed incorrectly

What gstack Gets Right

gstack's 13 skills across five domains each have a persona, a mandate, and — crucially — an explicit list of things they refuse to do. /plan-ceo-review questions whether the feature should exist at all. /ship executes the release without re-litigating those questions. /plan-design-review audits the design but never touches code. The constraints are the feature.

This is the same insight as Hermes' 3,575-character cap, but applied to role instead of memory: the value comes from what is excluded, not from what is included. A memory system that stores everything is vulnerable to context cliffs. An agent that does everything is vulnerable to role contamination. Both failures stem from the same root: insufficient separation.

The gstack testing architecture makes this concrete. Their four-tier pyramid addresses what they call "invisible couplings" — the fact that changing any text in a skill prompt can change agent behavior in ways that conventional pass/fail tests can't detect. The solution: LLM-as-judge evaluation that scores documentation quality across three dimensions, eval persistence that saves every run for comparison, and a blame protocol that requires proof before calling a failure "pre-existing."

The invisible coupling insight generalizes. Text changes in any agent system — prompt modifications, memory updates, context adjustments — have behavioral consequences you can't predict from the text alone. The eval architecture needed to detect these regressions is the same whether the contamination is in memory or in role.


Diagnosing Contamination in Production

Signs of memory contamination:

  • Agent performance degrades over time within a session (context cliff approaching)
  • Agent makes contradictory statements that reference stale context
  • The same prompt produces worse results on a long-running agent than a fresh one
  • Agent confidently follows instructions that were superseded earlier in the session

Signs of role contamination:

  • Agent planning and reviewing in the same session produces mediocre output at both
  • Code review quality drops when the agent also wrote the code
  • Agent hedges on decisions it should be decisive about (because it's also thinking about execution)
  • Quality varies by task order within a session, not by task difficulty

The diagnostic test: Run the same task on a fresh agent with no accumulated context and no prior role in the same session. If quality improves significantly, you have contamination. If it doesn't, you have a different problem (model capability, prompt quality, task difficulty). The fresh-agent baseline is to contamination what a control group is to an experiment.


The Unified Principle

The design principle that connects OpenClaw's memory architecture to gstack's cognitive gear-shifting is the same: separation is not about adding capabilities — it's about removing contamination.

OpenClaw's pre-compaction flush saves memories before context overflows — it's a decontamination mechanism for accumulated noise. Hermes' 3,575-character cap prevents noise from accumulating in the first place. gstack's role separation prevents cognitive modes from contaminating each other. Claude Code's prompt caching layering prevents shared context from being contaminated by session-specific context.

Every architecture decision in production AI is, at root, a decontamination decision. The failure mode you're managing determines which contamination type you're defending against. The strategy you choose determines what you're willing to sacrifice to achieve separation.


Patterns Worth Stealing

Separate by role, not just by task. Multiple agents with distinct mandates and explicit refusals compose better than one agent doing everything. The refusal is as important as the capability — /ship that can't be talked out of shipping is more useful than a flexible agent that re-litigates the plan.

Test for invisible couplings. Any text change in an agent system — prompt, memory, context — can change behavior. Conventional pass/fail tests miss this. LLM-as-judge evaluation with persistence and run-to-run comparison catches regressions that binary tests cannot.

The fresh-agent diagnostic. When quality degrades, test whether a fresh agent (no accumulated context, no prior role) produces better results. If it does, your problem is contamination, not capability.

Eval persistence is the memory system for your testing. Save every eval run. Compare to baselines. Track efficiency trends (tool call counts, iteration counts). The eval database is to agent quality what memory is to agent context — without it, you're starting from zero every time.


See also: The Memory Model Is Your Failure Mode for the three memory architectures and their failure modes, How OpenClaw Implements Agent Memory for the code-level walkthrough of decontamination mechanisms, and Building Agent Evals for the eval construction patterns that detect invisible couplings.

Greg SalwitzApr 6, 2026