MMNTM logo
Return to Index
agent-engineering

The Architect's Guide to Engineering Claude Code Skills

A comprehensive manual for process engineering, context economics, and agent specialization. Learn how to transform Claude Code from a generalist into a specialized agent through modular skills.

MMNTM Team
8 min read
#claude-code#agent-engineering#prompt-engineering#skills#workflows

Introduction: The Shift from Prompting to Engineering

Out of the box, Claude Code is a high-competence generalist. It can write Python, debug JavaScript, and draft emails. However, enterprise workflows require specialists: a Senior Architect who knows your legacy stack, a Compliance Officer who knows GDPR by heart, or a Product Owner who adheres to strict prioritization.

Skills are the mechanism that transforms Claude from a generic assistant into a specialized agent. They are modular "onboarding packets" containing procedural knowledge, approved workflows, and custom tools.

Scope: This guide assumes familiarity with the basic mechanics of Claude Code. It focuses on architectural patterns for production-grade agent behaviors. For essential workflows and context management fundamentals, see The Claude Code Superuser Guide.


Part 1: The Canonical Architecture

A skill is not just a text prompt; it is a directory structure designed for Context Economics. It optimizes the trade-off between agent capability and token consumption using Progressive Disclosure.

1. The Directory Structure

Do not dump everything into a single text file. Adhere to the Hub-and-Spoke model to prevent context flooding.

skill-name/
├── SKILL.md (The Hub)
│   ├── YAML Frontmatter (Router Configuration)
│   └── Markdown Body (Logic, State Machine, Iron Laws)
└── Bundled Resources (The Spokes)
    ├── scripts/      # Deterministic Tools (Python/Bash) - Black Boxes
    ├── references/   # Interface Documentation (APIs, Schemas) - On Demand
    └── assets/       # Output Artifacts (Templates, Fonts)

2. The Ecosystem: CLAUDE.md vs. Skills

  • CLAUDE.md (The Map): Lives in the project root. It provides high-level context (tech stack, testing commands) and points to the skills. "For database changes, see the postgres-schema skill."
  • Skills (The Territory): Live in .claude/skills. They contain the deep, task-specific procedures.

Part 2: The Reference Implementation (SKILL.md)

Use this skeleton to encode the "Iron Law," "State Machine," and "Consultative" patterns.

Frontmatter (YAML):

---
name: systematic-debugging
description: Debugs failing tests or unexpected behavior by finding root causes before proposing fixes. Use for any bug, test failure, or unexplained output.
version: 1.0.0
---

Content Structure:

The Iron Law

<iron_law>
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST.
If you haven't completed Phase 1, you cannot propose fixes.
Violating the letter of this rule is violating the spirit of this rule.
</iron_law>

Phase 1: The Announce Protocol

Announce at start: "I am using the Systematic Debugging skill. My governing rule is: No fixes without root cause."

Phase 2: Consultative Interface (The Interview Gate)

Do not execute immediately. If the user provided vague symptoms ("it's broken"), ask the following clarifying questions first:

  1. "How do I reproduce this?"
  2. "What was the last working state?"
  3. "Are there recent relevant commits?"

Phase 3: The Workflow (State Machine)

digraph logic {
  "Start" -> "Inventory Current State";
  "Inventory Current State" -> "Form Hypothesis";
  "Form Hypothesis" -> "Create Minimal Reproduction";
  "Create Minimal Reproduction" -> "Verify Fix" [label="Test Fails"];
  "Create Minimal Reproduction" -> "Refine Hypothesis" [label="Test Passes"];
}

Phase 4: Context Strategy

  • Tools: Run scripts/analyze_logs.py --help. DO NOT read the script source.
  • Reference: For error codes, read references/error_dictionary.md.

Phase 5: The Escape Hatch

If the user explicitly requests a "hotfix" or uses the override phrase "Emergency Override", you may bypass the Iron Law, but you must output a warning: "⚠️ BYPASSING ROOT CAUSE ANALYSIS."


Part 3: Phrasing & Psychological Control

Writing prompts for agents is Adversarial Engineering. You must anticipate the model's bias toward agreeableness, laziness, and speed, and engineer against it.

1. The Trigger Description

The description field in the frontmatter is the API for your skill.

  • Rule: Describe WHAT it does concisely, and WHEN to use it specifically (symptoms, intent).
  • The Trap: Never summarize the steps of the workflow here. If you do, Claude might read the summary and try to "wing it" without loading the full SKILL.md file.
  • Good: "Debugs failing tests. Use when encountering any bug, test failure, or unexpected behavior."

The Trap: Never summarize the steps of the workflow in the description. If you do, Claude might read the summary and try to "wing it" without loading the full SKILL.md file.

2. XML Structural Anchoring

LLMs adhere to instructions better when they are compartmentalized. Use custom XML tags to isolate critical logic from general text.

  • Tags: <iron_law>, <execution_steps>, <output_format>.
  • Why: Claude is trained to parse XML data. Structuring your prompt as data increases adherence to complex multi-step rules.

3. Tone Policing (The Anti-Sycophancy Protocol)

Default models are trained to be polite ("I hope this finds you well!"). In a CLI environment, this is noise.

  • Instruction: "Adopt a Senior Engineer persona. Be terse. Do not offer performative agreement (e.g., 'You're absolutely right!'). Do not apologize for errors; simply correct them."

4. Rationalization Pre-buttals

Anticipate the AI's tendency to cut corners and forbid it explicitly.

  • Example: "You will want to say 'I fixed it' without running the tests because you are confident. This is wrong. Stop. Run the verification command."

Part 4: Context Economics & The "Read" Strategy

The most expensive resource an agent has is attention (context window). Explicitly manage what the agent reads.

For more on context management and the "Compact Before Auto-Compact" pattern, see The Claude Code Superuser Guide.

1. The "Black Box" Pattern (Scripts)

Scenario: Deterministic logic (math, file parsing, heavy regex).

Rule: Prevent the LLM from reading the source code.

Instruction: "DO NOT read scripts/analyze.py. Treat it as a binary. Run with --help to see usage."

Why: Reading a 500-line script wastes tokens and encourages the model to try "fixing" the tool rather than using it.

2. The "Interface" Pattern (References)

Scenario: The agent needs to use an internal library or DSL.

Rule: Provide a condensed interface definition, not the implementation.

Implementation: Create references/my-dsl-docs.md containing only signatures and usage examples.

Why: The agent needs to know how to call the function, not how the function works.

3. The "Inventory-First" Pattern

Scenario: Modifying a complex existing system (PPTX, Codebases).

Rule: Force a read-only inventory step before generation.

Instruction: "Before writing code, run grep or ls -R to map the directory structure. Output a JSON summary of the file hierarchy."

Why: Prevents "blind edits" (hallucinating file paths) by forcing the agent to build a mental map first.


Part 5: The Four Archetypes & Evals

Successful skills generally fall into one of four archetypes. Use these Evaluation Criteria (Evals) to test them.

Type A: The Specialist (Strategic)

Embeds high-level frameworks a raw LLM lacks.

  • Example: product-manager-toolkit (RICE prioritization).
  • Eval: Give the agent a list of features. Does it ask for "Reach" and "Effort" data before ranking them? Or does it just guess?

Type B: The Disciplinarian (Guardrail)

Enforces process rigor.

  • Example: verification-before-completion.
  • Eval: Ask the agent to "fix the bug" without providing a reproduction case. Does the agent refuse and demand the repro case (Pass), or does it attempt a fix immediately (Fail)?

Type C: The Operator (Tool Expert)

Overcomes outdated training data regarding specific tools.

  • Example: k8s-deploy-pro.
  • Eval: Ask the agent to deploy to a cluster. Does it construct a valid kubectl command based on references/CLI.md, or does it hallucinate flags from an older version?

Type D: The Archivist (Organizational Memory)

Ensures consistency with company standards.

  • Example: brand-guidelines.
  • Eval: Ask the agent to create a slide. Does it use the specific Hex code from assets/palette.json, or does it pick a generic "Blue"?

Part 6: The Development Lifecycle (TDD for Skills)

Treat prompts as code. Use Test-Driven Development (TDD) for your skills.

  1. Red (The Baseline): Run a complex task without the skill. Record the failure (e.g., "The agent skipped the tests").
  2. Green (The Fix): Write a SKILL.md that addresses that specific failure (e.g., "Verification Gate: Run npm test").
  3. Refactor (Optimization): Run the task again. If the agent finds a loophole ("I ran the tests but ignored the failure"), tighten the rule.

Pro Tip: Treat prompts as code. Use Test-Driven Development (TDD) to evolve your skills iteratively.


Quick Reference: The Audit Checklist

Before deploying a skill, audit it against these criteria:

  1. Trigger-Based Description: Does it describe when to use it?
  2. The Iron Law: Is there a non-negotiable constraint wrapped in <iron_law>?
  3. Consultative Interface: Does it ask questions before executing ambiguous tasks?
  4. Black Box vs. Interface: Are scripts treated as binaries? Are docs separated into references/?
  5. Rationalization Check: Does it explicitly forbid the "lazy" path?
  6. Tone Police: Does it strip away "customer service" language?
  7. Visual Logic: Are complex flows rendered in dot code blocks?

Key Takeaways

Skills are modular "onboarding packets" that transform Claude Code from a generalist into a specialized agent through:

  • Hub-and-Spoke Architecture - Optimized for context economics
  • Iron Laws - Non-negotiable constraints that enforce process rigor
  • Consultative Interfaces - Ask questions before executing ambiguous tasks
  • Context Strategies - Black box scripts, interface-only docs, inventory-first patterns
  • TDD for Prompts - Iteratively evolve skills based on real failures

By treating skills as engineered artifacts rather than informal prompts, you can build reliable, production-grade agent behaviors that scale across your organization.


Ready to build your first skill? Start with the canonical template and audit checklist above.

The Architect's Guide to Engineering Claude Code Skills | MMNTM