Introduction: The Shift from Prompting to Engineering
Out of the box, Claude Code is a high-competence generalist. It can write Python, debug JavaScript, and draft emails. However, enterprise workflows require specialists: a Senior Architect who knows your legacy stack, a Compliance Officer who knows GDPR by heart, or a Product Owner who adheres to strict prioritization.
Skills are the mechanism that transforms Claude from a generic assistant into a specialized agent. They are modular "onboarding packets" containing procedural knowledge, approved workflows, and custom tools.
Scope: This guide assumes familiarity with the basic mechanics of Claude Code. It focuses on architectural patterns for production-grade agent behaviors. For essential workflows and context management fundamentals, see The Claude Code Superuser Guide.
Part 1: The Canonical Architecture
A skill is not just a text prompt; it is a directory structure designed for Context Economics. It optimizes the trade-off between agent capability and token consumption using Progressive Disclosure.
1. The Directory Structure
Do not dump everything into a single text file. Adhere to the Hub-and-Spoke model to prevent context flooding.
skill-name/
├── SKILL.md (The Hub)
│ ├── YAML Frontmatter (Router Configuration)
│ └── Markdown Body (Logic, State Machine, Iron Laws)
└── Bundled Resources (The Spokes)
├── scripts/ # Deterministic Tools (Python/Bash) - Black Boxes
├── references/ # Interface Documentation (APIs, Schemas) - On Demand
└── assets/ # Output Artifacts (Templates, Fonts)2. The Ecosystem: CLAUDE.md vs. Skills
CLAUDE.md(The Map): Lives in the project root. It provides high-level context (tech stack, testing commands) and points to the skills. "For database changes, see thepostgres-schemaskill."- Skills (The Territory): Live in
.claude/skills. They contain the deep, task-specific procedures.
Part 2: The Reference Implementation (SKILL.md)
Use this skeleton to encode the "Iron Law," "State Machine," and "Consultative" patterns.
Frontmatter (YAML):
---
name: systematic-debugging
description: Debugs failing tests or unexpected behavior by finding root causes before proposing fixes. Use for any bug, test failure, or unexplained output.
version: 1.0.0
---Content Structure:
The Iron Law
<iron_law>
NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST.
If you haven't completed Phase 1, you cannot propose fixes.
Violating the letter of this rule is violating the spirit of this rule.
</iron_law>
Phase 1: The Announce Protocol
Announce at start: "I am using the Systematic Debugging skill. My governing rule is: No fixes without root cause."
Phase 2: Consultative Interface (The Interview Gate)
Do not execute immediately. If the user provided vague symptoms ("it's broken"), ask the following clarifying questions first:
- "How do I reproduce this?"
- "What was the last working state?"
- "Are there recent relevant commits?"
Phase 3: The Workflow (State Machine)
digraph logic {
"Start" -> "Inventory Current State";
"Inventory Current State" -> "Form Hypothesis";
"Form Hypothesis" -> "Create Minimal Reproduction";
"Create Minimal Reproduction" -> "Verify Fix" [label="Test Fails"];
"Create Minimal Reproduction" -> "Refine Hypothesis" [label="Test Passes"];
}Phase 4: Context Strategy
- Tools: Run
scripts/analyze_logs.py --help. DO NOT read the script source. - Reference: For error codes, read
references/error_dictionary.md.
Phase 5: The Escape Hatch
If the user explicitly requests a "hotfix" or uses the override phrase "Emergency Override", you may bypass the Iron Law, but you must output a warning: "⚠️ BYPASSING ROOT CAUSE ANALYSIS."
Part 3: Phrasing & Psychological Control
Writing prompts for agents is Adversarial Engineering. You must anticipate the model's bias toward agreeableness, laziness, and speed, and engineer against it.
1. The Trigger Description
The description field in the frontmatter is the API for your skill.
- Rule: Describe WHAT it does concisely, and WHEN to use it specifically (symptoms, intent).
- The Trap: Never summarize the steps of the workflow here. If you do, Claude might read the summary and try to "wing it" without loading the full
SKILL.mdfile. - Good: "Debugs failing tests. Use when encountering any bug, test failure, or unexpected behavior."
The Trap: Never summarize the steps of the workflow in the description. If you do, Claude might read the summary and try to "wing it" without loading the full SKILL.md file.
2. XML Structural Anchoring
LLMs adhere to instructions better when they are compartmentalized. Use custom XML tags to isolate critical logic from general text.
- Tags:
<iron_law>,<execution_steps>,<output_format>. - Why: Claude is trained to parse XML data. Structuring your prompt as data increases adherence to complex multi-step rules.
3. Tone Policing (The Anti-Sycophancy Protocol)
Default models are trained to be polite ("I hope this finds you well!"). In a CLI environment, this is noise.
- Instruction: "Adopt a Senior Engineer persona. Be terse. Do not offer performative agreement (e.g., 'You're absolutely right!'). Do not apologize for errors; simply correct them."
4. Rationalization Pre-buttals
Anticipate the AI's tendency to cut corners and forbid it explicitly.
- Example: "You will want to say 'I fixed it' without running the tests because you are confident. This is wrong. Stop. Run the verification command."
Part 4: Context Economics & The "Read" Strategy
The most expensive resource an agent has is attention (context window). Explicitly manage what the agent reads.
For more on context management and the "Compact Before Auto-Compact" pattern, see The Claude Code Superuser Guide.
1. The "Black Box" Pattern (Scripts)
Scenario: Deterministic logic (math, file parsing, heavy regex).
Rule: Prevent the LLM from reading the source code.
Instruction: "DO NOT read scripts/analyze.py. Treat it as a binary. Run with --help to see usage."
Why: Reading a 500-line script wastes tokens and encourages the model to try "fixing" the tool rather than using it.
2. The "Interface" Pattern (References)
Scenario: The agent needs to use an internal library or DSL.
Rule: Provide a condensed interface definition, not the implementation.
Implementation: Create references/my-dsl-docs.md containing only signatures and usage examples.
Why: The agent needs to know how to call the function, not how the function works.
3. The "Inventory-First" Pattern
Scenario: Modifying a complex existing system (PPTX, Codebases).
Rule: Force a read-only inventory step before generation.
Instruction: "Before writing code, run grep or ls -R to map the directory structure. Output a JSON summary of the file hierarchy."
Why: Prevents "blind edits" (hallucinating file paths) by forcing the agent to build a mental map first.
Part 5: The Four Archetypes & Evals
Successful skills generally fall into one of four archetypes. Use these Evaluation Criteria (Evals) to test them.
Type A: The Specialist (Strategic)
Embeds high-level frameworks a raw LLM lacks.
- Example:
product-manager-toolkit(RICE prioritization). - Eval: Give the agent a list of features. Does it ask for "Reach" and "Effort" data before ranking them? Or does it just guess?
Type B: The Disciplinarian (Guardrail)
Enforces process rigor.
- Example:
verification-before-completion. - Eval: Ask the agent to "fix the bug" without providing a reproduction case. Does the agent refuse and demand the repro case (Pass), or does it attempt a fix immediately (Fail)?
Type C: The Operator (Tool Expert)
Overcomes outdated training data regarding specific tools.
- Example:
k8s-deploy-pro. - Eval: Ask the agent to deploy to a cluster. Does it construct a valid
kubectlcommand based onreferences/CLI.md, or does it hallucinate flags from an older version?
Type D: The Archivist (Organizational Memory)
Ensures consistency with company standards.
- Example:
brand-guidelines. - Eval: Ask the agent to create a slide. Does it use the specific Hex code from
assets/palette.json, or does it pick a generic "Blue"?
Part 6: The Development Lifecycle (TDD for Skills)
Treat prompts as code. Use Test-Driven Development (TDD) for your skills.
- Red (The Baseline): Run a complex task without the skill. Record the failure (e.g., "The agent skipped the tests").
- Green (The Fix): Write a
SKILL.mdthat addresses that specific failure (e.g., "Verification Gate: Runnpm test"). - Refactor (Optimization): Run the task again. If the agent finds a loophole ("I ran the tests but ignored the failure"), tighten the rule.
Pro Tip: Treat prompts as code. Use Test-Driven Development (TDD) to evolve your skills iteratively.
Quick Reference: The Audit Checklist
Before deploying a skill, audit it against these criteria:
- Trigger-Based Description: Does it describe when to use it?
- The Iron Law: Is there a non-negotiable constraint wrapped in
<iron_law>? - Consultative Interface: Does it ask questions before executing ambiguous tasks?
- Black Box vs. Interface: Are scripts treated as binaries? Are docs separated into
references/? - Rationalization Check: Does it explicitly forbid the "lazy" path?
- Tone Police: Does it strip away "customer service" language?
- Visual Logic: Are complex flows rendered in
dotcode blocks?
Key Takeaways
Skills are modular "onboarding packets" that transform Claude Code from a generalist into a specialized agent through:
- Hub-and-Spoke Architecture - Optimized for context economics
- Iron Laws - Non-negotiable constraints that enforce process rigor
- Consultative Interfaces - Ask questions before executing ambiguous tasks
- Context Strategies - Black box scripts, interface-only docs, inventory-first patterns
- TDD for Prompts - Iteratively evolve skills based on real failures
By treating skills as engineered artifacts rather than informal prompts, you can build reliable, production-grade agent behaviors that scale across your organization.
Ready to build your first skill? Start with the canonical template and audit checklist above.