Table of Contents

This article is based on a week-long optimization effort across three production Skills — prd-analysis, system-design, and autoforge — covering the full loop from token-level measurement to actual code changes. All numbers come from real JSONL session files, with inflation factors corrected.

Why Skill Cost Deserves Its Own Treatment

Generic “LLM cost reduction” articles usually talk about context pruning, cache warmup, and model downgrading. These apply to Skills too — but the Skill execution environment has several structural differences:

  1. Long sessions + deep call stacks. A Skill dispatches multiple subagents, and each subagent may launch its own tool loop. One dispatch = one independent conversation context, and subagents do not share prompt cache with each other.
  2. Once the main agent’s context grows, every subsequent turn pays cache_read. A Skill session often has 15–20 main-agent turns; any file that enters the main-agent context is billed as cache_read on every single turn afterward.
  3. Who controls the tier is unclear. Built-in subagent types (Explore, code-reviewer, etc.) force a tier that the Skill cannot override; only general-purpose + an explicit model puts control in the Skill’s hands.
  4. Output tokens are badly underestimated. Sonnet’s output is 50× its cache_read price; Opus’s is 50× too. Skill authors instinctively optimize “read fewer files” while ignoring the bigger lever: “write fewer prompts.”

The six principles below are distilled from real measurements. Each can be applied by directly editing Skill files.


Principle 0: Accurate Accounting Is the Prerequisite

Before any optimization, learn to correctly compute the cost of a Skill session.

Pitfall: Accumulating by JSONL Line Double-Counts

Claude Code splits a single assistant message with multiple content blocks into multiple lines in JSONL — but every line carries the same usage field. Summing by line means every message is counted 2–4 times.

The correct approach: deduplicate by message.id before accumulating usage.

seen = set()
for line in open(jsonl_path):
    obj = json.loads(line)
    if obj.get("type") != "assistant": continue
    mid = obj["message"]["id"]
    if mid in seen: continue
    seen.add(mid)
    # usage here is the true value for this message

I fell into this trap on my first diagnosis, reporting $57.99. Deduplicated, it was $17.12 — a 3.4× difference.

Pricing Intuition (Claude 4 series, $/1M tokens)

Tierinputcache_createcache_readoutput
Opus1518.751.5075
Sonnet33.750.3015
Haiku11.250.105

Remember two ratios:

  • output ≈ 50× cache_read (within the same tier)
  • Opus ≈ 5× Sonnet ≈ 15× Haiku (on the output axis)

Pitfall: Transport-Layer Retries Distort Cost Perception

API timeout → Claude Code auto-retries. Retries are not billed for output tokens (only a 200 OK settles billing). So the perception “it looped for 25 minutes and burned money” is often wrong — that’s wasted time, not wasted money. When diagnosing “where the money went,” separate these two kinds of waste.


Principle 1: Align Task with Tier — Don’t Fall Back on Defaults

Anti-pattern: a Skill writing Agent({ description: "...", prompt: "..." }) and letting model fall through to inherit.

Consequence: parent is Opus, so subagents are Opus. Dispatching 9 Opus subagents to do “replace this line of text with that line” easily costs $80+ per session.

Task → Tier Mapping

TaskAppropriate TierReason
File listing, Grep scans, keyword matchingHaikuPattern matching is Haiku’s strong suit
Structured parsing (YAML/Markdown → data), text classificationHaiku or SonnetSimple → Haiku; needs semantic judgment → Sonnet
Deterministic text edits (execute “Fix: X” replacements)SonnetOpus’s reasoning budget adds no value here
Deep single-file review (catch AC gaps, cross-field consistency)SonnetHaiku misses real issues; Opus is wasteful
Cross-module architecture judgment, interface alignment, conflict resolutionOpusGenuinely needs long-chain reasoning
One-shot parsing + reduction of large files (>5k tokens)SonnetHaiku misreads structured schemas

The Two Most Common Misuses

  1. Deterministic edits running inherit=Opus: a single revise flow dispatches 9 subagents, each running hundreds of lines of MultiEdit, consuming 80% of total session cost. Switching to model: "sonnet" immediately drops it 5×.
  2. Per-file review running on Explore (a built-in light tier): findings are all wording/formatting issues; cross-field consistency and AC completeness get missed. The --review → --revise loop never tightens — convergence fails. Switching to general-purpose + sonnet typically cuts rounds from 4 → 2.

Principle 2: subagent_type Decides Who Controls the Tier

Claude Code subagents fall into two categories:

  • Built-in specialized agents (Explore, code-reviewer, Plan, etc.) — the tier is decided by the harness and cannot be overridden by the Skill.
  • general-purpose — the tier is decided by the Skill’s model parameter.

When writing a Skill, the default should be: general-purpose + explicit model. Only use a built-in agent when the task genuinely matches its intended role (e.g., pure codebase exploration → Explore).

Counter-Example

review-mode.md initially dispatched Explore for per-file PRD review. It looked cheap, but was actually running on the Haiku tier and lacked the recall needed to surface real issues. Switching to general-purpose + sonnet raised per-round cost 3×, but total rounds halved and overall cost came out flat or lower — plus users could immediately tell “the review reports are now hitting real issues.”

The Version-Pinning Anti-Pattern

Do not write model: claude-sonnet-4-6 in a Skill. Models rotate (4.5 → 4.6 → 4.7), and pinning a specific version rots the Skill. Always use tier aliases: opus / sonnet / haiku. Skill authors are expressing “I want this tier,” not “I want this specific model.”


Principle 3: cache_read × rounds — the Hidden Cost Most People Miss

Once the main agent Reads a file, that file sits in the prompt cache for every subsequent turn, each turn billed as cache_read. This cost doesn’t show up in any single Read tool’s stats — it’s the tail of every downstream turn “freeloading off the cache.”

Quantifying a Typical Case

A REVIEW report of 1,000 lines ≈ 30k tokens, main agent is Opus, revise process runs 20 turns:

  • Main agent cost: 30,000 × 20 × $1.50/M = $0.90
  • 8 Fix subagents (Sonnet), 4 turns each: 30,000 × 4 × 8 × $0.30/M = $0.29
  • Total: $1.19 — all from “read once, but cached for a long time”

Pattern: Delegate Large-File Reads to a Subagent; Main Agent Only Holds the Manifest

The main agent doesn’t read the large file. Instead, dispatch a Sonnet subagent:

  • Input: the path to the large file
  • Output: a 2–4k-token structured summary (YAML/JSON manifest)

The main agent’s context now carries only a 2–4k manifest, still over 20 turns:

  • Main agent cost: 3,000 × 20 × $1.50/M = $0.09 (↓ 90%)
  • The subagent’s 30k × 1 read happens inside its own short session — cost is tiny
  • If downstream Fix subagents still need the raw text, let them read it themselves — their context is thrown away after completion and won’t be amplified across turns.

When to apply this pattern: any intermediate artifact larger than 5k tokens (review reports, long-form research findings, architecture docs) where the main agent only needs the “index” layer, not the body.


Principle 4: Template A vs Template B — Output Tokens Are the Most Expensive

Subagent dispatch comes in two common styles:

  • Template A (by reference): the prompt says “read {path} and execute the Fix: instructions inside.”
  • Template B (inline): the prompt writes out “replace line 34 X with Y, line 56 …”

The cost difference lands on the main agent:

  • Template B: the main agent has to write out every edit in the prompt. A cluster with 2k tokens of edits means 2k main-agent output tokens. Opus output cost: 2k × $75/M = $0.15 — for one dispatch of one subagent.
  • Template A: the main agent’s prompt is only ≈ 200 tokens (paths + target file list). The subagent reads and extracts on its own. Main-agent output ≈ $0.015. 10× cheaper.

When to Use Which

ScenarioRecommendedReason
Source (findings / edits) is already persisted to a fileASubagent reads directly; save main-agent output tokens
Source is dynamically generated in-session (interactive mode)BNo persisted file to reference, must inline
Each cluster has very few edits (<500 tokens)BOutput cost is tiny; A’s file read adds overhead

In one line: Skill authors’ first instinct is usually “make subagents read less” (Template B thinking), but in reality main-agent output is the more expensive side.


Principle 5: Cluster Size and MultiEdit — Two Engineering Details, One Big Rule

Cluster Sizing: 3 Files Is the Sweet Spot

Each Fix subagent has N turns per conversation, and each turn cache_reads all currently open file contents.

  • 3-file cluster, 5k per file, total 15k context
  • 40 edits → 40 turns (one per Edit)
  • total cache_read: 15k × 40 = 600k

5-file cluster? 25k × 40 = 1000k. 66% larger.

So Skills should dispatch more, smaller clusters in parallel, not a few large clusters.

MultiEdit vs Sequential Edit

When a file has multiple edits:

  • Sequential Edit: each Edit is one turn. The N-th Edit cache_reads the entire N−1 turns of conversation state. Cost is O(N²).
  • MultiEdit: one turn. Cost is O(N).

Skill prompts must force subagents to use MultiEdit when there are >1 edits. Not “suggested” — “forbid sequential Edit.”


Principle 6: Convergence Rate Is a Cost Signal

This one is the most commonly overlooked.

If you observe that every review round surfaces issues, each revise round fixes them, but the next round reports new issues — and the total issue count never drops — this is not normal iteration. It’s tier misalignment. Symptoms:

  • The Haiku-tier review subagent only finds surface-level (wording, formatting) issues
  • After revise, the text barely changes, so the next Haiku pass finds another batch from a different shallow angle
  • Deep issues (incomplete ACs, cross-field contradictions, data-model misalignment) remain untouched
  • The user feels “it’s converging” but it never reaches zero issues

How to diagnose: over 3 consecutive --review rounds, are the categories of Critical findings decreasing? If the count is similar but the angle keeps shifting, that’s fake convergence.

How to fix: not more rounds — one tier upgrade to the right tier. Moving per-file review from Haiku to Sonnet raises per-round cost 3–5×, but often cuts total rounds from 4 → 2, so net cost drops and deep issues finally surface.


Landing Checklist (10 Items)

When cost-reviewing a Skill, ask in order:

  1. Do all Agent() calls explicitly set model? (Not inherit.)
  2. Is model a tier alias (sonnet) or a pinned version (claude-sonnet-4-6)? Must be an alias.
  3. Is subagent_type: "Explore" being used for tasks that require semantic judgment? Switch to general-purpose + sonnet.
  4. Does the main agent Read any intermediate artifact > 5k tokens? Can it be delegated to a subagent with only a manifest consumed upstream?
  5. Are subagent dispatch prompts using reference (A) or inline (B) style? When a persistable source file exists, use A.
  6. Do Fix / edit subagents have an upper bound on cluster size? Recommended ≤3 files.
  7. Does the Skill force MultiEdit (not sequential Edit)? State it as “forbidden” rather than “suggested.”
  8. Any “verification reads” (a Read right after an Edit)? Forbid them.
  9. Any redundant Grep/Glob discovery (paths already given in the prompt but the subagent searches again)? Forbid them.
  10. Any history of “review not converging”? If yes, 9 times out of 10 it’s a tier problem.

Quantified Results

One /prd-analysis --revise session, before and after applying these principles:

ItemBeforeAfter
Subagent tier9 × Opus (inherit)9 × Sonnet (explicit)
Cluster sizeUnlimited≤3 files
Edit modeMixed Edit / MultiEditForced MultiEdit
REVIEW file readMain agent readsClustering subagent reads
TemplateMixed A/BForce A when persistence exists
Total cost~$80~$15

5× cost reduction with equal or better convergence speed. The only cost is ~150 extra lines in the Skill file governing “who uses which model, when” — rules that should have been there all along, previously left to implicit harness defaults.


Closing Thoughts

Skill cost optimization is not the same problem as traditional LLM-application cost optimization. Skill sessions are longer, call stacks deeper, and model-choice effects are amplified by the number of turns. The two most expensive things — output tokens and the main agent’s persistent cache_read — happen to be the two that intuition most easily misses.

Skill authors should treat the chain “task → tier → subagent_type → template” as a design decision as important as the business logic itself. The cost of writing Agent({ prompt: "..." }) and letting everything default might be the $94 line item on next month’s bill.