Cost and Performance Optimization for Claude Code Skills: 6 Principles from a Real Session
Table of Contents
This article is based on a week-long optimization effort across three production Skills —
prd-analysis,system-design, andautoforge— covering the full loop from token-level measurement to actual code changes. All numbers come from real JSONL session files, with inflation factors corrected.
Why Skill Cost Deserves Its Own Treatment
Generic “LLM cost reduction” articles usually talk about context pruning, cache warmup, and model downgrading. These apply to Skills too — but the Skill execution environment has several structural differences:
- Long sessions + deep call stacks. A Skill dispatches multiple subagents, and each subagent may launch its own tool loop. One dispatch = one independent conversation context, and subagents do not share prompt cache with each other.
- Once the main agent’s context grows, every subsequent turn pays cache_read. A Skill session often has 15–20 main-agent turns; any file that enters the main-agent context is billed as cache_read on every single turn afterward.
- Who controls the tier is unclear. Built-in subagent types (
Explore,code-reviewer, etc.) force a tier that the Skill cannot override; onlygeneral-purpose+ an explicitmodelputs control in the Skill’s hands. - Output tokens are badly underestimated. Sonnet’s output is 50× its cache_read price; Opus’s is 50× too. Skill authors instinctively optimize “read fewer files” while ignoring the bigger lever: “write fewer prompts.”
The six principles below are distilled from real measurements. Each can be applied by directly editing Skill files.
Principle 0: Accurate Accounting Is the Prerequisite
Before any optimization, learn to correctly compute the cost of a Skill session.
Pitfall: Accumulating by JSONL Line Double-Counts
Claude Code splits a single assistant message with multiple content blocks into multiple lines in JSONL — but every line carries the same usage field. Summing by line means every message is counted 2–4 times.
The correct approach: deduplicate by message.id before accumulating usage.
seen = set()
for line in open(jsonl_path):
obj = json.loads(line)
if obj.get("type") != "assistant": continue
mid = obj["message"]["id"]
if mid in seen: continue
seen.add(mid)
# usage here is the true value for this message
I fell into this trap on my first diagnosis, reporting $57.99. Deduplicated, it was $17.12 — a 3.4× difference.
Pricing Intuition (Claude 4 series, $/1M tokens)
| Tier | input | cache_create | cache_read | output |
|---|---|---|---|---|
| Opus | 15 | 18.75 | 1.50 | 75 |
| Sonnet | 3 | 3.75 | 0.30 | 15 |
| Haiku | 1 | 1.25 | 0.10 | 5 |
Remember two ratios:
- output ≈ 50× cache_read (within the same tier)
- Opus ≈ 5× Sonnet ≈ 15× Haiku (on the output axis)
Pitfall: Transport-Layer Retries Distort Cost Perception
API timeout → Claude Code auto-retries. Retries are not billed for output tokens (only a 200 OK settles billing). So the perception “it looped for 25 minutes and burned money” is often wrong — that’s wasted time, not wasted money. When diagnosing “where the money went,” separate these two kinds of waste.
Principle 1: Align Task with Tier — Don’t Fall Back on Defaults
Anti-pattern: a Skill writing Agent({ description: "...", prompt: "..." }) and letting model fall through to inherit.
Consequence: parent is Opus, so subagents are Opus. Dispatching 9 Opus subagents to do “replace this line of text with that line” easily costs $80+ per session.
Task → Tier Mapping
| Task | Appropriate Tier | Reason |
|---|---|---|
| File listing, Grep scans, keyword matching | Haiku | Pattern matching is Haiku’s strong suit |
| Structured parsing (YAML/Markdown → data), text classification | Haiku or Sonnet | Simple → Haiku; needs semantic judgment → Sonnet |
| Deterministic text edits (execute “Fix: X” replacements) | Sonnet | Opus’s reasoning budget adds no value here |
| Deep single-file review (catch AC gaps, cross-field consistency) | Sonnet | Haiku misses real issues; Opus is wasteful |
| Cross-module architecture judgment, interface alignment, conflict resolution | Opus | Genuinely needs long-chain reasoning |
| One-shot parsing + reduction of large files (>5k tokens) | Sonnet | Haiku misreads structured schemas |
The Two Most Common Misuses
- Deterministic edits running inherit=Opus: a single revise flow dispatches 9 subagents, each running hundreds of lines of MultiEdit, consuming 80% of total session cost. Switching to
model: "sonnet"immediately drops it 5×. - Per-file review running on
Explore(a built-in light tier): findings are all wording/formatting issues; cross-field consistency and AC completeness get missed. The--review → --reviseloop never tightens — convergence fails. Switching togeneral-purpose + sonnettypically cuts rounds from 4 → 2.
Principle 2: subagent_type Decides Who Controls the Tier
Claude Code subagents fall into two categories:
- Built-in specialized agents (
Explore,code-reviewer,Plan, etc.) — the tier is decided by the harness and cannot be overridden by the Skill. general-purpose— the tier is decided by the Skill’smodelparameter.
When writing a Skill, the default should be: general-purpose + explicit model. Only use a built-in agent when the task genuinely matches its intended role (e.g., pure codebase exploration → Explore).
Counter-Example
review-mode.md initially dispatched Explore for per-file PRD review. It looked cheap, but was actually running on the Haiku tier and lacked the recall needed to surface real issues. Switching to general-purpose + sonnet raised per-round cost 3×, but total rounds halved and overall cost came out flat or lower — plus users could immediately tell “the review reports are now hitting real issues.”
The Version-Pinning Anti-Pattern
Do not write model: claude-sonnet-4-6 in a Skill. Models rotate (4.5 → 4.6 → 4.7), and pinning a specific version rots the Skill. Always use tier aliases: opus / sonnet / haiku. Skill authors are expressing “I want this tier,” not “I want this specific model.”
Principle 3: cache_read × rounds — the Hidden Cost Most People Miss
Once the main agent Reads a file, that file sits in the prompt cache for every subsequent turn, each turn billed as cache_read. This cost doesn’t show up in any single Read tool’s stats — it’s the tail of every downstream turn “freeloading off the cache.”
Quantifying a Typical Case
A REVIEW report of 1,000 lines ≈ 30k tokens, main agent is Opus, revise process runs 20 turns:
- Main agent cost:
30,000 × 20 × $1.50/M = $0.90 - 8 Fix subagents (Sonnet), 4 turns each:
30,000 × 4 × 8 × $0.30/M = $0.29 - Total: $1.19 — all from “read once, but cached for a long time”
Pattern: Delegate Large-File Reads to a Subagent; Main Agent Only Holds the Manifest
The main agent doesn’t read the large file. Instead, dispatch a Sonnet subagent:
- Input: the path to the large file
- Output: a 2–4k-token structured summary (YAML/JSON manifest)
The main agent’s context now carries only a 2–4k manifest, still over 20 turns:
- Main agent cost:
3,000 × 20 × $1.50/M = $0.09(↓ 90%) - The subagent’s 30k × 1 read happens inside its own short session — cost is tiny
- If downstream Fix subagents still need the raw text, let them read it themselves — their context is thrown away after completion and won’t be amplified across turns.
When to apply this pattern: any intermediate artifact larger than 5k tokens (review reports, long-form research findings, architecture docs) where the main agent only needs the “index” layer, not the body.
Principle 4: Template A vs Template B — Output Tokens Are the Most Expensive
Subagent dispatch comes in two common styles:
- Template A (by reference): the prompt says “read
{path}and execute the Fix: instructions inside.” - Template B (inline): the prompt writes out “replace line 34 X with Y, line 56 …”
The cost difference lands on the main agent:
- Template B: the main agent has to write out every edit in the prompt. A cluster with 2k tokens of edits means 2k main-agent output tokens. Opus output cost:
2k × $75/M = $0.15— for one dispatch of one subagent. - Template A: the main agent’s prompt is only ≈ 200 tokens (paths + target file list). The subagent reads and extracts on its own. Main-agent output ≈ $0.015. 10× cheaper.
When to Use Which
| Scenario | Recommended | Reason |
|---|---|---|
| Source (findings / edits) is already persisted to a file | A | Subagent reads directly; save main-agent output tokens |
| Source is dynamically generated in-session (interactive mode) | B | No persisted file to reference, must inline |
| Each cluster has very few edits (<500 tokens) | B | Output cost is tiny; A’s file read adds overhead |
In one line: Skill authors’ first instinct is usually “make subagents read less” (Template B thinking), but in reality main-agent output is the more expensive side.
Principle 5: Cluster Size and MultiEdit — Two Engineering Details, One Big Rule
Cluster Sizing: 3 Files Is the Sweet Spot
Each Fix subagent has N turns per conversation, and each turn cache_reads all currently open file contents.
- 3-file cluster, 5k per file, total 15k context
- 40 edits → 40 turns (one per Edit)
- total cache_read:
15k × 40 = 600k
5-file cluster? 25k × 40 = 1000k. 66% larger.
So Skills should dispatch more, smaller clusters in parallel, not a few large clusters.
MultiEdit vs Sequential Edit
When a file has multiple edits:
- Sequential Edit: each Edit is one turn. The N-th Edit cache_reads the entire N−1 turns of conversation state. Cost is O(N²).
- MultiEdit: one turn. Cost is O(N).
Skill prompts must force subagents to use MultiEdit when there are >1 edits. Not “suggested” — “forbid sequential Edit.”
Principle 6: Convergence Rate Is a Cost Signal
This one is the most commonly overlooked.
If you observe that every review round surfaces issues, each revise round fixes them, but the next round reports new issues — and the total issue count never drops — this is not normal iteration. It’s tier misalignment. Symptoms:
- The Haiku-tier review subagent only finds surface-level (wording, formatting) issues
- After revise, the text barely changes, so the next Haiku pass finds another batch from a different shallow angle
- Deep issues (incomplete ACs, cross-field contradictions, data-model misalignment) remain untouched
- The user feels “it’s converging” but it never reaches zero issues
How to diagnose: over 3 consecutive --review rounds, are the categories of Critical findings decreasing? If the count is similar but the angle keeps shifting, that’s fake convergence.
How to fix: not more rounds — one tier upgrade to the right tier. Moving per-file review from Haiku to Sonnet raises per-round cost 3–5×, but often cuts total rounds from 4 → 2, so net cost drops and deep issues finally surface.
Landing Checklist (10 Items)
When cost-reviewing a Skill, ask in order:
- Do all
Agent()calls explicitly setmodel? (Notinherit.) - Is
modela tier alias (sonnet) or a pinned version (claude-sonnet-4-6)? Must be an alias. - Is
subagent_type: "Explore"being used for tasks that require semantic judgment? Switch togeneral-purpose + sonnet. - Does the main agent Read any intermediate artifact > 5k tokens? Can it be delegated to a subagent with only a manifest consumed upstream?
- Are subagent dispatch prompts using reference (A) or inline (B) style? When a persistable source file exists, use A.
- Do Fix / edit subagents have an upper bound on cluster size? Recommended ≤3 files.
- Does the Skill force
MultiEdit(not sequentialEdit)? State it as “forbidden” rather than “suggested.” - Any “verification reads” (a Read right after an Edit)? Forbid them.
- Any redundant Grep/Glob discovery (paths already given in the prompt but the subagent searches again)? Forbid them.
- Any history of “review not converging”? If yes, 9 times out of 10 it’s a tier problem.
Quantified Results
One /prd-analysis --revise session, before and after applying these principles:
| Item | Before | After |
|---|---|---|
| Subagent tier | 9 × Opus (inherit) | 9 × Sonnet (explicit) |
| Cluster size | Unlimited | ≤3 files |
| Edit mode | Mixed Edit / MultiEdit | Forced MultiEdit |
| REVIEW file read | Main agent reads | Clustering subagent reads |
| Template | Mixed A/B | Force A when persistence exists |
| Total cost | ~$80 | ~$15 |
5× cost reduction with equal or better convergence speed. The only cost is ~150 extra lines in the Skill file governing “who uses which model, when” — rules that should have been there all along, previously left to implicit harness defaults.
Closing Thoughts
Skill cost optimization is not the same problem as traditional LLM-application cost optimization. Skill sessions are longer, call stacks deeper, and model-choice effects are amplified by the number of turns. The two most expensive things — output tokens and the main agent’s persistent cache_read — happen to be the two that intuition most easily misses.
Skill authors should treat the chain “task → tier → subagent_type → template” as a design decision as important as the business logic itself. The cost of writing Agent({ prompt: "..." }) and letting everything default might be the $94 line item on next month’s bill.