Skill Quality · Part 3

Plain Sonnet sounds confident. Skills make sure it's right — so you don't burn Saturday debugging a hallucination. Two judges from two vendors agreed.

benchmarksskillsevaluationretrievalanthropicopenai

Skills Actually Help: The Numbers

It's 9pm. You've got two hours before bed and a Stripe webhook that double-charges customers under retries. You ask Sonnet. It answers — confidently, fluently, with a code block and everything. Sometimes it's confidently wrong, in the specific way that ships Friday and pages you Tuesday. That's the tax on confident-but-generic AI: not the hallucinations you catch, the ones you don't.

WinDAGs' windags_skill_graft grafts four senior specialists from a 547-skill catalog straight into the model's context — Stripe webhook idempotency, Postgres MVCC, JWT rotation, Airflow retries — plus the on-demand ability to pull deeper reference docs when the agent decides it needs them. You don't pick the skills. The cascade does. Your context only ever sees the 4–8 the matcher picked for the question you actually asked.

Two independent judges from two different vendors — Anthropic's Opus 4.7 and OpenAI's gpt-5.5 — read every pair blind on a 5-criterion rubric. Both picked the grafted answer. Opus 4.7 picked it 70% of the time vs. 12% for vanilla. gpt-5.5 picked it 58% of the time vs. 40%. Different model families, same call: the senior-engineer answer beats the generalist one.

|Erich Owens, Curiositech

What windags_skill_graft actually puts in your context

Not 551 skill descriptions stuffed into your prompt. The cascade runs server-side and ships back the 4 specialists it picked — full bodies, ~10–12K tokens — plus 4 adjacent catalog entries (name + description, ~500 tokens) for awareness, plus an on-demand tool the agent can use to pull specific reference docs when it needs them. Your context budget stays for the actual problem.

Shape per the skillful-node-prompt dossier — what every WinDAGs DAG node gets when it executes. Same shape we tested here.

Identity branch
  • 4 primary skills: full SKILL.md bodies merged, ~10–12K tokens of grafted expertise
  • 4 adjacent skills: name + description only, for catalog awareness
  • Reference files listed per skill (paths + sizes), loadable on demand via read_skill_reference
Protocol branch
  • Task-handling loop: restate → assess fit → execute → validate → report
  • Escalation contract: refuse explicitly when grafted skills don't cover the question
  • Mandatory confidence block at the end of every response
What the agent did with the tools we gave it
84%
of prompts (42/50) — agent took an extra turn to load references or search the catalog
56 calls
read_skill_reference (37 loaded a real file, 19 hit the helpful “wrong path” listing)
43/50
cascade hit-rate — known-good specialist in top-8 (9 escalations to wider catalog via windags_skill_search)

Two judges, blind, head-to-head

Two flagship judges from two different vendors read every pair (vanilla Sonnet 4.6 vs. Sonnet 4.6 + WinDAGs Skill Graft) blind, with randomized first position, on a 5-criterion rubric. Bigger green bar = the grafted answer won more often.

Opus 4.7
35 / 6 · 9 tie
+29pt margin · grafted answer wins
All 50 prompts judged.
gpt-5.5-2026-04-23
29 / 20 · 1 tie
+9pt margin · grafted answer wins
All 50 prompts judged.

Where Skill Graft pulls ahead, per criterion (Opus 4.7)

On Opus 4.7's 5-criterion rubric, the grafted answer wins decisively on exactly the things you can't Google your way to at 9pm — respects conventions (this codebase's, this framework's) and actionable (one obvious next step, not a dissertation). Raw correctness roughly ties — both Sonnet versions know syntax. Skills carry the judgment.

Respects conventions
vanilla 7
tie 14
graft 29
Correctness
vanilla 8
tie 14
graft 28
Actionable
vanilla 10
tie 13
graft 27
Addresses actual problem
vanilla 2
tie 37
graft 11
Avoids hallucinations
vanilla 18
tie 19
graft 13

By question category (Opus 4.7)

5 prompts per category. Skill Graft helps most where there's real specialist knowledge — Postgres performance, Kubernetes ops, ML pipelines. On well-trodden territory like REST APIs, vanilla Sonnet already knows enough that the gap narrows.

Data Pipelines
graft 5
Observability
graft 5
GraphQL & REST
vanilla 1
graft 4
Build & Deploy
tie 2
graft 3
Kubernetes
vanilla 1
graft 4
Stripe / Payments
vanilla 1
graft 4
Postgres
tie 2
graft 3
Frontend
tie 3
graft 2
ML Pipelines
vanilla 1
graft 3
Auth & OAuth
vanilla 2
graft 2

All 50 prompts, drillable

Pick any prompt to see the grafted skills, the tools the agent actually called, both responses side-by-side, and what each judge said.

Loading bench data…

Context budget: graft vs. dump-everything

The naive way to give an agent specialist knowledge is to stuff every skill description into the system prompt. With 551 skills that's ~50K tokens of catalog noise the model has to read past before getting to your actual question. Graft does the opposite: the cascade picks the 4–8 specialists for this question and ships only those.

Naive dump
~50K tok
551 skill names + descriptions + tags. Context-poisoning at scale.
Skill Graft
~10–12K tok
4 full SKILL.md bodies + 4 catalog adjacencies. References loaded on demand if the agent decides it needs them.
Savings
~75% fewer tok
Plus better outputs (per the judges above). The cascade decides what's relevant; you don't pay attention tax on what isn't.

Scope of this study

  • One-shot Q&A. Each prompt gets one Sonnet 4.6 response (with up to 8 tool-use turns for on-demand reference loading). Real engineering involves iteration, tests, file edits — those live in /next-move and get their own measurement.
  • 50 hand-written senior-engineering prompts across 10 categories (auth, payments, Postgres, K8s, ML, frontend, build, observability, data pipelines, GraphQL/REST), 5 prompts each. Not a fuzz test, not a leaderboard — a load-bearing sample of what a senior engineer Slacks at 9pm.
  • Two flagship judges, two vendors (Anthropic Opus 4.7, OpenAI gpt-5.5). Two-vendor agreement is the point: the effect isn't Anthropic-flavored taste. A larger judge ensemble would tighten confidence intervals and is on the to-do.

Reproduce

Scripts, prompts, raw outputs, and judge verdicts live in the public curiositech/windags-skills repo. Set ANTHROPIC_API_KEY and OPENAI_API_KEY, then:

git clone https://github.com/curiositech/windags-skills.git
cd windags-skills/scripts/bench
pnpm install

# 50 prompts × 2 conditions, Sonnet 4.6, max_tokens 32768
pnpm tsx runner-skill-graft-v2.ts --run sg-v2 --concurrency 4

# Judge with Opus 4.7 + gpt-5.5
pnpm tsx judge-pairs.ts runs/sg-v2 \
  --pair vanilla_v2,skill_graft_v2 \
  --provider anthropic --model claude-opus-4-7 --tag opus-4-7

# Bundle the JSON
pnpm tsx export-sg-v2.ts runs/sg-v2
Discussion
Scroll down to load comments