Skill Quality · Part 3

Plain Sonnet sounds confident. Skills make sure it's right — so you don't burn Saturday debugging a hallucination. Two judges from two vendors agreed.

benchmarksskillsevaluationretrievalanthropicopenai

Skills Actually Help: The Numbers

It's 9pm. You've got two hours before bed and a Stripe webhook that double-charges customers under retries. You ask Sonnet. It answers — confidently, fluently, with a code block and everything. Sometimes it's confidently wrong, in the specific way that ships Friday and pages you Tuesday. That's the tax on confident-but-generic AI: not the hallucinations you catch, the ones you don't.

WinDAGs' windags_skill_graft grafts four senior specialists from a 547-skill catalog straight into the model's context — Stripe webhook idempotency, Postgres MVCC, JWT rotation, Airflow retries — plus the on-demand ability to pull deeper reference docs when the agent decides it needs them. You don't pick the skills. The cascade does. Your context only ever sees the 4–8 the matcher picked for the question you actually asked.

Two independent judges from two different vendors — Anthropic's Opus 4.7 and OpenAI's gpt-5.5 — read every pair blind on a 5-criterion rubric. Both picked the grafted answer. Opus 4.7 picked it 70% of the time vs. 12% for vanilla. gpt-5.5 picked it 58% of the time vs. 40%. Different model families, same call: the senior-engineer answer beats the generalist one.

Wednesday, April 29, 2026|Erich Owens, Curiositech

What `windags_skill_graft` actually puts in your context

Not 551 skill descriptions stuffed into your prompt. The cascade runs server-side and ships back the 4 specialists it picked — full bodies, ~10–12K tokens — plus 4 adjacent catalog entries (name + description, ~500 tokens) for awareness, plus an on-demand tool the agent can use to pull specific reference docs when it needs them. Your context budget stays for the actual problem.

Shape per the skillful-node-prompt dossier — what every WinDAGs DAG node gets when it executes. Same shape we tested here.

Identity branch

4 primary skills: full SKILL.md bodies merged, ~10–12K tokens of grafted expertise
4 adjacent skills: name + description only, for catalog awareness
Reference files listed per skill (paths + sizes), loadable on demand via read_skill_reference

Protocol branch

Task-handling loop: restate → assess fit → execute → validate → report
Escalation contract: refuse explicitly when grafted skills don't cover the question
Mandatory confidence block at the end of every response

What the agent did with the tools we gave it

84%

of prompts (42/50) — agent took an extra turn to load references or search the catalog

56 calls

read_skill_reference (37 loaded a real file, 19 hit the helpful “wrong path” listing)

43/50

cascade hit-rate — known-good specialist in top-8 (9 escalations to wider catalog via windags_skill_search)

windags-skills/scripts/bench

Two judges, blind, head-to-head

Two flagship judges from two different vendors read every pair (vanilla Sonnet 4.6 vs. Sonnet 4.6 + WinDAGs Skill Graft) blind, with randomized first position, on a 5-criterion rubric. Bigger green bar = the grafted answer won more often.

Opus 4.7

35 / 6 · 9 tie

+29pt margin · grafted answer wins

All 50 prompts judged.

gpt-5.5-2026-04-23

29 / 20 · 1 tie

+9pt margin · grafted answer wins

All 50 prompts judged.

Where Skill Graft pulls ahead, per criterion (Opus 4.7)

On Opus 4.7's 5-criterion rubric, the grafted answer wins decisively on exactly the things you can't Google your way to at 9pm — respects conventions (this codebase's, this framework's) and actionable (one obvious next step, not a dissertation). Raw correctness roughly ties — both Sonnet versions know syntax. Skills carry the judgment.

Respects conventions

vanilla 7

tie 14

graft 29

Correctness

vanilla 8

tie 14

graft 28

Actionable

vanilla 10

tie 13

graft 27

Addresses actual problem

vanilla 2

tie 37

graft 11

Avoids hallucinations

vanilla 18

tie 19

graft 13

By question category (Opus 4.7)

5 prompts per category. Skill Graft helps most where there's real specialist knowledge — Postgres performance, Kubernetes ops, ML pipelines. On well-trodden territory like REST APIs, vanilla Sonnet already knows enough that the gap narrows.

Data Pipelines

graft 5

Observability

graft 5

GraphQL & REST

vanilla 1

graft 4

Build & Deploy

tie 2

graft 3

Kubernetes

vanilla 1

graft 4

Stripe / Payments

vanilla 1

graft 4

Postgres

tie 2

graft 3

Frontend

tie 3

graft 2

ML Pipelines

vanilla 1

graft 3

Auth & OAuth

vanilla 2

graft 2

All 50 prompts, drillable

Pick any prompt to see the grafted skills, the tools the agent actually called, both responses side-by-side, and what each judge said.

Loading bench data…

Context budget: graft vs. dump-everything

The naive way to give an agent specialist knowledge is to stuff every skill description into the system prompt. With 551 skills that's ~50K tokens of catalog noise the model has to read past before getting to your actual question. Graft does the opposite: the cascade picks the 4–8 specialists for this question and ships only those.

Naive dump

~50K tok

551 skill names + descriptions + tags. Context-poisoning at scale.

Skill Graft

~10–12K tok

4 full SKILL.md bodies + 4 catalog adjacencies. References loaded on demand if the agent decides it needs them.

Savings

~75% fewer tok

Plus better outputs (per the judges above). The cascade decides what's relevant; you don't pay attention tax on what isn't.

Scope of this study

One-shot Q&A. Each prompt gets one Sonnet 4.6 response (with up to 8 tool-use turns for on-demand reference loading). Real engineering involves iteration, tests, file edits — those live in /next-move and get their own measurement.
50 hand-written senior-engineering prompts across 10 categories (auth, payments, Postgres, K8s, ML, frontend, build, observability, data pipelines, GraphQL/REST), 5 prompts each. Not a fuzz test, not a leaderboard — a load-bearing sample of what a senior engineer Slacks at 9pm.
Two flagship judges, two vendors (Anthropic Opus 4.7, OpenAI gpt-5.5). Two-vendor agreement is the point: the effect isn't Anthropic-flavored taste. A larger judge ensemble would tighten confidence intervals and is on the to-do.

Reproduce

Scripts, prompts, raw outputs, and judge verdicts live in the public curiositech/windags-skills repo. Set ANTHROPIC_API_KEY and OPENAI_API_KEY, then:

git clone https://github.com/curiositech/windags-skills.git
cd windags-skills/scripts/bench
pnpm install

# 50 prompts × 2 conditions, Sonnet 4.6, max_tokens 32768
pnpm tsx runner-skill-graft-v2.ts --run sg-v2 --concurrency 4

# Judge with Opus 4.7 + gpt-5.5
pnpm tsx judge-pairs.ts runs/sg-v2 \
  --pair vanilla_v2,skill_graft_v2 \
  --provider anthropic --model claude-opus-4-7 --tag opus-4-7

# Bundle the JSON
pnpm tsx export-sg-v2.ts runs/sg-v2

Discussion

Scroll down to load comments

Back to all posts