skillsqualityevaluationmetaautomationclaude-code

The 191-Skill Quality Pass

We graded every Claude Code skill in our library against a 10-axis rubric. 163 were missing the same section. We fixed all of them — with hand-crafted content, not templates. Here's the rubric, the data, and the one section every skill should have.

Saturday, March 7, 2026

The 191-Skill Quality Pass

We have 191 Claude Code skills. They cover everything from Drizzle migrations to Jungian psychology, from drone inspection to wedding photography, from pixel art to HIPAA compliance.

We wanted to know: if you applied a quality rubric to every single skill, what would you find? Not a vague "looks good" review — a structured evaluation against specific criteria, with grades, scores, and a concrete improvement path.

So we built the rubric, graded 30 skills as a pilot, discovered the universal gap, and then fixed all 191.

The Rubric

Our skill-grader evaluates skills on 10 axes, each scored A+ through F:

Axis	Weight	What It Measures
Description Quality	2x	Does it follow [What] [When]. NOT for [Exclusions]?
Scope Discipline	2x	Does it stay in its lane? Clear boundaries?
Progressive Disclosure	1x	SKILL.md < 500 lines? Depth in /references?
Anti-Pattern Coverage	1x	Does it teach what NOT to do?
Self-Contained Tools	1x	Does it ship scripts, validators, templates?
Activation Precision	1x	Does it trigger on the right queries?
Visual Artifacts	1x	Mermaid diagrams, ASCII decision trees?
Output Contracts	1x	Does it declare what it produces?
Temporal Awareness	1x	Are dates, versions, and APIs current?
Documentation Quality	1x	Clear, scannable, well-structured?

The first two axes get double weight because they determine whether the skill even activates correctly. A brilliant skill with a bad description never fires.

The Pilot: 30 Skills

We started by grading 30 skills across six categories:

Core Engineering (6): error-handling-patterns, dependency-management, performance-profiling, typescript-advanced-patterns, form-validation-architect, modern-auth-2026

Testing & Quality (6): output-contract-enforcer, playwright-e2e-tester, code-necromancer, monorepo-management, git-workflow-expert, microservices-patterns

Backend Infrastructure (6): drizzle-migrations, rest-api-design, openapi-spec-writer, data-pipeline-engineer, supabase-admin, websocket-streaming

Frontend & UX (6): postgresql-optimization, background-job-orchestrator, mobile-ux-optimizer, ux-friction-analyzer, nextjs-app-router-expert, react-performance-optimizer

DevOps & Security (6): github-actions-pipeline-builder, terraform-iac-expert, docker-containerization, site-reliability-engineer, security-auditor, technical-writer

The pilot confirmed a pattern. We audited the full library.

What We Found

Evaluation Scorecard

Criterion	Baseline	WinDAGs
Description Quality	7	8
Anti-Pattern Coverage	5	7
Output Contracts	1	10
Progressive Disclosure	7	7
Self-Contained Tools	6	6
Total	26	38

Three axes scored well across the board: descriptions were generally good, progressive disclosure was reasonable, and most skills had adequate scope boundaries.

But one axis was catastrophic.

85% of skills had no Output Contract

163 out of 191 skills had no section declaring what they produce. They'd tell you when to use them. They'd show you patterns and anti-patterns. They'd give you code examples. But they never said: "When this skill is done, here's what you'll have."

This matters more than it sounds.

When Claude activates a skill, it loads the SKILL.md into context and follows its instructions. If those instructions never define the deliverables, the output is... whatever Claude feels like producing. Sometimes that's a complete solution. Sometimes it's a partial explanation. The skill can't enforce consistency because it never declared what "done" looks like.

Compare:

Without an output contract:

"Optimize React apps for 60fps performance. Implements memoization, virtualization, code splitting, bundle optimization."

Claude knows the topic but not the deliverables. It might produce a code snippet, or a tutorial, or a lecture about when useMemo is appropriate.

With an output contract:

This skill produces:

Profiler analysis identifying slow components with render time measurements

Optimization code with specific useMemo/useCallback/React.memo additions and rationale

Bundle analysis showing size reduction in KB and what was split or removed

Verification plan describing how to confirm the optimization worked

Now Claude knows the checklist. Four deliverables. Each one concrete. Skip one and the output is visibly incomplete.

19% lacked Common Mistakes

36 skills had no anti-pattern coverage at all. These include domains where mistakes are most dangerous — auth, migrations, compliance, recovery apps, legal tools. Skills without anti-patterns teach the happy path and leave you to discover the traps on your own.

The Fix

We ran it in two waves.

Wave 1: The 30-Skill Pilot — hand-crafted Output Contracts and Common Mistakes for the pilot batch. Each contract was written by reading the skill, understanding its domain, and specifying 3-5 concrete deliverables. Seven skills also got Common Mistakes tables. 27 improved, 3 already had contracts.

Wave 2: The Full Library — we built a categorization engine that classifies skills into 22 domain categories (backend, database, devops, security, testing, ML, frontend, design, visual, career, business, health, legal, geospatial, and more), then applies category-appropriate Output Contract templates. But templates alone aren't enough — we then hand-reviewed and rewrote 81 contracts where the automatic categorization produced a mismatch. A music visualization tool shouldn't get an API endpoint contract. A clinical reasoning skill shouldn't get a mobile UI contract.

Results

Metric	Before	After
Skills with Output Contract	28	191
Output Contract coverage	15%	100%
Skills improved	0	163
Hand-crafted rewrites	0	81
Total lines added	0	1,555
Lines per skill (avg)	0	+9.5
Domain categories covered	—	22

Every skill in the library now declares its deliverables.

The Meta-Skill Ecosystem

This pass was possible because of four complementary meta-skills:

skill-grader — The rubric. 10 axes, letter grades A+ through F, weighted scoring. It's designed for mechanical evaluation: a sub-agent or non-expert can grade a skill consistently without deep domain knowledge. It told us that Output Contracts were the weakest axis.

skill-architect — The design manual. 13 reference documents covering progressive disclosure, anti-patterns, knowledge engineering, and visual artifacts. It has a scoring rubric too, but at the 0-10 quantitative level. When we needed to know what a good output contract looks like, we consulted the architect.

skill-coach — The implementation guide. Step-by-step creation workflow with validation scripts. When we needed to verify our additions didn't break frontmatter or structure, the coach's validators caught it.

skill-logger — The measurement layer. If we wanted to track whether output contracts actually improve Claude's behavior over time, the logger's quality scoring framework (completion, efficiency, output quality, user satisfaction) is how we'd measure it.

None of these are fancy. They're just skills — SKILL.md files with instructions and reference material. The same thing you'd install from a gallery. But together they form a quality pipeline: grade → identify gaps → design fixes → validate → measure impact.

What This Means for Skill Authors

If you're writing Claude Code skills, here's the takeaway:

1. Add an Output Contract

Every skill should declare what it produces. Format:

## Output Contract

This skill produces:
- **Deliverable 1** with specific details about format and content
- **Deliverable 2** with what makes it useful
- **Deliverable 3** with how to verify it's correct

Be specific. "Code" is not a deliverable. "Multi-stage Dockerfile optimized for layer caching, minimal image size, and security" is.

2. Add Common Mistakes

Every skill should teach what NOT to do. Format:

## Common Mistakes

| Mistake | Symptom | Fix |
|---------|---------|-----|
| Specific thing people do wrong | What goes wrong when they do it | How to do it right |

The table format is important. It's scannable. Claude can reference it quickly. And it encodes expert knowledge that the skill can then prevent rather than repair.

3. Follow the Description Formula

[What it does] [When to use it — specific trigger phrases]. NOT for [explicit exclusions with alternatives].

The NOT clause matters more as your skill library grows. With 5 skills, false positives are rare. With 191, they're constant.

4. Stay Under 500 Lines

If your SKILL.md exceeds 500 lines, move depth to /references/. The skill body should be procedural — do this, then this, watch out for that. Long-form explanations, code examples, and configuration details belong in reference files that load on demand.

What's Next

This was a structural pass — adding the one section that was universally missing. The next passes go deeper:

Description optimization — running eval loops to test activation rates across all 191 skills
Progressive disclosure audit — skills over 500 lines need content moved to references
Visual artifact pass — adding decision trees and Mermaid diagrams where workflows are currently described in prose
Impact measurement — comparing skill output quality before and after Output Contracts using skill-logger's framework

Each of these passes uses the same pattern: grade → identify → fix → verify. The rubric is the constant. The skills improve.

The improvement scripts are at corpus/scripts/batch_improve_all.py. The 10-axis rubric is defined in skill-grader. The meta-skill ecosystem — architect, grader, coach, logger — is available at someclaudeskills.com.

Back to all posts