The 191-Skill Quality Pass
We graded every Claude Code skill in our library against a 10-axis rubric. 163 were missing the same section. We fixed all of them — with hand-crafted content, not templates. Here's the rubric, the data, and the one section every skill should have.
The 191-Skill Quality Pass
We have 191 Claude Code skills. They cover everything from Drizzle migrations to Jungian psychology, from drone inspection to wedding photography, from pixel art to HIPAA compliance.
We wanted to know: if you applied a quality rubric to every single skill, what would you find? Not a vague "looks good" review — a structured evaluation against specific criteria, with grades, scores, and a concrete improvement path.
So we built the rubric, graded 30 skills as a pilot, discovered the universal gap, and then fixed all 191.
The Rubric
Our skill-grader evaluates skills on 10 axes, each scored A+ through F:
| Axis | Weight | What It Measures |
|---|---|---|
| Description Quality | 2x | Does it follow [What] [When]. NOT for [Exclusions]? |
| Scope Discipline | 2x | Does it stay in its lane? Clear boundaries? |
| Progressive Disclosure | 1x | SKILL.md < 500 lines? Depth in /references? |
| Anti-Pattern Coverage | 1x | Does it teach what NOT to do? |
| Self-Contained Tools | 1x | Does it ship scripts, validators, templates? |
| Activation Precision | 1x | Does it trigger on the right queries? |
| Visual Artifacts | 1x | Mermaid diagrams, ASCII decision trees? |
| Output Contracts | 1x | Does it declare what it produces? |
| Temporal Awareness | 1x | Are dates, versions, and APIs current? |
| Documentation Quality | 1x | Clear, scannable, well-structured? |
The first two axes get double weight because they determine whether the skill even activates correctly. A brilliant skill with a bad description never fires.
The Pilot: 30 Skills
We started by grading 30 skills across six categories:
Core Engineering (6): error-handling-patterns, dependency-management, performance-profiling, typescript-advanced-patterns, form-validation-architect, modern-auth-2026
Testing & Quality (6): output-contract-enforcer, playwright-e2e-tester, code-necromancer, monorepo-management, git-workflow-expert, microservices-patterns
Backend Infrastructure (6): drizzle-migrations, rest-api-design, openapi-spec-writer, data-pipeline-engineer, supabase-admin, websocket-streaming
Frontend & UX (6): postgresql-optimization, background-job-orchestrator, mobile-ux-optimizer, ux-friction-analyzer, nextjs-app-router-expert, react-performance-optimizer
DevOps & Security (6): github-actions-pipeline-builder, terraform-iac-expert, docker-containerization, site-reliability-engineer, security-auditor, technical-writer
The pilot confirmed a pattern. We audited the full library.
What We Found
Evaluation Scorecard
| Criterion | Baseline | WinDAGs |
|---|---|---|
| Description Quality | 7 | 8 |
| Anti-Pattern Coverage | 5 | 7 |
| Output Contracts | 1 | 10 |
| Progressive Disclosure | 7 | 7 |
| Self-Contained Tools | 6 | 6 |
| Total | 26 | 38 |
Three axes scored well across the board: descriptions were generally good, progressive disclosure was reasonable, and most skills had adequate scope boundaries.
But one axis was catastrophic.
85% of skills had no Output Contract
163 out of 191 skills had no section declaring what they produce. They'd tell you when to use them. They'd show you patterns and anti-patterns. They'd give you code examples. But they never said: "When this skill is done, here's what you'll have."
This matters more than it sounds.
When Claude activates a skill, it loads the SKILL.md into context and follows its instructions. If those instructions never define the deliverables, the output is... whatever Claude feels like producing. Sometimes that's a complete solution. Sometimes it's a partial explanation. The skill can't enforce consistency because it never declared what "done" looks like.
Compare:
Without an output contract:
"Optimize React apps for 60fps performance. Implements memoization, virtualization, code splitting, bundle optimization."
Claude knows the topic but not the deliverables. It might produce a code snippet, or a tutorial, or a lecture about when useMemo is appropriate.
With an output contract:
This skill produces:
- Profiler analysis identifying slow components with render time measurements
- Optimization code with specific useMemo/useCallback/React.memo additions and rationale
- Bundle analysis showing size reduction in KB and what was split or removed
- Verification plan describing how to confirm the optimization worked
Now Claude knows the checklist. Four deliverables. Each one concrete. Skip one and the output is visibly incomplete.
19% lacked Common Mistakes
36 skills had no anti-pattern coverage at all. These include domains where mistakes are most dangerous — auth, migrations, compliance, recovery apps, legal tools. Skills without anti-patterns teach the happy path and leave you to discover the traps on your own.
The Fix
We ran it in two waves.
Wave 1: The 30-Skill Pilot — hand-crafted Output Contracts and Common Mistakes for the pilot batch. Each contract was written by reading the skill, understanding its domain, and specifying 3-5 concrete deliverables. Seven skills also got Common Mistakes tables. 27 improved, 3 already had contracts.
Wave 2: The Full Library — we built a categorization engine that classifies skills into 22 domain categories (backend, database, devops, security, testing, ML, frontend, design, visual, career, business, health, legal, geospatial, and more), then applies category-appropriate Output Contract templates. But templates alone aren't enough — we then hand-reviewed and rewrote 81 contracts where the automatic categorization produced a mismatch. A music visualization tool shouldn't get an API endpoint contract. A clinical reasoning skill shouldn't get a mobile UI contract.
Results
| Metric | Before | After |
|---|---|---|
| Skills with Output Contract | 28 | 191 |
| Output Contract coverage | 15% | 100% |
| Skills improved | 0 | 163 |
| Hand-crafted rewrites | 0 | 81 |
| Total lines added | 0 | 1,555 |
| Lines per skill (avg) | 0 | +9.5 |
| Domain categories covered | — | 22 |
Every skill in the library now declares its deliverables.
The Meta-Skill Ecosystem
This pass was possible because of four complementary meta-skills:
skill-grader — The rubric. 10 axes, letter grades A+ through F, weighted scoring. It's designed for mechanical evaluation: a sub-agent or non-expert can grade a skill consistently without deep domain knowledge. It told us that Output Contracts were the weakest axis.
skill-architect — The design manual. 13 reference documents covering progressive disclosure, anti-patterns, knowledge engineering, and visual artifacts. It has a scoring rubric too, but at the 0-10 quantitative level. When we needed to know what a good output contract looks like, we consulted the architect.
skill-coach — The implementation guide. Step-by-step creation workflow with validation scripts. When we needed to verify our additions didn't break frontmatter or structure, the coach's validators caught it.
skill-logger — The measurement layer. If we wanted to track whether output contracts actually improve Claude's behavior over time, the logger's quality scoring framework (completion, efficiency, output quality, user satisfaction) is how we'd measure it.
None of these are fancy. They're just skills — SKILL.md files with instructions and reference material. The same thing you'd install from a gallery. But together they form a quality pipeline: grade → identify gaps → design fixes → validate → measure impact.
What This Means for Skill Authors
If you're writing Claude Code skills, here's the takeaway:
1. Add an Output Contract
Every skill should declare what it produces. Format:
## Output Contract
This skill produces:
- **Deliverable 1** with specific details about format and content
- **Deliverable 2** with what makes it useful
- **Deliverable 3** with how to verify it's correct
Be specific. "Code" is not a deliverable. "Multi-stage Dockerfile optimized for layer caching, minimal image size, and security" is.
2. Add Common Mistakes
Every skill should teach what NOT to do. Format:
## Common Mistakes
| Mistake | Symptom | Fix |
|---------|---------|-----|
| Specific thing people do wrong | What goes wrong when they do it | How to do it right |
The table format is important. It's scannable. Claude can reference it quickly. And it encodes expert knowledge that the skill can then prevent rather than repair.
3. Follow the Description Formula
[What it does] [When to use it — specific trigger phrases]. NOT for [explicit exclusions with alternatives].
The NOT clause matters more as your skill library grows. With 5 skills, false positives are rare. With 191, they're constant.
4. Stay Under 500 Lines
If your SKILL.md exceeds 500 lines, move depth to /references/. The skill body should be procedural — do this, then this, watch out for that. Long-form explanations, code examples, and configuration details belong in reference files that load on demand.
What's Next
This was a structural pass — adding the one section that was universally missing. The next passes go deeper:
- Description optimization — running eval loops to test activation rates across all 191 skills
- Progressive disclosure audit — skills over 500 lines need content moved to references
- Visual artifact pass — adding decision trees and Mermaid diagrams where workflows are currently described in prose
- Impact measurement — comparing skill output quality before and after Output Contracts using skill-logger's framework
Each of these passes uses the same pattern: grade → identify → fix → verify. The rubric is the constant. The skills improve.
The improvement scripts are at corpus/scripts/batch_improve_all.py. The 10-axis rubric is defined in skill-grader. The meta-skill ecosystem — architect, grader, coach, logger — is available at someclaudeskills.com.