skillsqualityevaluationmetaautomationclaude-code

The 191-Skill Quality Pass

We graded every Claude Code skill in our library against a 10-axis rubric. 163 were missing the same section. We fixed all of them — with hand-crafted content, not templates. Here's the rubric, the data, and the one section every skill should have.

The 191-Skill Quality Pass

We have 191 Claude Code skills. They cover everything from Drizzle migrations to Jungian psychology, from drone inspection to wedding photography, from pixel art to HIPAA compliance.

We wanted to know: if you applied a quality rubric to every single skill, what would you find? Not a vague "looks good" review — a structured evaluation against specific criteria, with grades, scores, and a concrete improvement path.

So we built the rubric, graded 30 skills as a pilot, discovered the universal gap, and then fixed all 191.


The Rubric

Our skill-grader evaluates skills on 10 axes, each scored A+ through F:

Axis Weight What It Measures
Description Quality 2x Does it follow [What] [When]. NOT for [Exclusions]?
Scope Discipline 2x Does it stay in its lane? Clear boundaries?
Progressive Disclosure 1x SKILL.md < 500 lines? Depth in /references?
Anti-Pattern Coverage 1x Does it teach what NOT to do?
Self-Contained Tools 1x Does it ship scripts, validators, templates?
Activation Precision 1x Does it trigger on the right queries?
Visual Artifacts 1x Mermaid diagrams, ASCII decision trees?
Output Contracts 1x Does it declare what it produces?
Temporal Awareness 1x Are dates, versions, and APIs current?
Documentation Quality 1x Clear, scannable, well-structured?

The first two axes get double weight because they determine whether the skill even activates correctly. A brilliant skill with a bad description never fires.


The Pilot: 30 Skills

We started by grading 30 skills across six categories:

Core Engineering (6): error-handling-patterns, dependency-management, performance-profiling, typescript-advanced-patterns, form-validation-architect, modern-auth-2026

Testing & Quality (6): output-contract-enforcer, playwright-e2e-tester, code-necromancer, monorepo-management, git-workflow-expert, microservices-patterns

Backend Infrastructure (6): drizzle-migrations, rest-api-design, openapi-spec-writer, data-pipeline-engineer, supabase-admin, websocket-streaming

Frontend & UX (6): postgresql-optimization, background-job-orchestrator, mobile-ux-optimizer, ux-friction-analyzer, nextjs-app-router-expert, react-performance-optimizer

DevOps & Security (6): github-actions-pipeline-builder, terraform-iac-expert, docker-containerization, site-reliability-engineer, security-auditor, technical-writer

The pilot confirmed a pattern. We audited the full library.


What We Found

Evaluation Scorecard

CriterionBaselineWinDAGs
Description Quality78
Anti-Pattern Coverage57
Output Contracts110
Progressive Disclosure77
Self-Contained Tools66
Total2638

Three axes scored well across the board: descriptions were generally good, progressive disclosure was reasonable, and most skills had adequate scope boundaries.

But one axis was catastrophic.

85% of skills had no Output Contract

163 out of 191 skills had no section declaring what they produce. They'd tell you when to use them. They'd show you patterns and anti-patterns. They'd give you code examples. But they never said: "When this skill is done, here's what you'll have."

This matters more than it sounds.

When Claude activates a skill, it loads the SKILL.md into context and follows its instructions. If those instructions never define the deliverables, the output is... whatever Claude feels like producing. Sometimes that's a complete solution. Sometimes it's a partial explanation. The skill can't enforce consistency because it never declared what "done" looks like.

Compare:

Without an output contract:

"Optimize React apps for 60fps performance. Implements memoization, virtualization, code splitting, bundle optimization."

Claude knows the topic but not the deliverables. It might produce a code snippet, or a tutorial, or a lecture about when useMemo is appropriate.

With an output contract:

This skill produces:

  • Profiler analysis identifying slow components with render time measurements
  • Optimization code with specific useMemo/useCallback/React.memo additions and rationale
  • Bundle analysis showing size reduction in KB and what was split or removed
  • Verification plan describing how to confirm the optimization worked

Now Claude knows the checklist. Four deliverables. Each one concrete. Skip one and the output is visibly incomplete.

19% lacked Common Mistakes

36 skills had no anti-pattern coverage at all. These include domains where mistakes are most dangerous — auth, migrations, compliance, recovery apps, legal tools. Skills without anti-patterns teach the happy path and leave you to discover the traps on your own.


The Fix

We ran it in two waves.

Wave 1: The 30-Skill Pilot — hand-crafted Output Contracts and Common Mistakes for the pilot batch. Each contract was written by reading the skill, understanding its domain, and specifying 3-5 concrete deliverables. Seven skills also got Common Mistakes tables. 27 improved, 3 already had contracts.

Wave 2: The Full Library — we built a categorization engine that classifies skills into 22 domain categories (backend, database, devops, security, testing, ML, frontend, design, visual, career, business, health, legal, geospatial, and more), then applies category-appropriate Output Contract templates. But templates alone aren't enough — we then hand-reviewed and rewrote 81 contracts where the automatic categorization produced a mismatch. A music visualization tool shouldn't get an API endpoint contract. A clinical reasoning skill shouldn't get a mobile UI contract.

Results

Metric Before After
Skills with Output Contract 28 191
Output Contract coverage 15% 100%
Skills improved 0 163
Hand-crafted rewrites 0 81
Total lines added 0 1,555
Lines per skill (avg) 0 +9.5
Domain categories covered 22

Every skill in the library now declares its deliverables.


The Meta-Skill Ecosystem

This pass was possible because of four complementary meta-skills:

skill-grader — The rubric. 10 axes, letter grades A+ through F, weighted scoring. It's designed for mechanical evaluation: a sub-agent or non-expert can grade a skill consistently without deep domain knowledge. It told us that Output Contracts were the weakest axis.

skill-architect — The design manual. 13 reference documents covering progressive disclosure, anti-patterns, knowledge engineering, and visual artifacts. It has a scoring rubric too, but at the 0-10 quantitative level. When we needed to know what a good output contract looks like, we consulted the architect.

skill-coach — The implementation guide. Step-by-step creation workflow with validation scripts. When we needed to verify our additions didn't break frontmatter or structure, the coach's validators caught it.

skill-logger — The measurement layer. If we wanted to track whether output contracts actually improve Claude's behavior over time, the logger's quality scoring framework (completion, efficiency, output quality, user satisfaction) is how we'd measure it.

None of these are fancy. They're just skills — SKILL.md files with instructions and reference material. The same thing you'd install from a gallery. But together they form a quality pipeline: gradeidentify gapsdesign fixesvalidatemeasure impact.


What This Means for Skill Authors

If you're writing Claude Code skills, here's the takeaway:

1. Add an Output Contract

Every skill should declare what it produces. Format:

## Output Contract

This skill produces:
- **Deliverable 1** with specific details about format and content
- **Deliverable 2** with what makes it useful
- **Deliverable 3** with how to verify it's correct

Be specific. "Code" is not a deliverable. "Multi-stage Dockerfile optimized for layer caching, minimal image size, and security" is.

2. Add Common Mistakes

Every skill should teach what NOT to do. Format:

## Common Mistakes

| Mistake | Symptom | Fix |
|---------|---------|-----|
| Specific thing people do wrong | What goes wrong when they do it | How to do it right |

The table format is important. It's scannable. Claude can reference it quickly. And it encodes expert knowledge that the skill can then prevent rather than repair.

3. Follow the Description Formula

[What it does] [When to use it — specific trigger phrases]. NOT for [explicit exclusions with alternatives].

The NOT clause matters more as your skill library grows. With 5 skills, false positives are rare. With 191, they're constant.

4. Stay Under 500 Lines

If your SKILL.md exceeds 500 lines, move depth to /references/. The skill body should be procedural — do this, then this, watch out for that. Long-form explanations, code examples, and configuration details belong in reference files that load on demand.


What's Next

This was a structural pass — adding the one section that was universally missing. The next passes go deeper:

  • Description optimization — running eval loops to test activation rates across all 191 skills
  • Progressive disclosure audit — skills over 500 lines need content moved to references
  • Visual artifact pass — adding decision trees and Mermaid diagrams where workflows are currently described in prose
  • Impact measurement — comparing skill output quality before and after Output Contracts using skill-logger's framework

Each of these passes uses the same pattern: grade → identify → fix → verify. The rubric is the constant. The skills improve.


The improvement scripts are at corpus/scripts/batch_improve_all.py. The 10-axis rubric is defined in skill-grader. The meta-skill ecosystem — architect, grader, coach, logger — is available at someclaudeskills.com.