skillsmetaanthropicevaluationrecursioncomposition

When Meta-Skills Collide

We built a cross-evaluation agent, pointed Anthropic's skill-creator and our skill-architect at each other, and recorded what happened. Real transcripts. Real diffs. Real analysis of what two different philosophies of skill-building value.

When Meta-Skills Collide

Two AI skill-creation tools. Each thinks it knows how to build skills. We pointed them at each other and recorded what happened.

One is Anthropic's, open-sourced under Apache 2.0. One is ours. Both are meta-skills -- skills whose entire job is to create, evaluate, and improve other skills. We built an agent to wield each one against the other, captured every word, and now we're going to show you exactly what they said.

No summaries. No paraphrasing. The transcripts are right here.


Thank You, Anthropic

This experiment exists because Anthropic open-sourced their skill creation tooling under the Apache 2.0 license. They didn't have to. The license permits reproduction, derivative works, and public display with attribution. We include their complete skill-creator with full provenance -- every file, every script, every agent definition.

For the latest version, go to their repository. Everything here is a snapshot from commit b0cbd3df, March 7, 2026.


The Contenders: See For Yourself

Before we tell you what we found, look at both skills. Not a summary -- the actual folders. Every file, with line counts. Click the file buttons to browse references, scripts, and agents.

Side-by-Side Comparison

Anthropic's skill-creator
Source: Anthropic (Apache 2.0) --GitHub
Loading skill data...

Anthropic's skill-creator

skill-creator/                         (485 lines SKILL.md, 33,168 bytes)
├── SKILL.md                           485 lines
├── LICENSE.txt                        201 lines
├── agents/
│   ├── analyzer.md                    275 lines
│   ├── comparator.md                  203 lines
│   └── grader.md                      224 lines
├── assets/
│   └── eval_review.html                45 lines
├── eval-viewer/
│   ├── generate_review.py             312 lines
│   └── viewer.html                  1,247 lines
├── references/
│   └── schemas.md                     306 lines
└── scripts/
    ├── __init__.py                      0 lines
    ├── aggregate_benchmark.py         187 lines
    ├── generate_report.py              89 lines
    ├── improve_description.py         142 lines
    ├── package_skill.py                98 lines
    ├── quick_validate.py               72 lines
    ├── run_eval.py                    156 lines
    ├── run_loop.py                    234 lines
    └── utils.py                        47 lines

Total: 19 files. 9 Python scripts that actually run. 3 agent definitions for evaluation pipelines. An HTML eval viewer that generates standalone review pages. This is a factory floor.

WinDAGs' skill-architect

skill-architect/                       (503 lines SKILL.md, 23,647 bytes)
├── SKILL.md                           503 lines
├── CHANGELOG.md                       139 lines
├── README.md                          132 lines
├── agents/
│   └── cross-evaluator.md              87 lines
├── scripts/
│   ├── validate_mermaid.py            649 lines
│   ├── validate_skill.py             310 lines
│   ├── check_self_contained.py        210 lines
│   └── init_skill.py                  193 lines
└── references/
    ├── antipatterns.md                308 lines
    ├── claude-extension-taxonomy.md   344 lines
    ├── description-guide.md           188 lines
    ├── knowledge-engineering.md       290 lines
    ├── mcp-template.md                118 lines
    ├── plugin-architecture.md         220 lines
    ├── scoring-rubric.md               82 lines
    ├── self-contained-tools.md        209 lines
    ├── skill-composition.md            87 lines
    ├── skill-lifecycle.md              95 lines
    ├── subagent-design.md             248 lines
    ├── subagent-template.md           196 lines
    └── visual-artifacts.md            428 lines

Total: 22 files. 13 reference documents spanning knowledge engineering to plugin architecture. 4 Python scripts for validation and scaffolding. A scoring rubric. An anti-pattern catalog with shibboleth templates. This is a library.

What Skill-Creator's Scripts Actually Do

Anthropic shipped real tooling, not templates. Here's the pipeline:

Script Lines What It Does
run_eval.py 310 Tests whether a skill's description triggers correctly. Spawns claude -p subprocesses for each eval query, captures whether the skill fired. JSON output.
run_loop.py 328 The main loop: run_eval + improve_description, iterating until all pass or max iterations. Tracks history, supports train/test split to prevent overfitting.
improve_description.py 247 Takes eval results and generates an improved description by calling claude -p. The improvement is guided by which queries failed to trigger.
aggregate_benchmark.py 401 Reads grading.json files from run directories, produces mean/stddev/min/max for each metric, computes deltas between with-skill and without-skill configurations.
generate_report.py 326 Generates a visual HTML report from run_loop output. Shows each description attempt with pass/fail for every test case, distinguishing train vs test.
package_skill.py 136 Creates a distributable .skill file (zip archive) from a skill folder. Validates frontmatter before packaging.
quick_validate.py 102 Checks frontmatter completeness: name, description, valid field names. Uses PyYAML.
utils.py 47 Shared SKILL.md parser -- extracts name, description, and full content from frontmatter.

The eval viewer (eval-viewer/generate_review.py, 312 lines + viewer.html, 1,247 lines) generates a standalone HTML page with two tabs: Outputs (review each test case, leave feedback) and Benchmark (quantitative comparison with pass rates, timing, token usage). It's a complete review workstation in a single HTML file.

The three agent definitions (agents/analyzer.md, agents/comparator.md, agents/grader.md) are prompts for specialized subagents. The grader evaluates assertions against outputs. The comparator does blind A/B comparison. The analyzer explains why one version beat another.

This is a factory floor. It's meant to be run, not read.


The Algebra

We have two operators and two artifacts. Here's the map of what we did:

Composition Algebra: Click a Node

SC evaluatesSA evaluatescommutes?SA₀7.2/10SC₀4.7/10SC ∘ SA~8/10 est.SA ∘ SC~7/10 est.

SA and SC are functions. SA(SC) means "skill-architect evaluates and improves skill-creator." The ∘ is function composition. Click any node above to see the file tree, frontmatter, and diffs.

The question that drives this entire experiment: what happens when you iterate?


The Critiques

Each skill got its own source folder, the target folder (read-only), and a writable output copy. Tools: Read, Write, Edit, Glob, Grep, Bash. (Anthropic's skill-creator ships with three specialized agent definitions — agents/analyzer.md, agents/comparator.md, agents/grader.md. SC read and loaded them as part of reading its own folder.)

claude -p "You are Skill Architect. Your source folder: {sa_path}.
           Evaluate and improve skill-creator at: {sc_path}.
           Write all improvements to: {output_path}." \
  --allowed-tools Read,Write,Edit,Glob,Grep,Bash(python:*) \
  --permission-mode bypassPermissions

Each agent read its own references and ran its scripts before scoring. Here's the initial scorecard and side-by-side breakdown:

Evaluation Scorecard

CriterionBaselineWinDAGs
Frontmatter67
Progressive Disclosure56
Anti-Patterns44
Visual Artifacts37
Shibboleths74
Self-Containment57
Activation Quality68
Total3643

Layer 1: The Critiques

skill-architect evaluates
skill-creator
4.7/10
Frontmatter
5

Missing allowed-tools, missing NOT clause

Progressive Disclosure
6

485 lines, duplicated JSON schemas in body + references

Anti-Patterns
5

Commits 3 of 10 anti-patterns it teaches others to avoid

Visual Artifacts
1

Zero Mermaid diagrams for 4+ complex workflows

Shibboleths
4

Some implicit signals, no systematic encoding

Self-Containment
7

Scripts are real and functional; no requirements.txt

Activation Quality
5

Good adaptive behavior, but weak explicit triggers

Verbatim from transcript

There are **zero** Mermaid diagrams or visual artifacts anywhere in the SKILL.md or its references.
The skill's own validation script (quick_validate.py) checks allowed-tools as a valid field, yet the skill itself doesn't use it.
The skill teaches others to avoid these exact anti-patterns but commits three of them itself.

What it changed

  • Added 2 Mermaid diagrams (lifecycle flowchart + eval sequence)
  • Added NOT clause: "NOT for installing skills, general coding, skill browsing"
  • Added allowed-tools: Read,Write,Edit,Bash,Glob,Grep,TodoWrite,WebFetch
  • Reorganized into 7 numbered Phases matching lifecycle diagram
  • Added Common Mistakes shibboleth table
  • Removed "Cool? Cool." and informal tone markers
Self-assessment
Confidence: 0.78
Best: The Mermaid diagrams transform a 485-line prose wall into a navigable workflow.
Gap: The tension between the skill's ambition and the 500-line budget.

What Skill-Architect Found in Skill-Creator

Score: 5.1/10 — then 6.0 → 8.0 with full folder access

The architect applied its seven-dimension rubric with forensic precision:

Criterion Score Key Finding
Frontmatter 6/10 Missing NOT clause; description not precisely pushy
Progressive Disclosure 5/10 Near 500-line limit; platform-specific sections bloating SKILL.md
Anti-Patterns 4/10 No NOT clause, no Mermaid diagrams, platform bloat, no shibboleth section
Visual Artifacts 3/10 Only the directory tree; workflow, eval loop, and trigger optimization are prose
Shibboleths 7/10 Genuine domain expertise: trigger rates, when quantitative evals help, overfitting risk
Self-Containment 5/10 Scripts likely real but no explicit bundled resources list
Activation Quality 6/10 Correct triggers, but false-positive risk on adjacent "prompt engineering" queries

On the visual artifacts finding:

"The only structural artifact is the directory tree. The creation workflow, the eval loop (spawn → draft assertions → grade → viewer → feedback → iterate), and the trigger optimization loop are all described as numbered steps in prose. These are textbook Mermaid diagram candidates."

Notably, skill-architect gave skill-creator a 7/10 on shibboleths — higher than skill-architect's own self-evaluation score on that dimension (4/10). skill-creator encodes real expertise about when quantitative evals help vs. don't, the overfitting risk in description optimization loops, and the importance of reading transcripts not just metrics. The architect recognized expertise it couldn't represent in its own format.

With its source folder loaded, SA ran its own validators and rubric before writing a single change. Score moved to 8.0. What changed in SC's folder:

Addition What it is
description-optimization.md SA's description craft methodology — didn't exist in SC
platform-notes.md Platform adaptation guide — was buried in SKILL.md body
CHANGELOG.md Version history — didn't exist

SA also added the two Mermaid diagrams it flagged as missing (a high-level creation-to-packaging flowchart and a sequence diagram of the eval loop) and a proper anti-pattern section with Novice/Expert/Timeline shibboleth templates.

SA's honest comparison:

"skill-creator is a more operationally sophisticated skill than skill-architect. It ships a complete eval pipeline, working scripts, three specialized agent definitions, a JSON schema reference, and a working HTML viewer — none of which sa0 ships in comparable completeness. In terms of self-contained tooling, sc0 is arguably better. Where sa0 exceeds sc0: structural rigor."

An evaluator acknowledging the target is better-equipped than itself.

What Skill-Creator Found in Skill-Architect

Score: 6.7/10 — then 7.6/10 with full folder access

The creator used its own seven-dimension rubric and found real gaps on the operational side:

Dimension Score Key Finding
Triggering 6/10 "Design, create, audit, improve" misses natural phrasings: "write a skill", "build a skill"
Output Quality 5/10 Create has a template; audit, improve, debug have no output contract
Eval Loop Readiness 7/10 Validation checklist is assertion-ready; success metrics are measurable
Iteration Support 7/10 Clear metrics (>90% activation, <5% false positive); scoring rubric lives in references, not inline
Communication Clarity 8/10 Strong tables and diagrams; "shibboleth" appears before it's defined
Description Optimization 6/10 Lists specific operations; should lead with what users get, not what operations exist
Self-Containment 8/10 References indexed; one path ambiguity in script commands

The output quality gap:

"Four operations are implied (create, audit, improve, debug) but only create has a defined output format. Audit, improve, and debug produce... something. An eval harness would struggle to grade audit or debug outputs automatically."

And the irony finding:

"The SKILL.md is 503 lines. The skill's own rule says '<500 lines.' This is a do-as-I-say-not-as-I-do violation."

With full folder access, SC ran SA's own validators against SA:

validate_skill.py:
   [size] SKILL.md is 505 lines (max 500). ERROR.

check_self_contained.py:
   Phantom reference: references/server-components-deep-dive.md (does not exist)

SA's SKILL.md violates its own 500-line rule. SA's own checker caught it.

Then SC went into the reference files. skill-lifecycle.md — lifecycle state machine rendered as ASCII box-drawing. Anti-pattern #10 in SA's own catalog says "use Mermaid." SC replaced it with a proper stateDiagram-v2. skill-composition.md — five ASCII art dependency diagrams, all converted to Mermaid flowcharts.

SC also found a real script bug in validate_mermaid.py:

# Line 565 — both branches produce identical empty strings
icon = "  " if issue.severity == "error" else "  "

Error and warning were visually identical in output. SC fixed it.

SC's biggest move: Output Contracts — a new section defining what each operation produces (create → SKILL.md file; audit → structured report with dimension scores; improve → complete rewritten SKILL.md with diff; debug → diagnosed root cause and concrete fix). This transforms the skill from instruction-only to assertion-ready.

What changed in SA's folder:

Change Type
SKILL.md: 505 → 476 lines Compressed
Fixed false NOT clause (MCP excluded, but SA teaches MCP via 3 ref files) Consistency
validate_mermaid.py: error icon fixed Bug fix
Phantom reference removed from antipatterns.md Phantom
ASCII art → Mermaid in skill-lifecycle.md Diagram conversion
ASCII art → Mermaid x5 in skill-composition.md Diagram conversion
troubleshooting.md New file

SC's honest comparison:

"skill-architect is more comprehensive than skill-creator in raw content — more shibboleths, more reference files, working scripts. skill-creator's advantage is tighter discipline around eval methodology and assertion-based quality measurement. If these two skills were composed, the combined quality would exceed either alone."

Self-Evaluations: Mirrors Turned Inward

We also ran SA on itself and SC on itself with full folder access.

SA on SA: SA ran check_self_contained.py against itself. The script returned 7 "phantom" references. 5 were false positives — the checker was matching reference patterns inside illustrative prose. SA's quality gate was fundamentally broken: it was flagging its own documentation examples as missing files. SA fixed check_self_contained.py with an ILLUSTRATIVE_MARKERS regex, then added activation-debugging.md — a gap it found by noticing the skill listed activation debugging as a use case but shipped no content for it.

Grade: 7.3 → 8.8/10 (B → B+)

SC on SC: SC found a functional path bug. The SKILL.md tells users to save outputs to eval-<ID>/with_skill/outputs/. But aggregate_benchmark.py expects grading.json at eval-<ID>/with_skill/run-*/grading.json. The aggregator skips directories with no run-* subdirs. Result: benchmark.json would always be empty. A silent bug that a user would only discover after running a complete eval cycle. SC fixed it, rewrote the description to follow its own "pushy principle," and added eval-patterns.md.

Score: 8.2/10

What the Tools Revealed

Each evaluator's tools revealed a different dimension of quality.

SA's tools are architectural probes. They find what's missing: no diagram, no NOT clause, no anti-pattern section. They don't fire on existing content.

SC's tools are consistency probes. They find what's wrong: script bugs, path inconsistencies, self-contradictions. They don't add new architectural layers.

Both revealed something text couldn't. SA's own check_self_contained.py was generating false positives from its own illustrative prose — the skill was failing the very quality gate it ships. SC's aggregate_benchmark.py would always produce empty output because of a missing directory level. Neither failure was visible from reading SKILL.md alone. Both were invisible until you ran the code.

A skill you can't run is less self-contained than it appears.

Side-by-Side Comparison

SC (SA-improved)
Source: Anthropic (Apache 2.0) --GitHub
Loading skill data...

The Iteration Paths

The Braid: Layer 1 Crossings

SA (skill-architect) — over
SC (skill-creator) — over
click nodes

Now that we have four artifacts (see the algebra diagram above), there are three paths for iteration:

Path A: Fixed-Base -- SA(SC₁) = SC₂, SA(SC₂) = SC₃... Same evaluator grinding. Measures convergence.

Path B: Self-Reflective -- SA(SA) = SA₁, then SA₁(SC). Does self-improvement make you a better evaluator?

Path C: Cross-Spiral -- Each generation uses the OTHER's latest version. This is the braid. Does the diagram commute? SC(SA(SC)) vs SA(SC(SA))? Almost certainly not. The difference reveals what each evaluator cannot see about itself.


Round 3: The Cross-Spiral

In Round 2, each skill evaluated the other from scratch. In Round 3, we let each skill improve itself first — then sent its improved version to evaluate the other's improved version. SA₁ (SA after self-evaluating its full folder) evaluated SC₁ (SC after self-evaluating its full folder). SC₁ evaluated SA₁. The question: what does a cross-evaluator find after both sides have already cleaned their own houses?

SA₁ Evaluates SC₁

Score: 8.2 → 8.8/10 (A−)

SA₁ ran its validation scripts against SC₁ and found three things SC₁'s self-evaluation missed:

526 lines. SC₁ had grown past 500 lines during self-improvement (iter-2 added a workspace diagram and a grader prompt template without extracting anything). SC's scripts don't enforce the 500-line rule — SA's do. SC₁ couldn't catch itself violating a rule it doesn't measure.

No Mermaid diagram for the core loop. SC₁'s central concept — the create/test/grade/improve eval loop — is described in prose. SA₁'s cross-evaluator spotted this immediately; it's one of SA's six evaluation dimensions (Visual Artifacts). SC₁ evaluated itself against its own rubric, which doesn't include that dimension. The diagram that was missing was the one SC was most accustomed to not having.

No NOT clause. SC₁'s description still lacked the exclusion clause SA's rubric requires. Again: SC₁'s rubric doesn't mandate NOT clauses. It couldn't flag their absence in itself.

SA₁'s summary of the finding:

"self-evaluators don't notice what they're used to reading."

SA₁'s convergence assessment: the diff from iter-2 to iter-3 is "meaningfully smaller" than iter-1 to iter-2. Convergence confidence: high.

SC₁ Evaluates SA₁

Score: 8.8 → 8.97/10 (A−)

SC₁ ran SA₁'s validation scripts cold and immediately found something the self-evaluation had missed — because the self-evaluation couldn't have seen it.

SA₁'s EVALUATION.md described the phantom reference fixes it made. To explain which paths had been false positives, it cited them inline:

"`scripts/analyze.py` (false positive — illustrative)"
"`` `references/X.md` `` (false positive — illustrative placeholder)"
"referenced `references/api-guide.md`"
"cited `scripts/validate.py`"

SC₁ ran check_self_contained.py on the finished output and it failed — on exactly those four lines. The evaluation document describing the phantom-detection fix was itself triggering phantom detection. SA₁ had declared the checker now passed. The checker was failing on SA₁'s own words about passing.

SC₁'s diagnosis:

"The evaluation was written after the fix, so the author wasn't running the checker against the completed EVALUATION.md. This is a workflow gap: self-evaluation only validates the skill body, not the evaluation artifact itself."

SC₁ also found HTML entities in four reference files — &gt;, &lt;, &amp; rendering as literal text in the agent-loaded markdown. SA₁'s validate_skill.py only checked SKILL.md. SC₁ extended the validator to scan all .md files recursively, catching the entities the tool had always missed.

SC₁'s convergence assessment:

"Diminishing returns visible. Each iteration yields smaller gains."

What the Cross-Spiral Reveals

Two kinds of evaluator blindness emerged clearly in Round 3:

You can't see what you don't measure. SA₁ found SC₁'s missing visual artifact and line-count violation because SA's rubric measures those. SC₁ wouldn't have flagged either — they're not in SC's quality framework. Each cross-evaluator finds a different shadow: the shadow of its own values, projected onto the target.

You can't see what you just wrote. SA₁'s EVALUATION.md described the fix in the very language that would trigger the failure. This isn't carelessness — it's structural. You write the evaluation after fixing the files, you reference the paths you fixed, and you don't think to run the checker again on the document you're in the middle of writing. A cross-evaluator reads the finished artifact cold. The self-evaluator never gets that perspective.

Both rounds converged to the same composite score: A−. But they got there by finding different things.

Side-by-Side Comparison

SC₁ (cross-evaluated)
Source: Anthropic (Apache 2.0) --GitHub
Loading skill data...

What We Learned

1. Tools inherit their creator's values

Skill-architect was built by someone who values knowledge architecture: layered references, progressive disclosure, shibboleth encoding, Mermaid diagrams. Skill-creator was built by someone who values measurement infrastructure: scripts, benchmarks, automated grading, iteration loops.

When each evaluates the other, they find what's missing from their own perspective. The architect gave skill-creator 3/10 on visual artifacts because diagrams are sacred to the architect. The creator scored skill-architect 5/10 on output quality because assertability is sacred to the creator — if you can't write a test for it, it doesn't exist.

Neither is wrong. They're applying different value systems.

2. Both skills violate their own rules

The architect's SKILL.md is 503 lines. Its own rule says <500.

The creator teaches "pushy descriptions" and "always include a NOT clause." Its own description is neutral with no exclusions.

The creator's quick_validate.py checks for allowed-tools in frontmatter. The creator's own frontmatter doesn't have it.

This isn't a gotcha. This is the fundamental problem of meta-tools: the cobbler's children go barefoot.

3. NOT clauses are contextual, not universal

With 15 skills, NOT clauses are hygiene. With 191 skills, they're architecture. The right answer depends on how many skills are competing for activation in your namespace.

4. The scorecard had a surprise

The architect gave skill-creator 7/10 on shibboleths — its highest score, and higher than skill-architect's own self-evaluation on that dimension (4/10). This is real: skill-creator encodes subtle expertise about when quantitative evals help vs. don't, the overfitting risk in description optimization loops, and the importance of reading transcripts instead of just metrics. The architect recognized expertise it couldn't represent in its own format.

Meanwhile, both evaluators independently found the same description problem: internal vocabulary that users don't type. "Expert-level progressive disclosure" became "my skill doesn't trigger." "Measure skill performance" became "run skill evals." The semantic matching engine doesn't care about your internal vocabulary. Both skills failed this test on themselves.

5. The best meta-skill would be both

An encyclopedia with a factory floor. The architect's knowledge depth (13 reference files, anti-pattern catalogs, shibboleth templates, 23-type Mermaid guide) combined with the creator's measurement infrastructure (9 scripts, 3 evaluation agents, HTML viewer, benchmark aggregation). Neither covers the full space alone.

Both evaluators concluded this independently. SA said SC's tooling is "arguably better" on self-containment. SC said if the two skills "were composed, the combined quality would exceed either alone." Two different philosophies. Same answer.

6. Tools are more honest than text

In Layer 1, each skill described what it valued. In Layer 2, each skill demonstrated what it valued by using its tools.

SA's tools are architectural probes. Running them finds what's structurally absent. SC's tools are consistency probes. Running them finds what's mechanically broken.

Both revealed something text wouldn't. SA's own check_self_contained.py was generating false positives from its own illustrative prose — the skill was failing the very quality gate it ships. SC's aggregate_benchmark.py would always produce empty output because of a missing directory level in the path structure. Neither failure was visible from reading SKILL.md alone. Both were invisible until you ran the code.

The implication: a skill you can't run is less self-contained than it appears.

7. The evaluation artifact is also subject to evaluation

In Round 3, SA₁'s EVALUATION.md — the document that said "check_self_contained.py now passes" — caused check_self_contained.py to fail. The self-evaluator writes the evaluation after fixing the files, references the paths it just fixed, and doesn't think to run the checker against the document it's in the middle of writing.

SC₁, reading the finished output cold, ran the checker and found the failure immediately.

This is a general principle: the act of documenting a fix can reintroduce the problem being fixed. A self-evaluator can't get outside its own output to notice this. A cross-evaluator can.

It also suggests that evaluation infrastructure needs its own testing. SC₁'s fix — adding <!-- phantom-ok --> annotations and extending ILLUSTRATIVE_MARKERS with evaluation-document prose patterns — was itself a form of meta-evaluation: auditing the evaluator's assumptions about its own output.


What Comes Next

Three rounds of cross-evaluation are done. Both evaluators are now at A− and agree they're near convergence. The interesting remaining questions aren't about improving these skills further — they're structural questions about the evaluation process itself.

Does the braid commute? The cross-spiral ran SA₁(SC₁) and SC₁(SA₁) — two directions of the same crossing. Both found different things and both landed at the same grade. But if you computed SA(SC(SA)) vs SC(SA(SC)) from the originals, would they converge to the same point? The two evaluators apply different rubrics and find different failures. The limit might depend on which direction you travel.

What's the fixed point of the composition SA ∘ SC? We have SA(SC₁) = SC₂ and SC(SA₁) = SA₂. But what about (SA ∘ SC)(SA) — applying both evaluators in sequence to the same target? Round 2 and Round 3 applied them independently. What would the composed skill look like if you ran both evaluators together, letting each build on the other's findings?

Can you compose the skills? Both evaluators independently concluded that the ideal meta-skill would have SA's knowledge depth and SC's measurement infrastructure. That's a hypothesis, not a skill. Building it would require merging 13 reference files with 9 evaluation scripts, unifying two different rubrics, and resolving the NOT-clause philosophy difference. That's a design problem, not an evaluation problem.

The experiment continues. Transcripts and diffs are in the eval-data directory. The eval-viewer (shipped as part of skill-creator's tooling) can render the benchmark results as a standalone HTML review page if you want to explore the grading data yourself.


All experiment data, transcripts, and diffs are in the eval-data directory. Anthropic's skill-creator is included under Apache 2.0 with full attribution (see PROVENANCE.md).