Add quality-playbook skill for automated quality system generation by andrewstellman · Pull Request #659 · anthropics/skills

andrewstellman · 2026-03-16T05:16:05Z

Summary

A skill that revives traditional quality engineering practices — the kind most teams cut decades ago — and uses AI to make them cheap enough to run on every project.

Most AI testing tools work from source code: they generate test stubs based on what the code does. This skill does something different. It explores a codebase first — reading specs, schemas, defensive patterns, architecture, even developer chat history — to understand what the code is supposed to do, then generates a complete quality infrastructure to verify it. That's the gap between "does this function return the right value?" and "does this system fulfill its purpose?" — and it's the problem quality engineering was invented to solve.

Tip: For a slash command, create .claude/commands/quality-playbook.md containing a single line:
Run the quality-playbook skill against this codebase.

What it generates

Six deliverables that together form a repeatable quality system:

Deliverable	What It Does
QUALITY.md	Quality constitution — defines what "correct" means for this specific project, with coverage targets tied to rationale so future sessions can't argue standards down
Functional tests	Spec-traced tests in the project's native language and test framework, derived from what the spec says should happen — not from what the code does
RUN_CODE_REVIEW.md	Code review protocol with anti-hallucination guardrails: cite line numbers, grep before claiming something is missing, flag uncertainty as QUESTION not BUG
RUN_INTEGRATION_TESTS.md	Integration test protocol with a runnable test matrix — an AI agent can read this file and execute it as an autonomous test runner
RUN_SPEC_AUDIT.md	Council of Three protocol: three independent AI models audit code against specs (in testing, 74% of defects were caught by only one model — single-model review misses most problems)
AGENTS.md	Bootstrap file so every future AI session inherits the quality system instead of starting from scratch

What it doesn't do

This is not a test stub generator, a linter, or a code review bot. It doesn't mechanically produce tests from source code. It doesn't scan for vulnerabilities. It doesn't score your existing test suite. It builds the quality system that tells you what to test, why, and how to know when it's working — the piece that sits between "here are specifications" and "here are tests" that no existing tool generates.

Existing tools fall into four buckets, none of which do what this skill does:

Test generators (Diffblue, Qodo, Early.ai) generate tests mechanically from source code, including tests that verify bugs
Code review tools (CodeRabbit, Greptile) flag problems in code already written, single-model
Spec-driven frameworks (GitHub Spec Kit, Kiro) manage the process but don't generate the testing infrastructure
Open-source skills (agentic-bootstrap, Star Chamber) generate personas or do standalone review, not quality infrastructure

Three things this does that nothing else does

Each maps to a traditional quality practice that got cut for cost:

1. Forensic inversion of defensive patterns (root cause analysis)
Every tool I surveyed treats try/except blocks and null checks as evidence of robustness. This skill inverts that: defensive code is scar tissue from past failures. Every try/except block, every null check, every retry loop points to a failure mode that belongs in the test plan. Instead of proving the code is safe, the skill reads the confessions.

2. Coverage theater prevention (test plan review)
Explicitly defines and flags fake tests: existence-only checks that prove something is there without verifying what it is, presence-only assertions, mocked validators that bypass the thing you're testing, single-variant testing. The difference between 95% coverage that catches nothing and tests that actually find bugs.

3. Generated protocols that AI agents execute autonomously (quality planning)
The integration test protocol isn't just documentation. When you tell Claude Code to run the tests, it reads the protocol, follows the execution plan step by step, reports progress, and produces a final summary with a ship-or-no-ship recommendation. The documentation becomes the test runner's instructions.

Testing

Tested against five codebases across four languages. The skill had never seen any of them:

Project	Language	Tests	Key Result
Octobatch (batch orchestrator)	Python	Full playbook	All passing
BlazorMatchGame (game UI)	C#	30 functional	Found 2 real bugs: ghost match vulnerability, timer lifecycle leak
Spring PetClinic (REST API)	Java	44 functional	Found telephone validation gap — unreported bug
Gson (Google's JSON library)	Java	53 new + 4,638 existing	All pass, zero regressions
Javalin (web framework)	Kotlin/Java	48 functional + 13 integration groups	309 total tests, zero failures

Also submitted to github/awesome-copilot — the skill follows the shared Agent Skills specification and works with both Claude Code and Copilot.

Background

This skill emerged from an O'Reilly Radar series on AI-driven development. The full story covers how it was built, why it works, and the quality engineering theory behind it.

Test plan

Eval harness: 20/20 skill triggering accuracy
Octobatch (Python): 6 artifacts generated, functional tests passing
Gson (Java): 53 functional tests passing against 4,638 existing tests
Spring Petclinic (Java/Spring Boot): 44 functional tests, found novel telephone validation bug
Javalin (Kotlin): 48 functional tests passing, 13/13 integration tests passing
BlazorMatchGame (C#): 30 functional tests, found 2 real bugs
Frontmatter validates against skill-creator quick_validate.py allowed properties

A skill that explores any codebase from scratch and generates six quality artifacts: quality constitution, spec-traced functional tests, code review protocol, integration testing protocol, Council of Three spec audit, and AI bootstrap file (AGENTS.md). Tested against five codebases across four languages. Also submitted to anthropics/skills#659. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Explores any codebase from scratch and generates six quality artifacts: QUALITY.md (constitution), spec-traced functional tests, code review protocol, integration test protocol, Council of Three spec audit, and AGENTS.md bootstrap file. Works with any language. Includes Phase 4 Quick Start prompts for executing generated artifacts.

…anner - Add Phase 2 regression test generation to code review protocol - Add startup banner displaying version and author - Bump version 1.0.0 → 1.1.0 - Update skill description to mention regression test generation - Add language-specific regression test guidance (Go, Rust, Python, Java) Validated against 4 codebases: agent-deck (Go, 7 races confirmed), edgequake (Rust, 5 bugs confirmed), open-swe (Python, 5 bugs confirmed), agentscope-java (Java, functional + integration tests passing). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

andrewstellman · 2026-03-27T02:43:45Z

Updated with v1.1.0: adds regression test generation (Phase 2 of code review protocol), startup banner, and version metadata. Validated across 8 codebases in 5 languages — 19 confirmed bugs found with failing regression tests.

… detection Add Step 5a (state machine completeness analysis) and expand Step 6 with missing safeguard detection patterns. These catch two categories of bugs that defensive pattern analysis alone misses: unhandled states in lifecycle/status machines, and operations that commit users to expensive work without adequate preview or termination conditions. Validated against 5 codebases (Octobatch, agent-deck, edgequake, open-swe, agentscope-java) — all known bugs surfaced. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ate filenames

andrewstellman force-pushed the quality-playbook branch from 55b2c3a to 8d58b69 Compare March 16, 2026 15:27

scutuatua-crypto approved these changes Mar 16, 2026

View reviewed changes

andrewstellman mentioned this pull request Mar 25, 2026

Add quality-playbook skill github/awesome-copilot#1168

Merged

16 tasks

andrewstellman force-pushed the quality-playbook branch from 8d58b69 to 427e2d4 Compare March 26, 2026 00:38

andrewstellman and others added 2 commits March 28, 2026 19:54

Align SKILL.md with awesome-copilot: single quotes, language-appropri…

a8fff92

…ate filenames

JackLuguibin mentioned this pull request Apr 5, 2026

📊 AI CLI 工具社区动态日报 2026-04-05 JackLuguibin/big_model_radar#1

Open

This was referenced Apr 6, 2026

📊 AI CLI 工具社区动态日报 2026-04-06 gsscsd/big_model_radar#141

Open

📊 AI CLI 工具社区动态日报 2026-04-07 gsscsd/big_model_radar#147

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add quality-playbook skill for automated quality system generation#659

Add quality-playbook skill for automated quality system generation#659
andrewstellman wants to merge 4 commits intoanthropics:mainfrom
andrewstellman:quality-playbook

andrewstellman commented Mar 16, 2026 •

edited

Loading

Uh oh!

andrewstellman commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

andrewstellman commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it generates

What it doesn't do

Three things this does that nothing else does

Testing

Background

Test plan

Uh oh!

andrewstellman commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andrewstellman commented Mar 16, 2026 •

edited

Loading