Skip to content

Add quality-playbook skill for automated quality system generation#659

Open
andrewstellman wants to merge 4 commits intoanthropics:mainfrom
andrewstellman:quality-playbook
Open

Add quality-playbook skill for automated quality system generation#659
andrewstellman wants to merge 4 commits intoanthropics:mainfrom
andrewstellman:quality-playbook

Conversation

@andrewstellman
Copy link
Copy Markdown

@andrewstellman andrewstellman commented Mar 16, 2026

Summary

A skill that revives traditional quality engineering practices — the kind most teams cut decades ago — and uses AI to make them cheap enough to run on every project.

Most AI testing tools work from source code: they generate test stubs based on what the code does. This skill does something different. It explores a codebase first — reading specs, schemas, defensive patterns, architecture, even developer chat history — to understand what the code is supposed to do, then generates a complete quality infrastructure to verify it. That's the gap between "does this function return the right value?" and "does this system fulfill its purpose?" — and it's the problem quality engineering was invented to solve.

Tip: For a slash command, create .claude/commands/quality-playbook.md containing a single line:
Run the quality-playbook skill against this codebase.

What it generates

Six deliverables that together form a repeatable quality system:

Deliverable What It Does
QUALITY.md Quality constitution — defines what "correct" means for this specific project, with coverage targets tied to rationale so future sessions can't argue standards down
Functional tests Spec-traced tests in the project's native language and test framework, derived from what the spec says should happen — not from what the code does
RUN_CODE_REVIEW.md Code review protocol with anti-hallucination guardrails: cite line numbers, grep before claiming something is missing, flag uncertainty as QUESTION not BUG
RUN_INTEGRATION_TESTS.md Integration test protocol with a runnable test matrix — an AI agent can read this file and execute it as an autonomous test runner
RUN_SPEC_AUDIT.md Council of Three protocol: three independent AI models audit code against specs (in testing, 74% of defects were caught by only one model — single-model review misses most problems)
AGENTS.md Bootstrap file so every future AI session inherits the quality system instead of starting from scratch

What it doesn't do

This is not a test stub generator, a linter, or a code review bot. It doesn't mechanically produce tests from source code. It doesn't scan for vulnerabilities. It doesn't score your existing test suite. It builds the quality system that tells you what to test, why, and how to know when it's working — the piece that sits between "here are specifications" and "here are tests" that no existing tool generates.

Existing tools fall into four buckets, none of which do what this skill does:

  • Test generators (Diffblue, Qodo, Early.ai) generate tests mechanically from source code, including tests that verify bugs
  • Code review tools (CodeRabbit, Greptile) flag problems in code already written, single-model
  • Spec-driven frameworks (GitHub Spec Kit, Kiro) manage the process but don't generate the testing infrastructure
  • Open-source skills (agentic-bootstrap, Star Chamber) generate personas or do standalone review, not quality infrastructure

Three things this does that nothing else does

Each maps to a traditional quality practice that got cut for cost:

1. Forensic inversion of defensive patterns (root cause analysis)
Every tool I surveyed treats try/except blocks and null checks as evidence of robustness. This skill inverts that: defensive code is scar tissue from past failures. Every try/except block, every null check, every retry loop points to a failure mode that belongs in the test plan. Instead of proving the code is safe, the skill reads the confessions.

2. Coverage theater prevention (test plan review)
Explicitly defines and flags fake tests: existence-only checks that prove something is there without verifying what it is, presence-only assertions, mocked validators that bypass the thing you're testing, single-variant testing. The difference between 95% coverage that catches nothing and tests that actually find bugs.

3. Generated protocols that AI agents execute autonomously (quality planning)
The integration test protocol isn't just documentation. When you tell Claude Code to run the tests, it reads the protocol, follows the execution plan step by step, reports progress, and produces a final summary with a ship-or-no-ship recommendation. The documentation becomes the test runner's instructions.

Testing

Tested against five codebases across four languages. The skill had never seen any of them:

Project Language Tests Key Result
Octobatch (batch orchestrator) Python Full playbook All passing
BlazorMatchGame (game UI) C# 30 functional Found 2 real bugs: ghost match vulnerability, timer lifecycle leak
Spring PetClinic (REST API) Java 44 functional Found telephone validation gap — unreported bug
Gson (Google's JSON library) Java 53 new + 4,638 existing All pass, zero regressions
Javalin (web framework) Kotlin/Java 48 functional + 13 integration groups 309 total tests, zero failures

Also submitted to github/awesome-copilot — the skill follows the shared Agent Skills specification and works with both Claude Code and Copilot.

Background

This skill emerged from an O'Reilly Radar series on AI-driven development. The full story covers how it was built, why it works, and the quality engineering theory behind it.

Test plan

  • Eval harness: 20/20 skill triggering accuracy
  • Octobatch (Python): 6 artifacts generated, functional tests passing
  • Gson (Java): 53 functional tests passing against 4,638 existing tests
  • Spring Petclinic (Java/Spring Boot): 44 functional tests, found novel telephone validation bug
  • Javalin (Kotlin): 48 functional tests passing, 13/13 integration tests passing
  • BlazorMatchGame (C#): 30 functional tests, found 2 real bugs
  • Frontmatter validates against skill-creator quick_validate.py allowed properties

andrewstellman added a commit to andrewstellman/awesome-copilot that referenced this pull request Mar 25, 2026
A skill that explores any codebase from scratch and generates six
quality artifacts: quality constitution, spec-traced functional tests,
code review protocol, integration testing protocol, Council of Three
spec audit, and AI bootstrap file (AGENTS.md).

Tested against five codebases across four languages. Also submitted
to anthropics/skills#659.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
andrewstellman added a commit to andrewstellman/awesome-copilot that referenced this pull request Mar 25, 2026
A skill that explores any codebase from scratch and generates six
quality artifacts: quality constitution, spec-traced functional tests,
code review protocol, integration testing protocol, Council of Three
spec audit, and AI bootstrap file (AGENTS.md).

Tested against five codebases across four languages. Also submitted
to anthropics/skills#659.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explores any codebase from scratch and generates six quality artifacts:
QUALITY.md (constitution), spec-traced functional tests, code review
protocol, integration test protocol, Council of Three spec audit, and
AGENTS.md bootstrap file. Works with any language. Includes Phase 4
Quick Start prompts for executing generated artifacts.
…anner

- Add Phase 2 regression test generation to code review protocol
- Add startup banner displaying version and author
- Bump version 1.0.0 → 1.1.0
- Update skill description to mention regression test generation
- Add language-specific regression test guidance (Go, Rust, Python, Java)

Validated against 4 codebases: agent-deck (Go, 7 races confirmed),
edgequake (Rust, 5 bugs confirmed), open-swe (Python, 5 bugs confirmed),
agentscope-java (Java, functional + integration tests passing).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andrewstellman
Copy link
Copy Markdown
Author

Updated with v1.1.0: adds regression test generation (Phase 2 of code review protocol), startup banner, and version metadata. Validated across 8 codebases in 5 languages — 19 confirmed bugs found with failing regression tests.

andrewstellman and others added 2 commits March 28, 2026 19:54
… detection

Add Step 5a (state machine completeness analysis) and expand Step 6
with missing safeguard detection patterns. These catch two categories
of bugs that defensive pattern analysis alone misses: unhandled states
in lifecycle/status machines, and operations that commit users to
expensive work without adequate preview or termination conditions.

Validated against 5 codebases (Octobatch, agent-deck, edgequake,
open-swe, agentscope-java) — all known bugs surfaced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants