Add quality-playbook skill for automated quality system generation#659
Open
andrewstellman wants to merge 4 commits intoanthropics:mainfrom
Open
Add quality-playbook skill for automated quality system generation#659andrewstellman wants to merge 4 commits intoanthropics:mainfrom
andrewstellman wants to merge 4 commits intoanthropics:mainfrom
Conversation
55b2c3a to
8d58b69
Compare
scutuatua-crypto
approved these changes
Mar 16, 2026
andrewstellman
added a commit
to andrewstellman/awesome-copilot
that referenced
this pull request
Mar 25, 2026
A skill that explores any codebase from scratch and generates six quality artifacts: quality constitution, spec-traced functional tests, code review protocol, integration testing protocol, Council of Three spec audit, and AI bootstrap file (AGENTS.md). Tested against five codebases across four languages. Also submitted to anthropics/skills#659. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
16 tasks
andrewstellman
added a commit
to andrewstellman/awesome-copilot
that referenced
this pull request
Mar 25, 2026
A skill that explores any codebase from scratch and generates six quality artifacts: quality constitution, spec-traced functional tests, code review protocol, integration testing protocol, Council of Three spec audit, and AI bootstrap file (AGENTS.md). Tested against five codebases across four languages. Also submitted to anthropics/skills#659. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explores any codebase from scratch and generates six quality artifacts: QUALITY.md (constitution), spec-traced functional tests, code review protocol, integration test protocol, Council of Three spec audit, and AGENTS.md bootstrap file. Works with any language. Includes Phase 4 Quick Start prompts for executing generated artifacts.
8d58b69 to
427e2d4
Compare
…anner - Add Phase 2 regression test generation to code review protocol - Add startup banner displaying version and author - Bump version 1.0.0 → 1.1.0 - Update skill description to mention regression test generation - Add language-specific regression test guidance (Go, Rust, Python, Java) Validated against 4 codebases: agent-deck (Go, 7 races confirmed), edgequake (Rust, 5 bugs confirmed), open-swe (Python, 5 bugs confirmed), agentscope-java (Java, functional + integration tests passing). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Author
|
Updated with v1.1.0: adds regression test generation (Phase 2 of code review protocol), startup banner, and version metadata. Validated across 8 codebases in 5 languages — 19 confirmed bugs found with failing regression tests. |
… detection Add Step 5a (state machine completeness analysis) and expand Step 6 with missing safeguard detection patterns. These catch two categories of bugs that defensive pattern analysis alone misses: unhandled states in lifecycle/status machines, and operations that commit users to expensive work without adequate preview or termination conditions. Validated against 5 codebases (Octobatch, agent-deck, edgequake, open-swe, agentscope-java) — all known bugs surfaced. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This was referenced Apr 2, 2026
This was referenced Apr 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A skill that revives traditional quality engineering practices — the kind most teams cut decades ago — and uses AI to make them cheap enough to run on every project.
Most AI testing tools work from source code: they generate test stubs based on what the code does. This skill does something different. It explores a codebase first — reading specs, schemas, defensive patterns, architecture, even developer chat history — to understand what the code is supposed to do, then generates a complete quality infrastructure to verify it. That's the gap between "does this function return the right value?" and "does this system fulfill its purpose?" — and it's the problem quality engineering was invented to solve.
Tip: For a slash command, create
.claude/commands/quality-playbook.mdcontaining a single line:Run the quality-playbook skill against this codebase.What it generates
Six deliverables that together form a repeatable quality system:
What it doesn't do
This is not a test stub generator, a linter, or a code review bot. It doesn't mechanically produce tests from source code. It doesn't scan for vulnerabilities. It doesn't score your existing test suite. It builds the quality system that tells you what to test, why, and how to know when it's working — the piece that sits between "here are specifications" and "here are tests" that no existing tool generates.
Existing tools fall into four buckets, none of which do what this skill does:
Three things this does that nothing else does
Each maps to a traditional quality practice that got cut for cost:
1. Forensic inversion of defensive patterns (root cause analysis)
Every tool I surveyed treats try/except blocks and null checks as evidence of robustness. This skill inverts that: defensive code is scar tissue from past failures. Every try/except block, every null check, every retry loop points to a failure mode that belongs in the test plan. Instead of proving the code is safe, the skill reads the confessions.
2. Coverage theater prevention (test plan review)
Explicitly defines and flags fake tests: existence-only checks that prove something is there without verifying what it is, presence-only assertions, mocked validators that bypass the thing you're testing, single-variant testing. The difference between 95% coverage that catches nothing and tests that actually find bugs.
3. Generated protocols that AI agents execute autonomously (quality planning)
The integration test protocol isn't just documentation. When you tell Claude Code to run the tests, it reads the protocol, follows the execution plan step by step, reports progress, and produces a final summary with a ship-or-no-ship recommendation. The documentation becomes the test runner's instructions.
Testing
Tested against five codebases across four languages. The skill had never seen any of them:
Also submitted to github/awesome-copilot — the skill follows the shared Agent Skills specification and works with both Claude Code and Copilot.
Background
This skill emerged from an O'Reilly Radar series on AI-driven development. The full story covers how it was built, why it works, and the quality engineering theory behind it.
Test plan