🚀 AI Production Readiness Checklist

MLOps • LLMOps • GenAI • Agentic RAG • AI Governance • Enterprise AI Safety

The Complete Guide to Production AI: From 87% Failure Rate to Deployment Success

480+ Production Checklist Items • 20 Domains • CRISP-DM Based • Enterprise Ready

📥 Download Interactive Checklist • 📊 Download CSV Template • 🏗️ View Architecture

A battle-tested checklist built from 27 years of enterprise experience and analysis of $15B+ in AI failures (IBM Watson, Zillow, Babylon Health, Character.AI). Avoid the mistakes that killed billion-dollar AI projects.

⚡ Quick Start

🎯 New to this checklist?

📊 Know what you need?

Jump to: Architecture | Security | Monitoring | Healthcare AI

⬆️ Top · Next: Why This Checklist ➡️

Why This Checklist Exists

After 27 years of building enterprise systems and analyzing why AI projects fail in production, I've compiled this checklist of everything you need to consider before deploying AI to real users.

This checklist helps you avoid:

💸 Financial disasters like Zillow's $500M+ algorithmic trading collapse
⚠️ Safety failures like Character.AI's crisis mishandling leading to teen suicide
🏥 Clinical harm like IBM Watson's unsafe treatment recommendations
📉 Business failures like Babylon Health's $4.2B → $0 collapse
⚖️ Legal liability from EU AI Act violations, HIPAA breaches, or bias lawsuits

📈 The Reality of AI in Production (2025) — Click to Expand

Metric	Value	Source
ML projects failing to reach production	87%	Industry research
Companies with full operational AI integration	1%	McKinsey
Organizations planning to increase AI investment (2025)	92%	Gartner
Organizations using AI agents in production	79%	Industry survey
Enterprises with 50+ generative AI use cases in pipeline	80%	Enterprise survey
Organizations actively managing AI spending (2x from 2024)	63%	FinOps Foundation
Faster model deployment with comprehensive MLOps	60%	MLOps research
Reduction in production incidents with proper governance	40%	Governance studies

Market Growth:

AI agents market: $5.4B → $7.6B (2024→2025)
Enterprise LLM market: $5.9B → $71.1B projected by 2035

⬆️ Quick Start · Next: Architecture ➡️

🏗️ AI Production Architecture

⬅️ Why This Checklist · Next: How to Use ➡️

📖 How to Use This Checklist

Purpose

This checklist helps you systematically evaluate your AI system's readiness for production deployment. Each section addresses a critical aspect of enterprise AI operations—skip any section at your own risk.

Step-by-Step Guide

Assess Current State - Go through each section and check items you've already completed
Identify Gaps - Unchecked items represent potential risks or missing capabilities
Prioritize by Risk - Focus on Security, Safety, and Monitoring first—these prevent disasters
Filter by Stage - Use lifecycle stage filters to focus on items relevant to your current phase
Create Action Plan - Turn unchecked items into tasks with owners and deadlines
Track Progress - Use the interactive HTML checklist with auto-save and dark mode support

Priority Order (Recommended)

Priority	Sections	Why
🔴 Critical	Security & Compliance, Safety & Ethics, Assured Intelligence	Legal liability, user safety, quantified uncertainty
🟠 High	Monitoring & Observability, Cost Management, Data Quality	You can't fix what you can't see; costs can explode
🟡 Important	Red Teaming, Governance, Evaluation, Metric Alignment	Prevent attacks, ensure compliance, maintain quality
🟢 Foundation	Architecture, Agentic AI, Performance	Long-term scalability and maintainability
🔵 Enablers	Prompt Engineering, Strategy, Team	Operational excellence and continuous improvement

Scoring Your Readiness

Score	Level	What It Means
0-20%	🔴 Prototype	Demo only—not ready for any real users
21-40%	🟠 Alpha	Internal testing only with technical users
41-60%	🟡 Beta	Limited external users with clear warnings
61-80%	🟢 Production Ready	Ready for general availability
81-100%	🏆 Enterprise Grade	Mission-critical deployment ready

Section Overview

Section	What It Covers	Key Risk If Skipped
Architecture & Design	Data pipelines, model infrastructure, system design	Technical debt, scaling failures
🔬 Data Quality & Statistical Validity	Training-serving skew, data leakage, drift detection	Silent failures, "optimism trap," model degradation
Agentic AI & MAS	Multi-agent patterns, orchestration, collaboration	Coordination failures, unpredictable behavior
Security & Compliance	Auth, encryption, privacy, industry standards	Data breaches, legal penalties
Red Teaming & LLM Security	OWASP vulnerabilities, adversarial testing	Prompt injection, data leakage
Performance & Scale	Latency, throughput, parallelism	Poor user experience, outages
Cost Management & FinOps	Token tracking, budgets, optimization	Unexpected bills, budget overruns
Safety & Ethics	Input/output safety, bias, responsible AI	Harmful outputs, reputation damage
Monitoring & Observability	Metrics, alerting, dashboards	Blind to issues, slow incident response
Operations & Maintenance	Deployment, model management, DR	Downtime, data loss
🔧 Technical Debt & System Integrity	CACE principle, pipeline jungles, feedback loops	Brittle systems, cascading failures, stagnation
AI Governance	Regulatory compliance, EU AI Act, audit trails	Fines, legal action, failed audits
LLM Evaluation & Testing	Quality metrics, testing types, benchmarks	Degraded quality, hallucinations
📐 Metric Alignment & Evaluation	Proxy problems, Goodhart's Law, online evaluation	Business-destructive "optimized" models
🔬 Assured Intelligence & Quantitative Safety	Conformal prediction, calibration, causal inference, zero-FN	Overconfident wrong predictions, unquantified risk, proxy discrimination
Prompt Engineering	Design principles, version control, CI/CD	Inconsistent outputs, maintenance chaos
AI Strategy & Transformation	Roadmap, implementation phases, change management	Failed adoption, wasted investment
Team & Process	Documentation, training, organizational readiness	Knowledge silos, operational failures
🏥 Healthcare & Mental Health AI	Crisis detection, clinical validation, ethics	Patient harm, deaths, lawsuits
⚠️ Anti-Patterns & Case Studies	Zillow, Amazon, Epic failure analysis	Repeating billion-dollar mistakes

⬅️ Architecture · Next: Essential 20 ➡️

⚡ TL;DR: The Essential 20

Don't have time for 400+ items? Start here. These 20 items are non-negotiable for ANY AI project going to production. Complete these first, then expand based on your persona path.

The Absolute Minimum (Do These or Don't Ship)

#	Item	Why It's Critical	Section
1	Authentication (JWT/OAuth)	No auth = anyone can abuse your API	Security
2	Rate limiting per user	Prevents cost explosions and abuse	Security
3	Prompt injection detection	#1 LLM vulnerability (OWASP LLM01)	Red Teaming
4	Output toxicity filtering	Prevents harmful/offensive outputs	Safety
5	PII detection and masking	Legal requirement (GDPR, HIPAA)	Privacy
6	Error handling with fallbacks	Graceful degradation, not crashes	Architecture
7	Basic monitoring (latency, errors)	You can't fix what you can't see	Monitoring
8	Cost alerts and hard limits	Prevents $100K surprise bills	FinOps
9	Rollback procedure documented	Quick recovery from bad deployments	Operations
10	Human escalation path defined	When AI fails, humans must intervene	Safety
11	Golden test dataset (~50 prompts)	Catch regressions before users do	Evaluation
12	Model/prompt version control	Know what's deployed, enable rollback	MLOps
13	TLS encryption (data in transit)	Basic security requirement	Security
14	Backup strategy (3-2-1 rule)	Recover from disasters	DR
15	API documentation	Others can use and maintain it	Team
16	Hallucination rate tracking	Know how often your AI lies	Evaluation
17	Clear scope boundaries	Users know what AI can/can't do	Safety
18	Audit logging	Forensics when things go wrong	Compliance
19	Bias testing completed	Avoid discrimination lawsuits	Ethics
20	Kill switch / disable capability	Emergency shutdown when needed	Operations

Completed all 20? You're at ~40% readiness (Alpha stage). Now pick your persona path to reach production.

⬆️ Back to Top · Next: Persona Paths ➡️

🎯 Choose Your Path: Persona-Based Guides

Different roles need different priorities. Find your persona below and follow the customized path to production readiness.

Quick Persona Finder

I am a...	My main concern is...	Jump to
CTO / Technical Executive	Technical strategy, team scaling, risk	CTO Path
VP of AI / Head of ML	AI roadmap, team leadership, delivery	VP AI Path
Startup Founder	Ship fast without disasters	Startup Path
Enterprise Architect	Scale, compliance, integration	Enterprise Path
Solo Developer	Side project / learning	Solo Path
Healthcare/Medical	Patient safety, FDA, HIPAA	Healthcare Path
Financial Services	Fraud, compliance, audit	FinServ Path
Data Scientist	Transitioning to ML Engineering	DS→MLE Path
Platform Team	Infrastructure, MLOps	Platform Path
Compliance/Legal	Risk, regulations, audit	Compliance Path
Agency/Consultancy	Building for clients	Agency Path
Government/Public Sector	Transparency, FedRAMP, citizens	Government Path

👔 CTO / Technical Executive

Your Reality: Board accountability, budget ownership, team scaling, technical risk across the organization, vendor relationships, security posture.

Your Risk Profile: Career-defining decisions. AI failures become your failures. Must balance innovation speed with enterprise risk.

Your Strategic Priorities

flowchart TB
    subgraph CTO["👔 CTO STRATEGIC FRAMEWORK"]
        direction TB

        subgraph Governance["🏛️ GOVERNANCE & RISK"]
            G1["AI Risk Committee"]
            G2["Board Reporting"]
            G3["Insurance Coverage"]
        end

        subgraph Technical["⚙️ TECHNICAL STRATEGY"]
            T1["Build vs Buy"]
            T2["Vendor Selection"]
            T3["Architecture Standards"]
        end

        subgraph Team["👥 ORGANIZATION"]
            O1["Team Structure"]
            O2["Hiring Strategy"]
            O3["Skills Development"]
        end

        subgraph Delivery["🚀 DELIVERY"]
            D1["Portfolio Prioritization"]
            D2["Success Metrics"]
            D3["Incident Response"]
        end
    end

    style CTO fill:transparent,stroke:#1e40af,stroke-width:2px
    style Governance fill:#fef2f2,stroke:#dc2626
    style Technical fill:#dbeafe,stroke:#3b82f6
    style Team fill:#dcfce7,stroke:#22c55e
    style Delivery fill:#fef3c7,stroke:#f59e0b

Phase 1: Strategic Foundation (Month 1)

Priority	Decision	Key Questions
🔴 Week 1	AI Risk Assessment	What's our risk appetite? What could kill the company?
🔴 Week 2	Build vs Buy Strategy	Core competency or commodity? Vendor lock-in risks?
🟠 Week 3	Team & Budget	Do we have the talent? What's realistic budget?
🟠 Week 4	Governance Model	Who approves AI projects? What are the gates?

Phase 2: Organizational Setup (Month 2-3)

AI steering committee formed (you + CEO + Legal + Product)
AI ethics guidelines published internally
Vendor evaluation criteria established
Security review process for AI tools defined
Budget allocation and tracking system
Success metrics defined (business outcomes, not just technical)

Phase 3: Operational Excellence (Month 4-6)

Incident response plan for AI failures
Board reporting dashboard created
Insurance coverage reviewed for AI-specific risks
Regulatory compliance roadmap (EU AI Act, etc.)
Technical debt management process
Knowledge sharing across AI teams

Key Decisions Only You Can Make

Decision	Options	Consider
Build vs Buy	Internal team vs Vendors vs Hybrid	Core IP, time-to-market, talent availability
Model Strategy	Proprietary vs Open Source vs API	Cost, control, compliance, capabilities
Risk Tolerance	Conservative vs Aggressive	Industry, stage, competition, regulation
Team Structure	Centralized vs Federated vs Hybrid	Company size, culture, use case diversity
Vendor Selection	OpenAI vs Anthropic vs Google vs OSS	Cost, features, data residency, reliability

Your Dashboard Metrics

Metric	Why It Matters	Target
AI Project ROI	Justify investment to board	>3x within 18 months
Time to Production	Measure team velocity	<90 days for typical project
Incident Rate	Operational excellence	<1 P1 per quarter
Cost per Inference	Unit economics	Decreasing trend
Compliance Score	Risk management	100% mandatory items
Team Retention	Talent strategy	>85% annual retention

Board Reporting Template

Present these quarterly:

Portfolio Status - Projects, stages, blockers
Risk Register - Top 5 AI risks and mitigations
Financial - Spend vs budget, ROI by project
Compliance - Regulatory status, audit findings
Competitive - How we compare to industry

Sections to Own (Delegate Details)

AI Governance — Own the framework, delegate implementation
AI Strategy & Transformation — Your primary section
Security & Compliance — Ensure coverage, don't implement
Cost Management & FinOps — Budget accountability

What to Delegate

Technical implementation → VP of AI / Engineering leads
Day-to-day operations → Platform team
Compliance details → Legal / Compliance team
Vendor negotiations → Procurement (with your input)

⬆️ Back to Personas · Next: VP of AI ➡️

🎯 VP of AI / Head of ML

Your Reality: Translating strategy into execution, managing ML teams, delivering AI products, balancing research vs production, hiring and retaining talent.

Your Risk Profile: Accountable for AI delivery. Must ship while maintaining quality. Team success = your success.

Your Operational Focus

flowchart LR
    subgraph VPAI["🎯 VP OF AI OPERATIONAL FRAMEWORK"]
        direction LR

        subgraph Strategy["📋 STRATEGY"]
            S1["Roadmap"]
            S2["Prioritization"]
            S3["Resource<br/>Allocation"]
        end

        subgraph Delivery["🚀 DELIVERY"]
            D1["Project<br/>Management"]
            D2["Quality<br/>Gates"]
            D3["Release<br/>Process"]
        end

        subgraph Team["👥 TEAM"]
            T1["Hiring"]
            T2["Development"]
            T3["Culture"]
        end

        subgraph Excellence["⭐ EXCELLENCE"]
            E1["Best<br/>Practices"]
            E2["Tooling"]
            E3["Metrics"]
        end

        Strategy --> Delivery --> Team --> Excellence
    end

    style VPAI fill:transparent,stroke:#7c3aed,stroke-width:2px
    style Strategy fill:#dbeafe,stroke:#3b82f6
    style Delivery fill:#dcfce7,stroke:#22c55e
    style Team fill:#fef3c7,stroke:#f59e0b
    style Excellence fill:#fae8ff,stroke:#a855f7

Phase 1: Team & Process Foundation (Month 1-2)

Priority	Action	Outcome
🔴 Week 1-2	Assess current team capabilities	Skills matrix, gap analysis
🔴 Week 2-3	Establish project intake process	Clear prioritization criteria
🟠 Week 3-4	Define quality gates	Stage-gate process adopted
🟠 Month 2	Set up MLOps foundations	CI/CD, monitoring, versioning

Phase 2: Delivery Excellence (Month 2-4)

Project portfolio dashboard created
Sprint/iteration cadence established
Code review and ML review process defined
Experiment tracking system implemented
Model registry and versioning in place
Evaluation framework standardized

Phase 3: Scale & Optimize (Month 4-6)

Self-service ML platform capabilities
Reusable components library
Cross-team knowledge sharing (ML guild)
Continuous improvement retrospectives
Career ladders and growth paths defined
On-call rotation and incident management

Team Structure Options

flowchart TB
    subgraph Structures["TEAM STRUCTURE OPTIONS"]
        subgraph Central["🏢 CENTRALIZED"]
            C1["All ML in one team"]
            C2["Pros: Standards, efficiency"]
            C3["Cons: Bottleneck, distant from product"]
        end

        subgraph Embedded["🔀 EMBEDDED"]
            E1["ML in each product team"]
            E2["Pros: Close to product"]
            E3["Cons: Inconsistent, silos"]
        end

        subgraph Hybrid["⚖️ HYBRID (Recommended)"]
            H1["Platform + Embedded"]
            H2["Pros: Best of both"]
            H3["Cons: Coordination overhead"]
        end
    end

    style Central fill:#fecaca,stroke:#dc2626
    style Embedded fill:#fef3c7,stroke:#f59e0b
    style Hybrid fill:#dcfce7,stroke:#22c55e

Your Weekly Rhythm

Day	Focus	Activities
Monday	Planning	Project status, blocker resolution, priority alignment
Tuesday	Technical	Architecture reviews, technical debt discussions
Wednesday	People	1:1s, hiring interviews, career conversations
Thursday	Delivery	Demo reviews, quality gate checks, release planning
Friday	Strategy	Roadmap refinement, stakeholder alignment, learning

Key Metrics to Track

Category	Metric	Target
Delivery	Projects on schedule	>80%
Quality	Models meeting accuracy targets	>90%
Velocity	Time from idea to production	<60 days
Reliability	Model uptime	>99.5%
Efficiency	Model retraining frequency	As needed, <monthly
Team	Engineer satisfaction (eNPS)	>40
Cost	Cost per prediction	Decreasing

Common Failure Patterns to Avoid

Anti-Pattern	Symptoms	Solution
Research Trap	Always experimenting, never shipping	Time-box research, define "good enough"
Hero Culture	1-2 people know everything	Documentation, pair programming, rotation
Technical Debt Spiral	Shipping fast, breaking often	Dedicated debt sprints, quality gates
Evaluation Theater	Good offline metrics, bad production	Real-world validation, shadow deployments
Scope Creep	Projects never finish	Clear success criteria, MVP mindset

Hiring Guide

Role	When to Hire	Key Skills
ML Engineer	First hire after you	Production systems, software engineering
Data Scientist	When you have data	Statistics, experimentation, modeling
MLOps Engineer	At scale	Infrastructure, automation, monitoring
Research Scientist	Competitive advantage needed	Novel methods, publications not required
ML Manager	Team > 6 people	Leadership, project management, technical

Sections to Own

LLM Evaluation & Testing — Quality is your responsibility
Operations & Maintenance — Delivery excellence
Monitoring & Observability — See problems early
Agentic AI & Multi-Agent Systems — Architecture patterns
Technical Debt & System Integrity — Keep systems healthy

Stakeholder Management

Stakeholder	They Care About	Give Them
CTO	Risk, budget, strategy	Monthly exec summary, risk register
Product	Features, timelines	Roadmap alignment, trade-off discussions
Engineering	Integration, reliability	API contracts, SLAs, documentation
Data	Quality, access	Data requirements, feedback loops
Business	ROI, capabilities	Business impact metrics, demos

⬆️ Back to Personas · ⬅️ CTO · Next: Startup ➡️

🚀 Startup Founder / Early-Stage

Your Reality: Limited resources, need to ship fast, can't afford disasters, investors watching.

Your Risk Profile: High speed, medium-high risk tolerance, but one bad incident could kill the company.

Phase 1: Pre-Launch Essentials (Week 1-2)

Focus on items that prevent company-killing incidents:

Priority	Items	Why
🔴 Day 1	Authentication, Rate Limiting, Cost Limits	Prevent abuse and bankruptcy
🔴 Day 2-3	Prompt Injection Protection, Output Filtering	Prevent PR disasters
🟠 Day 4-5	Basic Monitoring, Error Handling, Logging	Know when things break
🟠 Week 2	Golden Test Set, Rollback Procedure, Kill Switch	Catch issues, recover fast

Phase 2: Growth Mode (Month 1-3)

As you get users, add:

Phase 3: Scale Preparation (Month 3-6)

Before Series A or major growth:

Sections to Prioritize

Security & Compliance (auth, rate limiting)
Safety & Ethics (output filtering)
Cost Management (prevent bill shock)
Monitoring (basic observability)

Sections to Defer

Assured Intelligence (add after product-market fit)
Full Governance (add when preparing for enterprise sales)
Scale & Parallelism (premature optimization)

⬆️ Back to Personas · Next: Enterprise ➡️

🏢 Enterprise Architect

Your Reality: Complex stakeholder landscape, existing systems to integrate, compliance requirements, long procurement cycles.

Your Risk Profile: Low risk tolerance, high scrutiny, failures are career-limiting.

Phase 1: Foundation & Approval (Month 1-2)

Get organizational buy-in with proper governance:

Priority	Items	Why
🔴 Week 1-2	AI Vision, Use Case Prioritization, Cross-functional Team	Align stakeholders
🔴 Week 2-3	EU AI Act Mapping, Risk Classification, Legal Review	Regulatory compliance
🟠 Week 3-4	Security Architecture, Zero-Trust Design, RBAC	Enterprise security
🟠 Month 2	Data Governance, Lineage, Contracts	Data foundation

Phase 2: Controlled Pilot (Month 2-4)

Shadow mode deployment
A/B testing with internal users
Full audit trail implementation
Integration with existing SIEM/monitoring
Vendor risk assessment (if using third-party LLMs)

Phase 3: Production Rollout (Month 4-6)

Phase 4: Scale & Optimize (Month 6+)

FinOps optimization
Model registry and versioning
Automated retraining pipelines
Advanced monitoring (drift, bias)

Enterprise-Specific Considerations

Procurement: Add LLM vendor to approved vendor list
Legal: AI-specific terms in vendor contracts
HR: AI usage policies for employees
Finance: FinOps integration with existing cost centers

⬆️ Back to Personas · ⬅️ Startup · Next: Solo Dev ➡️

👤 Solo Developer / Side Project

Your Reality: Learning, limited time, no budget, acceptable if it breaks.

Your Risk Profile: High risk tolerance for yourself, but still need basics.

The Solo Developer Minimum (Do This Weekend)

#	Item	Time	Why
1	API key in environment variables (not code)	5 min	Basic security
2	Rate limiting (even basic)	30 min	Prevent abuse
3	Cost alerts on your LLM provider	10 min	Avoid surprise bills
4	Basic input validation	1 hour	Prevent injection
5	Error handling with user-friendly messages	1 hour	Better UX
6	Simple logging (console or file)	30 min	Debug issues
7	README with setup instructions	30 min	Future you will thank you
8	Git repository with .gitignore (no secrets!)	15 min	Version control basics

Total time: ~4 hours for a solid foundation

When to Level Up

Upgrade to Startup Path when:

You have real users (not just friends)
Processing any PII or sensitive data
Charging money for the service
Storing conversation history

Tools for Solo Developers

Free monitoring: Sentry free tier, simple uptime checks
Free LLM: Ollama locally, or free tiers of commercial APIs
Free hosting: Vercel, Railway, Fly.io free tiers
Cost control: Set hard spending limits on all API providers

⬆️ Back to Personas · ⬅️ Enterprise · Next: Healthcare ➡️

🏥 Healthcare / Medical AI

Your Reality: Lives at stake, heavy regulation, long validation cycles, clinical workflows.

Your Risk Profile: ZERO tolerance for safety failures. One death can end the company.

⚠️ Critical: Healthcare AI has unique requirements. The Healthcare & Mental Health AI section is MANDATORY, not optional.

Regulatory Pathway First

flowchart TD
    subgraph Regulatory["⚠️ BEFORE WRITING ANY CODE"]
        Q1{"1. Is this a<br/>Medical Device?"}
        Q1 -->|YES| FDA["📋 FDA Pathway<br/>510(k) / De Novo / PMA"]
        Q1 -->|NO| Q5

        FDA --> Q2{"2. Targeting<br/>EU Market?"}
        Q2 -->|YES| CE["🇪🇺 CE Marking<br/>MDR/IVDR Compliance"]
        Q2 -->|NO| Q3

        CE --> Q3{"3. Mental Health<br/>Application?"}
        Q3 -->|YES| CRISIS["🚨 Crisis Detection<br/>100% Recall Required"]
        Q3 -->|NO| Q4

        CRISIS --> Q4{"4. Processing<br/>Patient Data?"}
        Q4 -->|YES| HIPAA["🔒 HIPAA/HITECH<br/>Compliance Required"]
        Q4 -->|NO| Q5

        HIPAA --> Q5["✅ Proceed with<br/>Development"]
    end

    style Regulatory fill:#fef2f2,stroke:#dc2626,stroke-width:2px
    style Q1 fill:#fef3c7,stroke:#f59e0b,color:#78350f
    style Q2 fill:#fef3c7,stroke:#f59e0b,color:#78350f
    style Q3 fill:#fef3c7,stroke:#f59e0b,color:#78350f
    style Q4 fill:#fef3c7,stroke:#f59e0b,color:#78350f
    style FDA fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    style CE fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    style CRISIS fill:#fecaca,stroke:#dc2626,color:#7f1d1d
    style HIPAA fill:#fecaca,stroke:#dc2626,color:#7f1d1d
    style Q5 fill:#dcfce7,stroke:#22c55e,color:#14532d

Phase 1: Regulatory & Safety Foundation (Month 1-3)

Priority	Items	Why
🔴 Week 1	FDA SaMD Classification, Regulatory Strategy	Determines everything else
🔴 Week 2-4	IEC 62304 Software Lifecycle, ISO 13485 QMS	Required for FDA
🔴 Month 2	Safety-Critical Architecture (IEC 61508)	Formal safety invariants
🔴 Month 2-3	Crisis Detection System (if mental health)	100% recall, <1s response

Phase 2: Clinical Validation (Month 3-6)

IRB approval for clinical studies
Independent third-party validation
Geographic validation (all target regions)
Demographic validation (all patient groups)
Clinician workflow integration testing

Phase 3: Pre-Submission (Month 6-9)

Phase 4: Post-Market (Ongoing)

Sections to Prioritize (Mandatory Order)

Healthcare & Mental Health AI Safety — START HERE
Assured Intelligence — Uncertainty quantification
AI Governance — Regulatory compliance
Safety & Ethics — Output safety
Security & Compliance — HIPAA compliance

Healthcare-Specific Metrics

Metric	Target	Why
Crisis detection recall	100%	Zero false negatives for safety
Crisis response latency	<1 second	Immediate intervention
False positive rate	<5%	Minimize alert fatigue
Clinician override availability	Always	Humans must be able to intervene

⬆️ Back to Personas · ⬅️ Solo Dev · Next: FinServ ➡️

💰 Financial Services

Your Reality: Regulated industry, fraud concerns, audit requirements, model explainability mandates.

Your Risk Profile: Low tolerance, regulators watching, fiduciary duty.

Regulatory Framework First

US: OCC, Fed, CFPB guidance on AI/ML in banking
EU: EBA guidelines on ICT risk, DORA, AI Act
Global: Basel Committee principles for AI
Fair Lending: ECOA, Fair Housing Act (explainability required)

Phase 1: Compliance Foundation (Month 1-2)

Priority	Items	Why
🔴 Week 1-2	Model Risk Management (SR 11-7)	Federal Reserve requirement
🔴 Week 2-3	Fair Lending Analysis, Disparate Impact Testing	Avoid discrimination claims
🔴 Week 3-4	Explainability Requirements, Adverse Action Notices	Regulatory mandate
🟠 Month 2	Audit Trail, Model Lineage, Version Control	Examination readiness

Phase 2: Model Governance (Month 2-4)

Model inventory and tiering
Independent model validation (second line)
Model performance monitoring
Champion/challenger framework
Model documentation standards

Phase 3: Production Controls (Month 4-6)

Real-time fraud detection integration
Transaction monitoring
Suspicious activity reporting
Customer complaint tracking
Regulatory reporting automation

Sections to Prioritize

AI Governance — Model risk management
Metric Alignment & Evaluation — Avoid Goodhart's Law
Assured Intelligence — Calibration, uncertainty
Anti-Patterns: Case Studies — Learn from Zillow
Technical Debt & System Integrity — CACE principle

FinServ-Specific Requirements

Explainability: Every decision must be explainable to regulators and customers
Audit: Complete audit trail for all model decisions
Fairness: Regular disparate impact analysis across protected classes
Stress Testing: Model performance under adverse economic conditions

⬆️ Back to Personas · ⬅️ Healthcare · Next: DS→MLE ➡️

🔬 Data Scientist → ML Engineer

Your Reality: Strong in modeling, learning production skills, bridging the gap.

Your Risk Profile: Learning curve, need to understand ops and infrastructure.

Your Learning Path

flowchart LR
    subgraph DS["🔬 DATA SCIENTIST<br/>Skills"]
        DS1["📓 Jupyter<br/>Notebooks"]
        DS2["🧪 Local<br/>Experiments"]
        DS3["🎯 Model<br/>Accuracy"]
        DS4["📦 Batch<br/>Processing"]
        DS5["🐍 Python<br/>Scripts"]
    end

    subgraph GAP["🌉 BRIDGE THE GAP"]
        G1["Version<br/>Control"]
        G2["Reproducibility"]
        G3["System<br/>Reliability"]
        G4["Real-time<br/>Serving"]
        G5["Production<br/>Code"]
    end

    subgraph MLE["⚙️ ML ENGINEER<br/>Skills"]
        MLE1["📊 Git<br/>MLflow"]
        MLE2["🐳 Docker<br/>CI/CD"]
        MLE3["📈 Monitoring<br/>Alerting"]
        MLE4["🚀 APIs<br/>Streaming"]
        MLE5["✅ Testing<br/>Error Handling"]
    end

    DS1 --> G1 --> MLE1
    DS2 --> G2 --> MLE2
    DS3 --> G3 --> MLE3
    DS4 --> G4 --> MLE4
    DS5 --> G5 --> MLE5

    style DS fill:#fae8ff,stroke:#a855f7,stroke-width:2px
    style GAP fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
    style MLE fill:#dcfce7,stroke:#22c55e,stroke-width:2px
    style DS1 fill:#ffffff,stroke:#a855f7
    style DS2 fill:#ffffff,stroke:#a855f7
    style DS3 fill:#ffffff,stroke:#a855f7
    style DS4 fill:#ffffff,stroke:#a855f7
    style DS5 fill:#ffffff,stroke:#a855f7
    style G1 fill:#ffffff,stroke:#f59e0b
    style G2 fill:#ffffff,stroke:#f59e0b
    style G3 fill:#ffffff,stroke:#f59e0b
    style G4 fill:#ffffff,stroke:#f59e0b
    style G5 fill:#ffffff,stroke:#f59e0b
    style MLE1 fill:#ffffff,stroke:#22c55e
    style MLE2 fill:#ffffff,stroke:#22c55e
    style MLE3 fill:#ffffff,stroke:#22c55e
    style MLE4 fill:#ffffff,stroke:#22c55e
    style MLE5 fill:#ffffff,stroke:#22c55e

Phase 1: Production Fundamentals (Week 1-4)

Priority	Items	Why
🔴 Week 1	Version Control (prompts, models, data)	Reproducibility
🔴 Week 2	CI/CD Basics, Automated Testing	Quality gates
🟠 Week 3	Containerization (Docker), Environment Management	Consistency
🟠 Week 4	API Design, Error Handling	Production serving

Phase 2: Observability & Operations (Week 5-8)

Monitoring dashboards (Grafana, DataDog)
Alerting and on-call basics
Log aggregation and analysis
Performance profiling
Cost tracking per experiment

Phase 3: MLOps Maturity (Month 2-3)

Sections to Study (Learning Order)

Operations & Maintenance — Deployment basics
Monitoring & Observability — See what's happening
Data Quality & Statistical Validity — Training-serving skew
LLM Evaluation & Testing — Production evaluation
Technical Debt & System Integrity — Avoid ML-specific debt

Resources for the Transition

Book: "Designing Machine Learning Systems" by Chip Huyen
Course: "Made With ML" (free, production-focused)
Practice: Take a notebook project and deploy it end-to-end

⬆️ Back to Personas · ⬅️ FinServ · Next: Platform ➡️

🛠️ Platform / Infrastructure Team

Your Reality: Supporting multiple ML teams, standardization, self-service, scale.

Your Risk Profile: Reliability is your product. Downtime affects everyone.

Your Mission

Build the internal platform that makes ML teams successful.

Phase 1: Foundation Layer (Month 1-2)

Priority	Items	Why
🔴 Week 1-2	Kubernetes + GPU Operators	Compute foundation
🔴 Week 2-3	Model Serving Infrastructure (vLLM, Triton)	Inference platform
🟠 Week 3-4	Secrets Management, KMS	Security foundation
🟠 Month 2	Observability Stack (metrics, logs, traces)	Platform monitoring

Phase 2: MLOps Platform (Month 2-4)

Phase 3: Self-Service & Governance (Month 4-6)

Sections to Own

Architecture & Design — Infrastructure patterns
Performance & Scale — Latency, throughput
Cost Management & FinOps — Platform economics
Operations & Maintenance — Reliability
Monitoring & Observability — Platform health

Platform Team Metrics

Metric	Target	Why
Model deployment time	<1 hour	Self-service goal
Platform availability	99.9%	Reliability target
Cost per inference	Track & optimize	FinOps
Time to first experiment	<1 day	Developer experience

⬆️ Back to Personas · ⬅️ DS→MLE · Next: Compliance ➡️

⚖️ Compliance / Legal / Risk

Your Reality: Protect the organization, manage liability, ensure regulatory compliance.

Your Risk Profile: Your job is to identify and mitigate risks others miss.

Your Checklist for AI Projects

Pre-Deployment Review

Data provenance and licensing verified
Training data consent/rights confirmed
Output ownership/IP determined
Liability allocation documented
Insurance coverage reviewed

Regulatory Compliance

EU AI Act risk classification completed
Prohibited use cases verified (social scoring, etc.)
High-risk requirements mapped (if applicable)
GDPR/privacy impact assessment done
Industry-specific regulations addressed

Contractual Considerations

AI-specific terms in vendor contracts
Indemnification clauses reviewed
SLA requirements defined
Audit rights preserved
Data processing agreements updated

Governance Framework

AI ethics policy published
Incident response procedure documented
Escalation paths defined
Board/executive reporting established
External audit schedule set

Sections to Review (Priority Order)

AI Governance — Regulatory frameworks
Security & Compliance — Data protection
Safety & Ethics — Responsible AI
Anti-Patterns: Case Studies — Learn from failures
Healthcare & Mental Health AI — If applicable

Key Questions to Ask Engineering

How do we know the model isn't discriminating?
What happens when the model is wrong?
Can we explain decisions to regulators/customers?
How quickly can we disable the AI if needed?
What's our audit trail look like?

⬆️ Back to Personas · ⬅️ Platform · Next: Agency ➡️

🏗️ Agency / Consultancy

Your Reality: Building for clients, varied requirements, handoff considerations, repeatable processes.

Your Risk Profile: Client's risk becomes your risk. Reputation is everything.

Client Onboarding Checklist

Before starting any AI project, clarify:

Question	Why It Matters
Who owns the trained model?	IP and liability
What data can we use for training?	Legal rights
What are the regulatory requirements?	Compliance scope
Who operates it post-handoff?	Documentation needs
What's the budget for ongoing costs?	FinOps planning

Reusable Project Template

Phase 1: Discovery & Planning (Week 1-2)

Phase 2: Development (Week 3-8)

Phase 3: Handoff & Training (Week 9-10)

Sections to Standardize

Architecture & Design — Reusable patterns
Operations & Maintenance — Handoff docs
Team & Process — Documentation standards
Cost Management & FinOps — Client cost clarity

Agency Best Practices

Template everything: Reusable monitoring, CI/CD, documentation
Document decisions: Client sign-off on architecture choices
Clear handoff: Runbooks, training, support transition
Cost transparency: Show clients ongoing operational costs

⬆️ Back to Personas · ⬅️ Compliance · Next: Government ➡️

🏛️ Government / Public Sector

Your Reality: Public accountability, transparency requirements, procurement rules, citizen impact.

Your Risk Profile: Public trust is paramount. Failures make headlines.

Public Sector Specific Requirements

Transparency & Accountability

Algorithmic impact assessment published
Public documentation of AI use cases
Citizen appeal/challenge mechanism
Regular public reporting on AI performance
Freedom of Information considerations

Procurement & Vendors

Equity & Access

Accessibility compliance (508/WCAG)
Language access (LEP populations)
Digital divide considerations
Disparate impact analysis
Community input process

Sections to Prioritize

AI Governance — Public sector accountability
Safety & Ethics — Equity and fairness
Metric Alignment & Evaluation — Avoid gaming
Security & Compliance — FedRAMP, FISMA
Assured Intelligence — Explainability

Government-Specific Metrics

Metric	Requirement	Why
Explainability	High	Public accountability
Bias audits	Regular, public	Equity requirements
Uptime	High	Public service reliability
Data retention	Per records laws	Legal requirements

⬆️ Back to Personas · ⬅️ Agency · Next: Flowchart ➡️

🗺️ Decision Flowchart: Where Do I Start?

flowchart TD
    subgraph Decision["🗺️ WHERE DO I START?"]
        START["🚀 START HERE<br/>Do you have users?"]

        START -->|NO| BUILDING["🔨 Still Building"]
        START -->|YES| DEPLOYED["✅ Already Deployed"]

        BUILDING --> SENSITIVE{"Handling sensitive data?<br/>(PII, health, financial)"}
        DEPLOYED --> MONITORING{"Do you have<br/>monitoring & alerting?"}

        SENSITIVE -->|YES| PATH_SECURE["🔐 START WITH:<br/>━━━━━━━━━━━━<br/>• Security<br/>• Privacy<br/>• Compliance<br/>• Then Essential 20"]
        SENSITIVE -->|NO| PATH_ESSENTIAL["📋 START WITH:<br/>━━━━━━━━━━━━<br/>• Essential 20 items<br/>• Your persona path"]

        MONITORING -->|NO| PATH_URGENT["🚨 STOP! ADD NOW:<br/>━━━━━━━━━━━━<br/>• Monitoring<br/>• Alerting<br/>• Logging<br/>• Rollback"]
        MONITORING -->|YES| PATH_OPTIMIZE["📈 CHECK:<br/>━━━━━━━━━━━━<br/>• Cost management<br/>• Evaluation<br/>• Governance<br/>• Scale readiness"]
    end

    style START fill:#3b82f6,stroke:#1e40af,color:#ffffff,stroke-width:3px
    style BUILDING fill:#f59e0b,stroke:#d97706,color:#ffffff
    style DEPLOYED fill:#22c55e,stroke:#16a34a,color:#ffffff
    style SENSITIVE fill:#fef3c7,stroke:#f59e0b,color:#78350f
    style MONITORING fill:#fef3c7,stroke:#f59e0b,color:#78350f
    style PATH_SECURE fill:#fecaca,stroke:#dc2626,color:#7f1d1d
    style PATH_ESSENTIAL fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    style PATH_URGENT fill:#dc2626,stroke:#991b1b,color:#ffffff
    style PATH_OPTIMIZE fill:#dcfce7,stroke:#22c55e,color:#14532d
    style Decision fill:transparent,stroke:#64748b,stroke-width:2px

Quick Decision Matrix

Your Situation	Start With	Then Add
Side project, no users yet	Essential 20	Nothing until you have users
Startup, pre-launch	Essential 20 → Startup Path	Security, basic monitoring
Startup, have users	Startup Path	Evaluation, cost management
Enterprise, new project	Enterprise Path	Full governance from start
Healthcare/Medical	Healthcare Path	Everything in Healthcare section is mandatory
Financial services	FinServ Path	Explainability, audit trails
Production with issues	Monitoring	Whatever is causing the issues
Scaling problems	Performance & Scale	Cost management
Compliance audit coming	AI Governance	Security, documentation

⬆️ Back to Top · ⬅️ Personas · Next: FAQ ➡️

❓ Frequently Asked Questions

Do I need to complete ALL 400+ items?

No. The checklist is comprehensive by design—it covers everything from startups to enterprise healthcare AI.

Minimum viable: Complete the Essential 20 items
Production ready: Complete items relevant to your persona path
Enterprise grade: Complete 80%+ of all applicable items

Many items are marked "Configurable" meaning they depend on your context.

What's the minimum for a POC/prototype?

For a POC that only YOU will use:

API keys in environment variables (not code)
Basic error handling
Cost limits set on your LLM provider

For a POC that OTHERS will see:

Add: Authentication, rate limiting, basic input validation
Add: Clear "this is a prototype" disclaimers

For a POC with REAL DATA:

Add: Everything in the Essential 20

How long does it take to become production-ready?

It depends on your starting point and target:

Starting Point	Target	Typical Effort
Jupyter notebook	Internal tool	2-4 weeks
Working prototype	Startup MVP	4-8 weeks
MVP	Production	2-3 months
Production	Enterprise-grade	3-6 months

Healthcare/Financial add 2-6 months for compliance.

What if I'm a small team (1-3 people)?

Focus on high-impact, low-effort items:

Automate security basics: Auth, rate limiting, input validation
Use managed services: Don't build monitoring from scratch
Start with Essential 20: This covers 80% of critical risks
Skip scale sections: Until you actually need to scale
Use templates: Don't write runbooks from scratch

See Solo Developer Path or Startup Path.

What items cause the most production incidents?

Based on industry data and case studies:

Missing rate limiting → Cost explosions, abuse
No monitoring → Hours/days to detect issues
No rollback procedure → Extended outages
Prompt injection vulnerability → Data leakage, jailbreaks
Training-serving skew → Silent model degradation
Missing cost limits → $10K+ surprise bills
No golden test set → Regressions reach users
Hallucination without detection → User trust erosion

Which items can I defer until later?

Safe to defer (until you need them):

Item	When to Add
Multi-region failover	When you have users in multiple regions
Model parallelism	When single-GPU isn't enough
A/B testing framework	When you're optimizing, not building
Advanced FinOps	When costs exceed $10K/month
Formal verification	When in safety-critical domains
Full governance framework	When preparing for enterprise or compliance

Never defer: Security, basic monitoring, cost limits, rollback capability

What's different about LLM/GenAI vs traditional ML?

Key differences this checklist addresses:

Traditional ML	LLM/GenAI	Checklist Section
Feature engineering	Prompt engineering	Prompt Engineering
Model accuracy	Hallucination rate	LLM Evaluation
Batch inference	Real-time, streaming	Performance
Model drift	Prompt injection	Red Teaming
Fixed costs	Token-based costs	Cost Management
Input validation	Output safety	Safety & Ethics

How do I convince my manager/team to use this checklist?

Show them the cost of NOT using it:

Company	What Went Wrong	Cost
Zillow	Model overconfidence, no uncertainty quantification	$500M+ loss, 25% layoffs
IBM Watson	No clinical validation, unsafe recommendations	Killed the healthcare division
Character.AI	No crisis detection, inadequate safety	Teen suicide, lawsuits
Babylon Health	Overpromised, underdelivered on safety	$4.2B → $0

Then show them the Essential 20 takes ~2 weeks and prevents most disasters.

How often should I review the checklist?

Before major releases: Full relevant sections
Monthly: Monitoring and alerting effectiveness
Quarterly: Security and compliance sections
Annually: Full checklist review
After incidents: Relevant sections that could have prevented it
When regulations change: Governance sections

Is this checklist specific to any cloud provider or framework?

No. The checklist is cloud-agnostic and framework-agnostic. It works with:

Cloud: AWS, Azure, GCP, or on-premise
LLM Providers: OpenAI, Anthropic, Google, open-source models
Frameworks: LangChain, LlamaIndex, custom implementations
MLOps: MLflow, Weights & Biases, Kubeflow, custom solutions

The companion Technology Selection Guide provides specific tool recommendations.

⬆️ Back to Top · ⬅️ Flowchart · Next: Lifecycle Stages ➡️

🔄 AI Production Lifecycle Stages

Why stage-based workflow matters: Only 54% of AI projects transition from pilot to production (Gartner), and only 11% of companies unlock significant AI value (BCG). A structured stage-gate approach dramatically improves success rates by ensuring the right work happens at the right time.

The 8-Stage Model

flowchart LR
    subgraph Planning["📋 PLANNING"]
        S1[💡 Ideation]
        S2[🔍 Discovery]
    end

    subgraph Development["🔨 DEVELOPMENT"]
        S3[🧪 POC]
        S4[🔧 MVP]
        S5[👥 Pilot]
    end

    subgraph Operations["⚙️ OPERATIONS"]
        S6[🚀 Production]
        S7[📈 Scale]
        S8[⚡ Optimize]
    end

    S1 -->|Business Approved| S2
    S2 -->|Feasible| S3
    S3 -->|Viable| S4
    S4 -->|Usable| S5
    S5 -->|Safe & Effective| S6
    S6 -->|Stable| S7
    S7 -->|SLAs Met| S8
    S8 -.->|Continuous Improvement| S1

    style S1 fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    style S2 fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    style S3 fill:#fae8ff,stroke:#a855f7,color:#581c87
    style S4 fill:#fae8ff,stroke:#a855f7,color:#581c87
    style S5 fill:#fef3c7,stroke:#f59e0b,color:#78350f
    style S6 fill:#dcfce7,stroke:#22c55e,color:#14532d
    style S7 fill:#dcfce7,stroke:#22c55e,color:#14532d
    style S8 fill:#dcfce7,stroke:#22c55e,color:#14532d

    style Planning fill:transparent,stroke:#3b82f6,stroke-width:2px,color:#1e3a5f
    style Development fill:transparent,stroke:#a855f7,stroke-width:2px,color:#581c87
    style Operations fill:transparent,stroke:#22c55e,stroke-width:2px,color:#14532d

📋 Detailed Stage Breakdown — Click to expand

Stage	Key Activities	Exit Gate
1. Ideation	Business case, use case ID, success metrics, stakeholder buy-in	Business Approval
2. Discovery	Data assessment, feasibility, risk assessment, resource plan	Technical Feasible?
3. POC	Technical feasibility, core algorithm, initial results	Viable?
4. MVP	Working prototype, basic UI, integration	Usable?
5. Pilot	Limited users, real-world test, feedback loops, safety validation	Safe & Effective?
6. Production	Full deployment, MLOps pipeline, monitoring, governance	Production Ready?
7. Scale	Multi-region, performance, cost optimize, team scaling	Scalable?
8. Optimize	Continuous improvement, retraining, innovation	ROI Met?

📊 Industry Standard Comparison: CRISP-DM Mapping — Click to expand

Note: CRISP-DM (Cross-Industry Standard Process for Data Mining) is the de-facto industry standard for data science and ML projects, consistently ranking #1 in KDnuggets polls over 12+ years. Our 8-stage model extends CRISP-DM to address modern AI/MLOps requirements.

How Our 8 Stages Map to CRISP-DM

CRISP-DM Phase	Our Stage(s)	What We Add
1. Business Understanding	1. Ideation	Explicit stakeholder buy-in, success metrics
2. Data Understanding	2. Discovery	Risk assessment, resource planning
3. Data Preparation	2. Discovery + 3. POC	Integrated into discovery and POC phases
4. Modeling	3. POC + 4. MVP	Split into feasibility (POC) and prototype (MVP)
5. Evaluation	4. MVP + 5. Pilot	Extended with real-world pilot validation
6. Deployment	6. Production	Same focus on deployment
(not covered)	7. Scale	NEW: Multi-region, performance optimization
(not covered)	8. Optimize	NEW: Continuous improvement, retraining

Why We Extended CRISP-DM

CRISP-DM was published in 1999 and, while still valuable, has known limitations for modern AI systems:

CRISP-DM Limitation	How Our Model Addresses It
No MLOps/continuous training coverage	Stages 7-8 cover scaling and optimization
Designed for small teams	Gate system supports enterprise coordination
No pilot/validation phase	Stage 5 (Pilot) for real-world testing
Deployment is "done"	Stage 8 treats deployment as ongoing
Not AI-specific (Cognilytica)	Includes agentic AI, LLM, and safety considerations

Other Industry Frameworks

Framework	Stages	Best For	Reference
CRISP-DM	6 phases	Traditional ML/analytics	Wikipedia
Microsoft TDSP	5 stages	Azure-based projects	Microsoft Docs
Google MLOps	3 maturity levels	Automation-focused	Google Cloud
CPMAI	CRISP-DM + Agile	AI-specific projects	Cognilytica

Best Practice: "Data science teams that combine a loose implementation of CRISP-DM with overarching team-based agile project management approaches will likely see the best results." — Data Science PM

Gate Classification

Gates are classified into three categories based on risk:

Type	Symbol	When Required	Rationale
Mandatory	🔴	Always	Legal, safety, or existential risk—cannot proceed without
Advisory	🟡	Strongly recommended	Significantly improves success probability
Configurable	🟢	Organization decides	Depends on industry, user base, risk tolerance

📋 Gate Details by Type — Click to expand

🔴 Mandatory Gates (Cannot Proceed Without)

Gate	Items	Why Mandatory
Any → Next	Security vulnerabilities addressed	Legal liability, data breaches
Pilot → Production	Safety validation complete	User safety, especially Healthcare AI
Pilot → Production	Crisis detection tested (Healthcare)	Potential for fatal harm if missed
Any Stage	Data privacy compliance (GDPR/HIPAA)	Fines up to 4% of revenue
Production → Scale	Monitoring operational	Can't fix what you can't see

🟡 Advisory Gates (Strongly Recommended)

Gate	Items	Why Advisory
Discovery → POC	Risk assessment documented	Reduces surprises, but POC can surface unknowns
POC → MVP	Model accuracy targets defined	Important, but can refine in MVP
MVP → Pilot	Basic documentation complete	Helps users, but can iterate during pilot
Any Stage	Bias testing complete	Critical for fairness, depth varies by risk

🟢 Configurable Gates (Organization Decides)

Gate	Items	Factors to Consider
Any Stage	External validation	Required for Healthcare, optional for internal tools
POC → MVP	Clinical advisor review	Required for Healthcare AI, optional otherwise
Pilot → Production	A/B testing complete	Critical for consumer apps, optional for internal
Production → Scale	Multi-region deployment	Required for global, optional for single-market

Gate Decision Framework

flowchart TD
    Q1{Is there a legal/<br/>regulatory requirement?}

    Q1 -->|YES| M1[🔴 MANDATORY]
    Q1 -->|NO| Q2{Could failure cause<br/>user harm?}

    Q2 -->|YES| M2[🔴 MANDATORY]
    Q2 -->|NO| Q3{Does it significantly<br/>impact ROI?}

    Q3 -->|YES| A1[🟡 ADVISORY]
    Q3 -->|NO| C1[🟢 CONFIGURABLE]

    style M1 fill:#fecaca,stroke:#dc2626,color:#7f1d1d
    style M2 fill:#fecaca,stroke:#dc2626,color:#7f1d1d
    style A1 fill:#fef3c7,stroke:#f59e0b,color:#78350f
    style C1 fill:#dcfce7,stroke:#22c55e,color:#14532d

🏥 Healthcare AI: FDA Regulatory Overlay — Click to expand

When building Healthcare AI, enable this overlay to add FDA-specific requirements:

Standard Stage	FDA Addition	Requirements
Stage 3: POC	+ Pre-Submission	FDA feedback on regulatory pathway
Stage 4: MVP	+ Analytical Validation	Technical performance verification
Stage 5: Pilot	+ Clinical Validation	Real-world clinical testing
Stage 5→6 Gate	+ Regulatory Submission	510(k), De Novo, or PMA
Stage 6: Production	+ Market Authorization	FDA clearance/approval required
Stage 8: Optimize	+ Post-Market Surveillance	Ongoing safety monitoring

FDA Gate Requirements (All Mandatory):

Intended use clearly defined
Risk classification determined (Class I, II, or III)
Predicate device identified (for 510(k))
Clinical evidence sufficient for risk level
Quality Management System (QMS) established
Post-market surveillance plan documented

📖 Deep Dive: See docs/LIFECYCLE-STAGES.md for detailed stage requirements and checklists.

📋 Quick Navigation

Core Foundations	Production Operations	Strategy & Governance
🏗️ Architecture & Design	📊 Monitoring & Observability	📜 AI Governance
🔬 Data Quality & Statistical Validity	🔄 Operations & Maintenance	🧪 LLM Evaluation & Testing
🤖 Agentic AI & Multi-Agent Systems	🔧 Technical Debt & System Integrity	📐 Metric Alignment & Evaluation
🔐 Security & Compliance	💰 Cost Management & FinOps	🔬 Assured Intelligence & Quantitative Safety
🛡️ Red Teaming & LLM Security	⚡ Performance & Scale	✍️ Prompt Engineering
🛡️ Safety & Ethics	🏥 Healthcare & Mental Health AI	📈 AI Strategy & Transformation
	⚠️ Anti-Patterns & Case Studies	👥 Team & Process

🏗️ Architecture & Design

Important

Why it matters: Poor architecture decisions made early become expensive technical debt. A well-designed AI system separates concerns, enables scaling, and makes debugging possible. This section covers the foundational infrastructure that everything else builds upon.

AI-Native Architecture Blueprint (10 Steps)

Data Architecture

Data Pipeline Design
- Defined data ingestion strategy
- Implemented data validation and quality checks
- Set up data versioning system
- Created data lineage tracking
- Established data retention policies
💡 Implementation Tips
- Use tools like Dagster or Airflow for orchestration
- Implement Great Expectations for data quality
- Consider using DVC for data versioning
- Example from MultiDB-Chatbot: Separate databases for different data types

AI-Ready Pipeline Components

Schema validation with real-time checks and evolution planning
Data enrichment (location, user-agent, IDs)
Feature engineering for ML transformations
Tiered storage (bronze/silver/gold)
Data contracts between producers/consumers

💡 Data Pipeline Patterns

Pattern	Use Case	Trade-offs
Batch Processing	Lower-volume, non-real-time	Simple but delayed
Stream Processing	Real-time decisions, IoT	Complex but immediate
Lambda	Comprehensive view	Dual system complexity
Kappa	Event-driven apps	Simplified, replay-based
Data Lakehouse	Unified analytics + ML	Best of both worlds
Data Mesh	Large enterprises	Autonomy vs. governance

Model Architecture

System Architecture

Modular Design Requirements
- Loose coupling: Agents operate as services/processes
- Clear interfaces: APIs, event buses, message queues
- Policy-driven control: Guardrails define permissions, escalation, auditing
- Observability: All actions monitored and logged
- Zero-trust security for agent communications
- Versioning & rollback: Tag releases, automate rollbacks on failure
Microservices Design
- Separated inference from business logic
- Implemented API gateway
- Designed for horizontal scaling
- Created service mesh
- Established circuit breakers

Database Strategy

Selected appropriate databases for each workload
Implemented connection pooling
Set up read replicas
Configured automated backups
Tested disaster recovery

💡 Architecture Patterns Comparison

Pattern	Use Case	Trade-offs
Modular Systems	Independent components	Flexibility vs. coordination overhead
Centralized Platforms	Multiple use cases	Consistency vs. single point of failure
Decentralized	Department-managed AI	Autonomy vs. governance challenges
Federated Learning	Distributed data sources	Privacy vs. communication costs

⬆️ Navigation · ⬅️ Lifecycle · Next: Data Quality ➡️

🔬 Data Quality & Statistical Validity

Important

Why it matters: Research reveals that 80%+ of AI failures trace to data issues, not model complexity. Training-Serving Skew is a "silent failure"—models output garbage predictions with high confidence without crashing. Data leakage creates an "optimism trap" where prototype metrics are artificially inflated. This section addresses the primary technical determinant of production success.

Training-Serving Skew Prevention

⚠️ "This skew acts as a 'silent failure'; the model does not crash or throw exceptions. It simply outputs garbage predictions with high confidence."

Single Pipeline Architecture: Feature engineering code identical between training and inference (no dual-pipeline anti-pattern)
Feature Store Implemented: Centralized repository ensures feature calculation consistency across environments
Schema Enforcement: Input schemas validated at inference time match training schemas exactly
Numerical Precision Parity: Training (Python/Pandas) and serving (Java/Go/C++) use identical numerical precision
Time Zone Handling: Temporal features calculated identically (UTC normalization enforced)
Missing Value Strategy: Imputation logic production-identical (not notebook-specific hacks)
Shadow Mode Validation: New models run in parallel with existing, comparing outputs before promotion

💡 Anti-Pattern Alert

The "dual-pipeline" pattern (Data Scientists in Python → Engineers rewrite in Java) is a primary source of skew. Use Feature Stores (Feast, Tecton, Featureform) to structurally eliminate this risk.

Data Leakage Prevention

⚠️ "Leakage artificially inflates evaluation metrics during the PoC, creating a false sense of security that evaporates upon deployment."

Target Leakage Audit: All features verified to be causally available BEFORE prediction timestamp
Train-Test Contamination Check: No global preprocessing (normalization, scaling) performed before data split
Temporal Discipline: Time-series data split chronologically, never randomly
Feature Provenance Documentation: Each feature's data source, calculation logic, and temporal availability documented
Leakage Detection Tests: Automated tests flag suspiciously high-performing features (>0.95 correlation with target)
Cross-Validation Strategy: Appropriate CV method for data type (TimeSeriesSplit for temporal, GroupKFold for hierarchical)

💡 The Antibiotic Example

A pneumonia prediction model learned took_antibiotic=True predicts pneumonia perfectly—in historical data. In production, this feature is unknown at prediction time. The model fails catastrophically because it trained on leaked future information.

Distribution Drift Detection

⚠️ "The fundamental assumption that training and test data are IID (Independent and Identically Distributed) is rarely true in enterprise environments."

Drift Type	Definition	Detection Method	Trigger Action
Covariate Shift	P(X) changes	KS-test, PSI on inputs	Alert + investigate
Concept Drift	P(Y\|X) changes	Performance degradation	Immediate retraining
Label Shift	P(Y) changes	Prior probability monitoring	Recalibration

Covariate Shift Monitoring: Statistical tests (Kolmogorov-Smirnov, Population Stability Index) on input feature distributions
Concept Drift Detection: Ground truth feedback loops to detect P(Y|X) relationship changes
Label Shift Tracking: Target variable distribution (base rates) monitored over time
Automated Retraining Triggers: Drift thresholds trigger retraining pipelines (not just alerts)
Windowed Performance Tracking: Rolling accuracy/precision calculated by time window (daily, weekly)
Seasonality Accounting: Known cyclical patterns (holidays, quarters, fiscal years) factored into drift calculations

External Validation Requirements

⚠️ "The Epic Sepsis Model claimed AUC of 0.76-0.83 internally; external validation found AUC as low as 0.63."

Multi-Source Validation: Model tested on data from at least 2 independent sources/environments
Demographic Stratification: Performance validated and documented across demographic segments
Geographic Validation: If applicable, tested across all deployment regions/sites
Temporal Holdout: Validated on data from a future time period (not random split)
Site-Specific Calibration Plan: Strategy for adapting model to local deployment conditions
Model Card with External Results: External validation results documented in public model card

⬆️ Navigation · ⬅️ Architecture · Next: Agentic AI ➡️

🤖 Agentic AI & Multi-Agent Systems

Important

Why it matters: 79% of organizations are already using AI agents in production. Agentic systems can handle complex workflows autonomously, but without proper design patterns they become unpredictable and unreliable. This section covers proven enterprise patterns for building agents that work together effectively.

Agentic AI Design Patterns

Task-Oriented Agents
- Clear success criteria defined
- Error handling and retry logic implemented
- High reliability for repeatable operations
- Best for: Data entry, scheduling, document classification
Multi-Agent Collaboration
- Communication patterns established (sequential, hierarchical, bi-directional)
- Cross-check outputs to reduce hallucinations
- Conflict resolution mechanisms
- Distributed expertise coordination
Self-Improving Agents
- Feedback loops configured
- Performance monitoring active
- Drift detection implemented
- Continuous learning from interactions
- External reflection preferred over self-critique (code execution, tool validation)
- Environment feedback used to verify reasoning
RAG Agents
- Knowledge retrieval connected to reasoning
- Responses grounded in factual, up-to-date information
- Critical for document-heavy domains and compliance
Orchestrator Agents
- End-to-end workflow management
- Task distribution across specialized agents
- Failure handling with rerouting/fallback strategies
- Loose coupling and separation of concerns
ReAct Pattern (Reason + Act)
- Thought → Action → Observation loop implemented
- Tool failures handled in observation step with retry/fallback logic
- Reasoning traces logged for debugging and audit
- Dynamic re-planning when observations invalidate current plan
💡 Academic vs Enterprise Patterns

Academic Patterns Enterprise Patterns

Reflection Task-Oriented

Tool Use Multi-Agent Collaboration

ReAct Self-Improving

Planning RAG Agents

Multi-Agent Orchestrator Agents

Tip: Start with task-oriented pattern (lowest complexity, fastest time to value), then progress to sequential orchestration, then advanced patterns.

Multi-Agent Systems (MAS) Architecture

Core Components
- Agents with distinct roles, personas, specific contexts
- Agent management for collaboration patterns
- Human-in-the-loop for reliability in critical scenarios
- Specialized tools (web search, document processing, code)
- LLM backbone for processing and inference
- Context management with prompts enabling intent identification
- Memory systems (shared or individual) for context retention

MAS Design Best Practices

Clearly defined agent roles and responsibilities
Communication protocols for data sharing
Adaptive decision-making capabilities
Scalable architecture from the start
Comprehensive monitoring framework
Strong security (encryption, secure data handling)
Regular audits for bias and fairness
Error propagation prevention through data governance

💡 MAS vs Single-Agent Comparison

Aspect	Single-Agent	Multi-Agent
Architecture	Monolithic	Distributed
Fault Tolerance	Single point of failure	Resilient—others continue
Scalability	Limited	Add agents at runtime
Hallucination	Higher risk	Cross-checking reduces errors
Context Windows	Limited	Distribute across agents

Multi-Agent Frameworks Evaluated
- AutoGen (Microsoft): Dynamic agent interactions
- Semantic Kernel (Microsoft): Modular, bridges traditional programming and AI
- LlamaIndex: Knowledge-driven applications
- LangChain: Comprehensive orchestration
- CrewAI: Task-oriented multi-agent coordination

⬆️ Navigation · ⬅️ Data Quality · Next: Security ➡️

🔐 Security & Compliance

Important

Why it matters: AI systems handle sensitive data and make decisions that affect users. A security breach can expose PII, leak proprietary models, or allow prompt injection attacks. Compliance failures result in fines (GDPR: up to 4% of global revenue) and reputational damage. This is non-negotiable for production.

Authentication & Authorization

Access Control
- Implemented JWT/OAuth 2.0
- Set up API key management
- Created role-based access control (RBAC)
- Implemented rate limiting per user/tier
- Added IP allowlisting capabilities

Data Security

Compliance Requirements

⬆️ Navigation · ⬅️ Agentic AI · Next: Red Teaming ➡️

🛡️ Red Teaming & LLM Security

Important

Why it matters: LLMs have unique vulnerabilities that traditional security doesn't cover. Prompt injection can bypass all your safety measures. NVIDIA's red team found that insecure RAG permissions and unsanitized outputs are the top attack vectors. Proactive adversarial testing catches these before attackers do.

OWASP LLM Top 10 (2025)

Vulnerability Assessment
- LLM01: Prompt Injection - tested and mitigated
- LLM02: Sensitive Data Leakage - prevention in place
- LLM07: System Prompt Leakage - protected
- Model theft prevention
- Bias detection and mitigation
- Data poisoning prevention
- RAG exploitation protection
- API abuse prevention

Red Teaming Framework

Vulnerability Categories Tested

Content & Behavior
- Harmful content generation (offensive)
- Stereotypes and discrimination (bias)
- Data leakage (PII exposure)
- Non-robust responses (inconsistency)
- Prompt injection (user input manipulation)
- Jailbreaking (bypassing safety filters)

LLM Output Security (NVIDIA Findings)

Critical Mitigations
- Sanitize all LLM output (remove markdown, HTML, URLs)
- Image content security policies implemented
- Display entire links to users before connecting
- Active content disabled where appropriate
- Secure permissions on RAG data stores
- LLM-generated code execution sandboxed
💡 Red Teaming Tools (2025)
- Promptfoo: Open-source LLM red teaming framework
- DeepTeam: Built on DeepEval for safety testing
- AutoRTAI (HiddenLayer): Agent-based automated red teaming
- Mindgard DAST-AI: Dynamic application security testing for AI
- Adversa: Continuous red teaming for LLMs

⬆️ Navigation · ⬅️ Security · Next: Performance ➡️

⚡ Performance & Scale

Important

Why it matters: Users abandon AI applications that feel slow—every 100ms of latency reduces engagement. LLM inference is expensive; poor optimization wastes GPU resources. At scale, the difference between 100ms and 500ms response time is the difference between delighted users and churned customers.

Latency Optimization

Scalability

Resource Optimization

LLM Parallelism Techniques

Scaling Strategies

Data parallelism: Replicate model, distribute data
Model parallelism: Split model across devices
Tensor parallelism: Distribute tensor operations
Pipeline parallelism: Sequential stages across devices
Context parallelism: Distribute long context processing

💡 Deployment Options

Option	Pros	Cons
Cloud	Flexible, scalable, pay-as-you-go	Data privacy concerns
On-Premises	Data control, security	High upfront cost
Hybrid	Best of both, cost optimization	Complexity
Edge	Low latency, data residency	Limited compute

💡 Serving Frameworks (2025)

vLLM: High-throughput, paged attention
TensorRT-LLM: NVIDIA optimized inference
Ray Serve: Distributed serving, LangChain integration
Triton Inference Server: Multi-model, dynamic batching
llm-d: Kubernetes-native distributed inference

⬆️ Navigation · ⬅️ Red Teaming · Next: Cost ➡️

💰 Cost Management & FinOps

Important

Why it matters: AI costs can spiral out of control overnight. A single misconfigured prompt can 10x your token usage. 63% of organizations are now actively managing AI spending (doubled from 2024). Without proper FinOps, that "free tier" experiment becomes a $50K monthly bill.

AI-Specific Cost Drivers

Cost Tracking
- Token usage (input/output tokens processed)
- GPU compute (training and inference)
- Model training costs (initial and fine-tuning)
- Infrastructure (storage, network)
- API calls (third-party model usage)

Key FinOps Metrics

AI Cost Metrics
- Cost Per Token: Total cost / tokens processed
- Cost Per Inference: Total cost / inference requests
- Cost Per Unit of Work: e.g., cost per 100k words
- GPU Utilization: Aim for near 100%
- Training Cost Efficiency: Cost / model accuracy

Usage Tracking

Cost Controls

Optimization Strategies

⬆️ Navigation · ⬅️ Performance · Next: Safety ➡️

🛡️ Safety & Ethics

Important

Why it matters: LLMs can generate harmful, biased, or factually wrong content. One toxic output can go viral and destroy your brand. Organizations with ethical AI design report higher success rates. This section ensures your AI helps users without causing harm.

Content Safety

Ethical Considerations

⬆️ Navigation · ⬅️ Cost · Next: Monitoring ➡️

📊 Monitoring & Observability

Important

Why it matters: You can't fix what you can't see. AI systems degrade silently—model drift, data quality issues, and hallucination rates creep up over time. Without proper monitoring, you'll learn about problems from angry users, not dashboards. This is how you maintain quality post-launch.

System Monitoring

Application Monitoring

Business Metrics

Alerting

⬆️ Navigation · ⬅️ Safety · Next: Operations ➡️

🔄 Operations & Maintenance

Important

Why it matters: Production AI requires continuous care. Models need updates, prompts need tuning, and systems fail. Without proper deployment strategies (blue-green, canary), one bad release takes down production. Without disaster recovery, one outage becomes permanent data loss.

Deployment Strategy

Model Management

Disaster Recovery

⬆️ Navigation · ⬅️ Monitoring · Next: Tech Debt ➡️

🔧 Technical Debt & System Integrity

Important

Why it matters: ML systems have a unique capacity to incur massive, invisible maintenance costs. The CACE principle (Changing Anything Changes Everything) means small upstream changes can catastrophically break downstream models. This debt compounds silently during prototyping and surfaces explosively in production.

The CACE Principle (Changing Anything Changes Everything)

⚠️ "In an ML model, altering one input feature can change the optimal weights for all others, making systems incredibly brittle."

Feature Dependency Map: Documented which features are correlated/entangled with each other
Upstream Change Notifications: Automated alerts when data sources change schemas or distributions
Full Retraining Policy: Clear policy for when to retrain entire model vs. incremental update
Hyperparameter Sensitivity Analysis: Documented which hyperparameters are sensitive to data changes
Model-Data Version Binding: Model versions explicitly tied to specific data snapshots
Impact Analysis Process: Before any change, assess downstream impact on model performance

Pipeline Jungle Prevention

⚠️ "A failure in an upstream data source can propagate silently through the pipeline, corrupting training data without triggering an error."

Pipeline DAG Visualization: Data lineage visualized from raw source to model input
Data Contracts Enforced: Producer-consumer contracts for data schemas with automated validation
Intermediate Checkpoints: Data quality checks at each pipeline stage (not just ingestion and output)
Glue Code Elimination: Research/notebook code abstracted into testable modules (not copy-pasted)
Pipeline Unit Tests: Transformation logic has unit tests with expected input/output pairs
Null Propagation Alerts: Explicit handling and alerting for null/missing values at every stage
Idempotency Guaranteed: Pipeline can be re-run safely without side effects

Feedback Loop Management

Direct Feedback Loops Cataloged: Cases where model output directly becomes training data
Hidden Feedback Loops Identified: Indirect influence paths (model → world → data)
Loop Damping Mechanisms: Strategies to prevent runaway self-reinforcement
Exploration Budget: System allocates capacity to explore beyond model recommendations
Counterfactual Data Collection: Mechanisms to gather data on actions not taken

Undeclared Consumer Management

⚠️ "Any change or improvement can inadvertently break critical downstream processes, creating fear of updating and model stagnation."

Consumer Registry: All systems consuming model outputs documented and maintained
Deprecation Policy: Formal process for notifying consumers of model changes
Output Schema Versioning: Model outputs versioned with backward compatibility guarantees
Contract Testing: Downstream systems tested when model interface changes
Threshold Documentation: Any hard-coded thresholds on model outputs documented with owners
Breaking Change Protocol: Process for coordinating breaking changes across consumers

⬆️ Navigation · ⬅️ Operations · Next: Governance ➡️

📜 AI Governance

Important

Why it matters: The EU AI Act is now law. NIST and ISO 42001 are becoming enterprise requirements. Organizations that ignore governance face fines, failed audits, and banned products. Only 33% of organizations have embedded AI governance—being compliant is a competitive advantage.

Major Governance Frameworks

Regulatory Compliance Mapping
- EU AI Act: Risk-based classification, mandatory compliance
- NIST AI RMF: Risk management guidelines
- ISO 42001: International AI management standards
- OECD AI Principles: Ethical/human-centered guidelines
- Regional frameworks (UK Pro-Innovation, etc.)

EU AI Act: Prohibited Practices (Effective February 2025)

⚠️ CRITICAL: A technically successful prototype may be ILLEGAL to deploy. These practices result in immediate project termination.

Absolutely Prohibited (No Exceptions):

Social Scoring Ban: System does NOT evaluate/classify natural persons based on social behavior or personality traits leading to detrimental treatment
Emotion Recognition Ban (Workplace/Education): System does NOT infer emotions of individuals in workplaces or educational institutions
Real-Time Biometric ID Ban: System does NOT use real-time remote biometric identification in publicly accessible spaces (narrow law enforcement exceptions)
Subliminal Manipulation Ban: System does NOT deploy subliminal techniques beyond consciousness to distort behavior
Vulnerability Exploitation Ban: System does NOT exploit vulnerabilities of specific groups (age, disability, social/economic situation)
Biometric Categorization Ban: System does NOT categorize individuals based on biometric data to infer race, political opinions, religious beliefs, sexual orientation
Untargeted Facial Recognition Scraping Ban: System does NOT create facial recognition databases through untargeted scraping

Risk Classification Completed:

System classified as: Prohibited / High-Risk / Limited Risk / Minimal Risk
If High-Risk: Conformity assessment requirements identified
If High-Risk: Quality management system documented
Legal review completed for EU deployment

⛔ STOP GATE: If ANY prohibited practice applies to your system, EU deployment CANNOT proceed regardless of other readiness scores. Consult legal counsel immediately.

Governance Implementation (5 Pillars)

AI Organization
- Governance embedded within broader strategy
- Cross-functional team assembled
- Roles & responsibilities assigned
Legal & Regulatory Compliance
- Risk assessment methodology defined
- Regulatory mapping completed
- Data protection measures implemented
Ethics & Responsible AI
- Fairness, transparency, accountability documented
- Bias mitigation strategies identified
- Ethical guidelines published
Technology & Data
- Data governance framework established
- Model management policies defined
- AI model lifecycle processes mapped

Operations & Monitoring

Continuous oversight mechanisms
Audit trails implemented
Monitoring & review cadence established

💡 Governance Maturity Levels (PwC 2025)

Stage	Description	% of Organizations
Early	Building foundational policies	18%
Training	Developing structures & guidance	21%
Strategic	AI priorities defined & communicated	28%
Embedded	Integrated into core operations	33%

⬆️ Navigation · ⬅️ Tech Debt · Next: Evaluation ➡️

🧪 LLM Evaluation & Testing

Important

Why it matters: "It works on my laptop" isn't good enough for AI. LLMs hallucinate, drift, and behave differently with different inputs. Without systematic evaluation using golden datasets and automated testing, you're guessing about quality. This section ensures you can measure and maintain AI performance.

Evaluation Approaches

Multiple Evaluation Methods
- Multiple Choice: Benchmark-based Q&A (MMLU)
- Verifiers: Code/logic verification
- Leaderboards: User preference voting (LM Arena)
- LLM-as-Judge: Automated evaluation at scale

Functional Performance Metrics

Quality Metrics
- Accuracy (correctness of responses)
- Relevancy (alignment with query intent)
- Coherence (logical flow of output)
- Faithfulness (grounded in provided context)
- Hallucination rate (false/unsupported claims)

Operational Performance Metrics

RAG-Specific Metrics

Retrieval Quality
- Context precision (retrieved chunks actually useful)
- Context recall (relevant chunks retrieved)
- Faithfulness (output grounded in retrieval)
- Answer relevancy (concise, on-topic responses)

Testing Types

Comprehensive Testing
- Functional testing: Task-specific capabilities (pre-deployment)
- Regression testing: Same test cases across iterations
- Adversarial testing: Edge cases and attacks (security validation)
- A/B testing: Compare model/prompt variants (production)

Evaluation Best Practices

Quality Assurance
- "Golden" datasets (~200 prompts) as quality checkpoint
- Human review for failed or unclear judgments
- Combine offline (development) and online (production) evaluation
- Track metrics over time for drift detection
- CI/CD integration for automated quality gates
💡 Evaluation Tools (2025)
- DeepEval: Open-source, CI/CD integration, RAG support
- Arize Phoenix: Production observability and evaluation
- Braintrust: End-to-end evaluation platform
- LangSmith: LangChain's evaluation framework
- RAGAS: RAG-specific evaluation
- OpenAI Evals: Open-source, community-driven

Holistic Agent Evaluation (Beyond Component Metrics)

⚠️ A system can have perfect crisis detection but still fail if responses feel robotic, inconsistent, or fail to build trust. Component metrics miss the full picture.

The Evaluation Gap:

Component-Level (Current)	Agent-Level (Missing)
Intent classification accuracy	Therapeutic guideline adherence
Response latency (<2s)	Persona/character consistency
Embedding similarity scores	Tone consistency across sessions
RAG retrieval precision	User satisfaction (CSAT)
Generation perplexity	Therapeutic alliance strength

⬆️ Navigation · ⬅️ Governance · Next: Metrics ➡️

📐 Metric Alignment & Evaluation Integrity

Important

Why it matters: A model can be mathematically "optimal" according to its loss function while being "destructive" to the business. Goodhart's Law explains why metrics degrade when they become targets. This section ensures your evaluation actually predicts real-world success, not just offline performance.

The Proxy Problem

⚠️ "Optimizing for proxy metrics like CTR can lead a recommender to promote clickbait, ultimately degrading user trust and long-term retention."

Metric Mapping Document: Each offline metric explicitly mapped to corresponding business KPI
Negative Correlation Testing: Verified that optimizing proxy metric doesn't hurt true business objective
Long-Term Impact Assessment: Short-term metrics (CTR, engagement) validated against long-term outcomes (LTV, retention)
Multi-Objective Evaluation: Primary metric + guardrail metrics defined (optimize X while Y stays above threshold)
Stakeholder Metric Sign-Off: Business owners reviewed and approved proxy metric relevance

💡 The Recommender Trap

Netflix/Spotify research shows optimizing for clicks/streams often NEGATIVELY correlates with long-term satisfaction. Users click clickbait, hate it, then churn.

Goodhart's Law Awareness

⚠️ "When a measure becomes a target, it ceases to be a good measure."

Adversarial Metric Analysis: Documented how each metric could theoretically be "gamed"
Multi-Metric Dashboard: No single metric used as sole success criterion
Human-in-Loop Reviews: Regular qualitative review of outputs beyond automated metrics
Metric Validity Refresh: Scheduled cadence for reviewing whether metrics remain valid proxies
Unintended Consequence Monitoring: Active tracking of side effects from metric optimization

💡 Call Center Paradox

AI optimized for "Average Handling Time" learns that hanging up immediately = 0 seconds = perfect score. Metric gamed, customers furious.

Counterfactual & Feedback Loop Awareness

⚠️ "The feedback signal is 'censored'... the model reinforces its own initial biases, creating a self-fulfilling prophecy."

Feedback Loop Identification: All ways model output influences future training data documented
Hidden Loop Detection: Indirect feedback paths identified (model → user behavior → data)
Exploration Strategy: Model occasionally explores non-optimal actions to gather unbiased data
Off-Policy Evaluation Capability: Can estimate performance of alternative policies from logged data
Censored Data Acknowledgment: Known limitations documented (only observe outcomes for actions taken)
Debiasing Strategy: Plan for addressing selection bias in feedback data

💡 Predictive Policing Loop

Model predicts crime in Area A → Police deployed → Crime observed → Model reinforced. It predicts police deployment, not crime distribution.

Online Evaluation Infrastructure

A/B Testing Framework: Infrastructure for randomized controlled experiments in production
Shadow Mode Deployment: Models can run on live traffic without affecting user experience
Interleaving Capability: For ranking systems, can mix results from models A and B in same response
Guardrail Metrics: Safety/quality metrics that automatically halt experiments if breached
Statistical Rigor: Sample size calculations and significance thresholds documented before experiments
Experiment Velocity: Can run multiple concurrent experiments with proper isolation

⬆️ Navigation · ⬅️ Evaluation · Next: Assured Intelligence ➡️

🔬 Assured Intelligence & Quantitative Safety

Important

Why it matters: Traditional checklists ensure "probably works"—this section ensures "provably works within bounds." A model can achieve 95% accuracy while producing overconfident wrong predictions that cause patient deaths. Conformal Prediction, causal validation, and selective prediction provide mathematical guarantees that transform AI from "good enough" to "assured."

Uncertainty Quantification (Conformal Prediction)

⚠️ "A prediction of 'sepsis probability 0.73' is meaningless without knowing if the 95% interval is [0.71, 0.75] or [0.23, 0.95]."

Calibration Set Separated: Held-out data for conformal calibration (≥1000 samples)
Non-Conformity Score Defined: Appropriate score function for task type
Coverage Level Set: Target coverage defined (≥95% for healthcare, ≥90% typical)
Prediction Intervals Generated: Every prediction includes conformal interval
Coverage Validated Empirically: Actual coverage matches target
Conditional Coverage Tested: Coverage validated across subgroups (fairness)
Interval Width Monitored: Track and alert on interval width changes

💡 Conformal Prediction Explained

Conformal Prediction provides mathematically valid prediction intervals with guaranteed coverage—regardless of the underlying distribution.

P(Y_true ∈ Prediction_Set) ≥ 1 - α

This guarantee holds for ANY distribution (distribution-free) with finite samples.

Key Libraries: MAPIE, TorchUQ

Probability Calibration

⚠️ "When a model outputs P=0.80, 80% of cases with that score must actually be positive. Modern neural networks are notoriously miscalibrated—overconfident."

ECE Computed: Expected Calibration Error measured
- Healthcare: ECE < 0.05 (mandatory)
- Financial: ECE < 0.05 (recommended)
- Consumer: ECE < 0.10 (acceptable)
Reliability Diagram Generated: Visual calibration assessment
Post-Hoc Calibration Applied: Temperature scaling or Platt scaling if ECE too high
Calibration Per Subgroup: ECE validated across demographic groups
Recalibration Triggers: Automated recalibration when drift detected

💡 Calibration Metrics

Metric	Formula	Target
ECE	Weighted average of	accuracy - confidence
MCE	Maximum	accuracy - confidence
Brier Score	Mean squared error of probability estimates	Lower is better

Key Research: On Calibration of Modern Neural Networks (Guo et al., 2017)

Selective Prediction & Abstention

⚠️ "The most dangerous AI is one that's confidently wrong. The model's primary capability should be knowing when to say 'I don't know.'"

Uncertainty Threshold Defined: Threshold above which model abstains
Abstention Action Defined: Human review, fallback model, or error response
Coverage Target Set: Minimum % of inputs that must receive predictions (e.g., 85%)
OOD Detector Implemented: Out-of-distribution detection operational
OOD Threshold Calibrated: Threshold tuned on calibration set
Abstention Rate Monitored: Track % abstentions over time
Accuracy-on-Predicted Tracked: Accuracy excluding abstained cases

💡 Coverage-Accuracy Trade-off

Accuracy
   ▲
99%├─────────────────────────────────╮
   │                                 │
95%├───────────────╮                 │
   │               │                 │
90%├──────╮        │                 │
   │      │        │                 │
   └──────┴────────┴─────────────────┴──────▶ Coverage
        100%      90%       70%        50%

By abstaining on uncertain cases (reducing coverage), accuracy improves.

Causal Intelligence (Do-Calculus)

⚠️ "The Amazon recruiting AI failed because it learned correlations (women's college names → rejection) not causes. Removing 'gender' doesn't fix proxy discrimination."

Causal DAG Documented: Explicit causal graph for the domain
Domain Expert Validation: Causal assumptions reviewed by experts
Confounder Identification: All confounders identified and addressed
Proxy Discrimination Tested: Protected attributes cannot be reconstructed from features
Counterfactual Fairness Evaluated: Would prediction change if ONLY protected attribute changed?
Backdoor Paths Blocked: Confounders adjusted for or controlled

💡 Correlation vs. Causation

Analysis	Question	Method
Correlation	Are X and Y associated?	Statistical tests
Causation	Does X cause Y?	Do-calculus, interventions
Proxy	Can A be inferred from X?	Reconstruction testing
Counterfactual	What if A were different?	Causal inference

Key Research: Causality (Pearl, 2009), DoWhy Library

Zero-False-Negative Engineering (Healthcare/Safety-Critical)

⚠️ "In cancer screening, a false negative (missed cancer → death) is catastrophically worse than a false positive (unnecessary biopsy). Optimize for asymmetric error costs."

Asymmetric Error Costs Quantified: FN cost and FP cost explicitly documented
Cost Ratio Calculated: FN_cost / FP_cost ratio determines operating point
Sensitivity Floor Defined: Minimum sensitivity requirement (e.g., 99.9%)
Layered Architecture Implemented: Multiple detection layers for redundancy
- Layer 1: High-sensitivity detector (catch all positives)
- Layer 2: High-specificity classifier (reduce false positives)
- Layer 3: Anomaly detector (catch OOD cases)
- Layer 4: Human escalation (uncertain cases)
Layer Independence: Layers use different approaches/features
FN Root Cause Analysis: Every false negative investigated
Sensitivity Monitored Per Subgroup: Validated across demographics

💡 Zero-False-Negative Architecture

Input → [High-Sensitivity Detector] → Positive? → [Specific Classifier] → ...
                    │                                       │
                    │ Negative                              │
                    ▼                                       ▼
         [Anomaly Detector] → Anomalous? → Human Review    Output
                    │
                    │ Normal
                    ▼
               SAFE NEGATIVE

Key: A negative output requires ALL layers to agree.
     Any positive triggers escalation.

Assured Intelligence Summary Checklist

Category	Metric	Target (General)	Target (Healthcare)
Conformal	Coverage	≥ 90%	≥ 95%
Calibration	ECE	< 0.10	< 0.05
Selective	Abstention Rate	< 20%	< 30%
Selective	OOD Detection	> 90%	> 95%
Zero-FN	Sensitivity	≥ 95%	≥ 99%
Zero-FN	False Negatives	Track	0 target

📖 Deep Dive: See docs/ASSURED-INTELLIGENCE.md for comprehensive implementation guide with code patterns.

⬆️ Navigation · ⬅️ Metrics · Next: Prompts ➡️

✍️ Prompt Engineering

Important

Why it matters: Prompts are the code of AI applications—they determine output quality, consistency, and cost. Research shows adding "be concise" reduces token usage by 15-25%. Treating prompts as versioned artifacts with CI/CD enables rapid iteration and prevents regression. This is how you make AI reliable.

Production-Grade Prompt Engineering

Prompt Operations

Prompt Lifecycle Management
- Version control: Track changes, enable rollback
- CI/CD integration: Automate testing and deployment
- Monitor and iterate: Continuous improvement based on feedback
- Treat prompts as software artifacts
💡 Research-Backed Findings (2025)
- Structure matters: Most successful prompts follow clear pattern (intro, formatting, modular inputs)
- Adding "be concise" reduces token usage by 15-25%
- Different models respond better to different formatting patterns
- Prompts are repeatable—viral prompts work across thousands of users
Tools: Latitude, LangChain, PromptLayer, Lilypad

⬆️ Navigation · ⬅️ Assured Intelligence · Next: Strategy ➡️

📈 AI Strategy & Transformation

Important

Why it matters: 87% of ML projects fail to reach production—most due to organizational issues, not technology. Leadership buy-in is the single most predictive factor for AI success. Without a clear strategy, roadmap, and change management, you'll build great AI that nobody uses. This section bridges technology and business.

AI Strategy Roadmap (7 Workstreams - Gartner)

Implementation Phases

6-Phase Framework

Phase 1 - Assessment (2-6 weeks): Evaluate readiness, identify gaps
Phase 2 - Strategy (3-4 weeks): Define objectives, select use cases
Phase 3 - Pilot: Select 1-2 use cases, build POC
Phase 4 - Scale (6-12 months): Expand successful pilots
Phase 5 - Operationalize: MLOps, monitoring, continuous improvement
Phase 6 - Transform (12-24 months): Cultural shift, workforce transformation

💡 AI Maturity Levels

Level	Description	Characteristics
Early Stage	Building foundations	Policies, frameworks being developed
Training Stage	Developing capabilities	Employee training, governance structures
Strategic Stage	Active integration	AI integrated into operations
Embedded Stage	Full operational integration	AI actively drives decision-making

Critical Success Factors

Success Enablers
- Active leadership buy-in (single most predictive factor)
- Cross-functional teams (IT, business, data science)
- Clear business objectives (specific, measurable outcomes)
- Data quality foundation
- Change management program
- Iterative approach (start small, scale gradually)
- Governance framework (ethics, compliance, accountability)

Common Mistakes to Avoid

Anti-Patterns Identified
- Technology-first approach (adopting tool without clear problem)
- Underestimating data quality importance
- Neglecting governance and ethics
- Overreliance on technology (ignoring people/process/culture)
- Lack of ongoing monitoring and optimization
- Attempting too many simultaneous initiatives

⬆️ Navigation · ⬅️ Prompts · Next: Team ➡️

👥 Team & Process

Important

Why it matters: Technology doesn't deploy itself—people do. Knowledge silos, missing documentation, and untrained teams cause operational failures. When the on-call engineer can't find the runbook at 3 AM, your users suffer. This section ensures your team can build, run, and maintain AI systems effectively.

Documentation

Team Readiness

AI Transformation Readiness

Process Governance

⬆️ Navigation · ⬅️ Strategy · Next: Healthcare ➡️

🏥 Healthcare & Mental Health AI Safety

Important

Why it matters: Healthcare AI failures don't just cost money—they cost lives. IBM Watson for Oncology ($4B+ failure), Babylon Health ($4.2B → $0), Forward CarePods ($650M → shutdown), and Character.AI (teen suicide) demonstrate that healthcare and mental health AI requires fundamentally different safety standards. The checklist items below address failure patterns unique to these high-stakes domains.

Crisis Detection & Intervention (Mental Health AI)

⚠️ CRITICAL: Character.AI's chatbot asked a teen if he had "a plan" for suicide. When he said he didn't know if it would work, the bot replied "Don't talk that way. That's not a good reason not to go through with it." The teen died by suicide hours later.

Suicide/Self-Harm Detection: Multi-modal detection (explicit statements, indirect signals like "bridges over 25m in NYC")
Crisis Response Protocol: Immediate safety resources displayed on detection (crisis hotlines, text lines)
Human Escalation Path: 24/7 human handoff capability for high-risk conversations
No Harmful Encouragement: Responses validated to NEVER encourage self-harm, even inadvertently
Dependency Monitoring: User engagement patterns monitored for unhealthy attachment/addiction
Age-Appropriate Safeguards: Enhanced protections for minors (no romantic/sexual content, parental visibility)

Crisis Detection Performance Targets:

Metric	Target	Rationale
Recall	100%	Zero false negatives - every crisis must be detected
False Positive Rate	<5%	Minimize alert fatigue while maintaining recall
Response Time	<1s	Regulatory standard often 30s; aim for real-time
Severity Grading	3+ levels	IMMEDIATE (<30s) → URGENT (<5min) → ELEVATED (<1hr)

Crisis Detection Recall: 100% recall validated (zero false negatives)
False Positive Rate: <5% FPR to prevent alert fatigue
Response Time SLA: <1s detection time (regulatory max: 30s)
Multi-Stage Severity Grading: Tiered response based on crisis severity
Trajectory Analysis: 4+ turn progressive deterioration detection

💡 The Yara AI Lesson

A seasoned tech entrepreneur with clinical psychologist co-founder built Yara AI therapy—then voluntarily shut it down:

"We stopped Yara because we realized we were building in an impossible space. AI can be wonderful for everyday stress, sleep troubles, or processing a difficult conversation. But the moment someone truly vulnerable reaches out—someone in crisis, someone with deep trauma, someone contemplating ending their life—AI becomes dangerous. Not just inadequate. Dangerous."

Key Insight: Even with clinical expertise and AI safety focus, the founder determined mental health AI for vulnerable populations is currently impossible to do safely without strict scope boundaries.

Therapeutic AI Ethics (Brown University 15 Violations)

⚠️ Brown University (2025) identified 15 ethical violations in mental health chatbots including deceptive empathy, unfair discrimination, and amplifying feelings of rejection.

Contextual Adaptation: Responses account for user's lived experiences (not one-size-fits-all)
Therapeutic Collaboration: AI does not dominate conversations or impose solutions
Honest Empathy: No deceptive phrases like "I see you" that create false human connection
Bias Testing: Validated across gender, culture, religion, and mental health conditions
No Belief Reinforcement: AI does not reinforce user's false beliefs or delusions
Stigma Testing: Equal quality of response across conditions (depression vs. schizophrenia vs. addiction)
Rejection Mitigation: Responses validated to not amplify feelings of rejection

Clinical AI Validation (IBM Watson Lessons)

⚠️ IBM Watson for Oncology provided "inappropriate or even unsafe recommendations" because it was trained on US data and deployed internationally without validation.

Geographic Validation: Model validated in ALL deployment regions (not just development region)
Local Clinical Guidelines: Recommendations align with local treatment standards and drug availability
Unsafe Output Prevention: Clinical recommendations reviewed for potential patient harm
Peer-Reviewed Evidence: Marketing claims substantiated by independent clinical validation
Regulatory Approval: Appropriate clearances obtained (FDA, CE marking, etc.) before deployment
Clinician Override: Healthcare professionals can always override AI recommendations

Deployment Environment Validation (Google Verily Thailand)

⚠️ Google's diabetic retinopathy AI achieved 90%+ accuracy in lab settings but failed in Thai clinics due to lighting conditions, image quality, and internet connectivity.

Real-World Environment Testing: Validated in actual deployment conditions (lighting, equipment, connectivity)
Image/Input Quality Thresholds: Clear rejection criteria when input quality is insufficient
Graceful Degradation: System behavior defined for suboptimal conditions
Workflow Integration: Tested within actual clinical workflows, not just standalone

Human-in-the-Loop Requirements

⚠️ Forward Health CarePods removed human oversight from clinical contexts and failed due to "technical breakdowns, usability failures, and clinical safety concerns."

Human Review Required: All clinical AI recommendations require human clinician review
Clear AI Disclosure: Users understand they are interacting with AI, not a human
Human Handoff Protocol: Defined triggers for escalation to human professional
Usability with Real Patients: Interface tested with actual patient populations (not just healthy tech workers)
Clinical Context Preserved: Automation does not remove necessary human judgment from high-stakes decisions

Longitudinal Safety (Transformer Architecture Limitation)

⚠️ Yara AI founder: "The Transformer architecture is just not very good at longitudinal observation, making it ill-equipped to see little signs that build over time."

Longitudinal Pattern Tracking: System tracks patterns across sessions, not just within sessions
Deterioration Detection: Ability to detect gradual worsening over time
Session History Integration: Current session informed by relevant history
Trend Alerting: Concerning trends flagged for human review

Vulnerable Population Safeguards

⚠️ Babylon Health exacerbated health inequity by being "more accessible to younger (healthier) people than to older and less healthy groups."

Accessibility Validation: Tested with elderly, low-tech-literacy, and disabled users
Health Equity Assessment: AI does not create/worsen disparities across populations
Cognitive Load Assessment: Interface appropriate for users in distress or with cognitive limitations
Economic Model Validation: Business model tested against actual usage patterns (not optimistic projections)

Scope Boundaries (What AI Will NOT Do)

⚠️ The most responsible mental health AI companies define clear boundaries. Yara's founder: "AI can be wonderful for everyday stress, sleep troubles, or processing a difficult conversation. But the moment someone truly vulnerable reaches out... AI becomes dangerous."

Clear Scope Definition: Documented what the AI is designed for AND what it is NOT designed for
Scope Enforcement: Technical controls prevent AI from operating outside defined scope
User Expectation Setting: Users informed upfront about AI capabilities and limitations
Graceful Scope Exit: When user needs exceed scope, clear path to appropriate resources
Founder Kill Switch: Team prepared to shut down if safety cannot be assured

Elderly Care AI Considerations

⚠️ "That's too risky at this stage for high-stakes situations like caregiving. We want to make sure that everyone understands that you can't take what [an AI] comes back with at face value."

Human Review Required: All clinical recommendations reviewed by humans
Accessibility Validated: UI/UX tested with elderly populations (vision, hearing, cognitive)
Caregiver Integration: Family/caregiver notification and involvement paths
Technology Fear Mitigation: Design addresses technology anxiety in elderly users
Cognitive Decline Detection: Patterns flagged to appropriate care providers
Medication Safety: Drug interaction and dosage recommendations verified

Healthcare Regulatory Compliance

HIPAA/HITECH: PHI protection verified
FDA Software as Medical Device (SaMD): Classification determined
EU MDR: Medical device regulation compliance (if applicable)
State Mental Health Laws: Jurisdiction-specific requirements met
Clinical Trial Requirements: Human subjects research protocols followed
Liability Insurance: Professional liability coverage adequate

Medical Device Regulatory Path (FDA De Novo):

ISO 13485: Quality Management System gap analysis complete
IEC 62304: Software lifecycle classification determined (Class A/B/C)
ISO 14971: Risk management file with device-specific risks
Design History File (DHF): Initiated for FDA submission
Q-Submission: Pre-submission meeting scheduled with FDA
Clinical Trial Protocol: IRB approval obtained
Regulatory Consultant: Engaged for submission guidance

Safety-Critical Architecture (IEC 61508)

⚠️ For healthcare/therapeutic AI with physical device integration (RPM wearables, smart home, robotics), safety architecture must be formally proven BEFORE deployment. Retrofitting safety is 10x more expensive.

Safety Invariants (Must Be Formally Verified):

SAFETY_INVARIANTS = {
    "no_harm": "System SHALL NOT execute commands that could physically harm users",
    "fail_safe": "On any failure, system SHALL revert to safe default state",
    "human_override": "Human operator SHALL always be able to override automated decisions",
    "crisis_priority": "Crisis responses SHALL preempt all other operations",
    "audit_complete": "All safety-critical decisions SHALL be logged with full context"
}

Deterministic Safety Kernel: Real-time guarantees (<10ms response time)
Formal Verification: Mathematical proofs (Z3/TLA+) for all safety invariants
Triple Modular Redundancy: 3 independent checks for critical decisions
Hardware E-Stop: Physical override capability for all automated actions
Safety Interlock Controller: Prevents unsafe command sequences
Audit Logger: ISO 13485 compliant, 100% coverage of safety decisions
Watchdog Timers: Auto-failsafe on timeout
Zero Unproven Invariants: All safety properties formally proven

Success Criteria:

Zero safety-critical failures in 1M simulations
<10ms safety check latency
100% audit trail coverage
Hardware E-stop tested and documented

Healthcare AI Summary Checklist

Failure	What Happened	Year	Loss	Prevention Check
IBM Watson	US-trained model failed internationally	2023	$4B+	[ ] Geographic validation
Babylon Health	Unvalidated clinical claims	2023	$4.2B	[ ] Third-party clinical validation
Forward CarePods	Removed human oversight	2024	$650M	[ ] Human-in-the-loop maintained
Character.AI	No crisis detection, encouraged self-harm	2024	Teen suicide	[ ] Crisis detection + response safety
Yara	LLM can't track longitudinal patterns	2025	Voluntary	[ ] Longitudinal tracking
Brown Study	15 ethical violations in therapy bots	2025	Research	[ ] Ethics validation
Stanford Study	Stigma toward certain conditions	2025	Research	[ ] Bias testing
Epic Sepsis	67% miss rate, alert fatigue	2021	Clinical harm	[ ] PPV optimization
Google Verily	Lab accuracy failed in real clinics	2020	Undisclosed	[ ] Real-world environment testing
Olive AI	Healthcare ops unicorn collapse	2024	~$4B	[ ] Economic model validation

⬆️ Navigation · ⬅️ Team · Next: Anti-Patterns ➡️

⚠️ Anti-Patterns: Lessons from Catastrophic Failures

Important

Why it matters: These case studies represent billions in losses and destroyed careers. Each failure provides concrete patterns to detect and avoid in your own systems.

Case Study 1: Zillow Offers ($500M+ Loss)

What happened: Zillow's iBuying algorithm made instant cash offers on homes. In 2021, the division was shut down with a $500M+ write-down and 25% workforce reduction.

Root Causes Identified:

Adverse Selection: Model errors weren't random. Homeowners accepted overvalued offers, rejected undervalued ones. Zillow systematically acquired "lemons."
Regime Change Blindness: Model built on pre-COVID trends failed to adapt to volatile post-pandemic market.
Algorithmic Hubris: Point estimates treated as truth; uncertainty and tail risk ignored.

Anti-Patterns to Check:

Adverse Selection Analysis: Documented how counterparties might exploit asymmetric information about model errors
Regime Change Planning: Strategy for detecting and responding when historical patterns break
Uncertainty Quantification: Decisions use confidence intervals/prediction intervals, not point estimates
Human Override Protocol: Clear escalation path for high-stakes decisions beyond model recommendation
Asymmetric Error Costs: Documented and optimized for different costs of over-prediction vs. under-prediction

Case Study 2: Amazon Recruiting AI (Bias Amplification)

What happened: Amazon's resume-screening AI, trained on 10 years of hiring data, systematically penalized female candidates. Project scrapped.

Root Causes Identified:

Historical Bias in Training Data: Data reflected decade of male-dominated tech hiring.
Proxy Discrimination: Even with "gender" removed, model found proxies ("women's chess club," women's college names).

Anti-Patterns to Check:

Proxy Variable Audit: Tested whether protected attributes can be reconstructed from remaining features
Historical Bias Assessment: Training data evaluated for patterns reflecting historical discrimination
Disparate Impact Testing: Model outputs tested for statistical disparities across demographic groups
Bias Reconstruction Testing: Verified model can't infer protected attributes from allowed features
Regular Fairness Audits: Scheduled re-evaluation (not just one-time pre-launch testing)
Diverse Evaluation Team: People from affected groups involved in testing and evaluation

Case Study 3: Epic Sepsis Model (67% Miss Rate, Alert Fatigue)

What happened: Widely deployed clinical AI for early sepsis detection. External validation found it missed 67% of cases with ~12% Positive Predictive Value.

Root Causes Identified:

Alert Fatigue: ~8 false alarms per true positive. Clinicians ignored the tool entirely.
Overfitting to Source: Model overfitted to specific hospitals' coding practices and workflows.
COVID Regime Shift: During pandemic, couldn't distinguish COVID symptoms from sepsis (43% alert increase).

Anti-Patterns to Check:

External Validation Mandatory: Model tested outside development environment before deployment
PPV in Context: Positive Predictive Value calculated for actual deployment prevalence (not just sensitivity/specificity)
Alert Fatigue Assessment: If alerting system, false positive burden on users explicitly evaluated
User Trust Tracking: Monitoring whether users actually follow/trust model recommendations
Local Calibration Required: Strategy for adapting model to each deployment site's characteristics
Regime Change Detection: Monitoring for environmental shifts that invalidate model assumptions

Summary: Universal Anti-Pattern Checklist

Anti-Pattern	Zillow	Amazon	Epic	Your System
Adversarial/gaming not considered	✓			[ ]
Historical bias in training data		✓	✓	[ ]
Proxy discrimination possible		✓		[ ]
No external validation			✓	[ ]
Alert/recommendation fatigue risk			✓	[ ]
Regime change blindness	✓		✓	[ ]
Point estimates without uncertainty	✓			[ ]
No local/site calibration			✓	[ ]

⬆️ Navigation · ⬅️ Healthcare · Next: Scoring ➡️

📊 Scoring Your Readiness

Count your checked items:

Score	Readiness Level	Recommendation
0-20%	🔴 Prototype	Not ready for any real users
21-40%	🟠 Alpha	Internal testing only
41-60%	🟡 Beta	Limited external users with warnings
61-80%	🟢 Production Ready	Ready for general availability
81-100%	🏆 Enterprise Grade	Ready for mission-critical deployment

⬆️ Navigation · ⬅️ Anti-Patterns · Next: Quick Wins ➡️

🎯 Quick Wins

If you're overwhelmed, start with these high-impact items:

Authentication: Never deploy without it
Rate Limiting: Prevent abuse and cost overruns
Error Handling: Graceful failures save users
Monitoring: You can't fix what you can't see
Backup Strategy: Because data loss is unforgivable

⬆️ Navigation · ⬅️ Scoring · Next: Downloads ➡️

📥 Downloadable Tools

Format	Description	Download
Interactive HTML	Apple HIG-inspired checklist with auto-scoring, dark mode, lifecycle stages, gate classifications, progress tracking	Download HTML
CSV/Excel Template	Spreadsheet format with all 400+ items, Stage/Gate columns, priority levels - works in Excel, Google Sheets, Numbers	Download CSV
Architecture Diagram	Draw.io component diagram showing how all checklist components work together	Download .drawio

HTML Checklist Features

Apple Human Interface Guidelines Design:

SF Pro typography with optimal letter-spacing and weights
Native dark mode support (prefers-color-scheme)
Glassmorphism panels with backdrop blur effects
Custom circular checkboxes with animated checkmarks
Segmented control-style navigation tabs
8-point grid spacing system
44px touch targets for accessibility
Smooth spring animations and micro-interactions

Functionality:

Auto-Scoring: Real-time progress calculation with readiness badges
Lifecycle Filtering: Filter items by stage (Ideation → Optimize)
Gate Classification: Visual indicators for Mandatory/Advisory/Configurable items
Local Storage: Progress persists across browser sessions
Export/Import: Save and restore progress as JSON
Print-Friendly: Optimized print stylesheet
Responsive: Works on desktop, tablet, and mobile

Data Features:

CSV Version: Sortable by Section/Stage/Gate/Priority, add custom notes, calculate scores with formulas
Diagram: Editable in draw.io - shows 5-layer architecture with data flow

📝 Text Version of Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│  USER & CLIENT LAYER                                                        │
│  Users → Auth (JWT/OAuth) → Rate Limiting → API Gateway → Input Validation  │
└─────────────────────────────────────────────────────────────────────────────┘
                                      ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│  AGENTIC AI & ORCHESTRATION LAYER                                           │
│  Orchestrator → Task Agents → RAG Agents → Multi-Agent → Human-in-Loop      │
└─────────────────────────────────────────────────────────────────────────────┘
                                      ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│  MODEL & INFERENCE LAYER                                                    │
│  Prompt Engine → LLM Router → Primary/Fallback LLM → Output Safety          │
└─────────────────────────────────────────────────────────────────────────────┘
                                      ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│  DATA QUALITY & VALIDATION LAYER                                            │
│  Feature Store → Schema Validator → Drift Detector → Leakage Scanner        │
└─────────────────────────────────────────────────────────────────────────────┘
                                      ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│  DATA & KNOWLEDGE LAYER                                                     │
│  Vector DB → Knowledge Base → Cache → Data Lakehouse → External Data        │
└─────────────────────────────────────────────────────────────────────────────┘
                                      ↓
┌─────────────────────────────────────────────────────────────────────────────┐
│  INFRASTRUCTURE & COMPUTE LAYER                                             │
│  Kubernetes → GPU Cluster → Model Serving (vLLM) → Queue → Secrets          │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│  CROSS-CUTTING: Monitoring │ Governance │ MLOps │ Evaluation │ FinOps │Debt │
└─────────────────────────────────────────────────────────────────────────────┘

⬆️ Navigation · ⬅️ Quick Wins · Next: Tech Guides ➡️

🛠️ Technology Selection Guides

Choosing the right architecture and tools is critical. These decision frameworks are based on Google's 76-page AI Agents whitepaper, Anthropic's MCP documentation, and production engineer comparisons from 2024-2025.

RAG Pattern Selection

2025 Insight: Google's ICLR 2025 research shows RAG paradoxically reduces a model's ability to abstain when appropriate—additional context increases confidence and can lead to more hallucination. Add sufficiency checks before generation.

Pattern	When to Use	When NOT to Use	Stage	Key Research
Naive RAG	Simple Q&A, single doc source, prototyping	Multi-step reasoning, complex queries	POC	Baseline approach
Advanced RAG	Better accuracy needed, multiple sources, reranking	Simple use cases, low latency required	MVP/Pilot	Hybrid search + rerankers
Self-RAG	Model decides when/how much to retrieve	Static retrieval patterns sufficient	Pilot	2024 research
Modular RAG	Custom pipelines, domain-specific needs	Quick prototypes, standard use cases	Production	Component-based architecture
Graph RAG	Knowledge graphs, entity relationships, complex reasoning	Unstructured text only, simple retrieval	Production	Microsoft Graph RAG
Agentic RAG	Dynamic retrieval, tool use, multi-step reasoning	Static Q&A, simple lookups	Production/Scale	Google whitepaper patterns
Reasoning RAG	System 2 thinking, industry challenges	Simple factual queries	Scale	2025 survey

Production RAG Best Practices (2025):

Sufficiency check before generation (Google ICLR 2025)
Retrieve more context OR re-rank when insufficient
Tune abstention threshold with confidence signals
Hybrid search (vector + keyword) implemented
Streaming data ingestion for real-time updates

Agent Architecture Selection (Google Whitepaper Patterns)

Google's recommended patterns from their 76-page AI Agents whitepaper for production multi-agent systems:

Pattern	When to Use	Complexity	Google Use Case
Single Agent	Simple tasks, clear success criteria	Low	Task-oriented agents
Tool-Using Agent	External API calls, calculations	Medium	Navigation, search
Hierarchical Orchestration	Central agent routes to domain experts	High	Connected vehicle system
Diamond Pattern	Post-hoc moderation needed	High	Content safety
Peer-to-Peer Handoff	Autonomous query rerouting	High	User support flows
Collaborative Synthesis	Multiple agents contribute to response	Very High	Response mixer pattern
Adaptive Looping	Iterative refinement needed	Very High	Complex reasoning

Agent Decision Checklist:

Task complexity assessed (single-step vs. multi-step)
Human-in-the-loop requirements documented
Error tolerance and fallback strategy defined
Coordination overhead budget set
Safety pattern selected (Diamond for moderation)

Protocol Selection (MCP is now industry standard)

2025 Update: MCP adopted by OpenAI (March 2025), Google DeepMind (April 2025), and Microsoft Azure. Thousands of MCP servers built by community.

Protocol	Best For	Adoption	Security Notes
MCP (Model Context Protocol)	Tool integration, data connectors	Industry standard (2025)	Review prompt injection risks
A2A (Agent-to-Agent)	Multi-agent communication	Google standard	Enterprise MAS
OpenAI Agents SDK	OpenAI ecosystem	Growing	Native tool use
Custom REST/gRPC	Full control, legacy systems	Stable	Existing infrastructure

MCP Production Benefits (Anthropic 2025):

Code execution with MCP: 98.7% token reduction in complex workflows
API handles connection management, tool discovery, error handling
Pre-built servers: Google Drive, Slack, GitHub, Postgres, Puppeteer

Framework Selection (Production Engineer Comparison)

Based on production engineer comparisons and DataCamp analysis.

Framework	Best For	Learning Curve	Production Readiness	When to Use
LangGraph	Stateful workflows, complex graphs	Steep	High	Intricate branching workflows, need replay/rollback
CrewAI	Role-based teams, rapid prototyping	Easy	Medium	Defined role delegation, fastest to prototype
AutoGen	Dynamic conversations, Azure ecosystem	Medium	High	Enterprise environments, Microsoft stack
OpenAI Agents SDK	OpenAI-native agents	Easy	High	OpenAI ecosystem, simple agents
LlamaIndex	RAG, document Q&A	Easy	High	Data ingestion, retrieval pipelines
Haystack	Production RAG pipelines	Medium	Very High	Enterprise RAG, self-hosted
vLLM	High-throughput inference	Medium	Very High	Serving at scale, PagedAttention
TGI	HuggingFace model serving	Easy	High	HF ecosystem, production serving

Framework Selection by Use Case:

Intricate stateful workflows → LangGraph (state transitions, visual debugging)
Dynamic conversational systems → AutoGen (conversation-first design)
Defined role delegation → CrewAI (fastest path to working prototype)
Enterprise reliability → AutoGen (Microsoft-backed, Azure integration)

Model Selection Guide (December 2025)

Based on LMArena Leaderboard, Hugging Face Open LLM Leaderboard, and Artificial Analysis. Updated December 2025.

Use Case	Top Models (Dec 2025)	Open-Source Alternative	Notes
Complex reasoning	GPT-5, Claude Opus 4.5, Gemini 3.0 Pro	DeepSeek R1, Qwen3-235B	Gemini 3.0 Pro leads GPQA Diamond (91.9%)
High volume	GPT-5 Mini, Claude Haiku 4.5, Gemini 2.5 Flash	Qwen3 (0.6B-235B range), Jamba 1.6 Mini	Gemini 2.5 Flash: 372 tokens/sec
On-premise/Privacy	Llama 4 Maverick (400B), Mistral Large	DeepSeek-V3.1, Qwen3 Next	Llama 4 Scout fits single H100 (Int4)
Long context (1M+)	Gemini 3.0 (10M), Llama 4 Scout (10M)	Jamba 1.6 (256K), Qwen3 (128K)	Llama 4 Scout: 10M token context
Code generation	Claude Opus 4.5, GPT-5	DeepSeek Coder, Codestral	Claude Opus 4.5: first >80% SWE-Bench
Multimodal	GPT-5, Gemini 3.0, Claude Opus 4.5	Llama 4 (native multimodal), SmolVLM	Llama 4: natively multimodal, 200 languages
Agents/Tool use	Gemini 3.0, Claude Sonnet 4.5	Qwen3-Agent, Llama 4 Maverick	Sonnet 4.5: 61.4% OSWorld
EU data residency	Mistral (EU), Azure OpenAI (EU)	Mistral Large, Jamba 1.6	Mistral HQ in Paris
Edge/Mobile	GPT-5 Nano, Gemini 2.5 Flash-Lite	Jamba Reasoning 3B, Qwen3-4B	Jamba 3B: 250K context on phones

Latest Model Releases (Q4 2025):

Gemini 3.0 Pro (Nov 2025): #1 on LMArena, 41% on Humanity's Last Exam
Claude Opus 4.5 (Nov 2025): First model >80% SWE-Bench Verified
GPT-5.1 (Nov 2025): Faster reasoning, extended prompt caching
Llama 4 (Apr 2025): MoE architecture, 10M context (Scout), 400B params (Maverick)
Qwen3 Next 80B (Sep 2025): 3× smaller than 235B, 4× more experts

Hugging Face CEO Insight (Nov 2025):

"You can use a smaller, more specialized model that is going to be cheaper, faster, that you're going to be able to run on your infrastructure as an enterprise. I think that is the future of AI."

Model Decision Checklist:

Accuracy requirements benchmarked against leaderboards
Token economics calculated (input/output pricing)
Context window requirements assessed
Latency SLA vs. model size trade-off evaluated
Data privacy/residency requirements documented
Fine-tuning vs. RAG vs. prompt engineering decision made
Open-source license compatibility verified

📖 Deep Dive: See docs/TECHNOLOGY-SELECTION-GUIDE.md for detailed decision trees and case studies.

⬆️ Navigation · ⬅️ Downloads · Next: Resources ➡️

📚 Resources

Companion Documents

For deeper dives into specific topics, see our detailed reference guides:

Document	Description
Lifecycle Stages Guide	Detailed 8-stage workflow with gate requirements and FDA overlay
Technology Selection Guide	RAG, Agent, Framework, and Model decision frameworks
Assured Intelligence Guide	Conformal Prediction, Calibration, Causal Inference, Zero-False-Negative Engineering
Failure Taxonomy Deep Dive	Detailed analysis of the three failure domains: Data Schism, Metric Gap, Technical Debt
Case Studies	Expanded forensic analysis of Zillow ($500M+), Amazon (bias), Epic (clinical harm)
Healthcare AI Case Studies	12 healthcare/mental health AI failures: IBM Watson, Babylon Health, Character.AI, Yara AI, and more
MLOps Maturity Model	Assessment tool and progression roadmap from Level 0 to Level 3

Tools & Frameworks

Agent & Orchestration:

LangChain - RAG and agent framework
LlamaIndex - Knowledge-driven AI applications
AutoGen - Multi-agent conversation framework
CrewAI - Task-oriented multi-agent coordination
Semantic Kernel - Microsoft's modular AI framework

Evaluation & Testing:

DeepEval - LLM evaluation with CI/CD support
RAGAS - RAG evaluation framework
Promptfoo - LLM red teaming and testing
Arize Phoenix - LLM observability

Serving & Infrastructure:

vLLM - High-throughput LLM serving
Ray Serve - Scalable model serving
Triton Inference Server - Multi-model serving

MLOps & Monitoring:

Weights & Biases - ML experiment tracking
MLflow - ML lifecycle management
Prometheus - Monitoring and alerting
Grafana - Observability dashboards

Infrastructure:

Terraform - Infrastructure as Code
Kubernetes - Container orchestration

🤝 Contributing

This checklist is a living document. Please contribute your hard-won lessons:

Fork the repository
Add your items with practical examples
Submit a pull request
Share your production horror stories in discussions

⬆️ Navigation · ⬅️ Resources · Next: Credit ➡️

💫 Please Credit

If you find this checklist helpful, please consider:

Star this repo ⭐ to help others discover it
Credit the source when sharing or adapting:

AI Production Readiness Checklist by Aejaz Sheriff at Pragmatic Logic AI
Link back to this repository in your documentation, presentations, or articles
Share on LinkedIn, Twitter/X, or your tech community

Your attribution helps support the continued development of open-source AI resources!

⬆️ Navigation · ⬅️ Contributing · Next: License ➡️

📄 License

This project uses dual licensing to maximize both adoption and attribution:

Content	License	What You Can Do
Code (HTML, CSV, templates)	MIT	Use, modify, distribute freely
Documentation (Markdown, guides)	CC BY 4.0	Share and adapt with attribution

Attribution for documentation:

AI Production Readiness Checklist by Pragmatic Logic AI

⬆️ Navigation · ⬅️ Please Credit · Next: Credits ➡️

🙏 Credits

Created by Aejaz Sheriff at Pragmatic Logic AI based on:

27 years of enterprise system development
Countless production incidents and lessons learned
Contributions from the amazing AI community
Industry research from Gartner, McKinsey, PwC, and NVIDIA

🏷️ Keywords & Topics

Leadership & Strategy: CTO AI Strategy VP of AI Head of ML AI Team Leadership AI Executive Guide AI Board Reporting AI Risk Management Build vs Buy AI AI Vendor Selection AI Steering Committee AI Portfolio Management AI ROI Metrics

Personas & Roles: Startup AI Checklist Enterprise AI Architecture Solo Developer AI Healthcare AI Compliance Financial Services AI Data Scientist to ML Engineer Platform Team MLOps AI Compliance Officer Agency AI Development Government AI Public Sector AI

Production & Operations: AI Production LLM Deployment MLOps AI Governance Enterprise AI Generative AI AI Strategy AI Architecture Multi-Agent Systems RAG Agentic RAG ReAct Pattern Reason Act Pattern MCP Model Context Protocol Prompt Caching LLM Latency Optimization External Reflection Agent Reflection Prompt Engineering AI Security

Evaluation & Quality: LLM Evaluation Holistic Agent Evaluation WAI-AI Working Alliance Inventory LLM-as-Judge Persona Consistency AI FinOps Red Teaming OWASP LLM Golden Dataset Testing Hallucination Detection Bias Testing

Compliance & Regulation: AI Compliance EU AI Act IEC 61508 ISO 13485 IEC 62304 FDA De Novo FDA SaMD HIPAA AI SOC 2 AI FedRAMP AI Model Risk Management SR 11-7 Fair Lending AI

Healthcare & Safety: Responsible AI Healthcare AI Mental Health AI Safety Clinical AI Validation Therapeutic AI AI Ethics Safety-Critical AI Formal Verification Safety Invariants AI Crisis Detection Crisis Detection Recall

Data & ML Engineering: Training-Serving Skew Data Leakage Detection Model Drift AI Technical Debt Feature Store Edge AI Edge Cloud Split Model Registry A/B Testing ML Canary Deployment AI

Assured Intelligence: Conformal Prediction Causal AI Uncertainty Quantification Probability Calibration Zero-False-Negative Selective Prediction OOD Detection DoWhy CausalML Model Calibration ECE

⭐ Star this repo if it helps you avoid production disasters!

"In production, no one can hear your model scream."

pragmaticlogic.ai

⬆️ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
docs		docs
.gitignore		.gitignore
AI Production Checklist.png		AI Production Checklist.png
AI-Production-Architecture-Diagram.drawio		AI-Production-Architecture-Diagram.drawio
AI-Production-Checklist-Template.csv		AI-Production-Checklist-Template.csv
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-CC-BY-4.0		LICENSE-CC-BY-4.0
LICENSE-MIT		LICENSE-MIT
README.md		README.md
SECURITY.md		SECURITY.md
ai-production-checklist.html		ai-production-checklist.html

Academic Patterns	Enterprise Patterns
Reflection	Task-Oriented
Tool Use	Multi-Agent Collaboration
ReAct	Self-Improving
Planning	RAG Agents
Multi-Agent	Orchestrator Agents

License

Licenses found

asq-sheriff/AI-Production-Checklist

Folders and files

Latest commit

History

Repository files navigation

🚀 AI Production Readiness Checklist

MLOps • LLMOps • GenAI • Agentic RAG • AI Governance • Enterprise AI Safety

⚡ Quick Start

Why This Checklist Exists

📈 The Reality of AI in Production (2025) — Click to Expand

🏗️ AI Production Architecture

📖 How to Use This Checklist

Purpose

Step-by-Step Guide

Priority Order (Recommended)

Scoring Your Readiness

Section Overview

⚡ TL;DR: The Essential 20

The Absolute Minimum (Do These or Don't Ship)

🎯 Choose Your Path: Persona-Based Guides

Quick Persona Finder

👔 CTO / Technical Executive

Your Strategic Priorities

Phase 1: Strategic Foundation (Month 1)

Phase 2: Organizational Setup (Month 2-3)

Phase 3: Operational Excellence (Month 4-6)

Key Decisions Only You Can Make

Your Dashboard Metrics

Board Reporting Template

Sections to Own (Delegate Details)

What to Delegate

🎯 VP of AI / Head of ML

Your Operational Focus

Phase 1: Team & Process Foundation (Month 1-2)

Phase 2: Delivery Excellence (Month 2-4)

Phase 3: Scale & Optimize (Month 4-6)

Team Structure Options

Your Weekly Rhythm

Key Metrics to Track

Common Failure Patterns to Avoid

Hiring Guide

Sections to Own

Stakeholder Management

🚀 Startup Founder / Early-Stage

Phase 1: Pre-Launch Essentials (Week 1-2)

Phase 2: Growth Mode (Month 1-3)

Phase 3: Scale Preparation (Month 3-6)

Sections to Prioritize

Sections to Defer

🏢 Enterprise Architect

Phase 1: Foundation & Approval (Month 1-2)

Phase 2: Controlled Pilot (Month 2-4)

Phase 3: Production Rollout (Month 4-6)

Phase 4: Scale & Optimize (Month 6+)

Sections to Prioritize (In Order)

Enterprise-Specific Considerations

👤 Solo Developer / Side Project

The Solo Developer Minimum (Do This Weekend)

When to Level Up

Tools for Solo Developers

🏥 Healthcare / Medical AI

Regulatory Pathway First

Phase 1: Regulatory & Safety Foundation (Month 1-3)

Phase 2: Clinical Validation (Month 3-6)

Phase 3: Pre-Submission (Month 6-9)

Phase 4: Post-Market (Ongoing)

Sections to Prioritize (Mandatory Order)

Healthcare-Specific Metrics

💰 Financial Services

Regulatory Framework First

Phase 1: Compliance Foundation (Month 1-2)

Phase 2: Model Governance (Month 2-4)

Phase 3: Production Controls (Month 4-6)

Sections to Prioritize

FinServ-Specific Requirements

🔬 Data Scientist → ML Engineer

Your Learning Path

Phase 1: Production Fundamentals (Week 1-4)