The Complete Guide to Production AI: From 87% Failure Rate to Deployment Success
480+ Production Checklist Items • 20 Domains • CRISP-DM Based • Enterprise Ready
📥 Download Interactive Checklist • 📊 Download CSV Template • 🏗️ View Architecture
A battle-tested checklist built from 27 years of enterprise experience and analysis of $15B+ in AI failures (IBM Watson, Zillow, Babylon Health, Character.AI). Avoid the mistakes that killed billion-dollar AI projects.
|
🎯 New to this checklist?
|
📊 Know what you need? Jump to: Architecture | Security | Monitoring | Healthcare AI |
⬆️ Top · Next: Why This Checklist ➡️
After 27 years of building enterprise systems and analyzing why AI projects fail in production, I've compiled this checklist of everything you need to consider before deploying AI to real users.
This checklist helps you avoid:
- 💸 Financial disasters like Zillow's $500M+ algorithmic trading collapse
⚠️ Safety failures like Character.AI's crisis mishandling leading to teen suicide- 🏥 Clinical harm like IBM Watson's unsafe treatment recommendations
- 📉 Business failures like Babylon Health's $4.2B → $0 collapse
- ⚖️ Legal liability from EU AI Act violations, HIPAA breaches, or bias lawsuits
| Metric | Value | Source |
|---|---|---|
| ML projects failing to reach production | 87% | Industry research |
| Companies with full operational AI integration | 1% | McKinsey |
| Organizations planning to increase AI investment (2025) | 92% | Gartner |
| Organizations using AI agents in production | 79% | Industry survey |
| Enterprises with 50+ generative AI use cases in pipeline | 80% | Enterprise survey |
| Organizations actively managing AI spending (2x from 2024) | 63% | FinOps Foundation |
| Faster model deployment with comprehensive MLOps | 60% | MLOps research |
| Reduction in production incidents with proper governance | 40% | Governance studies |
Market Growth:
- AI agents market: $5.4B → $7.6B (2024→2025)
- Enterprise LLM market: $5.9B → $71.1B projected by 2035
⬆️ Quick Start · Next: Architecture ➡️
⬅️ Why This Checklist · Next: How to Use ➡️
This checklist helps you systematically evaluate your AI system's readiness for production deployment. Each section addresses a critical aspect of enterprise AI operations—skip any section at your own risk.
- Assess Current State - Go through each section and check items you've already completed
- Identify Gaps - Unchecked items represent potential risks or missing capabilities
- Prioritize by Risk - Focus on Security, Safety, and Monitoring first—these prevent disasters
- Filter by Stage - Use lifecycle stage filters to focus on items relevant to your current phase
- Create Action Plan - Turn unchecked items into tasks with owners and deadlines
- Track Progress - Use the interactive HTML checklist with auto-save and dark mode support
| Priority | Sections | Why |
|---|---|---|
| 🔴 Critical | Security & Compliance, Safety & Ethics, Assured Intelligence | Legal liability, user safety, quantified uncertainty |
| 🟠 High | Monitoring & Observability, Cost Management, Data Quality | You can't fix what you can't see; costs can explode |
| 🟡 Important | Red Teaming, Governance, Evaluation, Metric Alignment | Prevent attacks, ensure compliance, maintain quality |
| 🟢 Foundation | Architecture, Agentic AI, Performance | Long-term scalability and maintainability |
| 🔵 Enablers | Prompt Engineering, Strategy, Team | Operational excellence and continuous improvement |
| Score | Level | What It Means |
|---|---|---|
| 0-20% | 🔴 Prototype | Demo only—not ready for any real users |
| 21-40% | 🟠 Alpha | Internal testing only with technical users |
| 41-60% | 🟡 Beta | Limited external users with clear warnings |
| 61-80% | 🟢 Production Ready | Ready for general availability |
| 81-100% | 🏆 Enterprise Grade | Mission-critical deployment ready |
| Section | What It Covers | Key Risk If Skipped |
|---|---|---|
| Architecture & Design | Data pipelines, model infrastructure, system design | Technical debt, scaling failures |
| 🔬 Data Quality & Statistical Validity | Training-serving skew, data leakage, drift detection | Silent failures, "optimism trap," model degradation |
| Agentic AI & MAS | Multi-agent patterns, orchestration, collaboration | Coordination failures, unpredictable behavior |
| Security & Compliance | Auth, encryption, privacy, industry standards | Data breaches, legal penalties |
| Red Teaming & LLM Security | OWASP vulnerabilities, adversarial testing | Prompt injection, data leakage |
| Performance & Scale | Latency, throughput, parallelism | Poor user experience, outages |
| Cost Management & FinOps | Token tracking, budgets, optimization | Unexpected bills, budget overruns |
| Safety & Ethics | Input/output safety, bias, responsible AI | Harmful outputs, reputation damage |
| Monitoring & Observability | Metrics, alerting, dashboards | Blind to issues, slow incident response |
| Operations & Maintenance | Deployment, model management, DR | Downtime, data loss |
| 🔧 Technical Debt & System Integrity | CACE principle, pipeline jungles, feedback loops | Brittle systems, cascading failures, stagnation |
| AI Governance | Regulatory compliance, EU AI Act, audit trails | Fines, legal action, failed audits |
| LLM Evaluation & Testing | Quality metrics, testing types, benchmarks | Degraded quality, hallucinations |
| 📐 Metric Alignment & Evaluation | Proxy problems, Goodhart's Law, online evaluation | Business-destructive "optimized" models |
| 🔬 Assured Intelligence & Quantitative Safety | Conformal prediction, calibration, causal inference, zero-FN | Overconfident wrong predictions, unquantified risk, proxy discrimination |
| Prompt Engineering | Design principles, version control, CI/CD | Inconsistent outputs, maintenance chaos |
| AI Strategy & Transformation | Roadmap, implementation phases, change management | Failed adoption, wasted investment |
| Team & Process | Documentation, training, organizational readiness | Knowledge silos, operational failures |
| 🏥 Healthcare & Mental Health AI | Crisis detection, clinical validation, ethics | Patient harm, deaths, lawsuits |
| Zillow, Amazon, Epic failure analysis | Repeating billion-dollar mistakes |
⬅️ Architecture · Next: Essential 20 ➡️
Don't have time for 400+ items? Start here. These 20 items are non-negotiable for ANY AI project going to production. Complete these first, then expand based on your persona path.
| # | Item | Why It's Critical | Section |
|---|---|---|---|
| 1 | Authentication (JWT/OAuth) | No auth = anyone can abuse your API | Security |
| 2 | Rate limiting per user | Prevents cost explosions and abuse | Security |
| 3 | Prompt injection detection | #1 LLM vulnerability (OWASP LLM01) | Red Teaming |
| 4 | Output toxicity filtering | Prevents harmful/offensive outputs | Safety |
| 5 | PII detection and masking | Legal requirement (GDPR, HIPAA) | Privacy |
| 6 | Error handling with fallbacks | Graceful degradation, not crashes | Architecture |
| 7 | Basic monitoring (latency, errors) | You can't fix what you can't see | Monitoring |
| 8 | Cost alerts and hard limits | Prevents $100K surprise bills | FinOps |
| 9 | Rollback procedure documented | Quick recovery from bad deployments | Operations |
| 10 | Human escalation path defined | When AI fails, humans must intervene | Safety |
| 11 | Golden test dataset (~50 prompts) | Catch regressions before users do | Evaluation |
| 12 | Model/prompt version control | Know what's deployed, enable rollback | MLOps |
| 13 | TLS encryption (data in transit) | Basic security requirement | Security |
| 14 | Backup strategy (3-2-1 rule) | Recover from disasters | DR |
| 15 | API documentation | Others can use and maintain it | Team |
| 16 | Hallucination rate tracking | Know how often your AI lies | Evaluation |
| 17 | Clear scope boundaries | Users know what AI can/can't do | Safety |
| 18 | Audit logging | Forensics when things go wrong | Compliance |
| 19 | Bias testing completed | Avoid discrimination lawsuits | Ethics |
| 20 | Kill switch / disable capability | Emergency shutdown when needed | Operations |
Completed all 20? You're at ~40% readiness (Alpha stage). Now pick your persona path to reach production.
⬆️ Back to Top · Next: Persona Paths ➡️
Different roles need different priorities. Find your persona below and follow the customized path to production readiness.
| I am a... | My main concern is... | Jump to |
|---|---|---|
| CTO / Technical Executive | Technical strategy, team scaling, risk | CTO Path |
| VP of AI / Head of ML | AI roadmap, team leadership, delivery | VP AI Path |
| Startup Founder | Ship fast without disasters | Startup Path |
| Enterprise Architect | Scale, compliance, integration | Enterprise Path |
| Solo Developer | Side project / learning | Solo Path |
| Healthcare/Medical | Patient safety, FDA, HIPAA | Healthcare Path |
| Financial Services | Fraud, compliance, audit | FinServ Path |
| Data Scientist | Transitioning to ML Engineering | DS→MLE Path |
| Platform Team | Infrastructure, MLOps | Platform Path |
| Compliance/Legal | Risk, regulations, audit | Compliance Path |
| Agency/Consultancy | Building for clients | Agency Path |
| Government/Public Sector | Transparency, FedRAMP, citizens | Government Path |
Your Reality: Board accountability, budget ownership, team scaling, technical risk across the organization, vendor relationships, security posture.
Your Risk Profile: Career-defining decisions. AI failures become your failures. Must balance innovation speed with enterprise risk.
flowchart TB
subgraph CTO["👔 CTO STRATEGIC FRAMEWORK"]
direction TB
subgraph Governance["🏛️ GOVERNANCE & RISK"]
G1["AI Risk Committee"]
G2["Board Reporting"]
G3["Insurance Coverage"]
end
subgraph Technical["⚙️ TECHNICAL STRATEGY"]
T1["Build vs Buy"]
T2["Vendor Selection"]
T3["Architecture Standards"]
end
subgraph Team["👥 ORGANIZATION"]
O1["Team Structure"]
O2["Hiring Strategy"]
O3["Skills Development"]
end
subgraph Delivery["🚀 DELIVERY"]
D1["Portfolio Prioritization"]
D2["Success Metrics"]
D3["Incident Response"]
end
end
style CTO fill:transparent,stroke:#1e40af,stroke-width:2px
style Governance fill:#fef2f2,stroke:#dc2626
style Technical fill:#dbeafe,stroke:#3b82f6
style Team fill:#dcfce7,stroke:#22c55e
style Delivery fill:#fef3c7,stroke:#f59e0b
| Priority | Decision | Key Questions |
|---|---|---|
| 🔴 Week 1 | AI Risk Assessment | What's our risk appetite? What could kill the company? |
| 🔴 Week 2 | Build vs Buy Strategy | Core competency or commodity? Vendor lock-in risks? |
| 🟠 Week 3 | Team & Budget | Do we have the talent? What's realistic budget? |
| 🟠 Week 4 | Governance Model | Who approves AI projects? What are the gates? |
- AI steering committee formed (you + CEO + Legal + Product)
- AI ethics guidelines published internally
- Vendor evaluation criteria established
- Security review process for AI tools defined
- Budget allocation and tracking system
- Success metrics defined (business outcomes, not just technical)
- Incident response plan for AI failures
- Board reporting dashboard created
- Insurance coverage reviewed for AI-specific risks
- Regulatory compliance roadmap (EU AI Act, etc.)
- Technical debt management process
- Knowledge sharing across AI teams
| Decision | Options | Consider |
|---|---|---|
| Build vs Buy | Internal team vs Vendors vs Hybrid | Core IP, time-to-market, talent availability |
| Model Strategy | Proprietary vs Open Source vs API | Cost, control, compliance, capabilities |
| Risk Tolerance | Conservative vs Aggressive | Industry, stage, competition, regulation |
| Team Structure | Centralized vs Federated vs Hybrid | Company size, culture, use case diversity |
| Vendor Selection | OpenAI vs Anthropic vs Google vs OSS | Cost, features, data residency, reliability |
| Metric | Why It Matters | Target |
|---|---|---|
| AI Project ROI | Justify investment to board | >3x within 18 months |
| Time to Production | Measure team velocity | <90 days for typical project |
| Incident Rate | Operational excellence | <1 P1 per quarter |
| Cost per Inference | Unit economics | Decreasing trend |
| Compliance Score | Risk management | 100% mandatory items |
| Team Retention | Talent strategy | >85% annual retention |
Present these quarterly:
- Portfolio Status - Projects, stages, blockers
- Risk Register - Top 5 AI risks and mitigations
- Financial - Spend vs budget, ROI by project
- Compliance - Regulatory status, audit findings
- Competitive - How we compare to industry
- AI Governance — Own the framework, delegate implementation
- AI Strategy & Transformation — Your primary section
- Security & Compliance — Ensure coverage, don't implement
- Cost Management & FinOps — Budget accountability
- Technical implementation → VP of AI / Engineering leads
- Day-to-day operations → Platform team
- Compliance details → Legal / Compliance team
- Vendor negotiations → Procurement (with your input)
⬆️ Back to Personas · Next: VP of AI ➡️
Your Reality: Translating strategy into execution, managing ML teams, delivering AI products, balancing research vs production, hiring and retaining talent.
Your Risk Profile: Accountable for AI delivery. Must ship while maintaining quality. Team success = your success.
flowchart LR
subgraph VPAI["🎯 VP OF AI OPERATIONAL FRAMEWORK"]
direction LR
subgraph Strategy["📋 STRATEGY"]
S1["Roadmap"]
S2["Prioritization"]
S3["Resource<br/>Allocation"]
end
subgraph Delivery["🚀 DELIVERY"]
D1["Project<br/>Management"]
D2["Quality<br/>Gates"]
D3["Release<br/>Process"]
end
subgraph Team["👥 TEAM"]
T1["Hiring"]
T2["Development"]
T3["Culture"]
end
subgraph Excellence["⭐ EXCELLENCE"]
E1["Best<br/>Practices"]
E2["Tooling"]
E3["Metrics"]
end
Strategy --> Delivery --> Team --> Excellence
end
style VPAI fill:transparent,stroke:#7c3aed,stroke-width:2px
style Strategy fill:#dbeafe,stroke:#3b82f6
style Delivery fill:#dcfce7,stroke:#22c55e
style Team fill:#fef3c7,stroke:#f59e0b
style Excellence fill:#fae8ff,stroke:#a855f7
| Priority | Action | Outcome |
|---|---|---|
| 🔴 Week 1-2 | Assess current team capabilities | Skills matrix, gap analysis |
| 🔴 Week 2-3 | Establish project intake process | Clear prioritization criteria |
| 🟠 Week 3-4 | Define quality gates | Stage-gate process adopted |
| 🟠 Month 2 | Set up MLOps foundations | CI/CD, monitoring, versioning |
- Project portfolio dashboard created
- Sprint/iteration cadence established
- Code review and ML review process defined
- Experiment tracking system implemented
- Model registry and versioning in place
- Evaluation framework standardized
- Self-service ML platform capabilities
- Reusable components library
- Cross-team knowledge sharing (ML guild)
- Continuous improvement retrospectives
- Career ladders and growth paths defined
- On-call rotation and incident management
flowchart TB
subgraph Structures["TEAM STRUCTURE OPTIONS"]
subgraph Central["🏢 CENTRALIZED"]
C1["All ML in one team"]
C2["Pros: Standards, efficiency"]
C3["Cons: Bottleneck, distant from product"]
end
subgraph Embedded["🔀 EMBEDDED"]
E1["ML in each product team"]
E2["Pros: Close to product"]
E3["Cons: Inconsistent, silos"]
end
subgraph Hybrid["⚖️ HYBRID (Recommended)"]
H1["Platform + Embedded"]
H2["Pros: Best of both"]
H3["Cons: Coordination overhead"]
end
end
style Central fill:#fecaca,stroke:#dc2626
style Embedded fill:#fef3c7,stroke:#f59e0b
style Hybrid fill:#dcfce7,stroke:#22c55e
| Day | Focus | Activities |
|---|---|---|
| Monday | Planning | Project status, blocker resolution, priority alignment |
| Tuesday | Technical | Architecture reviews, technical debt discussions |
| Wednesday | People | 1:1s, hiring interviews, career conversations |
| Thursday | Delivery | Demo reviews, quality gate checks, release planning |
| Friday | Strategy | Roadmap refinement, stakeholder alignment, learning |
| Category | Metric | Target |
|---|---|---|
| Delivery | Projects on schedule | >80% |
| Quality | Models meeting accuracy targets | >90% |
| Velocity | Time from idea to production | <60 days |
| Reliability | Model uptime | >99.5% |
| Efficiency | Model retraining frequency | As needed, <monthly |
| Team | Engineer satisfaction (eNPS) | >40 |
| Cost | Cost per prediction | Decreasing |
| Anti-Pattern | Symptoms | Solution |
|---|---|---|
| Research Trap | Always experimenting, never shipping | Time-box research, define "good enough" |
| Hero Culture | 1-2 people know everything | Documentation, pair programming, rotation |
| Technical Debt Spiral | Shipping fast, breaking often | Dedicated debt sprints, quality gates |
| Evaluation Theater | Good offline metrics, bad production | Real-world validation, shadow deployments |
| Scope Creep | Projects never finish | Clear success criteria, MVP mindset |
| Role | When to Hire | Key Skills |
|---|---|---|
| ML Engineer | First hire after you | Production systems, software engineering |
| Data Scientist | When you have data | Statistics, experimentation, modeling |
| MLOps Engineer | At scale | Infrastructure, automation, monitoring |
| Research Scientist | Competitive advantage needed | Novel methods, publications not required |
| ML Manager | Team > 6 people | Leadership, project management, technical |
- LLM Evaluation & Testing — Quality is your responsibility
- Operations & Maintenance — Delivery excellence
- Monitoring & Observability — See problems early
- Agentic AI & Multi-Agent Systems — Architecture patterns
- Technical Debt & System Integrity — Keep systems healthy
| Stakeholder | They Care About | Give Them |
|---|---|---|
| CTO | Risk, budget, strategy | Monthly exec summary, risk register |
| Product | Features, timelines | Roadmap alignment, trade-off discussions |
| Engineering | Integration, reliability | API contracts, SLAs, documentation |
| Data | Quality, access | Data requirements, feedback loops |
| Business | ROI, capabilities | Business impact metrics, demos |
⬆️ Back to Personas · ⬅️ CTO · Next: Startup ➡️
Your Reality: Limited resources, need to ship fast, can't afford disasters, investors watching.
Your Risk Profile: High speed, medium-high risk tolerance, but one bad incident could kill the company.
Focus on items that prevent company-killing incidents:
| Priority | Items | Why |
|---|---|---|
| 🔴 Day 1 | Authentication, Rate Limiting, Cost Limits | Prevent abuse and bankruptcy |
| 🔴 Day 2-3 | Prompt Injection Protection, Output Filtering | Prevent PR disasters |
| 🟠 Day 4-5 | Basic Monitoring, Error Handling, Logging | Know when things break |
| 🟠 Week 2 | Golden Test Set, Rollback Procedure, Kill Switch | Catch issues, recover fast |
As you get users, add:
- User feedback collection
- A/B testing framework
- Hallucination tracking
- Basic bias testing
- Privacy policy & ToS
Before Series A or major growth:
- SOC 2 Type I preparation
- GDPR compliance (if EU users)
- Comprehensive monitoring
- Incident response runbook
- On-call rotation
- Security & Compliance (auth, rate limiting)
- Safety & Ethics (output filtering)
- Cost Management (prevent bill shock)
- Monitoring (basic observability)
- Assured Intelligence (add after product-market fit)
- Full Governance (add when preparing for enterprise sales)
- Scale & Parallelism (premature optimization)
⬆️ Back to Personas · Next: Enterprise ➡️
Your Reality: Complex stakeholder landscape, existing systems to integrate, compliance requirements, long procurement cycles.
Your Risk Profile: Low risk tolerance, high scrutiny, failures are career-limiting.
Get organizational buy-in with proper governance:
| Priority | Items | Why |
|---|---|---|
| 🔴 Week 1-2 | AI Vision, Use Case Prioritization, Cross-functional Team | Align stakeholders |
| 🔴 Week 2-3 | EU AI Act Mapping, Risk Classification, Legal Review | Regulatory compliance |
| 🟠 Week 3-4 | Security Architecture, Zero-Trust Design, RBAC | Enterprise security |
| 🟠 Month 2 | Data Governance, Lineage, Contracts | Data foundation |
- Shadow mode deployment
- A/B testing with internal users
- Full audit trail implementation
- Integration with existing SIEM/monitoring
- Vendor risk assessment (if using third-party LLMs)
- Blue-green deployment capability
- Multi-region failover
- SOC 2 Type II audit
- Full incident response procedures
- Executive dashboards
- FinOps optimization
- Model registry and versioning
- Automated retraining pipelines
- Advanced monitoring (drift, bias)
- AI Governance — Start here
- Security & Compliance
- Architecture & Design
- Monitoring & Observability
- Technical Debt & System Integrity
- Procurement: Add LLM vendor to approved vendor list
- Legal: AI-specific terms in vendor contracts
- HR: AI usage policies for employees
- Finance: FinOps integration with existing cost centers
⬆️ Back to Personas · ⬅️ Startup · Next: Solo Dev ➡️
Your Reality: Learning, limited time, no budget, acceptable if it breaks.
Your Risk Profile: High risk tolerance for yourself, but still need basics.
| # | Item | Time | Why |
|---|---|---|---|
| 1 | API key in environment variables (not code) | 5 min | Basic security |
| 2 | Rate limiting (even basic) | 30 min | Prevent abuse |
| 3 | Cost alerts on your LLM provider | 10 min | Avoid surprise bills |
| 4 | Basic input validation | 1 hour | Prevent injection |
| 5 | Error handling with user-friendly messages | 1 hour | Better UX |
| 6 | Simple logging (console or file) | 30 min | Debug issues |
| 7 | README with setup instructions | 30 min | Future you will thank you |
| 8 | Git repository with .gitignore (no secrets!) | 15 min | Version control basics |
Total time: ~4 hours for a solid foundation
Upgrade to Startup Path when:
- You have real users (not just friends)
- Processing any PII or sensitive data
- Charging money for the service
- Storing conversation history
- Free monitoring: Sentry free tier, simple uptime checks
- Free LLM: Ollama locally, or free tiers of commercial APIs
- Free hosting: Vercel, Railway, Fly.io free tiers
- Cost control: Set hard spending limits on all API providers
⬆️ Back to Personas · ⬅️ Enterprise · Next: Healthcare ➡️
Your Reality: Lives at stake, heavy regulation, long validation cycles, clinical workflows.
Your Risk Profile: ZERO tolerance for safety failures. One death can end the company.
⚠️ Critical: Healthcare AI has unique requirements. The Healthcare & Mental Health AI section is MANDATORY, not optional.
flowchart TD
subgraph Regulatory["⚠️ BEFORE WRITING ANY CODE"]
Q1{"1. Is this a<br/>Medical Device?"}
Q1 -->|YES| FDA["📋 FDA Pathway<br/>510(k) / De Novo / PMA"]
Q1 -->|NO| Q5
FDA --> Q2{"2. Targeting<br/>EU Market?"}
Q2 -->|YES| CE["🇪🇺 CE Marking<br/>MDR/IVDR Compliance"]
Q2 -->|NO| Q3
CE --> Q3{"3. Mental Health<br/>Application?"}
Q3 -->|YES| CRISIS["🚨 Crisis Detection<br/>100% Recall Required"]
Q3 -->|NO| Q4
CRISIS --> Q4{"4. Processing<br/>Patient Data?"}
Q4 -->|YES| HIPAA["🔒 HIPAA/HITECH<br/>Compliance Required"]
Q4 -->|NO| Q5
HIPAA --> Q5["✅ Proceed with<br/>Development"]
end
style Regulatory fill:#fef2f2,stroke:#dc2626,stroke-width:2px
style Q1 fill:#fef3c7,stroke:#f59e0b,color:#78350f
style Q2 fill:#fef3c7,stroke:#f59e0b,color:#78350f
style Q3 fill:#fef3c7,stroke:#f59e0b,color:#78350f
style Q4 fill:#fef3c7,stroke:#f59e0b,color:#78350f
style FDA fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
style CE fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
style CRISIS fill:#fecaca,stroke:#dc2626,color:#7f1d1d
style HIPAA fill:#fecaca,stroke:#dc2626,color:#7f1d1d
style Q5 fill:#dcfce7,stroke:#22c55e,color:#14532d
| Priority | Items | Why |
|---|---|---|
| 🔴 Week 1 | FDA SaMD Classification, Regulatory Strategy | Determines everything else |
| 🔴 Week 2-4 | IEC 62304 Software Lifecycle, ISO 13485 QMS | Required for FDA |
| 🔴 Month 2 | Safety-Critical Architecture (IEC 61508) | Formal safety invariants |
| 🔴 Month 2-3 | Crisis Detection System (if mental health) | 100% recall, <1s response |
- IRB approval for clinical studies
- Independent third-party validation
- Geographic validation (all target regions)
- Demographic validation (all patient groups)
- Clinician workflow integration testing
- Clinical evidence package
- Risk management file (ISO 14971)
- Software documentation package
- Cybersecurity documentation
- Human factors validation
- Adverse event reporting system
- Post-market surveillance
- Continuous clinical monitoring
- Model performance tracking
- Regulatory update monitoring
- Healthcare & Mental Health AI Safety — START HERE
- Assured Intelligence — Uncertainty quantification
- AI Governance — Regulatory compliance
- Safety & Ethics — Output safety
- Security & Compliance — HIPAA compliance
| Metric | Target | Why |
|---|---|---|
| Crisis detection recall | 100% | Zero false negatives for safety |
| Crisis response latency | <1 second | Immediate intervention |
| False positive rate | <5% | Minimize alert fatigue |
| Clinician override availability | Always | Humans must be able to intervene |
⬆️ Back to Personas · ⬅️ Solo Dev · Next: FinServ ➡️
Your Reality: Regulated industry, fraud concerns, audit requirements, model explainability mandates.
Your Risk Profile: Low tolerance, regulators watching, fiduciary duty.
- US: OCC, Fed, CFPB guidance on AI/ML in banking
- EU: EBA guidelines on ICT risk, DORA, AI Act
- Global: Basel Committee principles for AI
- Fair Lending: ECOA, Fair Housing Act (explainability required)
| Priority | Items | Why |
|---|---|---|
| 🔴 Week 1-2 | Model Risk Management (SR 11-7) | Federal Reserve requirement |
| 🔴 Week 2-3 | Fair Lending Analysis, Disparate Impact Testing | Avoid discrimination claims |
| 🔴 Week 3-4 | Explainability Requirements, Adverse Action Notices | Regulatory mandate |
| 🟠 Month 2 | Audit Trail, Model Lineage, Version Control | Examination readiness |
- Model inventory and tiering
- Independent model validation (second line)
- Model performance monitoring
- Champion/challenger framework
- Model documentation standards
- Real-time fraud detection integration
- Transaction monitoring
- Suspicious activity reporting
- Customer complaint tracking
- Regulatory reporting automation
- AI Governance — Model risk management
- Metric Alignment & Evaluation — Avoid Goodhart's Law
- Assured Intelligence — Calibration, uncertainty
- Anti-Patterns: Case Studies — Learn from Zillow
- Technical Debt & System Integrity — CACE principle
- Explainability: Every decision must be explainable to regulators and customers
- Audit: Complete audit trail for all model decisions
- Fairness: Regular disparate impact analysis across protected classes
- Stress Testing: Model performance under adverse economic conditions
⬆️ Back to Personas · ⬅️ Healthcare · Next: DS→MLE ➡️
Your Reality: Strong in modeling, learning production skills, bridging the gap.
Your Risk Profile: Learning curve, need to understand ops and infrastructure.
flowchart LR
subgraph DS["🔬 DATA SCIENTIST<br/>Skills"]
DS1["📓 Jupyter<br/>Notebooks"]
DS2["🧪 Local<br/>Experiments"]
DS3["🎯 Model<br/>Accuracy"]
DS4["📦 Batch<br/>Processing"]
DS5["🐍 Python<br/>Scripts"]
end
subgraph GAP["🌉 BRIDGE THE GAP"]
G1["Version<br/>Control"]
G2["Reproducibility"]
G3["System<br/>Reliability"]
G4["Real-time<br/>Serving"]
G5["Production<br/>Code"]
end
subgraph MLE["⚙️ ML ENGINEER<br/>Skills"]
MLE1["📊 Git<br/>MLflow"]
MLE2["🐳 Docker<br/>CI/CD"]
MLE3["📈 Monitoring<br/>Alerting"]
MLE4["🚀 APIs<br/>Streaming"]
MLE5["✅ Testing<br/>Error Handling"]
end
DS1 --> G1 --> MLE1
DS2 --> G2 --> MLE2
DS3 --> G3 --> MLE3
DS4 --> G4 --> MLE4
DS5 --> G5 --> MLE5
style DS fill:#fae8ff,stroke:#a855f7,stroke-width:2px
style GAP fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style MLE fill:#dcfce7,stroke:#22c55e,stroke-width:2px
style DS1 fill:#ffffff,stroke:#a855f7
style DS2 fill:#ffffff,stroke:#a855f7
style DS3 fill:#ffffff,stroke:#a855f7
style DS4 fill:#ffffff,stroke:#a855f7
style DS5 fill:#ffffff,stroke:#a855f7
style G1 fill:#ffffff,stroke:#f59e0b
style G2 fill:#ffffff,stroke:#f59e0b
style G3 fill:#ffffff,stroke:#f59e0b
style G4 fill:#ffffff,stroke:#f59e0b
style G5 fill:#ffffff,stroke:#f59e0b
style MLE1 fill:#ffffff,stroke:#22c55e
style MLE2 fill:#ffffff,stroke:#22c55e
style MLE3 fill:#ffffff,stroke:#22c55e
style MLE4 fill:#ffffff,stroke:#22c55e
style MLE5 fill:#ffffff,stroke:#22c55e
| Priority | Items | Why |
|---|---|---|
| 🔴 Week 1 | Version Control (prompts, models, data) | Reproducibility |
| 🔴 Week 2 | CI/CD Basics, Automated Testing | Quality gates |
| 🟠 Week 3 | Containerization (Docker), Environment Management | Consistency |
| 🟠 Week 4 | API Design, Error Handling | Production serving |
- Monitoring dashboards (Grafana, DataDog)
- Alerting and on-call basics
- Log aggregation and analysis
- Performance profiling
- Cost tracking per experiment
- Feature stores
- Model registry
- A/B testing framework
- Drift detection
- Automated retraining triggers
- Operations & Maintenance — Deployment basics
- Monitoring & Observability — See what's happening
- Data Quality & Statistical Validity — Training-serving skew
- LLM Evaluation & Testing — Production evaluation
- Technical Debt & System Integrity — Avoid ML-specific debt
- Book: "Designing Machine Learning Systems" by Chip Huyen
- Course: "Made With ML" (free, production-focused)
- Practice: Take a notebook project and deploy it end-to-end
⬆️ Back to Personas · ⬅️ FinServ · Next: Platform ➡️
Your Reality: Supporting multiple ML teams, standardization, self-service, scale.
Your Risk Profile: Reliability is your product. Downtime affects everyone.
Build the internal platform that makes ML teams successful.
| Priority | Items | Why |
|---|---|---|
| 🔴 Week 1-2 | Kubernetes + GPU Operators | Compute foundation |
| 🔴 Week 2-3 | Model Serving Infrastructure (vLLM, Triton) | Inference platform |
| 🟠 Week 3-4 | Secrets Management, KMS | Security foundation |
| 🟠 Month 2 | Observability Stack (metrics, logs, traces) | Platform monitoring |
- Model registry (MLflow, Weights & Biases)
- Feature store (Feast, Tecton)
- Experiment tracking
- CI/CD pipelines for ML
- A/B testing infrastructure
- Developer portal / documentation
- Cost allocation and showback
- Quota management
- Audit logging
- Policy-as-code guardrails
- Architecture & Design — Infrastructure patterns
- Performance & Scale — Latency, throughput
- Cost Management & FinOps — Platform economics
- Operations & Maintenance — Reliability
- Monitoring & Observability — Platform health
| Metric | Target | Why |
|---|---|---|
| Model deployment time | <1 hour | Self-service goal |
| Platform availability | 99.9% | Reliability target |
| Cost per inference | Track & optimize | FinOps |
| Time to first experiment | <1 day | Developer experience |
⬆️ Back to Personas · ⬅️ DS→MLE · Next: Compliance ➡️
Your Reality: Protect the organization, manage liability, ensure regulatory compliance.
Your Risk Profile: Your job is to identify and mitigate risks others miss.
- Data provenance and licensing verified
- Training data consent/rights confirmed
- Output ownership/IP determined
- Liability allocation documented
- Insurance coverage reviewed
- EU AI Act risk classification completed
- Prohibited use cases verified (social scoring, etc.)
- High-risk requirements mapped (if applicable)
- GDPR/privacy impact assessment done
- Industry-specific regulations addressed
- AI-specific terms in vendor contracts
- Indemnification clauses reviewed
- SLA requirements defined
- Audit rights preserved
- Data processing agreements updated
- AI ethics policy published
- Incident response procedure documented
- Escalation paths defined
- Board/executive reporting established
- External audit schedule set
- AI Governance — Regulatory frameworks
- Security & Compliance — Data protection
- Safety & Ethics — Responsible AI
- Anti-Patterns: Case Studies — Learn from failures
- Healthcare & Mental Health AI — If applicable
- How do we know the model isn't discriminating?
- What happens when the model is wrong?
- Can we explain decisions to regulators/customers?
- How quickly can we disable the AI if needed?
- What's our audit trail look like?
⬆️ Back to Personas · ⬅️ Platform · Next: Agency ➡️
Your Reality: Building for clients, varied requirements, handoff considerations, repeatable processes.
Your Risk Profile: Client's risk becomes your risk. Reputation is everything.
Before starting any AI project, clarify:
| Question | Why It Matters |
|---|---|
| Who owns the trained model? | IP and liability |
| What data can we use for training? | Legal rights |
| What are the regulatory requirements? | Compliance scope |
| Who operates it post-handoff? | Documentation needs |
| What's the budget for ongoing costs? | FinOps planning |
- Requirements documentation
- Risk assessment
- Architecture design
- Cost estimation
- Timeline and milestones
- Environment setup (reproducible)
- Core functionality
- Testing suite
- Documentation (client-facing)
- Security review
- Operations runbook
- Monitoring dashboards
- Training sessions
- Support transition plan
- Sign-off documentation
- Architecture & Design — Reusable patterns
- Operations & Maintenance — Handoff docs
- Team & Process — Documentation standards
- Cost Management & FinOps — Client cost clarity
- Template everything: Reusable monitoring, CI/CD, documentation
- Document decisions: Client sign-off on architecture choices
- Clear handoff: Runbooks, training, support transition
- Cost transparency: Show clients ongoing operational costs
⬆️ Back to Personas · ⬅️ Compliance · Next: Government ➡️
Your Reality: Public accountability, transparency requirements, procurement rules, citizen impact.
Your Risk Profile: Public trust is paramount. Failures make headlines.
- Algorithmic impact assessment published
- Public documentation of AI use cases
- Citizen appeal/challenge mechanism
- Regular public reporting on AI performance
- Freedom of Information considerations
- FedRAMP authorization (US federal)
- StateRAMP (US state/local)
- Vendor AI ethics assessment
- Source code escrow
- Data sovereignty requirements
- Accessibility compliance (508/WCAG)
- Language access (LEP populations)
- Digital divide considerations
- Disparate impact analysis
- Community input process
- AI Governance — Public sector accountability
- Safety & Ethics — Equity and fairness
- Metric Alignment & Evaluation — Avoid gaming
- Security & Compliance — FedRAMP, FISMA
- Assured Intelligence — Explainability
| Metric | Requirement | Why |
|---|---|---|
| Explainability | High | Public accountability |
| Bias audits | Regular, public | Equity requirements |
| Uptime | High | Public service reliability |
| Data retention | Per records laws | Legal requirements |
⬆️ Back to Personas · ⬅️ Agency · Next: Flowchart ➡️
flowchart TD
subgraph Decision["🗺️ WHERE DO I START?"]
START["🚀 START HERE<br/>Do you have users?"]
START -->|NO| BUILDING["🔨 Still Building"]
START -->|YES| DEPLOYED["✅ Already Deployed"]
BUILDING --> SENSITIVE{"Handling sensitive data?<br/>(PII, health, financial)"}
DEPLOYED --> MONITORING{"Do you have<br/>monitoring & alerting?"}
SENSITIVE -->|YES| PATH_SECURE["🔐 START WITH:<br/>━━━━━━━━━━━━<br/>• Security<br/>• Privacy<br/>• Compliance<br/>• Then Essential 20"]
SENSITIVE -->|NO| PATH_ESSENTIAL["📋 START WITH:<br/>━━━━━━━━━━━━<br/>• Essential 20 items<br/>• Your persona path"]
MONITORING -->|NO| PATH_URGENT["🚨 STOP! ADD NOW:<br/>━━━━━━━━━━━━<br/>• Monitoring<br/>• Alerting<br/>• Logging<br/>• Rollback"]
MONITORING -->|YES| PATH_OPTIMIZE["📈 CHECK:<br/>━━━━━━━━━━━━<br/>• Cost management<br/>• Evaluation<br/>• Governance<br/>• Scale readiness"]
end
style START fill:#3b82f6,stroke:#1e40af,color:#ffffff,stroke-width:3px
style BUILDING fill:#f59e0b,stroke:#d97706,color:#ffffff
style DEPLOYED fill:#22c55e,stroke:#16a34a,color:#ffffff
style SENSITIVE fill:#fef3c7,stroke:#f59e0b,color:#78350f
style MONITORING fill:#fef3c7,stroke:#f59e0b,color:#78350f
style PATH_SECURE fill:#fecaca,stroke:#dc2626,color:#7f1d1d
style PATH_ESSENTIAL fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
style PATH_URGENT fill:#dc2626,stroke:#991b1b,color:#ffffff
style PATH_OPTIMIZE fill:#dcfce7,stroke:#22c55e,color:#14532d
style Decision fill:transparent,stroke:#64748b,stroke-width:2px
| Your Situation | Start With | Then Add |
|---|---|---|
| Side project, no users yet | Essential 20 | Nothing until you have users |
| Startup, pre-launch | Essential 20 → Startup Path | Security, basic monitoring |
| Startup, have users | Startup Path | Evaluation, cost management |
| Enterprise, new project | Enterprise Path | Full governance from start |
| Healthcare/Medical | Healthcare Path | Everything in Healthcare section is mandatory |
| Financial services | FinServ Path | Explainability, audit trails |
| Production with issues | Monitoring | Whatever is causing the issues |
| Scaling problems | Performance & Scale | Cost management |
| Compliance audit coming | AI Governance | Security, documentation |
⬆️ Back to Top · ⬅️ Personas · Next: FAQ ➡️
Do I need to complete ALL 400+ items?
No. The checklist is comprehensive by design—it covers everything from startups to enterprise healthcare AI.
- Minimum viable: Complete the Essential 20 items
- Production ready: Complete items relevant to your persona path
- Enterprise grade: Complete 80%+ of all applicable items
Many items are marked "Configurable" meaning they depend on your context.
What's the minimum for a POC/prototype?
For a POC that only YOU will use:
- API keys in environment variables (not code)
- Basic error handling
- Cost limits set on your LLM provider
For a POC that OTHERS will see:
- Add: Authentication, rate limiting, basic input validation
- Add: Clear "this is a prototype" disclaimers
For a POC with REAL DATA:
- Add: Everything in the Essential 20
How long does it take to become production-ready?
It depends on your starting point and target:
| Starting Point | Target | Typical Effort |
|---|---|---|
| Jupyter notebook | Internal tool | 2-4 weeks |
| Working prototype | Startup MVP | 4-8 weeks |
| MVP | Production | 2-3 months |
| Production | Enterprise-grade | 3-6 months |
Healthcare/Financial add 2-6 months for compliance.
What if I'm a small team (1-3 people)?
Focus on high-impact, low-effort items:
- Automate security basics: Auth, rate limiting, input validation
- Use managed services: Don't build monitoring from scratch
- Start with Essential 20: This covers 80% of critical risks
- Skip scale sections: Until you actually need to scale
- Use templates: Don't write runbooks from scratch
See Solo Developer Path or Startup Path.
What items cause the most production incidents?
Based on industry data and case studies:
- Missing rate limiting → Cost explosions, abuse
- No monitoring → Hours/days to detect issues
- No rollback procedure → Extended outages
- Prompt injection vulnerability → Data leakage, jailbreaks
- Training-serving skew → Silent model degradation
- Missing cost limits → $10K+ surprise bills
- No golden test set → Regressions reach users
- Hallucination without detection → User trust erosion
Which items can I defer until later?
Safe to defer (until you need them):
| Item | When to Add |
|---|---|
| Multi-region failover | When you have users in multiple regions |
| Model parallelism | When single-GPU isn't enough |
| A/B testing framework | When you're optimizing, not building |
| Advanced FinOps | When costs exceed $10K/month |
| Formal verification | When in safety-critical domains |
| Full governance framework | When preparing for enterprise or compliance |
Never defer: Security, basic monitoring, cost limits, rollback capability
What's different about LLM/GenAI vs traditional ML?
Key differences this checklist addresses:
| Traditional ML | LLM/GenAI | Checklist Section |
|---|---|---|
| Feature engineering | Prompt engineering | Prompt Engineering |
| Model accuracy | Hallucination rate | LLM Evaluation |
| Batch inference | Real-time, streaming | Performance |
| Model drift | Prompt injection | Red Teaming |
| Fixed costs | Token-based costs | Cost Management |
| Input validation | Output safety | Safety & Ethics |
How do I convince my manager/team to use this checklist?
Show them the cost of NOT using it:
| Company | What Went Wrong | Cost |
|---|---|---|
| Zillow | Model overconfidence, no uncertainty quantification | $500M+ loss, 25% layoffs |
| IBM Watson | No clinical validation, unsafe recommendations | Killed the healthcare division |
| Character.AI | No crisis detection, inadequate safety | Teen suicide, lawsuits |
| Babylon Health | Overpromised, underdelivered on safety | $4.2B → $0 |
Then show them the Essential 20 takes ~2 weeks and prevents most disasters.
How often should I review the checklist?
- Before major releases: Full relevant sections
- Monthly: Monitoring and alerting effectiveness
- Quarterly: Security and compliance sections
- Annually: Full checklist review
- After incidents: Relevant sections that could have prevented it
- When regulations change: Governance sections
Is this checklist specific to any cloud provider or framework?
No. The checklist is cloud-agnostic and framework-agnostic. It works with:
- Cloud: AWS, Azure, GCP, or on-premise
- LLM Providers: OpenAI, Anthropic, Google, open-source models
- Frameworks: LangChain, LlamaIndex, custom implementations
- MLOps: MLflow, Weights & Biases, Kubeflow, custom solutions
The companion Technology Selection Guide provides specific tool recommendations.
⬆️ Back to Top · ⬅️ Flowchart · Next: Lifecycle Stages ➡️
Why stage-based workflow matters: Only 54% of AI projects transition from pilot to production (Gartner), and only 11% of companies unlock significant AI value (BCG). A structured stage-gate approach dramatically improves success rates by ensuring the right work happens at the right time.
flowchart LR
subgraph Planning["📋 PLANNING"]
S1[💡 Ideation]
S2[🔍 Discovery]
end
subgraph Development["🔨 DEVELOPMENT"]
S3[🧪 POC]
S4[🔧 MVP]
S5[👥 Pilot]
end
subgraph Operations["⚙️ OPERATIONS"]
S6[🚀 Production]
S7[📈 Scale]
S8[⚡ Optimize]
end
S1 -->|Business Approved| S2
S2 -->|Feasible| S3
S3 -->|Viable| S4
S4 -->|Usable| S5
S5 -->|Safe & Effective| S6
S6 -->|Stable| S7
S7 -->|SLAs Met| S8
S8 -.->|Continuous Improvement| S1
style S1 fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
style S2 fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
style S3 fill:#fae8ff,stroke:#a855f7,color:#581c87
style S4 fill:#fae8ff,stroke:#a855f7,color:#581c87
style S5 fill:#fef3c7,stroke:#f59e0b,color:#78350f
style S6 fill:#dcfce7,stroke:#22c55e,color:#14532d
style S7 fill:#dcfce7,stroke:#22c55e,color:#14532d
style S8 fill:#dcfce7,stroke:#22c55e,color:#14532d
style Planning fill:transparent,stroke:#3b82f6,stroke-width:2px,color:#1e3a5f
style Development fill:transparent,stroke:#a855f7,stroke-width:2px,color:#581c87
style Operations fill:transparent,stroke:#22c55e,stroke-width:2px,color:#14532d
📋 Detailed Stage Breakdown — Click to expand
| Stage | Key Activities | Exit Gate |
|---|---|---|
| 1. Ideation | Business case, use case ID, success metrics, stakeholder buy-in | Business Approval |
| 2. Discovery | Data assessment, feasibility, risk assessment, resource plan | Technical Feasible? |
| 3. POC | Technical feasibility, core algorithm, initial results | Viable? |
| 4. MVP | Working prototype, basic UI, integration | Usable? |
| 5. Pilot | Limited users, real-world test, feedback loops, safety validation | Safe & Effective? |
| 6. Production | Full deployment, MLOps pipeline, monitoring, governance | Production Ready? |
| 7. Scale | Multi-region, performance, cost optimize, team scaling | Scalable? |
| 8. Optimize | Continuous improvement, retraining, innovation | ROI Met? |
📊 Industry Standard Comparison: CRISP-DM Mapping — Click to expand
Note: CRISP-DM (Cross-Industry Standard Process for Data Mining) is the de-facto industry standard for data science and ML projects, consistently ranking #1 in KDnuggets polls over 12+ years. Our 8-stage model extends CRISP-DM to address modern AI/MLOps requirements.
| CRISP-DM Phase | Our Stage(s) | What We Add |
|---|---|---|
| 1. Business Understanding | 1. Ideation | Explicit stakeholder buy-in, success metrics |
| 2. Data Understanding | 2. Discovery | Risk assessment, resource planning |
| 3. Data Preparation | 2. Discovery + 3. POC | Integrated into discovery and POC phases |
| 4. Modeling | 3. POC + 4. MVP | Split into feasibility (POC) and prototype (MVP) |
| 5. Evaluation | 4. MVP + 5. Pilot | Extended with real-world pilot validation |
| 6. Deployment | 6. Production | Same focus on deployment |
| (not covered) | 7. Scale | NEW: Multi-region, performance optimization |
| (not covered) | 8. Optimize | NEW: Continuous improvement, retraining |
CRISP-DM was published in 1999 and, while still valuable, has known limitations for modern AI systems:
| CRISP-DM Limitation | How Our Model Addresses It |
|---|---|
| No MLOps/continuous training coverage | Stages 7-8 cover scaling and optimization |
| Designed for small teams | Gate system supports enterprise coordination |
| No pilot/validation phase | Stage 5 (Pilot) for real-world testing |
| Deployment is "done" | Stage 8 treats deployment as ongoing |
| Not AI-specific (Cognilytica) | Includes agentic AI, LLM, and safety considerations |
| Framework | Stages | Best For | Reference |
|---|---|---|---|
| CRISP-DM | 6 phases | Traditional ML/analytics | Wikipedia |
| Microsoft TDSP | 5 stages | Azure-based projects | Microsoft Docs |
| Google MLOps | 3 maturity levels | Automation-focused | Google Cloud |
| CPMAI | CRISP-DM + Agile | AI-specific projects | Cognilytica |
Best Practice: "Data science teams that combine a loose implementation of CRISP-DM with overarching team-based agile project management approaches will likely see the best results." — Data Science PM
Gates are classified into three categories based on risk:
| Type | Symbol | When Required | Rationale |
|---|---|---|---|
| Mandatory | 🔴 | Always | Legal, safety, or existential risk—cannot proceed without |
| Advisory | 🟡 | Strongly recommended | Significantly improves success probability |
| Configurable | 🟢 | Organization decides | Depends on industry, user base, risk tolerance |
📋 Gate Details by Type — Click to expand
| Gate | Items | Why Mandatory |
|---|---|---|
| Any → Next | Security vulnerabilities addressed | Legal liability, data breaches |
| Pilot → Production | Safety validation complete | User safety, especially Healthcare AI |
| Pilot → Production | Crisis detection tested (Healthcare) | Potential for fatal harm if missed |
| Any Stage | Data privacy compliance (GDPR/HIPAA) | Fines up to 4% of revenue |
| Production → Scale | Monitoring operational | Can't fix what you can't see |
| Gate | Items | Why Advisory |
|---|---|---|
| Discovery → POC | Risk assessment documented | Reduces surprises, but POC can surface unknowns |
| POC → MVP | Model accuracy targets defined | Important, but can refine in MVP |
| MVP → Pilot | Basic documentation complete | Helps users, but can iterate during pilot |
| Any Stage | Bias testing complete | Critical for fairness, depth varies by risk |
| Gate | Items | Factors to Consider |
|---|---|---|
| Any Stage | External validation | Required for Healthcare, optional for internal tools |
| POC → MVP | Clinical advisor review | Required for Healthcare AI, optional otherwise |
| Pilot → Production | A/B testing complete | Critical for consumer apps, optional for internal |
| Production → Scale | Multi-region deployment | Required for global, optional for single-market |
flowchart TD
Q1{Is there a legal/<br/>regulatory requirement?}
Q1 -->|YES| M1[🔴 MANDATORY]
Q1 -->|NO| Q2{Could failure cause<br/>user harm?}
Q2 -->|YES| M2[🔴 MANDATORY]
Q2 -->|NO| Q3{Does it significantly<br/>impact ROI?}
Q3 -->|YES| A1[🟡 ADVISORY]
Q3 -->|NO| C1[🟢 CONFIGURABLE]
style M1 fill:#fecaca,stroke:#dc2626,color:#7f1d1d
style M2 fill:#fecaca,stroke:#dc2626,color:#7f1d1d
style A1 fill:#fef3c7,stroke:#f59e0b,color:#78350f
style C1 fill:#dcfce7,stroke:#22c55e,color:#14532d
🏥 Healthcare AI: FDA Regulatory Overlay — Click to expand
When building Healthcare AI, enable this overlay to add FDA-specific requirements:
| Standard Stage | FDA Addition | Requirements |
|---|---|---|
| Stage 3: POC | + Pre-Submission | FDA feedback on regulatory pathway |
| Stage 4: MVP | + Analytical Validation | Technical performance verification |
| Stage 5: Pilot | + Clinical Validation | Real-world clinical testing |
| Stage 5→6 Gate | + Regulatory Submission | 510(k), De Novo, or PMA |
| Stage 6: Production | + Market Authorization | FDA clearance/approval required |
| Stage 8: Optimize | + Post-Market Surveillance | Ongoing safety monitoring |
FDA Gate Requirements (All Mandatory):
- Intended use clearly defined
- Risk classification determined (Class I, II, or III)
- Predicate device identified (for 510(k))
- Clinical evidence sufficient for risk level
- Quality Management System (QMS) established
- Post-market surveillance plan documented
📖 Deep Dive: See docs/LIFECYCLE-STAGES.md for detailed stage requirements and checklists.
Important
Why it matters: Poor architecture decisions made early become expensive technical debt. A well-designed AI system separates concerns, enables scaling, and makes debugging possible. This section covers the foundational infrastructure that everything else builds upon.
-
Foundation Layer
- Data lakehouse combining flexibility of data lakes with structure of warehouses
- Governed data pipelines ensuring quality and compliance
- Semantic layers for consistent definitions and access patterns
-
Model Infrastructure
- Specialized infrastructure for LLMs and prompt management
- MLOps integration with CI/CD for models & prompts
- Offline and online evaluation pipelines
-
Responsible AI Automation
- Bias checks and red-teaming processes
- Explainability mechanisms
- Policy-as-code implementation
-
Pre-production & Runtime
- Safety/quality gates and runtime guardrails
- Prompts and model configs treated as versioned artifacts
- Monitoring, drift detection, and outcome KPIs
-
Scalable Infrastructure
- Kubernetes with GPU operators
- Autoscaling configured
- Mixed precision training/inference
-
Data Pipeline Design
- Defined data ingestion strategy
- Implemented data validation and quality checks
- Set up data versioning system
- Created data lineage tracking
- Established data retention policies
💡 Implementation Tips
- Use tools like Dagster or Airflow for orchestration
- Implement Great Expectations for data quality
- Consider using DVC for data versioning
- Example from MultiDB-Chatbot: Separate databases for different data types
-
AI-Ready Pipeline Components
- Schema validation with real-time checks and evolution planning
- Data enrichment (location, user-agent, IDs)
- Feature engineering for ML transformations
- Tiered storage (bronze/silver/gold)
- Data contracts between producers/consumers
💡 Data Pipeline Patterns
Pattern Use Case Trade-offs Batch Processing Lower-volume, non-real-time Simple but delayed Stream Processing Real-time decisions, IoT Complex but immediate Lambda Comprehensive view Dual system complexity Kappa Event-driven apps Simplified, replay-based Data Lakehouse Unified analytics + ML Best of both worlds Data Mesh Large enterprises Autonomy vs. governance
-
Model Selection
- Evaluated multiple model options
- Performed cost-benefit analysis
- Tested fallback models
- Documented model limitations
- Created model cards
-
Edge/Small Model Deployment
- On-device inference requirements assessed (mobile, IoT, embedded)
- Model quantization applied (INT4, INT8, FP16)
- Context window fits device memory constraints
- Offline capability tested (local vector store, cached responses)
- Battery/power consumption profiled
- Latency validated on target hardware (< 100ms for interactive)
- Model fits deployment target (Jamba 3B: phones, Llama 4 Scout: single GPU)
- Edge/cloud split ratio defined (e.g., 90% edge / 10% cloud fallback)
- Cloud fallback triggers documented (complexity, safety, connectivity)
- Total memory budget validated (≤8GB for consumer devices)
-
Retrieval Augmented Generation (RAG)
- Designed chunking strategy
- Optimized embedding dimensions
- Implemented hybrid search (vector + keyword)
- Set up reranking pipeline
- Configured context window management
-
Modular Design Requirements
- Loose coupling: Agents operate as services/processes
- Clear interfaces: APIs, event buses, message queues
- Policy-driven control: Guardrails define permissions, escalation, auditing
- Observability: All actions monitored and logged
- Zero-trust security for agent communications
- Versioning & rollback: Tag releases, automate rollbacks on failure
-
Microservices Design
- Separated inference from business logic
- Implemented API gateway
- Designed for horizontal scaling
- Created service mesh
- Established circuit breakers
-
Database Strategy
- Selected appropriate databases for each workload
- Implemented connection pooling
- Set up read replicas
- Configured automated backups
- Tested disaster recovery
💡 Architecture Patterns Comparison
Pattern Use Case Trade-offs Modular Systems Independent components Flexibility vs. coordination overhead Centralized Platforms Multiple use cases Consistency vs. single point of failure Decentralized Department-managed AI Autonomy vs. governance challenges Federated Learning Distributed data sources Privacy vs. communication costs
⬆️ Navigation · ⬅️ Lifecycle · Next: Data Quality ➡️
Important
Why it matters: Research reveals that 80%+ of AI failures trace to data issues, not model complexity. Training-Serving Skew is a "silent failure"—models output garbage predictions with high confidence without crashing. Data leakage creates an "optimism trap" where prototype metrics are artificially inflated. This section addresses the primary technical determinant of production success.
⚠️ "This skew acts as a 'silent failure'; the model does not crash or throw exceptions. It simply outputs garbage predictions with high confidence."
- Single Pipeline Architecture: Feature engineering code identical between training and inference (no dual-pipeline anti-pattern)
- Feature Store Implemented: Centralized repository ensures feature calculation consistency across environments
- Schema Enforcement: Input schemas validated at inference time match training schemas exactly
- Numerical Precision Parity: Training (Python/Pandas) and serving (Java/Go/C++) use identical numerical precision
- Time Zone Handling: Temporal features calculated identically (UTC normalization enforced)
- Missing Value Strategy: Imputation logic production-identical (not notebook-specific hacks)
- Shadow Mode Validation: New models run in parallel with existing, comparing outputs before promotion
💡 Anti-Pattern Alert
The "dual-pipeline" pattern (Data Scientists in Python → Engineers rewrite in Java) is a primary source of skew. Use Feature Stores (Feast, Tecton, Featureform) to structurally eliminate this risk.
⚠️ "Leakage artificially inflates evaluation metrics during the PoC, creating a false sense of security that evaporates upon deployment."
- Target Leakage Audit: All features verified to be causally available BEFORE prediction timestamp
- Train-Test Contamination Check: No global preprocessing (normalization, scaling) performed before data split
- Temporal Discipline: Time-series data split chronologically, never randomly
- Feature Provenance Documentation: Each feature's data source, calculation logic, and temporal availability documented
- Leakage Detection Tests: Automated tests flag suspiciously high-performing features (>0.95 correlation with target)
- Cross-Validation Strategy: Appropriate CV method for data type (TimeSeriesSplit for temporal, GroupKFold for hierarchical)
💡 The Antibiotic Example
A pneumonia prediction model learned took_antibiotic=True predicts pneumonia perfectly—in historical data. In production, this feature is unknown at prediction time. The model fails catastrophically because it trained on leaked future information.
⚠️ "The fundamental assumption that training and test data are IID (Independent and Identically Distributed) is rarely true in enterprise environments."
| Drift Type | Definition | Detection Method | Trigger Action |
|---|---|---|---|
| Covariate Shift | P(X) changes | KS-test, PSI on inputs | Alert + investigate |
| Concept Drift | P(Y|X) changes | Performance degradation | Immediate retraining |
| Label Shift | P(Y) changes | Prior probability monitoring | Recalibration |
- Covariate Shift Monitoring: Statistical tests (Kolmogorov-Smirnov, Population Stability Index) on input feature distributions
- Concept Drift Detection: Ground truth feedback loops to detect P(Y|X) relationship changes
- Label Shift Tracking: Target variable distribution (base rates) monitored over time
- Automated Retraining Triggers: Drift thresholds trigger retraining pipelines (not just alerts)
- Windowed Performance Tracking: Rolling accuracy/precision calculated by time window (daily, weekly)
- Seasonality Accounting: Known cyclical patterns (holidays, quarters, fiscal years) factored into drift calculations
⚠️ "The Epic Sepsis Model claimed AUC of 0.76-0.83 internally; external validation found AUC as low as 0.63."
- Multi-Source Validation: Model tested on data from at least 2 independent sources/environments
- Demographic Stratification: Performance validated and documented across demographic segments
- Geographic Validation: If applicable, tested across all deployment regions/sites
- Temporal Holdout: Validated on data from a future time period (not random split)
- Site-Specific Calibration Plan: Strategy for adapting model to local deployment conditions
- Model Card with External Results: External validation results documented in public model card
⬆️ Navigation · ⬅️ Architecture · Next: Agentic AI ➡️
Important
Why it matters: 79% of organizations are already using AI agents in production. Agentic systems can handle complex workflows autonomously, but without proper design patterns they become unpredictable and unreliable. This section covers proven enterprise patterns for building agents that work together effectively.
-
Task-Oriented Agents
- Clear success criteria defined
- Error handling and retry logic implemented
- High reliability for repeatable operations
- Best for: Data entry, scheduling, document classification
-
Multi-Agent Collaboration
- Communication patterns established (sequential, hierarchical, bi-directional)
- Cross-check outputs to reduce hallucinations
- Conflict resolution mechanisms
- Distributed expertise coordination
-
Self-Improving Agents
- Feedback loops configured
- Performance monitoring active
- Drift detection implemented
- Continuous learning from interactions
- External reflection preferred over self-critique (code execution, tool validation)
- Environment feedback used to verify reasoning
-
RAG Agents
- Knowledge retrieval connected to reasoning
- Responses grounded in factual, up-to-date information
- Critical for document-heavy domains and compliance
-
Orchestrator Agents
- End-to-end workflow management
- Task distribution across specialized agents
- Failure handling with rerouting/fallback strategies
- Loose coupling and separation of concerns
-
ReAct Pattern (Reason + Act)
- Thought → Action → Observation loop implemented
- Tool failures handled in observation step with retry/fallback logic
- Reasoning traces logged for debugging and audit
- Dynamic re-planning when observations invalidate current plan
💡 Academic vs Enterprise Patterns
Academic Patterns Enterprise Patterns Reflection Task-Oriented Tool Use Multi-Agent Collaboration ReAct Self-Improving Planning RAG Agents Multi-Agent Orchestrator Agents Tip: Start with task-oriented pattern (lowest complexity, fastest time to value), then progress to sequential orchestration, then advanced patterns.
-
Core Components
- Agents with distinct roles, personas, specific contexts
- Agent management for collaboration patterns
- Human-in-the-loop for reliability in critical scenarios
- Specialized tools (web search, document processing, code)
- LLM backbone for processing and inference
- Context management with prompts enabling intent identification
- Memory systems (shared or individual) for context retention
-
MAS Design Best Practices
- Clearly defined agent roles and responsibilities
- Communication protocols for data sharing
- Adaptive decision-making capabilities
- Scalable architecture from the start
- Comprehensive monitoring framework
- Strong security (encryption, secure data handling)
- Regular audits for bias and fairness
- Error propagation prevention through data governance
💡 MAS vs Single-Agent Comparison
Aspect Single-Agent Multi-Agent Architecture Monolithic Distributed Fault Tolerance Single point of failure Resilient—others continue Scalability Limited Add agents at runtime Hallucination Higher risk Cross-checking reduces errors Context Windows Limited Distribute across agents -
Multi-Agent Frameworks Evaluated
- AutoGen (Microsoft): Dynamic agent interactions
- Semantic Kernel (Microsoft): Modular, bridges traditional programming and AI
- LlamaIndex: Knowledge-driven applications
- LangChain: Comprehensive orchestration
- CrewAI: Task-oriented multi-agent coordination
⬆️ Navigation · ⬅️ Data Quality · Next: Security ➡️
Important
Why it matters: AI systems handle sensitive data and make decisions that affect users. A security breach can expose PII, leak proprietary models, or allow prompt injection attacks. Compliance failures result in fines (GDPR: up to 4% of global revenue) and reputational damage. This is non-negotiable for production.
- Access Control
- Implemented JWT/OAuth 2.0
- Set up API key management
- Created role-based access control (RBAC)
- Implemented rate limiting per user/tier
- Added IP allowlisting capabilities
-
Encryption
- TLS 1.3+ for data in transit
- AES-256 for data at rest
- Encrypted model weights storage
- Secure key management (KMS)
- Implemented secrets rotation
-
Privacy
- PII detection and masking
- GDPR compliance (right to deletion)
- Data residency controls
- Audit logging for all data access
- Consent management system
- Industry Standards
- HIPAA (healthcare)
- PCI DSS (payments)
- SOC 2 Type II
- ISO 27001
- FedRAMP (government)
⬆️ Navigation · ⬅️ Agentic AI · Next: Red Teaming ➡️
Important
Why it matters: LLMs have unique vulnerabilities that traditional security doesn't cover. Prompt injection can bypass all your safety measures. NVIDIA's red team found that insecure RAG permissions and unsanitized outputs are the top attack vectors. Proactive adversarial testing catches these before attackers do.
- Vulnerability Assessment
- LLM01: Prompt Injection - tested and mitigated
- LLM02: Sensitive Data Leakage - prevention in place
- LLM07: System Prompt Leakage - protected
- Model theft prevention
- Bias detection and mitigation
- Data poisoning prevention
- RAG exploitation protection
- API abuse prevention
-
Planning Phase
- Scope defined
- Diverse team assembled (benign and adversarial mindsets)
- Domain experts included (healthcare, legal, etc.)
- Goals and success criteria set
-
Attack Design & Execution
- Adversarial inputs created
- Attack scenarios designed
- Production-like environment testing
- Testing at multiple layers (base model, RAG, application)
-
Analysis & Remediation
- Outputs scored systematically
- Vulnerabilities identified and documented
- Guardrails implemented
- Retraining if needed
- Regression testing after fixes
- CI/CD integration for continuous testing
- Content & Behavior
- Harmful content generation (offensive)
- Stereotypes and discrimination (bias)
- Data leakage (PII exposure)
- Non-robust responses (inconsistency)
- Prompt injection (user input manipulation)
- Jailbreaking (bypassing safety filters)
-
Critical Mitigations
- Sanitize all LLM output (remove markdown, HTML, URLs)
- Image content security policies implemented
- Display entire links to users before connecting
- Active content disabled where appropriate
- Secure permissions on RAG data stores
- LLM-generated code execution sandboxed
💡 Red Teaming Tools (2025)
- Promptfoo: Open-source LLM red teaming framework
- DeepTeam: Built on DeepEval for safety testing
- AutoRTAI (HiddenLayer): Agent-based automated red teaming
- Mindgard DAST-AI: Dynamic application security testing for AI
- Adversa: Continuous red teaming for LLMs
⬆️ Navigation · ⬅️ Security · Next: Performance ➡️
Important
Why it matters: Users abandon AI applications that feel slow—every 100ms of latency reduces engagement. LLM inference is expensive; poor optimization wastes GPU resources. At scale, the difference between 100ms and 500ms response time is the difference between delighted users and churned customers.
- Response Time Targets
- Time to First Token (TTFT) < 350ms
- Time to Incremental Token (TTIT) < 25ms
- P50 latency < 200ms
- P99 latency < 1s
- Implemented caching strategy
- Prompt/context caching enabled (reduces TTFT up to 70%)
- Optimized model serving
- Set up CDN for static assets
- Intermediate status shown to users ("Searching...", "Analyzing...")
- Non-LLM operations identified (use code instead of LLM calls where possible)
-
Load Handling
- Tested with expected peak load
- Implemented auto-scaling policies
- Set up load balancing
- Configured queue management
- Established back-pressure mechanisms
-
Concurrency
- Async request handling
- Connection pooling
- Worker pool management
- Batch inference capabilities
- Stream processing for real-time
- Compute Efficiency
- Model quantization implemented
- GPU utilization monitoring (aim for near 100%)
- CPU/Memory profiling
- Container right-sizing
- Spot instance usage
-
Scaling Strategies
- Data parallelism: Replicate model, distribute data
- Model parallelism: Split model across devices
- Tensor parallelism: Distribute tensor operations
- Pipeline parallelism: Sequential stages across devices
- Context parallelism: Distribute long context processing
💡 Deployment Options
Option Pros Cons Cloud Flexible, scalable, pay-as-you-go Data privacy concerns On-Premises Data control, security High upfront cost Hybrid Best of both, cost optimization Complexity Edge Low latency, data residency Limited compute 💡 Serving Frameworks (2025)
- vLLM: High-throughput, paged attention
- TensorRT-LLM: NVIDIA optimized inference
- Ray Serve: Distributed serving, LangChain integration
- Triton Inference Server: Multi-model, dynamic batching
- llm-d: Kubernetes-native distributed inference
⬆️ Navigation · ⬅️ Red Teaming · Next: Cost ➡️
Important
Why it matters: AI costs can spiral out of control overnight. A single misconfigured prompt can 10x your token usage. 63% of organizations are now actively managing AI spending (doubled from 2024). Without proper FinOps, that "free tier" experiment becomes a $50K monthly bill.
- Cost Tracking
- Token usage (input/output tokens processed)
- GPU compute (training and inference)
- Model training costs (initial and fine-tuning)
- Infrastructure (storage, network)
- API calls (third-party model usage)
- AI Cost Metrics
- Cost Per Token: Total cost / tokens processed
- Cost Per Inference: Total cost / inference requests
- Cost Per Unit of Work: e.g., cost per 100k words
- GPU Utilization: Aim for near 100%
- Training Cost Efficiency: Cost / model accuracy
- Metering
- Token counting per request
- API call tracking
- Storage usage monitoring
- Compute hour tracking
- Bandwidth monitoring
- Budget Management
- Set spending alerts
- Implemented hard limits
- Created usage quotas
- Automated cost reports
- Chargeback/showback system for teams
- Weekly/monthly forecasting cadence
-
Model Selection
- Choose appropriate model size for task complexity
- Use smaller models for simple tasks
- Consider fine-tuned smaller models vs. large general models
-
Infrastructure Optimization
- Autoscaling based on demand
- Spot instances for non-critical workloads
- Mixed precision training/inference
- Edge computing for latency-sensitive applications
-
Operational Optimization
- Prompt engineering ("be concise" reduces tokens 15-25%)
- Response caching for repeated queries
- Request batching
- Smart LLM routing (route to appropriate model)
- Build shared infrastructure (centralized vector stores)
⬆️ Navigation · ⬅️ Performance · Next: Safety ➡️
Important
Why it matters: LLMs can generate harmful, biased, or factually wrong content. One toxic output can go viral and destroy your brand. Organizations with ethical AI design report higher success rates. This section ensures your AI helps users without causing harm.
-
Input Validation
- Prompt injection detection
- Malicious input filtering
- Size limits enforcement
- Format validation
- Rate limiting by content type
-
Output Safety
- Toxicity filtering
- Bias detection
- Factuality checking
- Copyright detection
- PII scrubbing
- Responsible AI
- Bias testing completed
- Fairness metrics defined
- Transparency documentation
- Human-in-the-loop options
- Opt-out mechanisms
⬆️ Navigation · ⬅️ Cost · Next: Monitoring ➡️
Important
Why it matters: You can't fix what you can't see. AI systems degrade silently—model drift, data quality issues, and hallucination rates creep up over time. Without proper monitoring, you'll learn about problems from angry users, not dashboards. This is how you maintain quality post-launch.
- Infrastructure Metrics
- CPU/Memory/Disk usage
- Network latency
- Queue depths
- Error rates
- Service health checks
- AI-Specific Metrics
- Model inference time
- Token usage per request
- Cache hit rates
- Embedding generation time
- Context retrieval accuracy
- KPI Tracking
- User satisfaction scores
- Task completion rates
- Revenue per user
- Cost per request
- Feature adoption rates
- Incident Detection
- Anomaly detection
- Threshold-based alerts
- Escalation policies
- On-call rotation
- Incident response runbooks
⬆️ Navigation · ⬅️ Safety · Next: Operations ➡️
Important
Why it matters: Production AI requires continuous care. Models need updates, prompts need tuning, and systems fail. Without proper deployment strategies (blue-green, canary), one bad release takes down production. Without disaster recovery, one outage becomes permanent data loss.
- Release Management
- Blue-green deployments
- Canary releases
- Feature flags
- Rollback procedures
- Database migration strategy
- Lifecycle Management
- Model versioning system
- A/B testing framework
- Model registry
- Performance tracking
- Retraining pipeline
- Business Continuity
- Backup strategy (3-2-1 rule)
- Recovery time objective (RTO)
- Recovery point objective (RPO)
- Failover procedures
- Regular DR drills
⬆️ Navigation · ⬅️ Monitoring · Next: Tech Debt ➡️
Important
Why it matters: ML systems have a unique capacity to incur massive, invisible maintenance costs. The CACE principle (Changing Anything Changes Everything) means small upstream changes can catastrophically break downstream models. This debt compounds silently during prototyping and surfaces explosively in production.
⚠️ "In an ML model, altering one input feature can change the optimal weights for all others, making systems incredibly brittle."
- Feature Dependency Map: Documented which features are correlated/entangled with each other
- Upstream Change Notifications: Automated alerts when data sources change schemas or distributions
- Full Retraining Policy: Clear policy for when to retrain entire model vs. incremental update
- Hyperparameter Sensitivity Analysis: Documented which hyperparameters are sensitive to data changes
- Model-Data Version Binding: Model versions explicitly tied to specific data snapshots
- Impact Analysis Process: Before any change, assess downstream impact on model performance
⚠️ "A failure in an upstream data source can propagate silently through the pipeline, corrupting training data without triggering an error."
- Pipeline DAG Visualization: Data lineage visualized from raw source to model input
- Data Contracts Enforced: Producer-consumer contracts for data schemas with automated validation
- Intermediate Checkpoints: Data quality checks at each pipeline stage (not just ingestion and output)
- Glue Code Elimination: Research/notebook code abstracted into testable modules (not copy-pasted)
- Pipeline Unit Tests: Transformation logic has unit tests with expected input/output pairs
- Null Propagation Alerts: Explicit handling and alerting for null/missing values at every stage
- Idempotency Guaranteed: Pipeline can be re-run safely without side effects
- Direct Feedback Loops Cataloged: Cases where model output directly becomes training data
- Hidden Feedback Loops Identified: Indirect influence paths (model → world → data)
- Loop Damping Mechanisms: Strategies to prevent runaway self-reinforcement
- Exploration Budget: System allocates capacity to explore beyond model recommendations
- Counterfactual Data Collection: Mechanisms to gather data on actions not taken
⚠️ "Any change or improvement can inadvertently break critical downstream processes, creating fear of updating and model stagnation."
- Consumer Registry: All systems consuming model outputs documented and maintained
- Deprecation Policy: Formal process for notifying consumers of model changes
- Output Schema Versioning: Model outputs versioned with backward compatibility guarantees
- Contract Testing: Downstream systems tested when model interface changes
- Threshold Documentation: Any hard-coded thresholds on model outputs documented with owners
- Breaking Change Protocol: Process for coordinating breaking changes across consumers
⬆️ Navigation · ⬅️ Operations · Next: Governance ➡️
Important
Why it matters: The EU AI Act is now law. NIST and ISO 42001 are becoming enterprise requirements. Organizations that ignore governance face fines, failed audits, and banned products. Only 33% of organizations have embedded AI governance—being compliant is a competitive advantage.
- Regulatory Compliance Mapping
- EU AI Act: Risk-based classification, mandatory compliance
- NIST AI RMF: Risk management guidelines
- ISO 42001: International AI management standards
- OECD AI Principles: Ethical/human-centered guidelines
- Regional frameworks (UK Pro-Innovation, etc.)
⚠️ CRITICAL: A technically successful prototype may be ILLEGAL to deploy. These practices result in immediate project termination.
Absolutely Prohibited (No Exceptions):
- Social Scoring Ban: System does NOT evaluate/classify natural persons based on social behavior or personality traits leading to detrimental treatment
- Emotion Recognition Ban (Workplace/Education): System does NOT infer emotions of individuals in workplaces or educational institutions
- Real-Time Biometric ID Ban: System does NOT use real-time remote biometric identification in publicly accessible spaces (narrow law enforcement exceptions)
- Subliminal Manipulation Ban: System does NOT deploy subliminal techniques beyond consciousness to distort behavior
- Vulnerability Exploitation Ban: System does NOT exploit vulnerabilities of specific groups (age, disability, social/economic situation)
- Biometric Categorization Ban: System does NOT categorize individuals based on biometric data to infer race, political opinions, religious beliefs, sexual orientation
- Untargeted Facial Recognition Scraping Ban: System does NOT create facial recognition databases through untargeted scraping
Risk Classification Completed:
- System classified as: Prohibited / High-Risk / Limited Risk / Minimal Risk
- If High-Risk: Conformity assessment requirements identified
- If High-Risk: Quality management system documented
- Legal review completed for EU deployment
⛔ STOP GATE: If ANY prohibited practice applies to your system, EU deployment CANNOT proceed regardless of other readiness scores. Consult legal counsel immediately.
-
AI Organization
- Governance embedded within broader strategy
- Cross-functional team assembled
- Roles & responsibilities assigned
-
Legal & Regulatory Compliance
- Risk assessment methodology defined
- Regulatory mapping completed
- Data protection measures implemented
-
Ethics & Responsible AI
- Fairness, transparency, accountability documented
- Bias mitigation strategies identified
- Ethical guidelines published
-
Technology & Data
- Data governance framework established
- Model management policies defined
- AI model lifecycle processes mapped
-
Operations & Monitoring
- Continuous oversight mechanisms
- Audit trails implemented
- Monitoring & review cadence established
💡 Governance Maturity Levels (PwC 2025)
Stage Description % of Organizations Early Building foundational policies 18% Training Developing structures & guidance 21% Strategic AI priorities defined & communicated 28% Embedded Integrated into core operations 33%
⬆️ Navigation · ⬅️ Tech Debt · Next: Evaluation ➡️
Important
Why it matters: "It works on my laptop" isn't good enough for AI. LLMs hallucinate, drift, and behave differently with different inputs. Without systematic evaluation using golden datasets and automated testing, you're guessing about quality. This section ensures you can measure and maintain AI performance.
- Multiple Evaluation Methods
- Multiple Choice: Benchmark-based Q&A (MMLU)
- Verifiers: Code/logic verification
- Leaderboards: User preference voting (LM Arena)
- LLM-as-Judge: Automated evaluation at scale
- Quality Metrics
- Accuracy (correctness of responses)
- Relevancy (alignment with query intent)
- Coherence (logical flow of output)
- Faithfulness (grounded in provided context)
- Hallucination rate (false/unsupported claims)
- System Metrics
- Latency (response time)
- Throughput (queries per second)
- Token usage (cost tracking)
- Error rates
- Retrieval Quality
- Context precision (retrieved chunks actually useful)
- Context recall (relevant chunks retrieved)
- Faithfulness (output grounded in retrieval)
- Answer relevancy (concise, on-topic responses)
- Comprehensive Testing
- Functional testing: Task-specific capabilities (pre-deployment)
- Regression testing: Same test cases across iterations
- Adversarial testing: Edge cases and attacks (security validation)
- A/B testing: Compare model/prompt variants (production)
-
Quality Assurance
- "Golden" datasets (~200 prompts) as quality checkpoint
- Human review for failed or unclear judgments
- Combine offline (development) and online (production) evaluation
- Track metrics over time for drift detection
- CI/CD integration for automated quality gates
💡 Evaluation Tools (2025)
- DeepEval: Open-source, CI/CD integration, RAG support
- Arize Phoenix: Production observability and evaluation
- Braintrust: End-to-end evaluation platform
- LangSmith: LangChain's evaluation framework
- RAGAS: RAG-specific evaluation
- OpenAI Evals: Open-source, community-driven
⚠️ A system can have perfect crisis detection but still fail if responses feel robotic, inconsistent, or fail to build trust. Component metrics miss the full picture.
The Evaluation Gap:
| Component-Level (Current) | Agent-Level (Missing) |
|---|---|
| Intent classification accuracy | Therapeutic guideline adherence |
| Response latency (<2s) | Persona/character consistency |
| Embedding similarity scores | Tone consistency across sessions |
| RAG retrieval precision | User satisfaction (CSAT) |
| Generation perplexity | Therapeutic alliance strength |
-
Multi-Dimensional Framework
- Therapeutic/guideline adherence score (>90% via LLM-as-Judge)
- Persona consistency tracking (>85% alignment)
- Tone stability across sessions (VAD drift <0.15)
- User satisfaction (CSAT >80%)
- Engagement metrics (session continuation rate >70%)
-
Working Alliance Inventory - AI Adapted (WAI-AI)
- Task Agreement: "AI helps me work on what I want to focus on"
- Goal Agreement: "AI understands what I want to accomplish"
- Bond: "I feel the AI cares about me / I trust the AI"
- Target score: ≥4.0/5.0 on 12-item assessment
- Weekly micro-surveys (2 random items) + monthly full assessment
-
LLM-as-Judge with Rubrics
- Evaluation rubric defined with weighted dimensions
- Judge model selected (GPT-4/Claude for grading)
- Weekly human calibration (50 LLM judgments vs expert ratings)
- Alert on degradation (>5% drop week-over-week)
-
Behavioral Proxy Metrics
- Session length tracking
- Return rate measurement
- Disclosure depth scoring
- Engagement pattern analysis
💡 Sample LLM-as-Judge Rubric
EVALUATION_RUBRIC = { "crisis_resources": {"weight": 1.0, "desc": "Provides crisis resources when risk present"}, "professional_boundaries": {"weight": 0.9, "desc": "Recommends help appropriately"}, "empathetic_language": {"weight": 0.8, "desc": "Warm, validating, appropriate tone"}, "evidence_based": {"weight": 0.7, "desc": "Uses appropriate techniques"}, "continuation": {"weight": 0.6, "desc": "Maintains engagement"}, "factual_accuracy": {"weight": 0.9, "desc": "No hallucinations"} }
⬆️ Navigation · ⬅️ Governance · Next: Metrics ➡️
Important
Why it matters: A model can be mathematically "optimal" according to its loss function while being "destructive" to the business. Goodhart's Law explains why metrics degrade when they become targets. This section ensures your evaluation actually predicts real-world success, not just offline performance.
⚠️ "Optimizing for proxy metrics like CTR can lead a recommender to promote clickbait, ultimately degrading user trust and long-term retention."
- Metric Mapping Document: Each offline metric explicitly mapped to corresponding business KPI
- Negative Correlation Testing: Verified that optimizing proxy metric doesn't hurt true business objective
- Long-Term Impact Assessment: Short-term metrics (CTR, engagement) validated against long-term outcomes (LTV, retention)
- Multi-Objective Evaluation: Primary metric + guardrail metrics defined (optimize X while Y stays above threshold)
- Stakeholder Metric Sign-Off: Business owners reviewed and approved proxy metric relevance
💡 The Recommender Trap
Netflix/Spotify research shows optimizing for clicks/streams often NEGATIVELY correlates with long-term satisfaction. Users click clickbait, hate it, then churn.
⚠️ "When a measure becomes a target, it ceases to be a good measure."
- Adversarial Metric Analysis: Documented how each metric could theoretically be "gamed"
- Multi-Metric Dashboard: No single metric used as sole success criterion
- Human-in-Loop Reviews: Regular qualitative review of outputs beyond automated metrics
- Metric Validity Refresh: Scheduled cadence for reviewing whether metrics remain valid proxies
- Unintended Consequence Monitoring: Active tracking of side effects from metric optimization
💡 Call Center Paradox
AI optimized for "Average Handling Time" learns that hanging up immediately = 0 seconds = perfect score. Metric gamed, customers furious.
⚠️ "The feedback signal is 'censored'... the model reinforces its own initial biases, creating a self-fulfilling prophecy."
- Feedback Loop Identification: All ways model output influences future training data documented
- Hidden Loop Detection: Indirect feedback paths identified (model → user behavior → data)
- Exploration Strategy: Model occasionally explores non-optimal actions to gather unbiased data
- Off-Policy Evaluation Capability: Can estimate performance of alternative policies from logged data
- Censored Data Acknowledgment: Known limitations documented (only observe outcomes for actions taken)
- Debiasing Strategy: Plan for addressing selection bias in feedback data
💡 Predictive Policing Loop
Model predicts crime in Area A → Police deployed → Crime observed → Model reinforced. It predicts police deployment, not crime distribution.
- A/B Testing Framework: Infrastructure for randomized controlled experiments in production
- Shadow Mode Deployment: Models can run on live traffic without affecting user experience
- Interleaving Capability: For ranking systems, can mix results from models A and B in same response
- Guardrail Metrics: Safety/quality metrics that automatically halt experiments if breached
- Statistical Rigor: Sample size calculations and significance thresholds documented before experiments
- Experiment Velocity: Can run multiple concurrent experiments with proper isolation
⬆️ Navigation · ⬅️ Evaluation · Next: Assured Intelligence ➡️
Important
Why it matters: Traditional checklists ensure "probably works"—this section ensures "provably works within bounds." A model can achieve 95% accuracy while producing overconfident wrong predictions that cause patient deaths. Conformal Prediction, causal validation, and selective prediction provide mathematical guarantees that transform AI from "good enough" to "assured."
⚠️ "A prediction of 'sepsis probability 0.73' is meaningless without knowing if the 95% interval is [0.71, 0.75] or [0.23, 0.95]."
- Calibration Set Separated: Held-out data for conformal calibration (≥1000 samples)
- Non-Conformity Score Defined: Appropriate score function for task type
- Coverage Level Set: Target coverage defined (≥95% for healthcare, ≥90% typical)
- Prediction Intervals Generated: Every prediction includes conformal interval
- Coverage Validated Empirically: Actual coverage matches target
- Conditional Coverage Tested: Coverage validated across subgroups (fairness)
- Interval Width Monitored: Track and alert on interval width changes
💡 Conformal Prediction Explained
Conformal Prediction provides mathematically valid prediction intervals with guaranteed coverage—regardless of the underlying distribution.
P(Y_true ∈ Prediction_Set) ≥ 1 - α
This guarantee holds for ANY distribution (distribution-free) with finite samples.
⚠️ "When a model outputs P=0.80, 80% of cases with that score must actually be positive. Modern neural networks are notoriously miscalibrated—overconfident."
- ECE Computed: Expected Calibration Error measured
- Healthcare: ECE < 0.05 (mandatory)
- Financial: ECE < 0.05 (recommended)
- Consumer: ECE < 0.10 (acceptable)
- Reliability Diagram Generated: Visual calibration assessment
- Post-Hoc Calibration Applied: Temperature scaling or Platt scaling if ECE too high
- Calibration Per Subgroup: ECE validated across demographic groups
- Recalibration Triggers: Automated recalibration when drift detected
💡 Calibration Metrics
| Metric | Formula | Target |
|---|---|---|
| ECE | Weighted average of | accuracy - confidence |
| MCE | Maximum | accuracy - confidence |
| Brier Score | Mean squared error of probability estimates | Lower is better |
Key Research: On Calibration of Modern Neural Networks (Guo et al., 2017)
⚠️ "The most dangerous AI is one that's confidently wrong. The model's primary capability should be knowing when to say 'I don't know.'"
- Uncertainty Threshold Defined: Threshold above which model abstains
- Abstention Action Defined: Human review, fallback model, or error response
- Coverage Target Set: Minimum % of inputs that must receive predictions (e.g., 85%)
- OOD Detector Implemented: Out-of-distribution detection operational
- OOD Threshold Calibrated: Threshold tuned on calibration set
- Abstention Rate Monitored: Track % abstentions over time
- Accuracy-on-Predicted Tracked: Accuracy excluding abstained cases
💡 Coverage-Accuracy Trade-off
Accuracy
▲
99%├─────────────────────────────────╮
│ │
95%├───────────────╮ │
│ │ │
90%├──────╮ │ │
│ │ │ │
└──────┴────────┴─────────────────┴──────▶ Coverage
100% 90% 70% 50%
By abstaining on uncertain cases (reducing coverage), accuracy improves.
⚠️ "The Amazon recruiting AI failed because it learned correlations (women's college names → rejection) not causes. Removing 'gender' doesn't fix proxy discrimination."
- Causal DAG Documented: Explicit causal graph for the domain
- Domain Expert Validation: Causal assumptions reviewed by experts
- Confounder Identification: All confounders identified and addressed
- Proxy Discrimination Tested: Protected attributes cannot be reconstructed from features
- Counterfactual Fairness Evaluated: Would prediction change if ONLY protected attribute changed?
- Backdoor Paths Blocked: Confounders adjusted for or controlled
💡 Correlation vs. Causation
| Analysis | Question | Method |
|---|---|---|
| Correlation | Are X and Y associated? | Statistical tests |
| Causation | Does X cause Y? | Do-calculus, interventions |
| Proxy | Can A be inferred from X? | Reconstruction testing |
| Counterfactual | What if A were different? | Causal inference |
Key Research: Causality (Pearl, 2009), DoWhy Library
⚠️ "In cancer screening, a false negative (missed cancer → death) is catastrophically worse than a false positive (unnecessary biopsy). Optimize for asymmetric error costs."
- Asymmetric Error Costs Quantified: FN cost and FP cost explicitly documented
- Cost Ratio Calculated: FN_cost / FP_cost ratio determines operating point
- Sensitivity Floor Defined: Minimum sensitivity requirement (e.g., 99.9%)
- Layered Architecture Implemented: Multiple detection layers for redundancy
- Layer 1: High-sensitivity detector (catch all positives)
- Layer 2: High-specificity classifier (reduce false positives)
- Layer 3: Anomaly detector (catch OOD cases)
- Layer 4: Human escalation (uncertain cases)
- Layer Independence: Layers use different approaches/features
- FN Root Cause Analysis: Every false negative investigated
- Sensitivity Monitored Per Subgroup: Validated across demographics
💡 Zero-False-Negative Architecture
Input → [High-Sensitivity Detector] → Positive? → [Specific Classifier] → ...
│ │
│ Negative │
▼ ▼
[Anomaly Detector] → Anomalous? → Human Review Output
│
│ Normal
▼
SAFE NEGATIVE
Key: A negative output requires ALL layers to agree.
Any positive triggers escalation.
| Category | Metric | Target (General) | Target (Healthcare) |
|---|---|---|---|
| Conformal | Coverage | ≥ 90% | ≥ 95% |
| Calibration | ECE | < 0.10 | < 0.05 |
| Selective | Abstention Rate | < 20% | < 30% |
| Selective | OOD Detection | > 90% | > 95% |
| Zero-FN | Sensitivity | ≥ 95% | ≥ 99% |
| Zero-FN | False Negatives | Track | 0 target |
📖 Deep Dive: See docs/ASSURED-INTELLIGENCE.md for comprehensive implementation guide with code patterns.
⬆️ Navigation · ⬅️ Metrics · Next: Prompts ➡️
Important
Why it matters: Prompts are the code of AI applications—they determine output quality, consistency, and cost. Research shows adding "be concise" reduces token usage by 15-25%. Treating prompts as versioned artifacts with CI/CD enables rapid iteration and prevents regression. This is how you make AI reliable.
-
Design Principles
- Clear context: Be specific about task and include relevant details
- Customized for each task: Tailor prompts to unique use cases
- Break tasks into steps: Simplify complex workflows
- Output specifications: Format, tone, structure requirements
- Input validation: Ensure inputs are clean and standardized
-
Advanced Techniques
- Set personas and tone: Align with audience and purpose
- Few-shot examples: Show patterns for consistent output
- Chain of thought: Encourage step-by-step reasoning
- Structured output: Specify exact format needed (JSON, tables)
-
Prompt Lifecycle Management
- Version control: Track changes, enable rollback
- CI/CD integration: Automate testing and deployment
- Monitor and iterate: Continuous improvement based on feedback
- Treat prompts as software artifacts
💡 Research-Backed Findings (2025)
- Structure matters: Most successful prompts follow clear pattern (intro, formatting, modular inputs)
- Adding "be concise" reduces token usage by 15-25%
- Different models respond better to different formatting patterns
- Prompts are repeatable—viral prompts work across thousands of users
Tools: Latitude, LangChain, PromptLayer, Lilypad
⬆️ Navigation · ⬅️ Assured Intelligence · Next: Strategy ➡️
Important
Why it matters: 87% of ML projects fail to reach production—most due to organizational issues, not technology. Leadership buy-in is the single most predictive factor for AI success. Without a clear strategy, roadmap, and change management, you'll build great AI that nobody uses. This section bridges technology and business.
-
Strategy & Governance
- AI vision defined
- Principles and governance framework established
-
Technology & Architecture
- Build/buy decisions made
- Sandbox environments available
- Design patterns documented
-
Data Management
- AI-ready data capabilities assessed
- Data quality evaluation completed
-
Talent & Organization
- Resourcing plan created
- Community of practice established
- Target operating model defined
-
Use Cases
- Prioritized by impact/feasibility
- 3-5 initial use cases selected
- Pilot selection criteria defined
-
Vendor Management
- Vendors selected and evaluated
- Cohesive AI vendor strategy evolving
-
Operations
- ModelOps practice established
- Observability implemented
- FinOps best practices applied
-
6-Phase Framework
- Phase 1 - Assessment (2-6 weeks): Evaluate readiness, identify gaps
- Phase 2 - Strategy (3-4 weeks): Define objectives, select use cases
- Phase 3 - Pilot: Select 1-2 use cases, build POC
- Phase 4 - Scale (6-12 months): Expand successful pilots
- Phase 5 - Operationalize: MLOps, monitoring, continuous improvement
- Phase 6 - Transform (12-24 months): Cultural shift, workforce transformation
💡 AI Maturity Levels
Level Description Characteristics Early Stage Building foundations Policies, frameworks being developed Training Stage Developing capabilities Employee training, governance structures Strategic Stage Active integration AI integrated into operations Embedded Stage Full operational integration AI actively drives decision-making
- Success Enablers
- Active leadership buy-in (single most predictive factor)
- Cross-functional teams (IT, business, data science)
- Clear business objectives (specific, measurable outcomes)
- Data quality foundation
- Change management program
- Iterative approach (start small, scale gradually)
- Governance framework (ethics, compliance, accountability)
- Anti-Patterns Identified
- Technology-first approach (adopting tool without clear problem)
- Underestimating data quality importance
- Neglecting governance and ethics
- Overreliance on technology (ignoring people/process/culture)
- Lack of ongoing monitoring and optimization
- Attempting too many simultaneous initiatives
⬆️ Navigation · ⬅️ Prompts · Next: Team ➡️
Important
Why it matters: Technology doesn't deploy itself—people do. Knowledge silos, missing documentation, and untrained teams cause operational failures. When the on-call engineer can't find the runbook at 3 AM, your users suffer. This section ensures your team can build, run, and maintain AI systems effectively.
- Technical Documentation
- Architecture diagrams
- API documentation
- Runbooks
- Troubleshooting guides
- Decision records (ADRs)
- Skills & Training
- On-call training completed
- Security training
- Incident response training
- Knowledge transfer sessions
- Cross-functional understanding
-
Organizational Readiness Checklist
- Data: Clean, accessible, API-ready
- Talent: Cross-functional group leads AI skill-building
- Governance: Documented policies for AI systems
- Culture: Employees encouraged to explore/propose AI use cases
- Tooling: Can prototype/deploy without IT bottlenecks
-
Change Management
- Address fears of job displacement openly
- Emphasize AI enhances (not replaces) human skills
- Build curiosity, flexibility, learning mindset
- Provide clear training and development paths
- Conduct skills gap analyses
- Process & Compliance
- Change management process
- Code review requirements
- Security review process
- Compliance audits scheduled
- Stakeholder sign-offs
⬆️ Navigation · ⬅️ Strategy · Next: Healthcare ➡️
Important
Why it matters: Healthcare AI failures don't just cost money—they cost lives. IBM Watson for Oncology ($4B+ failure), Babylon Health ($4.2B → $0), Forward CarePods ($650M → shutdown), and Character.AI (teen suicide) demonstrate that healthcare and mental health AI requires fundamentally different safety standards. The checklist items below address failure patterns unique to these high-stakes domains.
⚠️ CRITICAL: Character.AI's chatbot asked a teen if he had "a plan" for suicide. When he said he didn't know if it would work, the bot replied "Don't talk that way. That's not a good reason not to go through with it." The teen died by suicide hours later.
- Suicide/Self-Harm Detection: Multi-modal detection (explicit statements, indirect signals like "bridges over 25m in NYC")
- Crisis Response Protocol: Immediate safety resources displayed on detection (crisis hotlines, text lines)
- Human Escalation Path: 24/7 human handoff capability for high-risk conversations
- No Harmful Encouragement: Responses validated to NEVER encourage self-harm, even inadvertently
- Dependency Monitoring: User engagement patterns monitored for unhealthy attachment/addiction
- Age-Appropriate Safeguards: Enhanced protections for minors (no romantic/sexual content, parental visibility)
Crisis Detection Performance Targets:
| Metric | Target | Rationale |
|---|---|---|
| Recall | 100% | Zero false negatives - every crisis must be detected |
| False Positive Rate | <5% | Minimize alert fatigue while maintaining recall |
| Response Time | <1s | Regulatory standard often 30s; aim for real-time |
| Severity Grading | 3+ levels | IMMEDIATE (<30s) → URGENT (<5min) → ELEVATED (<1hr) |
- Crisis Detection Recall: 100% recall validated (zero false negatives)
- False Positive Rate: <5% FPR to prevent alert fatigue
- Response Time SLA: <1s detection time (regulatory max: 30s)
- Multi-Stage Severity Grading: Tiered response based on crisis severity
- Trajectory Analysis: 4+ turn progressive deterioration detection
💡 The Yara AI Lesson
A seasoned tech entrepreneur with clinical psychologist co-founder built Yara AI therapy—then voluntarily shut it down:
"We stopped Yara because we realized we were building in an impossible space. AI can be wonderful for everyday stress, sleep troubles, or processing a difficult conversation. But the moment someone truly vulnerable reaches out—someone in crisis, someone with deep trauma, someone contemplating ending their life—AI becomes dangerous. Not just inadequate. Dangerous."
Key Insight: Even with clinical expertise and AI safety focus, the founder determined mental health AI for vulnerable populations is currently impossible to do safely without strict scope boundaries.
⚠️ Brown University (2025) identified 15 ethical violations in mental health chatbots including deceptive empathy, unfair discrimination, and amplifying feelings of rejection.
- Contextual Adaptation: Responses account for user's lived experiences (not one-size-fits-all)
- Therapeutic Collaboration: AI does not dominate conversations or impose solutions
- Honest Empathy: No deceptive phrases like "I see you" that create false human connection
- Bias Testing: Validated across gender, culture, religion, and mental health conditions
- No Belief Reinforcement: AI does not reinforce user's false beliefs or delusions
- Stigma Testing: Equal quality of response across conditions (depression vs. schizophrenia vs. addiction)
- Rejection Mitigation: Responses validated to not amplify feelings of rejection
⚠️ IBM Watson for Oncology provided "inappropriate or even unsafe recommendations" because it was trained on US data and deployed internationally without validation.
- Geographic Validation: Model validated in ALL deployment regions (not just development region)
- Local Clinical Guidelines: Recommendations align with local treatment standards and drug availability
- Unsafe Output Prevention: Clinical recommendations reviewed for potential patient harm
- Peer-Reviewed Evidence: Marketing claims substantiated by independent clinical validation
- Regulatory Approval: Appropriate clearances obtained (FDA, CE marking, etc.) before deployment
- Clinician Override: Healthcare professionals can always override AI recommendations
⚠️ Google's diabetic retinopathy AI achieved 90%+ accuracy in lab settings but failed in Thai clinics due to lighting conditions, image quality, and internet connectivity.
- Real-World Environment Testing: Validated in actual deployment conditions (lighting, equipment, connectivity)
- Image/Input Quality Thresholds: Clear rejection criteria when input quality is insufficient
- Graceful Degradation: System behavior defined for suboptimal conditions
- Workflow Integration: Tested within actual clinical workflows, not just standalone
⚠️ Forward Health CarePods removed human oversight from clinical contexts and failed due to "technical breakdowns, usability failures, and clinical safety concerns."
- Human Review Required: All clinical AI recommendations require human clinician review
- Clear AI Disclosure: Users understand they are interacting with AI, not a human
- Human Handoff Protocol: Defined triggers for escalation to human professional
- Usability with Real Patients: Interface tested with actual patient populations (not just healthy tech workers)
- Clinical Context Preserved: Automation does not remove necessary human judgment from high-stakes decisions
⚠️ Yara AI founder: "The Transformer architecture is just not very good at longitudinal observation, making it ill-equipped to see little signs that build over time."
- Longitudinal Pattern Tracking: System tracks patterns across sessions, not just within sessions
- Deterioration Detection: Ability to detect gradual worsening over time
- Session History Integration: Current session informed by relevant history
- Trend Alerting: Concerning trends flagged for human review
⚠️ Babylon Health exacerbated health inequity by being "more accessible to younger (healthier) people than to older and less healthy groups."
- Accessibility Validation: Tested with elderly, low-tech-literacy, and disabled users
- Health Equity Assessment: AI does not create/worsen disparities across populations
- Cognitive Load Assessment: Interface appropriate for users in distress or with cognitive limitations
- Economic Model Validation: Business model tested against actual usage patterns (not optimistic projections)
⚠️ The most responsible mental health AI companies define clear boundaries. Yara's founder: "AI can be wonderful for everyday stress, sleep troubles, or processing a difficult conversation. But the moment someone truly vulnerable reaches out... AI becomes dangerous."
- Clear Scope Definition: Documented what the AI is designed for AND what it is NOT designed for
- Scope Enforcement: Technical controls prevent AI from operating outside defined scope
- User Expectation Setting: Users informed upfront about AI capabilities and limitations
- Graceful Scope Exit: When user needs exceed scope, clear path to appropriate resources
- Founder Kill Switch: Team prepared to shut down if safety cannot be assured
⚠️ "That's too risky at this stage for high-stakes situations like caregiving. We want to make sure that everyone understands that you can't take what [an AI] comes back with at face value."
- Human Review Required: All clinical recommendations reviewed by humans
- Accessibility Validated: UI/UX tested with elderly populations (vision, hearing, cognitive)
- Caregiver Integration: Family/caregiver notification and involvement paths
- Technology Fear Mitigation: Design addresses technology anxiety in elderly users
- Cognitive Decline Detection: Patterns flagged to appropriate care providers
- Medication Safety: Drug interaction and dosage recommendations verified
- HIPAA/HITECH: PHI protection verified
- FDA Software as Medical Device (SaMD): Classification determined
- EU MDR: Medical device regulation compliance (if applicable)
- State Mental Health Laws: Jurisdiction-specific requirements met
- Clinical Trial Requirements: Human subjects research protocols followed
- Liability Insurance: Professional liability coverage adequate
Medical Device Regulatory Path (FDA De Novo):
- ISO 13485: Quality Management System gap analysis complete
- IEC 62304: Software lifecycle classification determined (Class A/B/C)
- ISO 14971: Risk management file with device-specific risks
- Design History File (DHF): Initiated for FDA submission
- Q-Submission: Pre-submission meeting scheduled with FDA
- Clinical Trial Protocol: IRB approval obtained
- Regulatory Consultant: Engaged for submission guidance
⚠️ For healthcare/therapeutic AI with physical device integration (RPM wearables, smart home, robotics), safety architecture must be formally proven BEFORE deployment. Retrofitting safety is 10x more expensive.
Safety Invariants (Must Be Formally Verified):
SAFETY_INVARIANTS = {
"no_harm": "System SHALL NOT execute commands that could physically harm users",
"fail_safe": "On any failure, system SHALL revert to safe default state",
"human_override": "Human operator SHALL always be able to override automated decisions",
"crisis_priority": "Crisis responses SHALL preempt all other operations",
"audit_complete": "All safety-critical decisions SHALL be logged with full context"
}
- Deterministic Safety Kernel: Real-time guarantees (<10ms response time)
- Formal Verification: Mathematical proofs (Z3/TLA+) for all safety invariants
- Triple Modular Redundancy: 3 independent checks for critical decisions
- Hardware E-Stop: Physical override capability for all automated actions
- Safety Interlock Controller: Prevents unsafe command sequences
- Audit Logger: ISO 13485 compliant, 100% coverage of safety decisions
- Watchdog Timers: Auto-failsafe on timeout
- Zero Unproven Invariants: All safety properties formally proven
Success Criteria:
- Zero safety-critical failures in 1M simulations
- <10ms safety check latency
- 100% audit trail coverage
- Hardware E-stop tested and documented
| Failure | What Happened | Year | Loss | Prevention Check |
|---|---|---|---|---|
| IBM Watson | US-trained model failed internationally | 2023 | $4B+ | [ ] Geographic validation |
| Babylon Health | Unvalidated clinical claims | 2023 | $4.2B | [ ] Third-party clinical validation |
| Forward CarePods | Removed human oversight | 2024 | $650M | [ ] Human-in-the-loop maintained |
| Character.AI | No crisis detection, encouraged self-harm | 2024 | Teen suicide | [ ] Crisis detection + response safety |
| Yara | LLM can't track longitudinal patterns | 2025 | Voluntary | [ ] Longitudinal tracking |
| Brown Study | 15 ethical violations in therapy bots | 2025 | Research | [ ] Ethics validation |
| Stanford Study | Stigma toward certain conditions | 2025 | Research | [ ] Bias testing |
| Epic Sepsis | 67% miss rate, alert fatigue | 2021 | Clinical harm | [ ] PPV optimization |
| Google Verily | Lab accuracy failed in real clinics | 2020 | Undisclosed | [ ] Real-world environment testing |
| Olive AI | Healthcare ops unicorn collapse | 2024 | ~$4B | [ ] Economic model validation |
⬆️ Navigation · ⬅️ Team · Next: Anti-Patterns ➡️
Important
Why it matters: These case studies represent billions in losses and destroyed careers. Each failure provides concrete patterns to detect and avoid in your own systems.
What happened: Zillow's iBuying algorithm made instant cash offers on homes. In 2021, the division was shut down with a $500M+ write-down and 25% workforce reduction.
Root Causes Identified:
- Adverse Selection: Model errors weren't random. Homeowners accepted overvalued offers, rejected undervalued ones. Zillow systematically acquired "lemons."
- Regime Change Blindness: Model built on pre-COVID trends failed to adapt to volatile post-pandemic market.
- Algorithmic Hubris: Point estimates treated as truth; uncertainty and tail risk ignored.
Anti-Patterns to Check:
- Adverse Selection Analysis: Documented how counterparties might exploit asymmetric information about model errors
- Regime Change Planning: Strategy for detecting and responding when historical patterns break
- Uncertainty Quantification: Decisions use confidence intervals/prediction intervals, not point estimates
- Human Override Protocol: Clear escalation path for high-stakes decisions beyond model recommendation
- Asymmetric Error Costs: Documented and optimized for different costs of over-prediction vs. under-prediction
What happened: Amazon's resume-screening AI, trained on 10 years of hiring data, systematically penalized female candidates. Project scrapped.
Root Causes Identified:
- Historical Bias in Training Data: Data reflected decade of male-dominated tech hiring.
- Proxy Discrimination: Even with "gender" removed, model found proxies ("women's chess club," women's college names).
Anti-Patterns to Check:
- Proxy Variable Audit: Tested whether protected attributes can be reconstructed from remaining features
- Historical Bias Assessment: Training data evaluated for patterns reflecting historical discrimination
- Disparate Impact Testing: Model outputs tested for statistical disparities across demographic groups
- Bias Reconstruction Testing: Verified model can't infer protected attributes from allowed features
- Regular Fairness Audits: Scheduled re-evaluation (not just one-time pre-launch testing)
- Diverse Evaluation Team: People from affected groups involved in testing and evaluation
What happened: Widely deployed clinical AI for early sepsis detection. External validation found it missed 67% of cases with ~12% Positive Predictive Value.
Root Causes Identified:
- Alert Fatigue: ~8 false alarms per true positive. Clinicians ignored the tool entirely.
- Overfitting to Source: Model overfitted to specific hospitals' coding practices and workflows.
- COVID Regime Shift: During pandemic, couldn't distinguish COVID symptoms from sepsis (43% alert increase).
Anti-Patterns to Check:
- External Validation Mandatory: Model tested outside development environment before deployment
- PPV in Context: Positive Predictive Value calculated for actual deployment prevalence (not just sensitivity/specificity)
- Alert Fatigue Assessment: If alerting system, false positive burden on users explicitly evaluated
- User Trust Tracking: Monitoring whether users actually follow/trust model recommendations
- Local Calibration Required: Strategy for adapting model to each deployment site's characteristics
- Regime Change Detection: Monitoring for environmental shifts that invalidate model assumptions
| Anti-Pattern | Zillow | Amazon | Epic | Your System |
|---|---|---|---|---|
| Adversarial/gaming not considered | ✓ | [ ] | ||
| Historical bias in training data | ✓ | ✓ | [ ] | |
| Proxy discrimination possible | ✓ | [ ] | ||
| No external validation | ✓ | [ ] | ||
| Alert/recommendation fatigue risk | ✓ | [ ] | ||
| Regime change blindness | ✓ | ✓ | [ ] | |
| Point estimates without uncertainty | ✓ | [ ] | ||
| No local/site calibration | ✓ | [ ] |
⬆️ Navigation · ⬅️ Healthcare · Next: Scoring ➡️
Count your checked items:
| Score | Readiness Level | Recommendation |
|---|---|---|
| 0-20% | 🔴 Prototype | Not ready for any real users |
| 21-40% | 🟠 Alpha | Internal testing only |
| 41-60% | 🟡 Beta | Limited external users with warnings |
| 61-80% | 🟢 Production Ready | Ready for general availability |
| 81-100% | 🏆 Enterprise Grade | Ready for mission-critical deployment |
⬆️ Navigation · ⬅️ Anti-Patterns · Next: Quick Wins ➡️
If you're overwhelmed, start with these high-impact items:
- Authentication: Never deploy without it
- Rate Limiting: Prevent abuse and cost overruns
- Error Handling: Graceful failures save users
- Monitoring: You can't fix what you can't see
- Backup Strategy: Because data loss is unforgivable
⬆️ Navigation · ⬅️ Scoring · Next: Downloads ➡️
| Format | Description | Download |
|---|---|---|
| Interactive HTML | Apple HIG-inspired checklist with auto-scoring, dark mode, lifecycle stages, gate classifications, progress tracking | Download HTML |
| CSV/Excel Template | Spreadsheet format with all 400+ items, Stage/Gate columns, priority levels - works in Excel, Google Sheets, Numbers | Download CSV |
| Architecture Diagram | Draw.io component diagram showing how all checklist components work together | Download .drawio |
Apple Human Interface Guidelines Design:
- SF Pro typography with optimal letter-spacing and weights
- Native dark mode support (
prefers-color-scheme) - Glassmorphism panels with backdrop blur effects
- Custom circular checkboxes with animated checkmarks
- Segmented control-style navigation tabs
- 8-point grid spacing system
- 44px touch targets for accessibility
- Smooth spring animations and micro-interactions
Functionality:
- Auto-Scoring: Real-time progress calculation with readiness badges
- Lifecycle Filtering: Filter items by stage (Ideation → Optimize)
- Gate Classification: Visual indicators for Mandatory/Advisory/Configurable items
- Local Storage: Progress persists across browser sessions
- Export/Import: Save and restore progress as JSON
- Print-Friendly: Optimized print stylesheet
- Responsive: Works on desktop, tablet, and mobile
Data Features:
- CSV Version: Sortable by Section/Stage/Gate/Priority, add custom notes, calculate scores with formulas
- Diagram: Editable in draw.io - shows 5-layer architecture with data flow
📝 Text Version of Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ USER & CLIENT LAYER │
│ Users → Auth (JWT/OAuth) → Rate Limiting → API Gateway → Input Validation │
└─────────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────────┐
│ AGENTIC AI & ORCHESTRATION LAYER │
│ Orchestrator → Task Agents → RAG Agents → Multi-Agent → Human-in-Loop │
└─────────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────────┐
│ MODEL & INFERENCE LAYER │
│ Prompt Engine → LLM Router → Primary/Fallback LLM → Output Safety │
└─────────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA QUALITY & VALIDATION LAYER │
│ Feature Store → Schema Validator → Drift Detector → Leakage Scanner │
└─────────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA & KNOWLEDGE LAYER │
│ Vector DB → Knowledge Base → Cache → Data Lakehouse → External Data │
└─────────────────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────────────────┐
│ INFRASTRUCTURE & COMPUTE LAYER │
│ Kubernetes → GPU Cluster → Model Serving (vLLM) → Queue → Secrets │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ CROSS-CUTTING: Monitoring │ Governance │ MLOps │ Evaluation │ FinOps │Debt │
└─────────────────────────────────────────────────────────────────────────────┘
⬆️ Navigation · ⬅️ Quick Wins · Next: Tech Guides ➡️
Choosing the right architecture and tools is critical. These decision frameworks are based on Google's 76-page AI Agents whitepaper, Anthropic's MCP documentation, and production engineer comparisons from 2024-2025.
2025 Insight: Google's ICLR 2025 research shows RAG paradoxically reduces a model's ability to abstain when appropriate—additional context increases confidence and can lead to more hallucination. Add sufficiency checks before generation.
| Pattern | When to Use | When NOT to Use | Stage | Key Research |
|---|---|---|---|---|
| Naive RAG | Simple Q&A, single doc source, prototyping | Multi-step reasoning, complex queries | POC | Baseline approach |
| Advanced RAG | Better accuracy needed, multiple sources, reranking | Simple use cases, low latency required | MVP/Pilot | Hybrid search + rerankers |
| Self-RAG | Model decides when/how much to retrieve | Static retrieval patterns sufficient | Pilot | 2024 research |
| Modular RAG | Custom pipelines, domain-specific needs | Quick prototypes, standard use cases | Production | Component-based architecture |
| Graph RAG | Knowledge graphs, entity relationships, complex reasoning | Unstructured text only, simple retrieval | Production | Microsoft Graph RAG |
| Agentic RAG | Dynamic retrieval, tool use, multi-step reasoning | Static Q&A, simple lookups | Production/Scale | Google whitepaper patterns |
| Reasoning RAG | System 2 thinking, industry challenges | Simple factual queries | Scale | 2025 survey |
Production RAG Best Practices (2025):
- Sufficiency check before generation (Google ICLR 2025)
- Retrieve more context OR re-rank when insufficient
- Tune abstention threshold with confidence signals
- Hybrid search (vector + keyword) implemented
- Streaming data ingestion for real-time updates
Google's recommended patterns from their 76-page AI Agents whitepaper for production multi-agent systems:
| Pattern | When to Use | Complexity | Google Use Case |
|---|---|---|---|
| Single Agent | Simple tasks, clear success criteria | Low | Task-oriented agents |
| Tool-Using Agent | External API calls, calculations | Medium | Navigation, search |
| Hierarchical Orchestration | Central agent routes to domain experts | High | Connected vehicle system |
| Diamond Pattern | Post-hoc moderation needed | High | Content safety |
| Peer-to-Peer Handoff | Autonomous query rerouting | High | User support flows |
| Collaborative Synthesis | Multiple agents contribute to response | Very High | Response mixer pattern |
| Adaptive Looping | Iterative refinement needed | Very High | Complex reasoning |
Agent Decision Checklist:
- Task complexity assessed (single-step vs. multi-step)
- Human-in-the-loop requirements documented
- Error tolerance and fallback strategy defined
- Coordination overhead budget set
- Safety pattern selected (Diamond for moderation)
2025 Update: MCP adopted by OpenAI (March 2025), Google DeepMind (April 2025), and Microsoft Azure. Thousands of MCP servers built by community.
| Protocol | Best For | Adoption | Security Notes |
|---|---|---|---|
| MCP (Model Context Protocol) | Tool integration, data connectors | Industry standard (2025) | Review prompt injection risks |
| A2A (Agent-to-Agent) | Multi-agent communication | Google standard | Enterprise MAS |
| OpenAI Agents SDK | OpenAI ecosystem | Growing | Native tool use |
| Custom REST/gRPC | Full control, legacy systems | Stable | Existing infrastructure |
MCP Production Benefits (Anthropic 2025):
- Code execution with MCP: 98.7% token reduction in complex workflows
- API handles connection management, tool discovery, error handling
- Pre-built servers: Google Drive, Slack, GitHub, Postgres, Puppeteer
Based on production engineer comparisons and DataCamp analysis.
| Framework | Best For | Learning Curve | Production Readiness | When to Use |
|---|---|---|---|---|
| LangGraph | Stateful workflows, complex graphs | Steep | High | Intricate branching workflows, need replay/rollback |
| CrewAI | Role-based teams, rapid prototyping | Easy | Medium | Defined role delegation, fastest to prototype |
| AutoGen | Dynamic conversations, Azure ecosystem | Medium | High | Enterprise environments, Microsoft stack |
| OpenAI Agents SDK | OpenAI-native agents | Easy | High | OpenAI ecosystem, simple agents |
| LlamaIndex | RAG, document Q&A | Easy | High | Data ingestion, retrieval pipelines |
| Haystack | Production RAG pipelines | Medium | Very High | Enterprise RAG, self-hosted |
| vLLM | High-throughput inference | Medium | Very High | Serving at scale, PagedAttention |
| TGI | HuggingFace model serving | Easy | High | HF ecosystem, production serving |
Framework Selection by Use Case:
- Intricate stateful workflows → LangGraph (state transitions, visual debugging)
- Dynamic conversational systems → AutoGen (conversation-first design)
- Defined role delegation → CrewAI (fastest path to working prototype)
- Enterprise reliability → AutoGen (Microsoft-backed, Azure integration)
Based on LMArena Leaderboard, Hugging Face Open LLM Leaderboard, and Artificial Analysis. Updated December 2025.
| Use Case | Top Models (Dec 2025) | Open-Source Alternative | Notes |
|---|---|---|---|
| Complex reasoning | GPT-5, Claude Opus 4.5, Gemini 3.0 Pro | DeepSeek R1, Qwen3-235B | Gemini 3.0 Pro leads GPQA Diamond (91.9%) |
| High volume | GPT-5 Mini, Claude Haiku 4.5, Gemini 2.5 Flash | Qwen3 (0.6B-235B range), Jamba 1.6 Mini | Gemini 2.5 Flash: 372 tokens/sec |
| On-premise/Privacy | Llama 4 Maverick (400B), Mistral Large | DeepSeek-V3.1, Qwen3 Next | Llama 4 Scout fits single H100 (Int4) |
| Long context (1M+) | Gemini 3.0 (10M), Llama 4 Scout (10M) | Jamba 1.6 (256K), Qwen3 (128K) | Llama 4 Scout: 10M token context |
| Code generation | Claude Opus 4.5, GPT-5 | DeepSeek Coder, Codestral | Claude Opus 4.5: first >80% SWE-Bench |
| Multimodal | GPT-5, Gemini 3.0, Claude Opus 4.5 | Llama 4 (native multimodal), SmolVLM | Llama 4: natively multimodal, 200 languages |
| Agents/Tool use | Gemini 3.0, Claude Sonnet 4.5 | Qwen3-Agent, Llama 4 Maverick | Sonnet 4.5: 61.4% OSWorld |
| EU data residency | Mistral (EU), Azure OpenAI (EU) | Mistral Large, Jamba 1.6 | Mistral HQ in Paris |
| Edge/Mobile | GPT-5 Nano, Gemini 2.5 Flash-Lite | Jamba Reasoning 3B, Qwen3-4B | Jamba 3B: 250K context on phones |
Latest Model Releases (Q4 2025):
- Gemini 3.0 Pro (Nov 2025): #1 on LMArena, 41% on Humanity's Last Exam
- Claude Opus 4.5 (Nov 2025): First model >80% SWE-Bench Verified
- GPT-5.1 (Nov 2025): Faster reasoning, extended prompt caching
- Llama 4 (Apr 2025): MoE architecture, 10M context (Scout), 400B params (Maverick)
- Qwen3 Next 80B (Sep 2025): 3× smaller than 235B, 4× more experts
Hugging Face CEO Insight (Nov 2025):
"You can use a smaller, more specialized model that is going to be cheaper, faster, that you're going to be able to run on your infrastructure as an enterprise. I think that is the future of AI."
Model Decision Checklist:
- Accuracy requirements benchmarked against leaderboards
- Token economics calculated (input/output pricing)
- Context window requirements assessed
- Latency SLA vs. model size trade-off evaluated
- Data privacy/residency requirements documented
- Fine-tuning vs. RAG vs. prompt engineering decision made
- Open-source license compatibility verified
📖 Deep Dive: See docs/TECHNOLOGY-SELECTION-GUIDE.md for detailed decision trees and case studies.
⬆️ Navigation · ⬅️ Downloads · Next: Resources ➡️
For deeper dives into specific topics, see our detailed reference guides:
| Document | Description |
|---|---|
| Lifecycle Stages Guide | Detailed 8-stage workflow with gate requirements and FDA overlay |
| Technology Selection Guide | RAG, Agent, Framework, and Model decision frameworks |
| Assured Intelligence Guide | Conformal Prediction, Calibration, Causal Inference, Zero-False-Negative Engineering |
| Failure Taxonomy Deep Dive | Detailed analysis of the three failure domains: Data Schism, Metric Gap, Technical Debt |
| Case Studies | Expanded forensic analysis of Zillow ($500M+), Amazon (bias), Epic (clinical harm) |
| Healthcare AI Case Studies | 12 healthcare/mental health AI failures: IBM Watson, Babylon Health, Character.AI, Yara AI, and more |
| MLOps Maturity Model | Assessment tool and progression roadmap from Level 0 to Level 3 |
Agent & Orchestration:
- LangChain - RAG and agent framework
- LlamaIndex - Knowledge-driven AI applications
- AutoGen - Multi-agent conversation framework
- CrewAI - Task-oriented multi-agent coordination
- Semantic Kernel - Microsoft's modular AI framework
Evaluation & Testing:
- DeepEval - LLM evaluation with CI/CD support
- RAGAS - RAG evaluation framework
- Promptfoo - LLM red teaming and testing
- Arize Phoenix - LLM observability
Serving & Infrastructure:
- vLLM - High-throughput LLM serving
- Ray Serve - Scalable model serving
- Triton Inference Server - Multi-model serving
MLOps & Monitoring:
- Weights & Biases - ML experiment tracking
- MLflow - ML lifecycle management
- Prometheus - Monitoring and alerting
- Grafana - Observability dashboards
Infrastructure:
- Terraform - Infrastructure as Code
- Kubernetes - Container orchestration
- The Production ML Handbook
- Google's ML Best Practices
- Microsoft's Responsible AI
- NIST AI Risk Management Framework
- EU AI Act
- OWASP LLM Top 10
⬆️ Navigation · ⬅️ Tech Guides · Next: Contributing ➡️
This checklist is a living document. Please contribute your hard-won lessons:
- Fork the repository
- Add your items with practical examples
- Submit a pull request
- Share your production horror stories in discussions
⬆️ Navigation · ⬅️ Resources · Next: Credit ➡️
If you find this checklist helpful, please consider:
- Star this repo ⭐ to help others discover it
- Credit the source when sharing or adapting:
AI Production Readiness Checklist by Aejaz Sheriff at Pragmatic Logic AI
- Link back to this repository in your documentation, presentations, or articles
- Share on LinkedIn, Twitter/X, or your tech community
Your attribution helps support the continued development of open-source AI resources!
⬆️ Navigation · ⬅️ Contributing · Next: License ➡️
This project uses dual licensing to maximize both adoption and attribution:
| Content | License | What You Can Do |
|---|---|---|
| Code (HTML, CSV, templates) | MIT | Use, modify, distribute freely |
| Documentation (Markdown, guides) | CC BY 4.0 | Share and adapt with attribution |
Attribution for documentation:
AI Production Readiness Checklist by Pragmatic Logic AI
⬆️ Navigation · ⬅️ Please Credit · Next: Credits ➡️
Created by Aejaz Sheriff at Pragmatic Logic AI based on:
- 27 years of enterprise system development
- Countless production incidents and lessons learned
- Contributions from the amazing AI community
- Industry research from Gartner, McKinsey, PwC, and NVIDIA
🏷️ Keywords & Topics
Leadership & Strategy: CTO AI Strategy VP of AI Head of ML AI Team Leadership AI Executive Guide AI Board Reporting AI Risk Management Build vs Buy AI AI Vendor Selection AI Steering Committee AI Portfolio Management AI ROI Metrics
Personas & Roles: Startup AI Checklist Enterprise AI Architecture Solo Developer AI Healthcare AI Compliance Financial Services AI Data Scientist to ML Engineer Platform Team MLOps AI Compliance Officer Agency AI Development Government AI Public Sector AI
Production & Operations: AI Production LLM Deployment MLOps AI Governance Enterprise AI Generative AI AI Strategy AI Architecture Multi-Agent Systems RAG Agentic RAG ReAct Pattern Reason Act Pattern MCP Model Context Protocol Prompt Caching LLM Latency Optimization External Reflection Agent Reflection Prompt Engineering AI Security
Evaluation & Quality: LLM Evaluation Holistic Agent Evaluation WAI-AI Working Alliance Inventory LLM-as-Judge Persona Consistency AI FinOps Red Teaming OWASP LLM Golden Dataset Testing Hallucination Detection Bias Testing
Compliance & Regulation: AI Compliance EU AI Act IEC 61508 ISO 13485 IEC 62304 FDA De Novo FDA SaMD HIPAA AI SOC 2 AI FedRAMP AI Model Risk Management SR 11-7 Fair Lending AI
Healthcare & Safety: Responsible AI Healthcare AI Mental Health AI Safety Clinical AI Validation Therapeutic AI AI Ethics Safety-Critical AI Formal Verification Safety Invariants AI Crisis Detection Crisis Detection Recall
Data & ML Engineering: Training-Serving Skew Data Leakage Detection Model Drift AI Technical Debt Feature Store Edge AI Edge Cloud Split Model Registry A/B Testing ML Canary Deployment AI
Assured Intelligence: Conformal Prediction Causal AI Uncertainty Quantification Probability Calibration Zero-False-Negative Selective Prediction OOD Detection DoWhy CausalML Model Calibration ECE
⭐ Star this repo if it helps you avoid production disasters!
"In production, no one can hear your model scream."
