Skip to content

Latest commit

 

History

History
1599 lines (1267 loc) · 67.2 KB

File metadata and controls

1599 lines (1267 loc) · 67.2 KB

Product Requirements Document

InformaticaProjectAnalysis

Version: 1.7.0 (roadmap) Author: ad25343 Last Updated: 2026-03-19 License: CC BY-NC 4.0 — github.com/ad25343/InformaticaProjectAnalysis Contact: github.com/ad25343/InformaticaProjectAnalysis/issues

Bottom line: AI-powered pre-conversion analysis for teams migrating from Informatica PowerCenter to open code (Python, dbt, PySpark). Reads all mapping XMLs from a PowerCenter project, uses Claude to interpret SQL overrides and expression logic, identifies cross-mapping patterns, builds a dependency graph, and produces a conversion strategy that humans review before any conversion begins. GenAI is core to this tool — structural fingerprinting tells you the shape of your mappings; AI interpretation tells you what they actually do.

Deployment target: This is a department-level tool — designed to be run by a single data engineering or migration team inside an organisation. It runs as a local Python server (or on a team VM) and is accessed by a small number of named users (typically 3–15). It is not a multi-tenant SaaS platform. Enterprise deployment patterns (SSO, LDAP, multi-tenancy, compliance audit logs, horizontal scaling) are noted throughout as [ENTERPRISE-TODO] items for teams that outgrow the department scope.


1. Background — What Is Informatica PowerCenter?

Informatica PowerCenter is an enterprise ETL (Extract, Transform, Load) platform that has been used by large organizations — banks, insurers, telecoms, government agencies — for over two decades to move and transform data between systems. It is one of the most widely deployed data integration tools in the world.

In PowerCenter, each unit of data transformation logic is called a mapping. A mapping defines how data flows from a source (e.g., a database table, flat file, or API) through a series of transformations (lookups, expressions, filters, aggregations, routers) and into a target table. A single mapping might load a customer dimension table; another might aggregate daily transactions into a monthly summary.

A typical enterprise PowerCenter environment contains dozens to hundreds of these mappings, organized into folders and grouped by workflows that define when and in what order they run. All of this configuration is stored internally by PowerCenter, but it can be exported as XML files — one XML per mapping, per workflow, per parameter file.

These XML exports are the raw material that this tool works with.


2. The Problem — From SaaS Tools to Open Code

If your organization runs Informatica PowerCenter, you already know its gaps. The platform does what it was built to do — batch ETL on structured data — but it has not kept pace with what engineering teams are now being asked to deliver.

The most acute gap is GenAI. Business teams are demanding AI-enabled data pipelines — LLMs (Large Language Models) for classification, enrichment, anomaly detection, and summarization embedded directly in the data flow. Informatica PowerCenter has no native GenAI capability. It cannot call an LLM, cannot integrate with vector databases, and has no roadmap for AI-augmented pipelines. Staying on PowerCenter means the data infrastructure simply cannot participate in what the business is trying to do next.

The switch to open code closes these gaps. Python, PySpark, and dbt integrate with any model, any API, any tool in the ecosystem — and produce a codebase the team owns, version-controls, tests, and deploys without platform lock-in. For teams that have hit PowerCenter's ceiling, the destination is not another SaaS tool. The destination is a codebase.

This changes the conversion problem fundamentally. A SaaS-to-SaaS migration can lean on the destination platform's import wizards. But a SaaS-to-code migration needs to produce well-structured, maintainable source code — not just functionally equivalent scripts.

This is a conversion problem, not a rewrite. The business logic encoded in those mappings is tested, production-proven, and (often) poorly documented. The goal is to faithfully reproduce that logic in open code, not to redesign the data architecture from scratch.

Why One-at-a-Time Conversion Fails

The naive approach is to convert each mapping in isolation. Take mapping XML #1, parse it, produce the equivalent Python script, move on to mapping #2. This works — but it produces exactly the kind of unmaintainable codebase that teams are trying to escape:

  • If 14 mappings all follow the same truncate-and-load pattern differing only by table name, you get 14 separate scripts instead of one parameterized template plus a config file.
  • If 8 mappings share the same SCD2 (slowly changing dimension) pattern, each gets its own copy of the SCD2 logic instead of sharing a common implementation.
  • Shared lookup tables are redefined independently in every script.
  • There is no dependency graph — no way to know which scripts must run before others.
  • There is no project-level structure — no unified sources, no shared utilities, no layered organization.

The result is a converted codebase that works but is unmaintainable — hundreds of files with massive duplication, no structure, and no awareness of how the pieces fit together. You left a proprietary tool and landed in a code mess.

The fix is to analyze the full project before converting any individual mapping.


3. What This Tool Does

InformaticaProjectAnalysis is the pre-conversion analysis step for teams migrating from Informatica PowerCenter to open code. It reads all the mapping XMLs from a PowerCenter project (the complete collection of exported mappings, workflows, and parameter files) and produces a conversion strategy that answers three questions:

  1. Which mappings share the same structural pattern? Mappings that follow the same transformation flow (e.g., source → lookup → expression → target) with only table names and column names differing are candidates for a single parameterized template. The tool groups them together, shows the evidence, and assigns a confidence level.

  2. What depends on what? If mapping A loads DIM_CUSTOMER and mapping B does a lookup against DIM_CUSTOMER, then B depends on A — it must run after A completes. The tool builds a dependency graph across all mappings and computes a safe execution order.

  3. What needs human attention? Not every mapping can be automatically classified with high confidence. Custom SQL overrides, missing definitions, and unusual transformation patterns reduce certainty. The tool flags these for tech lead review.

The output is a strategy document (PDF + Excel + JSON) that tech leads and leadership review and approve before any conversion begins.

The tool observes and surfaces structural characteristics. It does not prescribe which target language to use (Python, dbt, PySpark, etc.), which warehouse to target, or how to orchestrate the converted pipelines. Those decisions belong to the humans reviewing the strategy and the conversion tools they choose.

What This Tool Does NOT Do

Out of scope — by design:

  • Effort / cost estimation. The tool surfaces complexity distribution and scope reduction figures. It does not translate these into story points, team-weeks, or monetary estimates. Those models are project- and team-specific and belong in your delivery toolchain, not here.

  • Migration wave planning / sprint sequencing. The dependency DAG gives you safe execution order for running the mappings in production. Choosing which mappings to convert first (risk management, incremental delivery, team allocation) is a programme management decision, not a structural analysis one.

  • Business criticality / impact scoring. The tool knows structural complexity but not business importance. Which mapping feeds the overnight regulatory batch? Which one is the CEO dashboard? That context does not exist in the XML files and must come from the organisation. It is not in scope for automated analysis.

  • The actual conversion. Code generation — producing Python, dbt, PySpark, or any other target — is a separate project. This tool produces the strategy that conversion is executed against.

Out of scope — enterprise deployment features:

These are deliberate non-goals for the department-level tool. Teams that outgrow these limits should treat them as a migration project from the department tool to a proper platform.

  • [ENTERPRISE-TODO] Single Sign-On / LDAP / OAuth. Auth is simple username+password in SQLite. Enterprise identity integration (Okta, Azure AD, corporate LDAP) is out of scope.
  • [ENTERPRISE-TODO] Multi-tenancy. One SQLite database per deployment. Multiple independent teams should run separate instances, not share one.
  • [ENTERPRISE-TODO] Horizontal scaling / load balancing. Single-process FastAPI
    • SQLite. Sufficient for a department; not designed for concurrent large-estate analyses across dozens of projects simultaneously.
  • [ENTERPRISE-TODO] Compliance audit logging (SOX, GDPR). The audit trail is for the migration review process, not for regulatory compliance frameworks.
  • [ENTERPRISE-TODO] High-availability / disaster recovery. SQLite WAL mode is robust for the department use case. Backup strategy is the team's responsibility.

4. Target Personas

This is a department-level tool used by a small, named team (typically 3–15 people). From v1.6 onwards, each user logs in with a username and password and is assigned one of four roles. Role determines what they see and what they can do.

Role Who Access
TECH_LEAD Data engineering tech lead Full access — analysis, review, override, approve, tracking
MANAGER Engineering manager or delivery lead Dashboard, PDF/Excel summaries, progress tracking (read), portfolio view
DEVELOPER Conversion developer Their assigned mappings, tracking status updates, reconciliation for their work
ANALYST Migration analyst or PO All read access, pattern library, cross-project portfolio

TECH_LEAD — Primary persona

Reviews the strategy in detail. Validates pattern groupings, confirms or adjusts assignments, identifies edge cases. Needs: structural evidence, dependency graph, per-mapping confidence levels, override controls, reconciliation manifest, group notes. Approves or rejects the final strategy.

MANAGER — Secondary persona

Reviews the summary layer. Needs: mapping count, pattern groups, complexity distribution, risk flags, scope reduction %, dependency depth, conversion progress (% complete by stage), blocked mappings list. Does not need per-mapping technical detail. Receives the PDF report.

DEVELOPER — Tertiary persona

Receives their mapping assignments from the approved strategy. Updates conversion status (IN_PROGRESS → CONVERTED → VALIDATED). Views the reconciliation manifest for their mappings. Does not review or override groupings — that is the tech lead's domain.

ANALYST — Quaternary persona

Cross-project visibility. Monitors multiple projects in the portfolio view. Accesses the pattern library to see what patterns have been confirmed across projects. Full read access everywhere. Cannot override groupings or update tracking status.


Authentication note (v1.6): Simple username + hashed password stored in SQLite. JWT session token, 8-hour expiry. No OAuth, LDAP, or SSO — this is a department tool. [ENTERPRISE-TODO] For corporate identity integration, wrap the FastAPI app with an OAuth2 reverse proxy (e.g., oauth2-proxy in front of nginx).


5. Core Principles

  1. Analyze all N mappings together. Cross-mapping references (a mapping that looks up a table produced by another mapping) can only be detected when the full project is visible. Analyzing mappings in isolation misses these relationships.

  2. Pattern grouping is the primary goal. Structurally similar mappings become one template + config instead of N separate files. A project of 50 mappings might collapse into 8 templates + 12 unique files.

  3. Strategy is a recommendation with evidence. Every grouping shows the member mappings, structural evidence, parameter differences, and confidence level. Nothing is a black box.

  4. Converting, not rewriting. The analysis observes what exists and recommends smart conversion. It does not redesign the data architecture or suggest how the project "should have been built."

  5. Variation handling is explicit. Mappings within a group are not all identical. The tool classifies how much each member differs from the group's canonical pattern and surfaces this transparently.

  6. Classification by structural behavior, not naming conventions. Real-world projects do not follow consistent naming. A mapping called TBL_047_PROC might be a simple dimension load. Classification uses transformation topology and graph position, not names.

  7. Honest uncertainty. The strategy distinguishes high-confidence classifications from ambiguous ones that need human confirmation. The output says "here are 38 I'm confident about, 8 that need a human to confirm, and 4 I couldn't classify."

  8. Security is paramount. All input is validated, all XML parsing is hardened against XXE (XML External Entity) injection, no secrets in code or logs, all dependencies audited.


6. Input — Project Configuration

The primary input is a *.project.yaml file that defines the full migration scope. This is the single source of truth for the analysis.

project:
  name: "Project Name"
  version: "1.0"
  owner: "Team Name"

source:
  type: folder              # folder | repo | zip | s3
  location: "/path/to/exports/"

scope:
  mappings:
    include: ["mappings/**/*.xml"]
    exclude: ["mappings/archive/**"]
  workflows:
    include: ["workflows/**/*.xml"]
  parameters:
    include: ["parameter_files/*.xml"]
  default_parameter_env: "dev"

analysis:
  fingerprint_strictness: "moderate"
  min_group_size: 2
  confidence_threshold: 0.7
  detect_shared_assets: true
  build_dependency_dag: true        # DAG = Directed Acyclic Graph
  classify_expressions: true

review:
  tech_lead:
    name: "Name"
    email: "email"
  leadership:
    name: "Name"
    email: "email"

output:
  strategy_format: "json"
  output_dir: "/output/"

notifications:
  webhook_url: ""
  events:
    on_analysis_complete: true
    on_strategy_ready: true
    on_review_approved: true

Source types: folder (local path), repo (Git URL + branch), zip (uploaded archive), s3 (bucket path).


7. Pipeline Architecture

*.project.yaml (uploaded via UI, dropped in watcher dir, or POSTed via API)
    │
    ▼
Phase 1   Discovery
          ├── Step 1.1  Source Resolution       [deterministic]
          │             Clone repo / mount folder / extract ZIP / pull S3
          │             Scan using scope globs → list of mapping XMLs
          │
          ├── Step 1.2  Parse All Mappings      [deterministic]
          │             Parse each mapping XML → structural components
          │             Cache results by file content hash (SHA-256)
          │             Aggregate into project-level graph
          │
          ├── Step 1.3  Build Project Graph      [deterministic + AI-assisted]
          │             Cross-mapping dependency edges (target → lookup references)
          │             Shared asset detection (tables referenced by 3+ mappings)
          │             Repeated expression detection
          │             AI: interpret custom SQL overrides, classify expression
          │             complexity, infer missing mapplet behavior, detect
          │             implicit dependencies
          │
    ▼
Phase 2   Pattern Grouping                      [AI-assisted]
          ├── Step 2.1  Structural Fingerprinting
          │             Extract "spine" per mapping — the ordered sequence of
          │             transformation types from source to target (see §9.1)
          │             Group by matching spine
          │
          ├── Step 2.2  Variation Classification
          │             Within each spine group, diff parameters
          │             Classify: Tier 1 (parameter only), Tier 2 (minor structural),
          │             Tier 3 (fundamental — does not group)
          │
          ├── Step 2.3  AI Pattern Naming + Confidence
          │             Name groups in human terms
          │             Assign confidence per mapping-to-group assignment
          │             Flag edge cases and ambiguous mappings
          │
    ▼
Phase 3   Strategy Document Generation
          ├── Step 3.1  PDF Generation          [leadership summary + tech lead detail]
          ├── Step 3.2  Excel Generation        [5 sheets: groups, DAG, shared assets,
          │                                      assignments, risk flags]
          ├── Step 3.3  Strategy JSON           [machine-readable output]
          │
    │
    ▼
Phase 4   ◼ Human Gate — Strategy Review
          Tech leads + leadership review in the UI.
          Actions: confirm groupings, override assignments, add notes.
          Decision: APPROVE → Phase 5 | REJECT → re-analysis with notes
    │
    ▼
Phase 5   Strategy Delivery
          Approved strategy available as:
          ├── JSON file download
          ├── PDF + Excel download
          └── API endpoint (GET /api/projects/{id}/strategy.json)

8. Phase 1 — Discovery (Detail)

8.1 Source Resolution

The source section of the project config determines how mappings are located:

Type Resolution
folder Scan location recursively using scope glob patterns
repo Clone location at branch, scan path using scope globs
zip Extract uploaded archive, scan using scope globs
s3 Pull objects matching scope globs from location bucket path

All resolved files are auto-detected by content: mapping XML, workflow XML, parameter file, or unknown. Files that don't match any known type are logged and skipped.

8.2 Mapping Parser

Each mapping XML is parsed to extract its structural components:

Per-mapping parse output:
├── transformations[]         Each transformation with type, ports, expressions
│   ├── type                  "Expression", "Lookup", "Aggregator", "Router", etc.
│   ├── ports[]               Input/output ports with datatypes
│   ├── expressions[]         Expression bodies per port
│   └── table_attribs{}       Lookup table name, conditions, etc.
├── connectors[]              Wiring: from_instance → to_instance
├── sources[]                 Source tables with db_type, owner, fields
├── targets[]                 Target tables with db_type, owner, fields
├── parameters[]              $$VAR definitions with defaults
├── mapplet_instances[]       Mapplet references (expanded if definitions available)
└── sql_overrides[]           Custom SQL on Source Qualifiers

Results are cached by SHA-256 hash of file content. Unchanged files are not re-parsed on incremental runs.

8.3 Cross-Mapping Graph Construction

From the per-mapping parse results, the tool builds a project-level graph:

Dependency edges — for each mapping, inspect all Lookup transformations' target table names. If that table name matches another mapping's target name, create a dependency edge: the lookup mapping depends on the target mapping.

Shared assets — tables that appear as Lookup sources in min_group_size or more mappings across the project.

Repeated expressions — expression bodies that appear verbatim or structurally equivalent in 4+ mappings.

8.4 AI-Assisted Interpretation

AI is called to interpret elements the parser cannot classify deterministically:

  • Custom SQL overrides in Source Qualifiers — what do they do, are two overrides structurally equivalent?
  • Expression complexity — is a 40-line DECODE the same pattern as a 3-line IIF, or fundamentally different?
  • Missing mapplet definitions — infer purpose from input/output ports and wiring
  • Implicit dependencies — references hidden in SQL or stored procedure calls

9. Phase 2 — Pattern Grouping (Detail)

9.1 Structural Fingerprinting

Each mapping's transformation topology is reduced to a canonical spine: the ordered sequence of transformation types from source to target, derived from the connectors graph.

Example spines:

  • SQ → EXP → TARGET (simple dimension load)
  • SQ → LKP → EXP → RTR → UPD → TARGET (SCD2)
  • SQ(×3) → JNR → LKP(×2) → EXP → RTR → TARGET(×2) (complex multi-source)

Mappings with matching spines are candidates for the same pattern group.

9.2 Variation Tiers

Within a spine group, variation is classified:

Tier 1 — Parameter variation. Structurally identical. Only table names, column names, filter values, connection strings differ. One template + one config file.

Tier 2 — Minor structural variation. Core flow is the same, but minor differences: an extra Expression, a Filter present in some but not all, different Lookup counts. Template accommodates variation via config flags.

Tier 3 — Fundamental structural variation. Different transformation types or flow shapes within the same spine match. Does not group. Convert individually.

Boundary between Tier 2 and Tier 3: spine + complexity profile. Two mappings match when they share a spine AND their complexity at each step is comparable.

9.3 Confidence Levels

Each mapping-to-group assignment carries a confidence level:

Confidence Meaning Action
HIGH Spine match + Tier 1 variation + no flags Auto-confirmed
MEDIUM Spine match + Tier 2 variation or minor flags Tech lead should confirm
LOW Weak spine match, significant variation, or AI uncertainty Requires human review
UNCLASSIFIED No spine match, custom SQL, missing definitions Convert individually

The analysis.confidence_threshold in the project config controls the cutoff below which mappings are flagged for human review (default: 0.7).


10. Phase 3 — Strategy Document (Detail)

10.1 PDF Report

Two layers in one document:

Page 1 — Leadership Summary

  • Project name, analysis date, mapping count
  • Pattern groups found (count + names)
  • Unique mappings (count + names)
  • Scope reduction: "50 mappings → N templates + M unique files"
  • Complexity distribution: simple / medium / complex counts
  • Dependency depth: number of sequential stages
  • Risk flags: count by severity
  • Confidence distribution: HIGH / MEDIUM / LOW / UNCLASSIFIED counts

Remaining pages — Tech Lead Detail

  • Per pattern group: member list, spine diagram, variation table, parameter differences, evidence, confidence per member
  • Dependency DAG (visual)
  • Shared asset catalogue
  • Risk flags detail (per mapping)
  • Unclassified mappings with reasons

10.2 Excel Workbook (5 sheets)

Sheet Contents
Pattern Groups Group name, spine, member count, member names, variation notes, confidence summary
Dependency Graph Source mapping, target mapping, edge type, shared table name
Shared Assets Table name, reference type, referenced by (mapping list), recommendation
Per-Mapping Assignments Mapping name, assigned group (or "unique"), confidence, variation tier, flags, override notes
Risk Flags Mapping name, flag type, severity, description, recommendation

10.3 Strategy JSON

Machine-readable output describing the full analysis results.

{
    "strategy_version": 1,
    "project_name": "FirstBank_Q1_Migration",
    "analysis_job_id": "uuid",
    "analyzed_at": "ISO datetime",

    "summary": {
        "total_mappings": 50,
        "pattern_groups": 8,
        "template_candidates": 36,
        "unique_mappings": 14,
        "scope_reduction_pct": 56
    },

    "pattern_groups": [
        {
            "group_id": "trunc_load_01",
            "group_name": "Truncate & Load",
            "spine": "SQ → EXP → TARGET",
            "members": [
                {
                    "mapping_name": "m_load_customer",
                    "confidence": "HIGH",
                    "variation_tier": 1,
                    "variation_notes": null,
                    "override": null
                }
            ],
            "externalized_params": ["source_table", "target_table", "column_list"],
            "template_hints": "Single config-driven truncate-and-load with optional filter"
        }
    ],

    "unique_mappings": [
        {
            "mapping_name": "m_complex_reconciliation",
            "reason": "Tier 3 — fundamentally different structure, no pattern match",
            "risk_flags": ["CUSTOM_SQL_OVERRIDE", "5_JOINER_TRANSFORMATIONS"]
        }
    ],

    "shared_assets": [
        {
            "table_name": "DIM_CUSTOMER",
            "referenced_by": ["m_fact_daily_txn", "m_fact_loan_origination", "m_agg_monthly"],
            "reference_type": "lookup",
            "recommendation": "shared reference — referenced by 3 mappings"
        }
    ],

    "dependency_dag": [
        {"from_mapping": "m_stg_customer", "to_mapping": "m_dim_customer", "via_table": "STG_CUSTOMER"},
        {"from_mapping": "m_dim_customer", "to_mapping": "m_fact_daily_txn", "via_table": "DIM_CUSTOMER"}
    ],

    "execution_order": [
        ["m_stg_customer", "m_stg_account", "m_stg_transactions"],
        ["m_dim_customer", "m_dim_account"],
        ["m_fact_daily_txn", "m_fact_loan_origination"],
        ["m_agg_monthly_summary"]
    ],

    "review": {
        "approved_at": "ISO datetime",
        "approved_by": "reviewer_name",
        "overrides": [],
        "notes": ""
    }
}

Schema versioned via strategy_version field.


11. Phase 4 — Human Gate

The strategy review is a structured decision gate in the UI.

11.1 Review Actions

Action Effect
Confirm mapping assignment Locks the mapping's group assignment
Move mapping to different group Simple override — no re-analysis
Convert mapping individually Removes mapping from its group; converts standalone
Split group Triggers lightweight re-validation (Phase 2 re-run on cached Phase 1 data)
Merge groups Triggers lightweight re-validation
Add notes Stored per mapping and per group; carried into strategy JSON
APPROVE Generates final strategy JSON with review metadata
REJECT Returns to analysis with reviewer notes as constraints

11.2 Audit Trail

Every review action is stamped with reviewer name, role, timestamp, and decision. Stored in the audit_log table and included in the strategy JSON.


12. Phase 5 — Strategy Delivery

The approved strategy is available in three formats:

JSON — machine-readable, schema-versioned. Downloaded via the UI or retrieved via GET /api/projects/{id}/strategy.json. Suitable as input to any downstream conversion tool or workflow.

PDF — human-readable strategy document. Leadership summary + tech lead detail.

Excel — reviewable tabular data. Five sheets covering all analysis outputs.

All three are generated from the same underlying analysis data and are consistent.


13. UI Architecture

13.1 Three Views

Dashboard (leadership) Project summary on one screen. Complexity heat map. Scope reduction metric. Dependency depth. Risk flag distribution. Confidence distribution. APPROVE / REJECT gate.

Pattern Groups (tech leads) Left panel: group list with member count and confidence indicator. Right panel: spine visualization, member table with variation tier and confidence, parameter differences table, evidence section, override controls, notes.

Dependency Graph (both) Interactive DAG. Nodes = mappings, colored by pattern group. Edges = dependencies. Click node for details. Execution stages highlighted. Critical path shown. Error propagation paths visible.

13.2 Technology

  • React frontend
  • FastAPI backend (port 8090)
  • SSE progress streaming during analysis
  • SQLite persistence
  • PDF generated server-side (reportlab or weasyprint)
  • Excel generated server-side (openpyxl)
  • DAG visualization: d3-dag or dagre-d3

14. API Surface

Method Path Description
POST /api/projects Upload project config YAML and start analysis
POST /api/projects/folder Start analysis from a folder path
GET /api/projects List all analysis jobs
GET /api/projects/{id} Get analysis job state
GET /api/projects/{id}/stream SSE progress stream
GET /api/projects/{id}/strategy.json Download strategy JSON
GET /api/projects/{id}/strategy.pdf Download strategy PDF
GET /api/projects/{id}/strategy.xlsx Download strategy Excel
POST /api/projects/{id}/review Submit review decision (APPROVE / REJECT)
POST /api/projects/{id}/override Submit mapping override (move, confirm, individualize)
GET /api/projects/{id}/graph Get dependency graph data (JSON for frontend rendering)
GET /api/projects/{id}/groups Get pattern groups with members
GET /api/projects/{id}/groups/{gid} Get single group detail
POST /api/projects/{id}/deliver Trigger strategy delivery
GET /api/projects/{id}/reconciliation.json Reconciliation test manifest (v1.1)
GET /api/projects/{id}/reconciliation.xlsx Reconciliation manifest as Excel (v1.1)
POST /api/projects/{id}/reconciliation/results Ingest test results from conversion tool (v1.1)
GET /api/schema/strategy/{version} Published JSON Schema for strategy output (v1.1)
GET /api/schema/strategy/changelog Schema version history (v1.1)
GET /api/projects/{id}/tracking Get conversion tracking state for all mappings (v1.2)
PATCH /api/projects/{id}/tracking/{mapping} Update conversion status for a mapping (v1.2)
GET /api/projects/{id}/change-impacts Get source-change impact alerts (v1.3)
POST /api/programmes Create a programme grouping multiple jobs (v1.4)
GET /api/programmes/{name} Aggregate portfolio view across programme (v1.4)
GET /api/programmes/{name}/report.pdf Portfolio PDF report (v1.4)
GET /api/library Browse pattern library entries (v1.5)
GET /api/library/{entry_id} Pattern detail with matching project history (v1.5)
POST /api/library/{entry_id}/suppress Suppress library match for current project (v1.5)
GET /api/audit Audit trail of all review decisions
GET /api/health Liveness + readiness probe
POST /api/auth/login Login and receive JWT (v1.6)
POST /api/auth/logout Invalidate session token (v1.6)
GET /api/auth/me Current user info (v1.6)
GET /api/users List users — TECH_LEAD only (v1.6)
POST /api/users Create user — TECH_LEAD only (v1.6)
PATCH /api/users/{user_id} Update role / display name (v1.6)
DELETE /api/users/{user_id} Deactivate user (v1.6)
GET /api/reports/{report_type} Role-enforced report download (v1.7)

15. Data Model

AnalysisJob
├── job_id              UUID
├── project_name        From project config
├── project_config      Full YAML content (stored)
├── status              AnalysisStatus enum
├── current_phase       1–5
├── created_at / updated_at
└── state               JSON blob — per-phase artifacts
    ├── source_resolution       Phase 1.1: files found, types detected
    ├── parse_results           Phase 1.2: per-mapping parse output (cached)
    ├── project_graph           Phase 1.3: aggregated graph, dependency edges, shared assets
    ├── pattern_groups          Phase 2: groups with members, spines, confidence
    ├── strategy_pdf_path       Phase 3: path to generated PDF
    ├── strategy_xlsx_path      Phase 3: path to generated Excel
    ├── strategy_json           Phase 3: strategy JSON
    ├── review                  Phase 4: review decision, overrides, notes
    └── delivery                Phase 5: delivery status

AnalysisStatus
├── PENDING
├── RESOLVING_SOURCE
├── PARSING
├── BUILDING_GRAPH
├── GROUPING
├── GENERATING_STRATEGY
├── AWAITING_REVIEW
├── APPROVED
├── DELIVERING
├── COMPLETE
├── FAILED
└── REJECTED

ReviewRecord
├── reviewer_name
├── reviewer_role
├── review_date
├── decision            APPROVED | REJECTED
├── overrides           List of mapping overrides applied
├── notes               Free-text reviewer notes

MappingOverride
├── mapping_name
├── action              confirm | move | individualize
├── from_group          Original group (if moving)
├── to_group            Target group (if moving)
├── notes               Reviewer rationale

ParseCache
├── file_hash           SHA-256 of XML content
├── parse_output        Cached parse result JSON
├── cached_at           Timestamp

ReconciliationManifest                        [v1.1]
├── job_id              FK → AnalysisJob
├── generated_at
└── entries[]
    ├── mapping_name
    ├── source_tables[]
    ├── target_table
    ├── join_key[]
    ├── column_checks[]
    ├── suggested_sql{}     source_count / target_count / mismatch_check
    └── ai_notes            Caveats from AI annotation

ConversionTrackingEntry                       [v1.2]
├── job_id              FK → AnalysisJob (the approved strategy job)
├── mapping_name
├── status              NOT_STARTED | IN_PROGRESS | CONVERTED | VALIDATED | BLOCKED | DEFERRED
├── assignee
├── notes
├── last_updated_at
└── history[]           Append-only list of status transitions

ChangeImpactAlert                             [v1.3]
├── job_id              FK → AnalysisJob (the re-analysis job)
├── mapping_name
├── change_type         STRUCTURAL | PARAMETER_ONLY
├── conversion_status_at_change_time
├── severity            INFO | WARNING | HIGH | CRITICAL
├── detected_at
└── acknowledged        bool

Programme                                     [v1.4]
├── name                Unique programme name
├── job_ids[]           FKs → AnalysisJob
└── created_at

PatternLibraryEntry                           [v1.5]
├── entry_id            UUID
├── spine_signature     Canonical spine string (e.g., "SQ → EXP → TARGET")
├── variation_profile   Fingerprint of variation characteristics (JSON)
├── pattern_name        Human name from approving tech lead
├── source_project      Project name where confirmed
├── source_job_id       FK → AnalysisJob
├── confirmed_at
├── confirmed_by
└── match_count         Incremented each time a new project matches this entry

ReconciliationResult                          [v1.1 optional ingest]
├── job_id              FK → AnalysisJob
├── mapping_name
├── result              PASS | FAIL | PARTIAL
├── notes
└── received_at

User                                          [v1.6]
├── user_id             UUID
├── username            Unique, case-insensitive
├── password_hash       bcrypt (12 rounds)
├── role                TECH_LEAD | MANAGER | DEVELOPER | ANALYST
├── display_name        Free text
├── created_at
└── last_login_at

16. Incremental Analysis

Phase 1 caches parse results by file content hash. On re-analysis:

  • Only new/changed XMLs are re-parsed
  • Phase 2 runs on the full project (fast — parsing is cached)
  • Previous human overrides are preserved unless the underlying XML changed
  • Strategy document includes a diff section showing what changed

17. Security Architecture

Security is infrastructure, not a feature layer. See SECURITY.md for full details.

Threat Defence
XXE in mapping XMLs safe_parse_xml() — DTD and entity resolution disabled
Path traversal in folder scanning All paths resolved relative to configured root; symlinks rejected
Zip Slip / Zip Bomb Validated extraction with byte and entry count caps
Malformed project config yaml.safe_load() + schema validation before processing
Secrets in XML scan_xml_for_secrets() checks CONNECTION attrs at parse time
SSRF in repo cloning URL allowlist; no arbitrary redirects
Unauthenticated access Session-cookie middleware on all non-static routes
Dependency CVEs pip-audit in CI; dependencies pinned
API injection All HTTP security headers applied (CSP, HSTS, X-Frame-Options)

18. Success Metrics

Metric Target
Mapping parsing completion rate > 99% of mappings parsed successfully
Pattern group accuracy (human-confirmed) > 85% of auto-assigned groupings confirmed without override
Dependency DAG completeness > 90% of actual dependencies detected
Shared asset detection rate > 95% of tables referenced by 3+ mappings identified
Strategy generation time (50 mappings) < 10 minutes
Strategy generation time (500 mappings) < 60 minutes
Human review time (median) < 30 minutes for 50-mapping project
Scope reduction Typical project: 40-60% reduction (N mappings → fewer templates + unique files)
Incremental re-analysis time < 2 minutes for 5 new mappings added to 50-mapping project
False positive rate (incorrect groupings) < 10%
Unclassified mapping rate < 15% of project

19. Technical Constraints

  • Python 3.11+ — asyncio patterns, X | Y union syntax
  • SQLite — sufficient for single-instance deployment; PostgreSQL migration path via SQLAlchemy
  • Claude API requiredANTHROPIC_API_KEY must be set. Structural fingerprinting (spine extraction, dependency DAG) is deterministic, but interpreting what mappings actually do — understanding SQL override intent, classifying expression logic, inferring reconciliation caveats — requires AI. There are no Python libraries that can reliably extract business intent from arbitrary Informatica SQL and expression syntax. Attempting to run without an API key produces structural groupings only, with no interpretation layer — not a useful output for a migration strategy review. [ENTERPRISE-TODO] Support for alternative LLM providers (Azure OpenAI, Bedrock) via a swappable _client.py adapter
  • Port 8090 — default; configurable via PORT env var
  • No Docker required — plain Python venv deployment; Dockerfile optional
  • License — CC BY-NC 4.0; commercial use requires written permission

20. Planned Capabilities — v1.1 through v1.7

The following three capabilities extend the tool beyond structural analysis into the active migration lifecycle. They do not require conversion to have happened — they are about tracking and protecting the analysis artefacts as conversion proceeds.


20.1 Data Equivalence / Test Strategy Scaffolding (v1.1)

The gap: The tool knows what each mapping reads and writes — source tables, target tables, key columns, expression logic, join conditions. But it produces no guidance on how to prove the converted code is equivalent to the original Informatica mapping. Data reconciliation is one of the hardest parts of any ETL migration and currently falls entirely to the delivery team.

What this capability adds:

For each mapping (and each pattern group), generate a reconciliation test manifest — a structured description of what needs to be validated:

{
  "mapping_name": "m_load_customer",
  "reconciliation": {
    "source_tables": ["OLTP.CUSTOMER"],
    "target_table": "DW.DIM_CUSTOMER",
    "join_key": ["CUSTOMER_ID"],
    "column_checks": ["CUSTOMER_NAME", "EMAIL", "STATUS", "LOAD_DATE"],
    "row_count_check": true,
    "null_checks": ["CUSTOMER_ID", "EMAIL"],
    "suggested_sql": {
      "source_count": "SELECT COUNT(*) FROM OLTP.CUSTOMER WHERE ...",
      "target_count": "SELECT COUNT(*) FROM DW.DIM_CUSTOMER WHERE ...",
      "mismatch_check": "SELECT a.CUSTOMER_ID FROM ... EXCEPT SELECT b.CUSTOMER_ID FROM ..."
    },
    "notes": "Expression port CUST_FULL_NAME is a simple concat — validate output matches IIF(ISNULL(FIRST_NAME),'',FIRST_NAME)||' '||LAST_NAME"
  }
}

For pattern groups, the manifest covers the group template (one set of checks, parameterised):

{
  "group_id": "trunc_load_01",
  "group_name": "Truncate & Load",
  "reconciliation_template": {
    "description": "Row count source vs target. Column sample on key fields. NULL check on business keys.",
    "parameterised_by": ["source_table", "target_table", "join_key"],
    "members": [
      { "mapping_name": "m_load_customer", "source_table": "OLTP.CUSTOMER", "target_table": "DW.DIM_CUSTOMER", "join_key": "CUSTOMER_ID" },
      { "mapping_name": "m_load_account",  "source_table": "OLTP.ACCOUNT",  "target_table": "DW.DIM_ACCOUNT",  "join_key": "ACCOUNT_ID" }
    ]
  }
}

Outputs:

  • GET /api/projects/{id}/reconciliation.json — full test manifest
  • GET /api/projects/{id}/reconciliation.xlsx — one row per mapping with source/target/key/column checks
  • Reconciliation section in the PDF report (tech lead detail pages)
  • AI-assisted: for mappings with SQL overrides or complex expressions, Claude annotates the suggested checks with caveats ("this SQL override joins 3 tables — row count alone is not sufficient; validate intermediate join cardinality")

Project config addition:

analysis:
  generate_reconciliation_manifest: true    # default: true
  reconciliation_key_hint: ".*_ID$"         # regex hint for identifying key columns

Why this matters: The delivery team can take the reconciliation manifest directly into their test harness. The analysis tool has already done the work of figuring out source/target pairs, key columns, and expression complexity — the test plan writes itself from that data.

Boundary — reconciliation feedback loop: This tool generates the test plan. Running the tests and reporting results is the conversion project's responsibility. The boundary is clean and explicit:

  • This tool outputs: reconciliation.json manifest (what to check and how)
  • Conversion tool inputs: that manifest into its test runner
  • Optional integration: the conversion tool may POST results back via webhook if the teams want closed-loop visibility. The expected POST schema:
    { "job_id": "abc123", "mapping_name": "m_load_customer",
      "result": "PASS | FAIL | PARTIAL", "notes": "row count matched, 2 columns mismatched" }
    This tool will store and surface those results if posted, but does not require them. Implementation of the test runner itself is out of scope.

20.2 Conversion Progress Tracking (v1.2)

The gap: Once the strategy is approved and conversion begins (in the separate conversion project), the approved strategy JSON becomes a static document. There is no way to track which mappings have been converted, which are in progress, which are blocked, and what percentage of the project is complete. Leadership has no live view of progress.

What this capability adds:

A conversion tracker layer on top of the approved strategy. Each mapping in the approved strategy gains a lifecycle status:

Status Meaning
NOT_STARTED Approved; conversion not yet begun
IN_PROGRESS Assigned to a developer; conversion underway
CONVERTED Code produced; awaiting validation
VALIDATED Reconciliation checks passed
BLOCKED Blocked — dependency, data quality, or scope question
DEFERRED Explicitly deferred to a later phase

Tracking is updated via API (push from the conversion project or manual update via UI):

PATCH /api/projects/{id}/tracking/{mapping_name}
{ "status": "CONVERTED", "assignee": "jane.smith", "notes": "PR #142" }

Dashboard additions:

  • Progress bar: N of M mappings converted (by status tier)
  • Pattern group progress: "Truncate & Load — 11/14 converted, 3 in progress"
  • Execution stage readiness: "Stage 2 (dimensions) — 100% validated. Stage 3 (facts) — 40% converted."
  • Blocked mapping list with notes
  • Velocity trend: mappings validated per week

New UI view: Progress Tracker

  • Table of all mappings with status, assignee, last updated, notes
  • Filter by status, group, stage, assignee
  • Bulk status update
  • Export to CSV for project management tools

Webhook events added:

  • mapping_status_changed — fires when any mapping status changes
  • stage_complete — fires when all mappings in an execution stage reach VALIDATED
  • project_complete — fires when all mappings reach VALIDATED

Why this matters: The strategy document is only useful if someone can tell whether it is being executed against. Progress tracking closes the loop between the approved strategy and the actual delivery — without requiring the conversion project to build its own tracking.


20.3 Source Change Detection During Migration (v1.3)

The gap: Informatica teams continue making production changes to mappings while conversion is underway. The tool's diff capability detects changes between analysis runs — but it does not cross-reference those changes against the conversion tracking state. A mapping that has been approved, assigned, and partially converted could silently change in the source Informatica project, invalidating the strategy and the work in progress.

What this capability adds:

Automatic impact alerting when a re-analysis detects changes to mappings that have an active conversion state.

The alert matrix:

Mapping state Source change detected Alert severity Action required
NOT_STARTED Any change INFO Noted in diff report; no action needed
IN_PROGRESS Structural change (spine or expression) WARNING Developer notified; review against new analysis
IN_PROGRESS Parameter change only (table/column names) INFO Noted; likely affects config, not template
CONVERTED Any structural change HIGH Converted code may be incorrect; re-review required
VALIDATED Any structural change CRITICAL Validated output may no longer match; re-validate
CONFIRMED override Underlying XML changed WARNING Human override cleared; re-confirm required

Alert delivery:

  • In-app notification panel (new UI element in dashboard header)
  • Webhook event: source_change_impact with affected mapping list and severity
  • Email digest (if ALERT_EMAIL env var set): daily summary of change impacts

New endpoint:

GET /api/projects/{id}/change-impacts

Returns list of mappings where source changed after conversion tracking status was set, grouped by severity.

Re-analysis trigger: When the watcher detects a changed *.project.yaml or when a CI/CD trigger fires, the tool automatically cross-references the diff report against conversion tracking states and generates impact alerts before any human reviews the new analysis.

Project config addition:

analysis:
  alert_on_source_change: true   # default: true when tracking is active
  structural_change_only: false  # if true, suppress alerts for parameter-only changes

Why this matters: A migration that takes 6 months is not running against a frozen source. Informatica teams fix bugs, add columns, and change logic during the conversion window. Without this, those changes are invisible until a UAT failure surfaces them — typically late, expensive, and stressful. This capability makes them visible the moment they happen.


20.4 Multi-Project Portfolio View (v1.4)

The gap: Large organisations run multiple Informatica projects in parallel — different business units, different migration phases, sometimes different delivery teams. Today every project is a completely isolated job. There is no aggregate view: no total mapping count across the programme, no cross-project risk comparison, no single leadership report that covers the whole migration.

What this capability adds:

A programme concept that groups multiple analysis jobs under one name. A programme is lightweight — just a label and a list of associated job IDs. No new analysis logic. All the data is already there.

New API:

POST /api/programmes               { "name": "FirstBank Full Migration", "job_ids": ["abc123", "def456"] }
GET  /api/programmes/{name}        Aggregate view
GET  /api/programmes/{name}/report Portfolio PDF/Excel

Portfolio dashboard — new top-level UI view:

Metric Shown across all projects
Total mappings Sum across all projects
Total pattern groups Sum; cross-project duplicates flagged
Scope reduction Weighted average across projects
Confidence distribution HIGH/MEDIUM/LOW/UNCLASSIFIED breakdown
Risk flags Sorted by severity across all projects
Review status Approved / Pending / Rejected per project
Conversion progress % VALIDATED across all projects (requires v1.2)

Project config addition:

project:
  name: "FirstBank DWH Migration"
  programme: "FirstBank Full Migration"   # optional grouping tag

When programme is set, the job is automatically associated with that programme on creation. The programme aggregate is always derived on-read — no separate sync needed.

Why this matters: Migration programmes are multi-project by nature. Leadership needs a single view of progress across the whole programme, not a tab-switching exercise across individual project dashboards. This capability costs almost nothing to build on top of the existing data model and immediately changes how the tool is used at the programme level.


20.5 Cross-Project Pattern Library (v1.5)

The gap: Every new project analysis starts from scratch. If "Truncate & Load" (spine: SQ → EXP → TARGET) has already been confirmed as a pattern in three previous projects and a template has been built for it, the fourth project's analysis produces that same pattern group again — with no awareness that it already exists and has already been solved.

Over time, repeat clients, industry verticals (banks, insurers, telcos), and in-house teams build up significant pattern knowledge that currently lives nowhere — or lives only in the heads of the engineers who did the earlier projects.

What this capability adds:

A pattern library — a shared registry of confirmed pattern signatures. The library is passive and automatic: no extra config, no manual curation required.

How it works:

When a strategy is approved and pattern groups are confirmed:

  1. Each HIGH-confidence group's spine + variation profile is written to the pattern library
  2. The group's name, source project, and confirmation date are recorded

When a new project runs Phase 2 (pattern grouping):

  1. After spine matching, each group is checked against the library
  2. Matches are surfaced as: "This group matches Truncate & Load (confirmed in 3 previous projects, most recently FirstBank Q1). A template may already exist."
  3. The match is a suggestion, not a constraint — tech leads can ignore it

Data model:

PatternLibraryEntry
├── entry_id            UUID
├── spine_signature     Canonical spine string (e.g., "SQ → EXP → TARGET")
├── variation_profile   Fingerprint of variation characteristics
├── pattern_name        Human name (from the approving tech lead)
├── source_project      Project name where confirmed
├── source_job_id       FK → AnalysisJob
├── confirmed_at
├── confirmed_by
└── match_count         How many subsequent projects have matched this entry

New endpoints:

GET  /api/library                     Browse all library entries
GET  /api/library/{entry_id}          Pattern detail + projects that matched
POST /api/library/{entry_id}/suppress  Exclude entry from future matching (per-project)

Matching is fuzzy, not exact. Two spines match if they share the same core sequence with a configurable tolerance for minor additions (Tier 2 variation). The matching threshold uses the same fingerprint_strictness setting from the project config.

Privacy / isolation: The library is local to the server instance. There is no cloud sync, no sharing of client data externally. In a multi-client consultancy scenario, the operator controls whether the library is shared across clients or isolated per deployment.

Why this matters: Pattern knowledge compounds. The 10th banking migration project doesn't need to rediscover that banks have the same 6 core patterns — the library already knows. Analysis gets faster to review, templates get reused sooner, and institutional knowledge is captured in the tool rather than lost between projects.


20.6 Persona-Based Access & Authentication (v1.6)

The gap: The tool currently has no user identity concept. Anyone who can reach port 8090 can do anything — approve strategies, override groupings, update tracking status. For a one-person team, this is fine. For a team of 5–15 people with different responsibilities, it creates accountability problems and accidental overwrites.

What this capability adds:

A lightweight login system with four roles. Department-appropriate — no OAuth, no LDAP, no corporate identity integration required.

Roles:

Role What they can do
TECH_LEAD Everything — analyse, review, override, approve, manage users
MANAGER Read all views, download reports, view tracking progress. Cannot approve or override
DEVELOPER View their assigned mappings, update tracking status for those mappings
ANALYST Full read access everywhere. Pattern library. Portfolio view

Authentication:

  • Login page in the React SPA (username + password)
  • Passwords hashed with bcrypt (12 rounds), stored in users SQLite table
  • JWT token returned on login, stored in browser localStorage
  • Token expiry: 8 hours. Refresh not required — re-login on expiry
  • FastAPI current_user dependency injected on all protected routes
  • First-run setup: CLI command python -m app.backend.setup_users creates the first TECH_LEAD account interactively

Role-based access enforcement:

  • TECH_LEAD — all endpoints
  • MANAGER — GET endpoints only; blocked from POST/PATCH/review/override
  • DEVELOPER — GET on job summary + their tracking mappings; PATCH on tracking only
  • ANALYST — all GET endpoints; blocked from all POST/PATCH

Users data model:

User
├── user_id         UUID
├── username        Unique, case-insensitive
├── password_hash   bcrypt
├── role            TECH_LEAD | MANAGER | DEVELOPER | ANALYST
├── display_name    Free text
├── created_at
└── last_login_at

User management endpoints:

POST   /api/auth/login              Return JWT (public)
POST   /api/auth/logout             Invalidate token
GET    /api/auth/me                 Current user info
GET    /api/users                   List users (TECH_LEAD only)
POST   /api/users                   Create user (TECH_LEAD only)
PATCH  /api/users/{user_id}         Update role/display name (TECH_LEAD only)
DELETE /api/users/{user_id}         Deactivate user (TECH_LEAD only)

Mapping assignment (DEVELOPER role):

  • Tech leads assign mappings to developers via the tracking view
  • PATCH /api/projects/{id}/tracking/{mapping} gains an assignee field
  • Developers can only PATCH mappings assigned to them
  • Assignment visible in the Progress Tracker UI

Project config addition:

# Optional: pre-assign mappings in config rather than via UI
assignments:
  jane.smith: ["m_load_customer", "m_load_account"]
  john.doe:   ["m_stg_transactions", "m_fact_daily"]

New dependencies: python-jose>=3.3,<4.0 (JWT), passlib[bcrypt]>=1.7,<2.0

[ENTERPRISE-TODO] SSO/LDAP integration: Wrap with an OAuth2 reverse proxy (e.g., oauth2-proxy, Authentik). The FastAPI JWT middleware is designed to be replaceable — swap the verify_token dependency for an OIDC token validator.

[ENTERPRISE-TODO] Fine-grained project access: Currently, roles are global. An enterprise deployment might need "Jane can see Project A but not Project B". This requires a project-level ACL table — not built for the department tool.

[ENTERPRISE-TODO] MFA: Not implemented. If the tool runs on an internal network, the corporate VPN provides a sufficient perimeter. External exposure requires TOTP or hardware key support.

Why this matters: Accountability without bureaucracy. The tech lead's approval is meaningful when it is stamped with their name and only they could have given it. The manager's view is cleaner when they are not accidentally shown controls they should not touch. The developer's view is focused when it only shows their work.


20.7 Persona-Aware Reporting (v1.7)

The gap: The tool produces one PDF report that contains both the leadership summary and the technical detail. This is the right document for a tech lead, but it is too detailed for a manager presenting to stakeholders and too high-level for a developer trying to understand their reconciliation requirements. Different roles need different views of the same underlying data.

What this capability adds:

Three distinct report formats, each targeted at one persona. Generated from the same analysis data — no new analysis required.

Report 1 — Executive Summary (for MANAGER role)

One page. Deliverable to a CTO or business sponsor who asked "where are we?".

Contents:

  • Project name, date, current phase (ANALYSIS / IN REVIEW / APPROVED / IN CONVERSION)
  • Total mappings: N mappings → M templates + K unique files (scope reduction)
  • Conversion progress: % NOT_STARTED / IN_PROGRESS / CONVERTED / VALIDATED / BLOCKED
  • Risk summary: count of HIGH/CRITICAL flags, blocked mappings
  • Execution stage readiness: "Stage 1: ✅ complete. Stage 2: 6/14 validated. Stage 3: 0% started"
  • One-line status: "On track" / "At risk — 3 mappings blocked" / "Needs attention"

Format: PDF (1 page) + optional email digest (daily, if REPORT_EMAIL env var set)

Report 2 — Tech Lead Review Pack (for TECH_LEAD role)

Existing PDF report — unchanged. This is the full strategy document with pattern group details, dependency graph, reconciliation tables, risk flags, and audit trail.

Report 3 — Developer Assignment Sheet (for DEVELOPER role)

Personal work breakdown. Shows only the mappings assigned to the logged-in developer.

Contents per mapping:

  • Mapping name, pattern group, variation tier
  • Source/target tables
  • Reconciliation checklist (from reconciliation manifest)
  • Current status + history
  • Any change-impact alerts for their mappings
  • Link to the strategy JSON entry

Format: Excel (1 sheet, one row per mapping) + filtered UI view

Report 4 — Portfolio Dashboard (for ANALYST role)

Cross-project view. Extends the v1.4 portfolio dashboard with persona-aware access.

Contents:

  • All projects in the programme: name, status, mapping count, % validated
  • Pattern library matches across projects
  • Aggregate risk flags and blocked mappings
  • Trend: mappings validated per week across the programme

Format: In-app UI view (existing portfolio dashboard, extended) + PDF export

Delivery:

  • All reports available via GET /api/reports/{type}?job_id={id} where type{executive, techlead, developer, portfolio}
  • Auth-enforced: MANAGER gets executive only, DEVELOPER gets their sheet only, TECH_LEAD gets all, ANALYST gets portfolio
  • Dashboard header shows only the download buttons the current user's role permits

New endpoint:

GET /api/reports/{report_type}?job_id={id}   Role-enforced report download
GET /api/reports/digest/preview              Preview today's email digest (TECH_LEAD only)

[ENTERPRISE-TODO] Scheduled email delivery: REPORT_SCHEDULE=daily|weekly env var with cron job. Department tool uses a manual "Send now" button instead.

[ENTERPRISE-TODO] Custom report templates: Teams customising the PDF layout (logo, branding, section order) need a templating layer. Not in scope for the department tool — the reportlab output is the template.

Why this matters: A manager who receives the 25-page tech lead PDF will not read it. A developer who sees the full strategy document is confused about what they personally need to do. The right information for the right person is not a nice-to-have — it is what makes the tool actually used by the whole team rather than just one person.


21. Version Roadmap

v0.1.0 — Foundation ✓

  • Project config parser and validation
  • Source resolution (folder type only)
  • Mapping XML parser (extract transformations, connectors, sources, targets)
  • Cross-mapping project graph construction (dependency edges, shared assets)
  • Spine extraction, structural fingerprinting, and pattern grouping
  • Execution order generation (topological sort into parallel stages)
  • Strategy JSON generation
  • FastAPI backend with all API endpoints
  • React UI: dashboard view, pattern groups view, dependency graph view
  • Structured logging

v0.2.0 — Strategy Documents + Review Gate ✓

  • PDF report generation (leadership summary + tech lead detail)
  • Excel workbook generation (5 sheets)
  • SSE progress streaming
  • Human review gate (APPROVE / REJECT) with persistence
  • Override controls (confirm, move, individualize)
  • SQLite persistence for jobs and audit trail

v0.3.0 — AI Enhancement ✓

  • AI-assisted expression classification
  • AI-assisted custom SQL interpretation
  • Confidence scoring refinement
  • Dependency DAG visualization in UI (interactive)
  • PDF download button; SSE progress bar in UI

v0.4.0 — Watcher + Incremental ✓

  • Watcher mode for *.project.yaml files
  • Incremental analysis (parse caching, diff reporting)
  • Override preservation across re-analysis
  • Webhook notifications (HMAC-signed)

v0.5.0 — Extended Source Types + CI/CD ✓

  • Git repo source type (shallow clone via GitPython)
  • ZIP upload source type with path traversal protection
  • S3 source type
  • CI/CD API trigger endpoint with API key auth
  • Artifacts listing endpoint

v1.0.0 — Production Ready ✓

  • Full test suite (57 tests: unit + API + integration)
  • GitHub Actions CI pipeline
  • Security headers middleware + rate limiting (slowapi)
  • Input validation hardening
  • Parallel parsing (ThreadPoolExecutor)
  • Paginated job list
  • CAPABILITIES.md

v1.1.0 — Reconciliation Test Scaffolding + Schema Contract

See §20.1 for full specification.

  • Reconciliation manifest generator (per mapping + per pattern group)
  • Source/target/key column extraction from parse results
  • AI-assisted annotation for complex expressions and SQL overrides
  • GET /api/projects/{id}/reconciliation.json
  • GET /api/projects/{id}/reconciliation.xlsx
  • Reconciliation section in PDF report
  • analysis.generate_reconciliation_manifest config flag
  • analysis.reconciliation_key_hint regex config
  • Optional reconciliation result ingest endpoint (POST from conversion tool)
  • Strategy JSON schema contract:
    • JSON Schema spec published at GET /api/schema/strategy/{version}
    • strategy_version field upgraded to full semantic version string ("1.1.0")
    • Backward compatibility guarantee documented: v1.x readers can consume v1.0 output
    • Schema changelog maintained at GET /api/schema/strategy/changelog
    • Conversion tool declares supported schema version in its config; analysis tool rejects incompatible consumers with a clear error

v1.2.0 — Conversion Progress Tracking

See §20.2 for full specification.

  • Conversion lifecycle status per mapping (NOT_STARTED → VALIDATED)
  • PATCH /api/projects/{id}/tracking/{mapping_name} endpoint
  • Progress Tracker UI view (table, filter, bulk update)
  • Dashboard progress bar and stage readiness indicators
  • Webhook events: mapping_status_changed, stage_complete, project_complete
  • CSV export for project management tool integration

v1.3.0 — Source Change Detection During Migration

See §20.3 for full specification.

  • Cross-reference diff report against conversion tracking state
  • Alert severity matrix (INFO / WARNING / HIGH / CRITICAL)
  • In-app notification panel
  • GET /api/projects/{id}/change-impacts endpoint
  • Webhook event: source_change_impact
  • Email digest support (ALERT_EMAIL env var)
  • analysis.alert_on_source_change config flag
  • Automatic impact assessment on every re-analysis or watcher trigger

v1.4.0 — Multi-Project Portfolio View

See §20.4 for full specification.

  • Programme concept: group multiple jobs under one name
  • programme field in project config
  • POST /api/programmes, GET /api/programmes/{name}
  • Portfolio dashboard UI view (aggregate metrics across all projects)
  • Portfolio PDF/Excel report
  • Cross-project risk and confidence distribution

v1.5.0 — Cross-Project Pattern Library

See §20.5 for full specification.

  • Pattern library SQLite table; auto-populated on strategy approval
  • Phase 2 match step against library (fuzzy spine + variation profile)
  • Library match surfaced as suggestion in pattern groups view
  • GET /api/library, GET /api/library/{entry_id}
  • POST /api/library/{entry_id}/suppress per-project suppression
  • Match count tracking across projects
  • Local-only; no external sync

v1.6.0 — Persona-Based Access & Authentication

See §20.6 for full specification.

  • Login page in React SPA (username + password)
  • Four roles: TECH_LEAD, MANAGER, DEVELOPER, ANALYST
  • JWT auth, 8-hour expiry, stored in users SQLite table (bcrypt passwords)
  • Role-based endpoint enforcement (FastAPI dependency injection)
  • Developer mapping assignment — PATCH tracking for assigned mappings only
  • User management UI (TECH_LEAD only)
  • CLI first-run setup: python -m app.backend.setup_users
  • New dependencies: python-jose, passlib[bcrypt]

v1.7.0 — Persona-Aware Reporting

See §20.7 for full specification.

  • Executive Summary PDF (1 page, MANAGER role) — progress, risk, stage readiness
  • Developer Assignment Sheet Excel (DEVELOPER role) — their mappings + reconciliation
  • Portfolio Dashboard (ANALYST role) — cross-project, pattern library, trend
  • GET /api/reports/{report_type}?job_id={id} — role-enforced
  • Dashboard header adapts download buttons to logged-in user's role

22. Sample Data

The repository ships a 50-mapping FirstBank test project for development and testing:

Tier Count Characteristics
Simple 15 Single source, SQ → EXP → Target, dimension/reference loads
Medium 20 Multi-source, lookups, aggregations, SCD2, fact loads
Complex 15 3+ sources, joiners, routers, multiple targets, regulatory/risk

Located at sample_data/firstbank/ with a ready-to-use project config at firstbank_migration.project.yaml.

Expected pattern groups from this project (validation target):

  • Simple dimension load (7 mappings, spine: SQ → EXP → TARGET)
  • Reference table load (4 mappings, spine: SQ → EXP → TARGET)
  • Staging extract (3 mappings, spine: SQ → FIL → TARGET)
  • SCD2 dimension (3 mappings, spine: SQ → LKP → EXP → RTR → UPD)
  • Fact with single lookup (6-7 mappings, spine: SQ → LKP → EXP)
  • Aggregation (3 mappings, spine: SQ → AGG or SQ → JNR → AGG)
  • Complex risk/regulatory (4+ mappings, similar multi-source spine)
  • Unique/individual: 5-8 mappings too specialized to template