Version: 1.7.0 (roadmap) Author: ad25343 Last Updated: 2026-03-19 License: CC BY-NC 4.0 — github.com/ad25343/InformaticaProjectAnalysis Contact: github.com/ad25343/InformaticaProjectAnalysis/issues
Bottom line: AI-powered pre-conversion analysis for teams migrating from Informatica PowerCenter to open code (Python, dbt, PySpark). Reads all mapping XMLs from a PowerCenter project, uses Claude to interpret SQL overrides and expression logic, identifies cross-mapping patterns, builds a dependency graph, and produces a conversion strategy that humans review before any conversion begins. GenAI is core to this tool — structural fingerprinting tells you the shape of your mappings; AI interpretation tells you what they actually do.
Deployment target: This is a department-level tool — designed to be run by a single data engineering or migration team inside an organisation. It runs as a local Python server (or on a team VM) and is accessed by a small number of named users (typically 3–15). It is not a multi-tenant SaaS platform. Enterprise deployment patterns (SSO, LDAP, multi-tenancy, compliance audit logs, horizontal scaling) are noted throughout as
[ENTERPRISE-TODO]items for teams that outgrow the department scope.
Informatica PowerCenter is an enterprise ETL (Extract, Transform, Load) platform that has been used by large organizations — banks, insurers, telecoms, government agencies — for over two decades to move and transform data between systems. It is one of the most widely deployed data integration tools in the world.
In PowerCenter, each unit of data transformation logic is called a mapping. A mapping defines how data flows from a source (e.g., a database table, flat file, or API) through a series of transformations (lookups, expressions, filters, aggregations, routers) and into a target table. A single mapping might load a customer dimension table; another might aggregate daily transactions into a monthly summary.
A typical enterprise PowerCenter environment contains dozens to hundreds of these mappings, organized into folders and grouped by workflows that define when and in what order they run. All of this configuration is stored internally by PowerCenter, but it can be exported as XML files — one XML per mapping, per workflow, per parameter file.
These XML exports are the raw material that this tool works with.
If your organization runs Informatica PowerCenter, you already know its gaps. The platform does what it was built to do — batch ETL on structured data — but it has not kept pace with what engineering teams are now being asked to deliver.
The most acute gap is GenAI. Business teams are demanding AI-enabled data pipelines — LLMs (Large Language Models) for classification, enrichment, anomaly detection, and summarization embedded directly in the data flow. Informatica PowerCenter has no native GenAI capability. It cannot call an LLM, cannot integrate with vector databases, and has no roadmap for AI-augmented pipelines. Staying on PowerCenter means the data infrastructure simply cannot participate in what the business is trying to do next.
The switch to open code closes these gaps. Python, PySpark, and dbt integrate with any model, any API, any tool in the ecosystem — and produce a codebase the team owns, version-controls, tests, and deploys without platform lock-in. For teams that have hit PowerCenter's ceiling, the destination is not another SaaS tool. The destination is a codebase.
This changes the conversion problem fundamentally. A SaaS-to-SaaS migration can lean on the destination platform's import wizards. But a SaaS-to-code migration needs to produce well-structured, maintainable source code — not just functionally equivalent scripts.
This is a conversion problem, not a rewrite. The business logic encoded in those mappings is tested, production-proven, and (often) poorly documented. The goal is to faithfully reproduce that logic in open code, not to redesign the data architecture from scratch.
The naive approach is to convert each mapping in isolation. Take mapping XML #1, parse it, produce the equivalent Python script, move on to mapping #2. This works — but it produces exactly the kind of unmaintainable codebase that teams are trying to escape:
- If 14 mappings all follow the same truncate-and-load pattern differing only by table name, you get 14 separate scripts instead of one parameterized template plus a config file.
- If 8 mappings share the same SCD2 (slowly changing dimension) pattern, each gets its own copy of the SCD2 logic instead of sharing a common implementation.
- Shared lookup tables are redefined independently in every script.
- There is no dependency graph — no way to know which scripts must run before others.
- There is no project-level structure — no unified sources, no shared utilities, no layered organization.
The result is a converted codebase that works but is unmaintainable — hundreds of files with massive duplication, no structure, and no awareness of how the pieces fit together. You left a proprietary tool and landed in a code mess.
The fix is to analyze the full project before converting any individual mapping.
InformaticaProjectAnalysis is the pre-conversion analysis step for teams migrating from Informatica PowerCenter to open code. It reads all the mapping XMLs from a PowerCenter project (the complete collection of exported mappings, workflows, and parameter files) and produces a conversion strategy that answers three questions:
-
Which mappings share the same structural pattern? Mappings that follow the same transformation flow (e.g., source → lookup → expression → target) with only table names and column names differing are candidates for a single parameterized template. The tool groups them together, shows the evidence, and assigns a confidence level.
-
What depends on what? If mapping A loads
DIM_CUSTOMERand mapping B does a lookup againstDIM_CUSTOMER, then B depends on A — it must run after A completes. The tool builds a dependency graph across all mappings and computes a safe execution order. -
What needs human attention? Not every mapping can be automatically classified with high confidence. Custom SQL overrides, missing definitions, and unusual transformation patterns reduce certainty. The tool flags these for tech lead review.
The output is a strategy document (PDF + Excel + JSON) that tech leads and leadership review and approve before any conversion begins.
The tool observes and surfaces structural characteristics. It does not prescribe which target language to use (Python, dbt, PySpark, etc.), which warehouse to target, or how to orchestrate the converted pipelines. Those decisions belong to the humans reviewing the strategy and the conversion tools they choose.
Out of scope — by design:
-
Effort / cost estimation. The tool surfaces complexity distribution and scope reduction figures. It does not translate these into story points, team-weeks, or monetary estimates. Those models are project- and team-specific and belong in your delivery toolchain, not here.
-
Migration wave planning / sprint sequencing. The dependency DAG gives you safe execution order for running the mappings in production. Choosing which mappings to convert first (risk management, incremental delivery, team allocation) is a programme management decision, not a structural analysis one.
-
Business criticality / impact scoring. The tool knows structural complexity but not business importance. Which mapping feeds the overnight regulatory batch? Which one is the CEO dashboard? That context does not exist in the XML files and must come from the organisation. It is not in scope for automated analysis.
-
The actual conversion. Code generation — producing Python, dbt, PySpark, or any other target — is a separate project. This tool produces the strategy that conversion is executed against.
Out of scope — enterprise deployment features:
These are deliberate non-goals for the department-level tool. Teams that outgrow these limits should treat them as a migration project from the department tool to a proper platform.
[ENTERPRISE-TODO]Single Sign-On / LDAP / OAuth. Auth is simple username+password in SQLite. Enterprise identity integration (Okta, Azure AD, corporate LDAP) is out of scope.[ENTERPRISE-TODO]Multi-tenancy. One SQLite database per deployment. Multiple independent teams should run separate instances, not share one.[ENTERPRISE-TODO]Horizontal scaling / load balancing. Single-process FastAPI- SQLite. Sufficient for a department; not designed for concurrent large-estate analyses across dozens of projects simultaneously.
[ENTERPRISE-TODO]Compliance audit logging (SOX, GDPR). The audit trail is for the migration review process, not for regulatory compliance frameworks.[ENTERPRISE-TODO]High-availability / disaster recovery. SQLite WAL mode is robust for the department use case. Backup strategy is the team's responsibility.
This is a department-level tool used by a small, named team (typically 3–15 people). From v1.6 onwards, each user logs in with a username and password and is assigned one of four roles. Role determines what they see and what they can do.
| Role | Who | Access |
|---|---|---|
TECH_LEAD |
Data engineering tech lead | Full access — analysis, review, override, approve, tracking |
MANAGER |
Engineering manager or delivery lead | Dashboard, PDF/Excel summaries, progress tracking (read), portfolio view |
DEVELOPER |
Conversion developer | Their assigned mappings, tracking status updates, reconciliation for their work |
ANALYST |
Migration analyst or PO | All read access, pattern library, cross-project portfolio |
TECH_LEAD — Primary persona
Reviews the strategy in detail. Validates pattern groupings, confirms or adjusts assignments, identifies edge cases. Needs: structural evidence, dependency graph, per-mapping confidence levels, override controls, reconciliation manifest, group notes. Approves or rejects the final strategy.
MANAGER — Secondary persona
Reviews the summary layer. Needs: mapping count, pattern groups, complexity distribution, risk flags, scope reduction %, dependency depth, conversion progress (% complete by stage), blocked mappings list. Does not need per-mapping technical detail. Receives the PDF report.
DEVELOPER — Tertiary persona
Receives their mapping assignments from the approved strategy. Updates conversion status (IN_PROGRESS → CONVERTED → VALIDATED). Views the reconciliation manifest for their mappings. Does not review or override groupings — that is the tech lead's domain.
ANALYST — Quaternary persona
Cross-project visibility. Monitors multiple projects in the portfolio view. Accesses the pattern library to see what patterns have been confirmed across projects. Full read access everywhere. Cannot override groupings or update tracking status.
Authentication note (v1.6):
Simple username + hashed password stored in SQLite. JWT session token, 8-hour
expiry. No OAuth, LDAP, or SSO — this is a department tool.
[ENTERPRISE-TODO] For corporate identity integration, wrap the FastAPI app with
an OAuth2 reverse proxy (e.g., oauth2-proxy in front of nginx).
-
Analyze all N mappings together. Cross-mapping references (a mapping that looks up a table produced by another mapping) can only be detected when the full project is visible. Analyzing mappings in isolation misses these relationships.
-
Pattern grouping is the primary goal. Structurally similar mappings become one template + config instead of N separate files. A project of 50 mappings might collapse into 8 templates + 12 unique files.
-
Strategy is a recommendation with evidence. Every grouping shows the member mappings, structural evidence, parameter differences, and confidence level. Nothing is a black box.
-
Converting, not rewriting. The analysis observes what exists and recommends smart conversion. It does not redesign the data architecture or suggest how the project "should have been built."
-
Variation handling is explicit. Mappings within a group are not all identical. The tool classifies how much each member differs from the group's canonical pattern and surfaces this transparently.
-
Classification by structural behavior, not naming conventions. Real-world projects do not follow consistent naming. A mapping called
TBL_047_PROCmight be a simple dimension load. Classification uses transformation topology and graph position, not names. -
Honest uncertainty. The strategy distinguishes high-confidence classifications from ambiguous ones that need human confirmation. The output says "here are 38 I'm confident about, 8 that need a human to confirm, and 4 I couldn't classify."
-
Security is paramount. All input is validated, all XML parsing is hardened against XXE (XML External Entity) injection, no secrets in code or logs, all dependencies audited.
The primary input is a *.project.yaml file that defines the full migration scope.
This is the single source of truth for the analysis.
project:
name: "Project Name"
version: "1.0"
owner: "Team Name"
source:
type: folder # folder | repo | zip | s3
location: "/path/to/exports/"
scope:
mappings:
include: ["mappings/**/*.xml"]
exclude: ["mappings/archive/**"]
workflows:
include: ["workflows/**/*.xml"]
parameters:
include: ["parameter_files/*.xml"]
default_parameter_env: "dev"
analysis:
fingerprint_strictness: "moderate"
min_group_size: 2
confidence_threshold: 0.7
detect_shared_assets: true
build_dependency_dag: true # DAG = Directed Acyclic Graph
classify_expressions: true
review:
tech_lead:
name: "Name"
email: "email"
leadership:
name: "Name"
email: "email"
output:
strategy_format: "json"
output_dir: "/output/"
notifications:
webhook_url: ""
events:
on_analysis_complete: true
on_strategy_ready: true
on_review_approved: trueSource types: folder (local path), repo (Git URL + branch), zip (uploaded archive), s3 (bucket path).
*.project.yaml (uploaded via UI, dropped in watcher dir, or POSTed via API)
│
▼
Phase 1 Discovery
├── Step 1.1 Source Resolution [deterministic]
│ Clone repo / mount folder / extract ZIP / pull S3
│ Scan using scope globs → list of mapping XMLs
│
├── Step 1.2 Parse All Mappings [deterministic]
│ Parse each mapping XML → structural components
│ Cache results by file content hash (SHA-256)
│ Aggregate into project-level graph
│
├── Step 1.3 Build Project Graph [deterministic + AI-assisted]
│ Cross-mapping dependency edges (target → lookup references)
│ Shared asset detection (tables referenced by 3+ mappings)
│ Repeated expression detection
│ AI: interpret custom SQL overrides, classify expression
│ complexity, infer missing mapplet behavior, detect
│ implicit dependencies
│
▼
Phase 2 Pattern Grouping [AI-assisted]
├── Step 2.1 Structural Fingerprinting
│ Extract "spine" per mapping — the ordered sequence of
│ transformation types from source to target (see §9.1)
│ Group by matching spine
│
├── Step 2.2 Variation Classification
│ Within each spine group, diff parameters
│ Classify: Tier 1 (parameter only), Tier 2 (minor structural),
│ Tier 3 (fundamental — does not group)
│
├── Step 2.3 AI Pattern Naming + Confidence
│ Name groups in human terms
│ Assign confidence per mapping-to-group assignment
│ Flag edge cases and ambiguous mappings
│
▼
Phase 3 Strategy Document Generation
├── Step 3.1 PDF Generation [leadership summary + tech lead detail]
├── Step 3.2 Excel Generation [5 sheets: groups, DAG, shared assets,
│ assignments, risk flags]
├── Step 3.3 Strategy JSON [machine-readable output]
│
│
▼
Phase 4 ◼ Human Gate — Strategy Review
Tech leads + leadership review in the UI.
Actions: confirm groupings, override assignments, add notes.
Decision: APPROVE → Phase 5 | REJECT → re-analysis with notes
│
▼
Phase 5 Strategy Delivery
Approved strategy available as:
├── JSON file download
├── PDF + Excel download
└── API endpoint (GET /api/projects/{id}/strategy.json)
The source section of the project config determines how mappings are located:
| Type | Resolution |
|---|---|
folder |
Scan location recursively using scope glob patterns |
repo |
Clone location at branch, scan path using scope globs |
zip |
Extract uploaded archive, scan using scope globs |
s3 |
Pull objects matching scope globs from location bucket path |
All resolved files are auto-detected by content: mapping XML, workflow XML, parameter file, or unknown. Files that don't match any known type are logged and skipped.
Each mapping XML is parsed to extract its structural components:
Per-mapping parse output:
├── transformations[] Each transformation with type, ports, expressions
│ ├── type "Expression", "Lookup", "Aggregator", "Router", etc.
│ ├── ports[] Input/output ports with datatypes
│ ├── expressions[] Expression bodies per port
│ └── table_attribs{} Lookup table name, conditions, etc.
├── connectors[] Wiring: from_instance → to_instance
├── sources[] Source tables with db_type, owner, fields
├── targets[] Target tables with db_type, owner, fields
├── parameters[] $$VAR definitions with defaults
├── mapplet_instances[] Mapplet references (expanded if definitions available)
└── sql_overrides[] Custom SQL on Source Qualifiers
Results are cached by SHA-256 hash of file content. Unchanged files are not re-parsed on incremental runs.
From the per-mapping parse results, the tool builds a project-level graph:
Dependency edges — for each mapping, inspect all Lookup transformations' target table names. If that table name matches another mapping's target name, create a dependency edge: the lookup mapping depends on the target mapping.
Shared assets — tables that appear as Lookup sources in min_group_size or
more mappings across the project.
Repeated expressions — expression bodies that appear verbatim or structurally equivalent in 4+ mappings.
AI is called to interpret elements the parser cannot classify deterministically:
- Custom SQL overrides in Source Qualifiers — what do they do, are two overrides structurally equivalent?
- Expression complexity — is a 40-line DECODE the same pattern as a 3-line IIF, or fundamentally different?
- Missing mapplet definitions — infer purpose from input/output ports and wiring
- Implicit dependencies — references hidden in SQL or stored procedure calls
Each mapping's transformation topology is reduced to a canonical spine: the ordered sequence of transformation types from source to target, derived from the connectors graph.
Example spines:
SQ → EXP → TARGET(simple dimension load)SQ → LKP → EXP → RTR → UPD → TARGET(SCD2)SQ(×3) → JNR → LKP(×2) → EXP → RTR → TARGET(×2)(complex multi-source)
Mappings with matching spines are candidates for the same pattern group.
Within a spine group, variation is classified:
Tier 1 — Parameter variation. Structurally identical. Only table names, column names, filter values, connection strings differ. One template + one config file.
Tier 2 — Minor structural variation. Core flow is the same, but minor differences: an extra Expression, a Filter present in some but not all, different Lookup counts. Template accommodates variation via config flags.
Tier 3 — Fundamental structural variation. Different transformation types or flow shapes within the same spine match. Does not group. Convert individually.
Boundary between Tier 2 and Tier 3: spine + complexity profile. Two mappings match when they share a spine AND their complexity at each step is comparable.
Each mapping-to-group assignment carries a confidence level:
| Confidence | Meaning | Action |
|---|---|---|
| HIGH | Spine match + Tier 1 variation + no flags | Auto-confirmed |
| MEDIUM | Spine match + Tier 2 variation or minor flags | Tech lead should confirm |
| LOW | Weak spine match, significant variation, or AI uncertainty | Requires human review |
| UNCLASSIFIED | No spine match, custom SQL, missing definitions | Convert individually |
The analysis.confidence_threshold in the project config controls the cutoff
below which mappings are flagged for human review (default: 0.7).
Two layers in one document:
Page 1 — Leadership Summary
- Project name, analysis date, mapping count
- Pattern groups found (count + names)
- Unique mappings (count + names)
- Scope reduction: "50 mappings → N templates + M unique files"
- Complexity distribution: simple / medium / complex counts
- Dependency depth: number of sequential stages
- Risk flags: count by severity
- Confidence distribution: HIGH / MEDIUM / LOW / UNCLASSIFIED counts
Remaining pages — Tech Lead Detail
- Per pattern group: member list, spine diagram, variation table, parameter differences, evidence, confidence per member
- Dependency DAG (visual)
- Shared asset catalogue
- Risk flags detail (per mapping)
- Unclassified mappings with reasons
| Sheet | Contents |
|---|---|
| Pattern Groups | Group name, spine, member count, member names, variation notes, confidence summary |
| Dependency Graph | Source mapping, target mapping, edge type, shared table name |
| Shared Assets | Table name, reference type, referenced by (mapping list), recommendation |
| Per-Mapping Assignments | Mapping name, assigned group (or "unique"), confidence, variation tier, flags, override notes |
| Risk Flags | Mapping name, flag type, severity, description, recommendation |
Machine-readable output describing the full analysis results.
{
"strategy_version": 1,
"project_name": "FirstBank_Q1_Migration",
"analysis_job_id": "uuid",
"analyzed_at": "ISO datetime",
"summary": {
"total_mappings": 50,
"pattern_groups": 8,
"template_candidates": 36,
"unique_mappings": 14,
"scope_reduction_pct": 56
},
"pattern_groups": [
{
"group_id": "trunc_load_01",
"group_name": "Truncate & Load",
"spine": "SQ → EXP → TARGET",
"members": [
{
"mapping_name": "m_load_customer",
"confidence": "HIGH",
"variation_tier": 1,
"variation_notes": null,
"override": null
}
],
"externalized_params": ["source_table", "target_table", "column_list"],
"template_hints": "Single config-driven truncate-and-load with optional filter"
}
],
"unique_mappings": [
{
"mapping_name": "m_complex_reconciliation",
"reason": "Tier 3 — fundamentally different structure, no pattern match",
"risk_flags": ["CUSTOM_SQL_OVERRIDE", "5_JOINER_TRANSFORMATIONS"]
}
],
"shared_assets": [
{
"table_name": "DIM_CUSTOMER",
"referenced_by": ["m_fact_daily_txn", "m_fact_loan_origination", "m_agg_monthly"],
"reference_type": "lookup",
"recommendation": "shared reference — referenced by 3 mappings"
}
],
"dependency_dag": [
{"from_mapping": "m_stg_customer", "to_mapping": "m_dim_customer", "via_table": "STG_CUSTOMER"},
{"from_mapping": "m_dim_customer", "to_mapping": "m_fact_daily_txn", "via_table": "DIM_CUSTOMER"}
],
"execution_order": [
["m_stg_customer", "m_stg_account", "m_stg_transactions"],
["m_dim_customer", "m_dim_account"],
["m_fact_daily_txn", "m_fact_loan_origination"],
["m_agg_monthly_summary"]
],
"review": {
"approved_at": "ISO datetime",
"approved_by": "reviewer_name",
"overrides": [],
"notes": ""
}
}Schema versioned via strategy_version field.
The strategy review is a structured decision gate in the UI.
| Action | Effect |
|---|---|
| Confirm mapping assignment | Locks the mapping's group assignment |
| Move mapping to different group | Simple override — no re-analysis |
| Convert mapping individually | Removes mapping from its group; converts standalone |
| Split group | Triggers lightweight re-validation (Phase 2 re-run on cached Phase 1 data) |
| Merge groups | Triggers lightweight re-validation |
| Add notes | Stored per mapping and per group; carried into strategy JSON |
| APPROVE | Generates final strategy JSON with review metadata |
| REJECT | Returns to analysis with reviewer notes as constraints |
Every review action is stamped with reviewer name, role, timestamp, and decision.
Stored in the audit_log table and included in the strategy JSON.
The approved strategy is available in three formats:
JSON — machine-readable, schema-versioned. Downloaded via the UI or retrieved
via GET /api/projects/{id}/strategy.json. Suitable as input to any downstream
conversion tool or workflow.
PDF — human-readable strategy document. Leadership summary + tech lead detail.
Excel — reviewable tabular data. Five sheets covering all analysis outputs.
All three are generated from the same underlying analysis data and are consistent.
Dashboard (leadership) Project summary on one screen. Complexity heat map. Scope reduction metric. Dependency depth. Risk flag distribution. Confidence distribution. APPROVE / REJECT gate.
Pattern Groups (tech leads) Left panel: group list with member count and confidence indicator. Right panel: spine visualization, member table with variation tier and confidence, parameter differences table, evidence section, override controls, notes.
Dependency Graph (both) Interactive DAG. Nodes = mappings, colored by pattern group. Edges = dependencies. Click node for details. Execution stages highlighted. Critical path shown. Error propagation paths visible.
- React frontend
- FastAPI backend (port 8090)
- SSE progress streaming during analysis
- SQLite persistence
- PDF generated server-side (reportlab or weasyprint)
- Excel generated server-side (openpyxl)
- DAG visualization: d3-dag or dagre-d3
| Method | Path | Description |
|---|---|---|
POST |
/api/projects |
Upload project config YAML and start analysis |
POST |
/api/projects/folder |
Start analysis from a folder path |
GET |
/api/projects |
List all analysis jobs |
GET |
/api/projects/{id} |
Get analysis job state |
GET |
/api/projects/{id}/stream |
SSE progress stream |
GET |
/api/projects/{id}/strategy.json |
Download strategy JSON |
GET |
/api/projects/{id}/strategy.pdf |
Download strategy PDF |
GET |
/api/projects/{id}/strategy.xlsx |
Download strategy Excel |
POST |
/api/projects/{id}/review |
Submit review decision (APPROVE / REJECT) |
POST |
/api/projects/{id}/override |
Submit mapping override (move, confirm, individualize) |
GET |
/api/projects/{id}/graph |
Get dependency graph data (JSON for frontend rendering) |
GET |
/api/projects/{id}/groups |
Get pattern groups with members |
GET |
/api/projects/{id}/groups/{gid} |
Get single group detail |
POST |
/api/projects/{id}/deliver |
Trigger strategy delivery |
GET |
/api/projects/{id}/reconciliation.json |
Reconciliation test manifest (v1.1) |
GET |
/api/projects/{id}/reconciliation.xlsx |
Reconciliation manifest as Excel (v1.1) |
POST |
/api/projects/{id}/reconciliation/results |
Ingest test results from conversion tool (v1.1) |
GET |
/api/schema/strategy/{version} |
Published JSON Schema for strategy output (v1.1) |
GET |
/api/schema/strategy/changelog |
Schema version history (v1.1) |
GET |
/api/projects/{id}/tracking |
Get conversion tracking state for all mappings (v1.2) |
PATCH |
/api/projects/{id}/tracking/{mapping} |
Update conversion status for a mapping (v1.2) |
GET |
/api/projects/{id}/change-impacts |
Get source-change impact alerts (v1.3) |
POST |
/api/programmes |
Create a programme grouping multiple jobs (v1.4) |
GET |
/api/programmes/{name} |
Aggregate portfolio view across programme (v1.4) |
GET |
/api/programmes/{name}/report.pdf |
Portfolio PDF report (v1.4) |
GET |
/api/library |
Browse pattern library entries (v1.5) |
GET |
/api/library/{entry_id} |
Pattern detail with matching project history (v1.5) |
POST |
/api/library/{entry_id}/suppress |
Suppress library match for current project (v1.5) |
GET |
/api/audit |
Audit trail of all review decisions |
GET |
/api/health |
Liveness + readiness probe |
POST |
/api/auth/login |
Login and receive JWT (v1.6) |
POST |
/api/auth/logout |
Invalidate session token (v1.6) |
GET |
/api/auth/me |
Current user info (v1.6) |
GET |
/api/users |
List users — TECH_LEAD only (v1.6) |
POST |
/api/users |
Create user — TECH_LEAD only (v1.6) |
PATCH |
/api/users/{user_id} |
Update role / display name (v1.6) |
DELETE |
/api/users/{user_id} |
Deactivate user (v1.6) |
GET |
/api/reports/{report_type} |
Role-enforced report download (v1.7) |
AnalysisJob
├── job_id UUID
├── project_name From project config
├── project_config Full YAML content (stored)
├── status AnalysisStatus enum
├── current_phase 1–5
├── created_at / updated_at
└── state JSON blob — per-phase artifacts
├── source_resolution Phase 1.1: files found, types detected
├── parse_results Phase 1.2: per-mapping parse output (cached)
├── project_graph Phase 1.3: aggregated graph, dependency edges, shared assets
├── pattern_groups Phase 2: groups with members, spines, confidence
├── strategy_pdf_path Phase 3: path to generated PDF
├── strategy_xlsx_path Phase 3: path to generated Excel
├── strategy_json Phase 3: strategy JSON
├── review Phase 4: review decision, overrides, notes
└── delivery Phase 5: delivery status
AnalysisStatus
├── PENDING
├── RESOLVING_SOURCE
├── PARSING
├── BUILDING_GRAPH
├── GROUPING
├── GENERATING_STRATEGY
├── AWAITING_REVIEW
├── APPROVED
├── DELIVERING
├── COMPLETE
├── FAILED
└── REJECTED
ReviewRecord
├── reviewer_name
├── reviewer_role
├── review_date
├── decision APPROVED | REJECTED
├── overrides List of mapping overrides applied
├── notes Free-text reviewer notes
MappingOverride
├── mapping_name
├── action confirm | move | individualize
├── from_group Original group (if moving)
├── to_group Target group (if moving)
├── notes Reviewer rationale
ParseCache
├── file_hash SHA-256 of XML content
├── parse_output Cached parse result JSON
├── cached_at Timestamp
ReconciliationManifest [v1.1]
├── job_id FK → AnalysisJob
├── generated_at
└── entries[]
├── mapping_name
├── source_tables[]
├── target_table
├── join_key[]
├── column_checks[]
├── suggested_sql{} source_count / target_count / mismatch_check
└── ai_notes Caveats from AI annotation
ConversionTrackingEntry [v1.2]
├── job_id FK → AnalysisJob (the approved strategy job)
├── mapping_name
├── status NOT_STARTED | IN_PROGRESS | CONVERTED | VALIDATED | BLOCKED | DEFERRED
├── assignee
├── notes
├── last_updated_at
└── history[] Append-only list of status transitions
ChangeImpactAlert [v1.3]
├── job_id FK → AnalysisJob (the re-analysis job)
├── mapping_name
├── change_type STRUCTURAL | PARAMETER_ONLY
├── conversion_status_at_change_time
├── severity INFO | WARNING | HIGH | CRITICAL
├── detected_at
└── acknowledged bool
Programme [v1.4]
├── name Unique programme name
├── job_ids[] FKs → AnalysisJob
└── created_at
PatternLibraryEntry [v1.5]
├── entry_id UUID
├── spine_signature Canonical spine string (e.g., "SQ → EXP → TARGET")
├── variation_profile Fingerprint of variation characteristics (JSON)
├── pattern_name Human name from approving tech lead
├── source_project Project name where confirmed
├── source_job_id FK → AnalysisJob
├── confirmed_at
├── confirmed_by
└── match_count Incremented each time a new project matches this entry
ReconciliationResult [v1.1 optional ingest]
├── job_id FK → AnalysisJob
├── mapping_name
├── result PASS | FAIL | PARTIAL
├── notes
└── received_at
User [v1.6]
├── user_id UUID
├── username Unique, case-insensitive
├── password_hash bcrypt (12 rounds)
├── role TECH_LEAD | MANAGER | DEVELOPER | ANALYST
├── display_name Free text
├── created_at
└── last_login_at
Phase 1 caches parse results by file content hash. On re-analysis:
- Only new/changed XMLs are re-parsed
- Phase 2 runs on the full project (fast — parsing is cached)
- Previous human overrides are preserved unless the underlying XML changed
- Strategy document includes a diff section showing what changed
Security is infrastructure, not a feature layer. See SECURITY.md for full details.
| Threat | Defence |
|---|---|
| XXE in mapping XMLs | safe_parse_xml() — DTD and entity resolution disabled |
| Path traversal in folder scanning | All paths resolved relative to configured root; symlinks rejected |
| Zip Slip / Zip Bomb | Validated extraction with byte and entry count caps |
| Malformed project config | yaml.safe_load() + schema validation before processing |
| Secrets in XML | scan_xml_for_secrets() checks CONNECTION attrs at parse time |
| SSRF in repo cloning | URL allowlist; no arbitrary redirects |
| Unauthenticated access | Session-cookie middleware on all non-static routes |
| Dependency CVEs | pip-audit in CI; dependencies pinned |
| API injection | All HTTP security headers applied (CSP, HSTS, X-Frame-Options) |
| Metric | Target |
|---|---|
| Mapping parsing completion rate | > 99% of mappings parsed successfully |
| Pattern group accuracy (human-confirmed) | > 85% of auto-assigned groupings confirmed without override |
| Dependency DAG completeness | > 90% of actual dependencies detected |
| Shared asset detection rate | > 95% of tables referenced by 3+ mappings identified |
| Strategy generation time (50 mappings) | < 10 minutes |
| Strategy generation time (500 mappings) | < 60 minutes |
| Human review time (median) | < 30 minutes for 50-mapping project |
| Scope reduction | Typical project: 40-60% reduction (N mappings → fewer templates + unique files) |
| Incremental re-analysis time | < 2 minutes for 5 new mappings added to 50-mapping project |
| False positive rate (incorrect groupings) | < 10% |
| Unclassified mapping rate | < 15% of project |
- Python 3.11+ — asyncio patterns,
X | Yunion syntax - SQLite — sufficient for single-instance deployment; PostgreSQL migration path via SQLAlchemy
- Claude API required —
ANTHROPIC_API_KEYmust be set. Structural fingerprinting (spine extraction, dependency DAG) is deterministic, but interpreting what mappings actually do — understanding SQL override intent, classifying expression logic, inferring reconciliation caveats — requires AI. There are no Python libraries that can reliably extract business intent from arbitrary Informatica SQL and expression syntax. Attempting to run without an API key produces structural groupings only, with no interpretation layer — not a useful output for a migration strategy review.[ENTERPRISE-TODO]Support for alternative LLM providers (Azure OpenAI, Bedrock) via a swappable_client.pyadapter - Port 8090 — default; configurable via
PORTenv var - No Docker required — plain Python venv deployment; Dockerfile optional
- License — CC BY-NC 4.0; commercial use requires written permission
The following three capabilities extend the tool beyond structural analysis into the active migration lifecycle. They do not require conversion to have happened — they are about tracking and protecting the analysis artefacts as conversion proceeds.
The gap: The tool knows what each mapping reads and writes — source tables, target tables, key columns, expression logic, join conditions. But it produces no guidance on how to prove the converted code is equivalent to the original Informatica mapping. Data reconciliation is one of the hardest parts of any ETL migration and currently falls entirely to the delivery team.
What this capability adds:
For each mapping (and each pattern group), generate a reconciliation test manifest — a structured description of what needs to be validated:
{
"mapping_name": "m_load_customer",
"reconciliation": {
"source_tables": ["OLTP.CUSTOMER"],
"target_table": "DW.DIM_CUSTOMER",
"join_key": ["CUSTOMER_ID"],
"column_checks": ["CUSTOMER_NAME", "EMAIL", "STATUS", "LOAD_DATE"],
"row_count_check": true,
"null_checks": ["CUSTOMER_ID", "EMAIL"],
"suggested_sql": {
"source_count": "SELECT COUNT(*) FROM OLTP.CUSTOMER WHERE ...",
"target_count": "SELECT COUNT(*) FROM DW.DIM_CUSTOMER WHERE ...",
"mismatch_check": "SELECT a.CUSTOMER_ID FROM ... EXCEPT SELECT b.CUSTOMER_ID FROM ..."
},
"notes": "Expression port CUST_FULL_NAME is a simple concat — validate output matches IIF(ISNULL(FIRST_NAME),'',FIRST_NAME)||' '||LAST_NAME"
}
}For pattern groups, the manifest covers the group template (one set of checks, parameterised):
{
"group_id": "trunc_load_01",
"group_name": "Truncate & Load",
"reconciliation_template": {
"description": "Row count source vs target. Column sample on key fields. NULL check on business keys.",
"parameterised_by": ["source_table", "target_table", "join_key"],
"members": [
{ "mapping_name": "m_load_customer", "source_table": "OLTP.CUSTOMER", "target_table": "DW.DIM_CUSTOMER", "join_key": "CUSTOMER_ID" },
{ "mapping_name": "m_load_account", "source_table": "OLTP.ACCOUNT", "target_table": "DW.DIM_ACCOUNT", "join_key": "ACCOUNT_ID" }
]
}
}Outputs:
GET /api/projects/{id}/reconciliation.json— full test manifestGET /api/projects/{id}/reconciliation.xlsx— one row per mapping with source/target/key/column checks- Reconciliation section in the PDF report (tech lead detail pages)
- AI-assisted: for mappings with SQL overrides or complex expressions, Claude annotates the suggested checks with caveats ("this SQL override joins 3 tables — row count alone is not sufficient; validate intermediate join cardinality")
Project config addition:
analysis:
generate_reconciliation_manifest: true # default: true
reconciliation_key_hint: ".*_ID$" # regex hint for identifying key columnsWhy this matters: The delivery team can take the reconciliation manifest directly into their test harness. The analysis tool has already done the work of figuring out source/target pairs, key columns, and expression complexity — the test plan writes itself from that data.
Boundary — reconciliation feedback loop: This tool generates the test plan. Running the tests and reporting results is the conversion project's responsibility. The boundary is clean and explicit:
- This tool outputs:
reconciliation.jsonmanifest (what to check and how) - Conversion tool inputs: that manifest into its test runner
- Optional integration: the conversion tool may POST results back via webhook
if the teams want closed-loop visibility. The expected POST schema:
This tool will store and surface those results if posted, but does not require them. Implementation of the test runner itself is out of scope.
{ "job_id": "abc123", "mapping_name": "m_load_customer", "result": "PASS | FAIL | PARTIAL", "notes": "row count matched, 2 columns mismatched" }
The gap: Once the strategy is approved and conversion begins (in the separate conversion project), the approved strategy JSON becomes a static document. There is no way to track which mappings have been converted, which are in progress, which are blocked, and what percentage of the project is complete. Leadership has no live view of progress.
What this capability adds:
A conversion tracker layer on top of the approved strategy. Each mapping in the approved strategy gains a lifecycle status:
| Status | Meaning |
|---|---|
NOT_STARTED |
Approved; conversion not yet begun |
IN_PROGRESS |
Assigned to a developer; conversion underway |
CONVERTED |
Code produced; awaiting validation |
VALIDATED |
Reconciliation checks passed |
BLOCKED |
Blocked — dependency, data quality, or scope question |
DEFERRED |
Explicitly deferred to a later phase |
Tracking is updated via API (push from the conversion project or manual update via UI):
PATCH /api/projects/{id}/tracking/{mapping_name}
{ "status": "CONVERTED", "assignee": "jane.smith", "notes": "PR #142" }
Dashboard additions:
- Progress bar: N of M mappings converted (by status tier)
- Pattern group progress: "Truncate & Load — 11/14 converted, 3 in progress"
- Execution stage readiness: "Stage 2 (dimensions) — 100% validated. Stage 3 (facts) — 40% converted."
- Blocked mapping list with notes
- Velocity trend: mappings validated per week
New UI view: Progress Tracker
- Table of all mappings with status, assignee, last updated, notes
- Filter by status, group, stage, assignee
- Bulk status update
- Export to CSV for project management tools
Webhook events added:
mapping_status_changed— fires when any mapping status changesstage_complete— fires when all mappings in an execution stage reach VALIDATEDproject_complete— fires when all mappings reach VALIDATED
Why this matters: The strategy document is only useful if someone can tell whether it is being executed against. Progress tracking closes the loop between the approved strategy and the actual delivery — without requiring the conversion project to build its own tracking.
The gap: Informatica teams continue making production changes to mappings while conversion is underway. The tool's diff capability detects changes between analysis runs — but it does not cross-reference those changes against the conversion tracking state. A mapping that has been approved, assigned, and partially converted could silently change in the source Informatica project, invalidating the strategy and the work in progress.
What this capability adds:
Automatic impact alerting when a re-analysis detects changes to mappings that have an active conversion state.
The alert matrix:
| Mapping state | Source change detected | Alert severity | Action required |
|---|---|---|---|
NOT_STARTED |
Any change | INFO | Noted in diff report; no action needed |
IN_PROGRESS |
Structural change (spine or expression) | WARNING | Developer notified; review against new analysis |
IN_PROGRESS |
Parameter change only (table/column names) | INFO | Noted; likely affects config, not template |
CONVERTED |
Any structural change | HIGH | Converted code may be incorrect; re-review required |
VALIDATED |
Any structural change | CRITICAL | Validated output may no longer match; re-validate |
CONFIRMED override |
Underlying XML changed | WARNING | Human override cleared; re-confirm required |
Alert delivery:
- In-app notification panel (new UI element in dashboard header)
- Webhook event:
source_change_impactwith affected mapping list and severity - Email digest (if
ALERT_EMAILenv var set): daily summary of change impacts
New endpoint:
GET /api/projects/{id}/change-impacts
Returns list of mappings where source changed after conversion tracking status was set, grouped by severity.
Re-analysis trigger:
When the watcher detects a changed *.project.yaml or when a CI/CD trigger fires,
the tool automatically cross-references the diff report against conversion tracking
states and generates impact alerts before any human reviews the new analysis.
Project config addition:
analysis:
alert_on_source_change: true # default: true when tracking is active
structural_change_only: false # if true, suppress alerts for parameter-only changesWhy this matters: A migration that takes 6 months is not running against a frozen source. Informatica teams fix bugs, add columns, and change logic during the conversion window. Without this, those changes are invisible until a UAT failure surfaces them — typically late, expensive, and stressful. This capability makes them visible the moment they happen.
The gap: Large organisations run multiple Informatica projects in parallel — different business units, different migration phases, sometimes different delivery teams. Today every project is a completely isolated job. There is no aggregate view: no total mapping count across the programme, no cross-project risk comparison, no single leadership report that covers the whole migration.
What this capability adds:
A programme concept that groups multiple analysis jobs under one name. A programme is lightweight — just a label and a list of associated job IDs. No new analysis logic. All the data is already there.
New API:
POST /api/programmes { "name": "FirstBank Full Migration", "job_ids": ["abc123", "def456"] }
GET /api/programmes/{name} Aggregate view
GET /api/programmes/{name}/report Portfolio PDF/Excel
Portfolio dashboard — new top-level UI view:
| Metric | Shown across all projects |
|---|---|
| Total mappings | Sum across all projects |
| Total pattern groups | Sum; cross-project duplicates flagged |
| Scope reduction | Weighted average across projects |
| Confidence distribution | HIGH/MEDIUM/LOW/UNCLASSIFIED breakdown |
| Risk flags | Sorted by severity across all projects |
| Review status | Approved / Pending / Rejected per project |
| Conversion progress | % VALIDATED across all projects (requires v1.2) |
Project config addition:
project:
name: "FirstBank DWH Migration"
programme: "FirstBank Full Migration" # optional grouping tagWhen programme is set, the job is automatically associated with that programme on
creation. The programme aggregate is always derived on-read — no separate sync needed.
Why this matters: Migration programmes are multi-project by nature. Leadership needs a single view of progress across the whole programme, not a tab-switching exercise across individual project dashboards. This capability costs almost nothing to build on top of the existing data model and immediately changes how the tool is used at the programme level.
The gap: Every new project analysis starts from scratch. If "Truncate & Load"
(spine: SQ → EXP → TARGET) has already been confirmed as a pattern in three previous
projects and a template has been built for it, the fourth project's analysis produces
that same pattern group again — with no awareness that it already exists and has already
been solved.
Over time, repeat clients, industry verticals (banks, insurers, telcos), and in-house teams build up significant pattern knowledge that currently lives nowhere — or lives only in the heads of the engineers who did the earlier projects.
What this capability adds:
A pattern library — a shared registry of confirmed pattern signatures. The library is passive and automatic: no extra config, no manual curation required.
How it works:
When a strategy is approved and pattern groups are confirmed:
- Each HIGH-confidence group's spine + variation profile is written to the pattern library
- The group's name, source project, and confirmation date are recorded
When a new project runs Phase 2 (pattern grouping):
- After spine matching, each group is checked against the library
- Matches are surfaced as: "This group matches Truncate & Load (confirmed in 3 previous projects, most recently FirstBank Q1). A template may already exist."
- The match is a suggestion, not a constraint — tech leads can ignore it
Data model:
PatternLibraryEntry
├── entry_id UUID
├── spine_signature Canonical spine string (e.g., "SQ → EXP → TARGET")
├── variation_profile Fingerprint of variation characteristics
├── pattern_name Human name (from the approving tech lead)
├── source_project Project name where confirmed
├── source_job_id FK → AnalysisJob
├── confirmed_at
├── confirmed_by
└── match_count How many subsequent projects have matched this entry
New endpoints:
GET /api/library Browse all library entries
GET /api/library/{entry_id} Pattern detail + projects that matched
POST /api/library/{entry_id}/suppress Exclude entry from future matching (per-project)
Matching is fuzzy, not exact. Two spines match if they share the same core sequence
with a configurable tolerance for minor additions (Tier 2 variation). The matching
threshold uses the same fingerprint_strictness setting from the project config.
Privacy / isolation: The library is local to the server instance. There is no cloud sync, no sharing of client data externally. In a multi-client consultancy scenario, the operator controls whether the library is shared across clients or isolated per deployment.
Why this matters: Pattern knowledge compounds. The 10th banking migration project doesn't need to rediscover that banks have the same 6 core patterns — the library already knows. Analysis gets faster to review, templates get reused sooner, and institutional knowledge is captured in the tool rather than lost between projects.
The gap: The tool currently has no user identity concept. Anyone who can reach port 8090 can do anything — approve strategies, override groupings, update tracking status. For a one-person team, this is fine. For a team of 5–15 people with different responsibilities, it creates accountability problems and accidental overwrites.
What this capability adds:
A lightweight login system with four roles. Department-appropriate — no OAuth, no LDAP, no corporate identity integration required.
Roles:
| Role | What they can do |
|---|---|
TECH_LEAD |
Everything — analyse, review, override, approve, manage users |
MANAGER |
Read all views, download reports, view tracking progress. Cannot approve or override |
DEVELOPER |
View their assigned mappings, update tracking status for those mappings |
ANALYST |
Full read access everywhere. Pattern library. Portfolio view |
Authentication:
- Login page in the React SPA (username + password)
- Passwords hashed with bcrypt (12 rounds), stored in
usersSQLite table - JWT token returned on login, stored in browser
localStorage - Token expiry: 8 hours. Refresh not required — re-login on expiry
- FastAPI
current_userdependency injected on all protected routes - First-run setup: CLI command
python -m app.backend.setup_userscreates the firstTECH_LEADaccount interactively
Role-based access enforcement:
TECH_LEAD— all endpointsMANAGER— GET endpoints only; blocked from POST/PATCH/review/overrideDEVELOPER— GET on job summary + their tracking mappings; PATCH on tracking onlyANALYST— all GET endpoints; blocked from all POST/PATCH
Users data model:
User
├── user_id UUID
├── username Unique, case-insensitive
├── password_hash bcrypt
├── role TECH_LEAD | MANAGER | DEVELOPER | ANALYST
├── display_name Free text
├── created_at
└── last_login_at
User management endpoints:
POST /api/auth/login Return JWT (public)
POST /api/auth/logout Invalidate token
GET /api/auth/me Current user info
GET /api/users List users (TECH_LEAD only)
POST /api/users Create user (TECH_LEAD only)
PATCH /api/users/{user_id} Update role/display name (TECH_LEAD only)
DELETE /api/users/{user_id} Deactivate user (TECH_LEAD only)
Mapping assignment (DEVELOPER role):
- Tech leads assign mappings to developers via the tracking view
PATCH /api/projects/{id}/tracking/{mapping}gains anassigneefield- Developers can only PATCH mappings assigned to them
- Assignment visible in the Progress Tracker UI
Project config addition:
# Optional: pre-assign mappings in config rather than via UI
assignments:
jane.smith: ["m_load_customer", "m_load_account"]
john.doe: ["m_stg_transactions", "m_fact_daily"]New dependencies: python-jose>=3.3,<4.0 (JWT), passlib[bcrypt]>=1.7,<2.0
[ENTERPRISE-TODO] SSO/LDAP integration: Wrap with an OAuth2 reverse proxy
(e.g., oauth2-proxy, Authentik). The FastAPI JWT middleware is designed to be
replaceable — swap the verify_token dependency for an OIDC token validator.
[ENTERPRISE-TODO] Fine-grained project access: Currently, roles are global.
An enterprise deployment might need "Jane can see Project A but not Project B". This
requires a project-level ACL table — not built for the department tool.
[ENTERPRISE-TODO] MFA: Not implemented. If the tool runs on an internal
network, the corporate VPN provides a sufficient perimeter. External exposure requires
TOTP or hardware key support.
Why this matters: Accountability without bureaucracy. The tech lead's approval is meaningful when it is stamped with their name and only they could have given it. The manager's view is cleaner when they are not accidentally shown controls they should not touch. The developer's view is focused when it only shows their work.
The gap: The tool produces one PDF report that contains both the leadership summary and the technical detail. This is the right document for a tech lead, but it is too detailed for a manager presenting to stakeholders and too high-level for a developer trying to understand their reconciliation requirements. Different roles need different views of the same underlying data.
What this capability adds:
Three distinct report formats, each targeted at one persona. Generated from the same analysis data — no new analysis required.
Report 1 — Executive Summary (for MANAGER role)
One page. Deliverable to a CTO or business sponsor who asked "where are we?".
Contents:
- Project name, date, current phase (ANALYSIS / IN REVIEW / APPROVED / IN CONVERSION)
- Total mappings:
N mappings → M templates + K unique files(scope reduction) - Conversion progress: % NOT_STARTED / IN_PROGRESS / CONVERTED / VALIDATED / BLOCKED
- Risk summary: count of HIGH/CRITICAL flags, blocked mappings
- Execution stage readiness: "Stage 1: ✅ complete. Stage 2: 6/14 validated. Stage 3: 0% started"
- One-line status: "On track" / "At risk — 3 mappings blocked" / "Needs attention"
Format: PDF (1 page) + optional email digest (daily, if REPORT_EMAIL env var set)
Report 2 — Tech Lead Review Pack (for TECH_LEAD role)
Existing PDF report — unchanged. This is the full strategy document with pattern group details, dependency graph, reconciliation tables, risk flags, and audit trail.
Report 3 — Developer Assignment Sheet (for DEVELOPER role)
Personal work breakdown. Shows only the mappings assigned to the logged-in developer.
Contents per mapping:
- Mapping name, pattern group, variation tier
- Source/target tables
- Reconciliation checklist (from reconciliation manifest)
- Current status + history
- Any change-impact alerts for their mappings
- Link to the strategy JSON entry
Format: Excel (1 sheet, one row per mapping) + filtered UI view
Report 4 — Portfolio Dashboard (for ANALYST role)
Cross-project view. Extends the v1.4 portfolio dashboard with persona-aware access.
Contents:
- All projects in the programme: name, status, mapping count, % validated
- Pattern library matches across projects
- Aggregate risk flags and blocked mappings
- Trend: mappings validated per week across the programme
Format: In-app UI view (existing portfolio dashboard, extended) + PDF export
Delivery:
- All reports available via
GET /api/reports/{type}?job_id={id}wheretype∈{executive, techlead, developer, portfolio} - Auth-enforced: MANAGER gets executive only, DEVELOPER gets their sheet only, TECH_LEAD gets all, ANALYST gets portfolio
- Dashboard header shows only the download buttons the current user's role permits
New endpoint:
GET /api/reports/{report_type}?job_id={id} Role-enforced report download
GET /api/reports/digest/preview Preview today's email digest (TECH_LEAD only)
[ENTERPRISE-TODO] Scheduled email delivery: REPORT_SCHEDULE=daily|weekly env var
with cron job. Department tool uses a manual "Send now" button instead.
[ENTERPRISE-TODO] Custom report templates: Teams customising the PDF layout
(logo, branding, section order) need a templating layer. Not in scope for the department
tool — the reportlab output is the template.
Why this matters: A manager who receives the 25-page tech lead PDF will not read it. A developer who sees the full strategy document is confused about what they personally need to do. The right information for the right person is not a nice-to-have — it is what makes the tool actually used by the whole team rather than just one person.
- Project config parser and validation
- Source resolution (folder type only)
- Mapping XML parser (extract transformations, connectors, sources, targets)
- Cross-mapping project graph construction (dependency edges, shared assets)
- Spine extraction, structural fingerprinting, and pattern grouping
- Execution order generation (topological sort into parallel stages)
- Strategy JSON generation
- FastAPI backend with all API endpoints
- React UI: dashboard view, pattern groups view, dependency graph view
- Structured logging
- PDF report generation (leadership summary + tech lead detail)
- Excel workbook generation (5 sheets)
- SSE progress streaming
- Human review gate (APPROVE / REJECT) with persistence
- Override controls (confirm, move, individualize)
- SQLite persistence for jobs and audit trail
- AI-assisted expression classification
- AI-assisted custom SQL interpretation
- Confidence scoring refinement
- Dependency DAG visualization in UI (interactive)
- PDF download button; SSE progress bar in UI
- Watcher mode for
*.project.yamlfiles - Incremental analysis (parse caching, diff reporting)
- Override preservation across re-analysis
- Webhook notifications (HMAC-signed)
- Git repo source type (shallow clone via GitPython)
- ZIP upload source type with path traversal protection
- S3 source type
- CI/CD API trigger endpoint with API key auth
- Artifacts listing endpoint
- Full test suite (57 tests: unit + API + integration)
- GitHub Actions CI pipeline
- Security headers middleware + rate limiting (slowapi)
- Input validation hardening
- Parallel parsing (ThreadPoolExecutor)
- Paginated job list
- CAPABILITIES.md
See §20.1 for full specification.
- Reconciliation manifest generator (per mapping + per pattern group)
- Source/target/key column extraction from parse results
- AI-assisted annotation for complex expressions and SQL overrides
GET /api/projects/{id}/reconciliation.jsonGET /api/projects/{id}/reconciliation.xlsx- Reconciliation section in PDF report
analysis.generate_reconciliation_manifestconfig flaganalysis.reconciliation_key_hintregex config- Optional reconciliation result ingest endpoint (POST from conversion tool)
- Strategy JSON schema contract:
- JSON Schema spec published at
GET /api/schema/strategy/{version} strategy_versionfield upgraded to full semantic version string ("1.1.0")- Backward compatibility guarantee documented: v1.x readers can consume v1.0 output
- Schema changelog maintained at
GET /api/schema/strategy/changelog - Conversion tool declares supported schema version in its config; analysis tool rejects incompatible consumers with a clear error
- JSON Schema spec published at
See §20.2 for full specification.
- Conversion lifecycle status per mapping (NOT_STARTED → VALIDATED)
PATCH /api/projects/{id}/tracking/{mapping_name}endpoint- Progress Tracker UI view (table, filter, bulk update)
- Dashboard progress bar and stage readiness indicators
- Webhook events:
mapping_status_changed,stage_complete,project_complete - CSV export for project management tool integration
See §20.3 for full specification.
- Cross-reference diff report against conversion tracking state
- Alert severity matrix (INFO / WARNING / HIGH / CRITICAL)
- In-app notification panel
GET /api/projects/{id}/change-impactsendpoint- Webhook event:
source_change_impact - Email digest support (
ALERT_EMAILenv var) analysis.alert_on_source_changeconfig flag- Automatic impact assessment on every re-analysis or watcher trigger
See §20.4 for full specification.
- Programme concept: group multiple jobs under one name
programmefield in project configPOST /api/programmes,GET /api/programmes/{name}- Portfolio dashboard UI view (aggregate metrics across all projects)
- Portfolio PDF/Excel report
- Cross-project risk and confidence distribution
See §20.5 for full specification.
- Pattern library SQLite table; auto-populated on strategy approval
- Phase 2 match step against library (fuzzy spine + variation profile)
- Library match surfaced as suggestion in pattern groups view
GET /api/library,GET /api/library/{entry_id}POST /api/library/{entry_id}/suppressper-project suppression- Match count tracking across projects
- Local-only; no external sync
See §20.6 for full specification.
- Login page in React SPA (username + password)
- Four roles: TECH_LEAD, MANAGER, DEVELOPER, ANALYST
- JWT auth, 8-hour expiry, stored in
usersSQLite table (bcrypt passwords) - Role-based endpoint enforcement (FastAPI dependency injection)
- Developer mapping assignment — PATCH tracking for assigned mappings only
- User management UI (TECH_LEAD only)
- CLI first-run setup:
python -m app.backend.setup_users - New dependencies:
python-jose,passlib[bcrypt]
See §20.7 for full specification.
- Executive Summary PDF (1 page, MANAGER role) — progress, risk, stage readiness
- Developer Assignment Sheet Excel (DEVELOPER role) — their mappings + reconciliation
- Portfolio Dashboard (ANALYST role) — cross-project, pattern library, trend
GET /api/reports/{report_type}?job_id={id}— role-enforced- Dashboard header adapts download buttons to logged-in user's role
The repository ships a 50-mapping FirstBank test project for development and testing:
| Tier | Count | Characteristics |
|---|---|---|
| Simple | 15 | Single source, SQ → EXP → Target, dimension/reference loads |
| Medium | 20 | Multi-source, lookups, aggregations, SCD2, fact loads |
| Complex | 15 | 3+ sources, joiners, routers, multiple targets, regulatory/risk |
Located at sample_data/firstbank/ with a ready-to-use project config at
firstbank_migration.project.yaml.
Expected pattern groups from this project (validation target):
- Simple dimension load (7 mappings, spine: SQ → EXP → TARGET)
- Reference table load (4 mappings, spine: SQ → EXP → TARGET)
- Staging extract (3 mappings, spine: SQ → FIL → TARGET)
- SCD2 dimension (3 mappings, spine: SQ → LKP → EXP → RTR → UPD)
- Fact with single lookup (6-7 mappings, spine: SQ → LKP → EXP)
- Aggregation (3 mappings, spine: SQ → AGG or SQ → JNR → AGG)
- Complex risk/regulatory (4+ mappings, similar multi-source spine)
- Unique/individual: 5-8 mappings too specialized to template