MCYJ Data Pipeline

A data pipeline that ingests Michigan child welfare licensing documents from the state API, uses LLMs to generate summaries and classify violations, and publishes an interactive dashboard.

New here? Start with ONBOARDING.md for setup instructions, data flow, and context.

Repository structure

ingestion/          Pull documents from Michigan API, extract PDF text
  data/             facility_information.csv, downloaded_files_database.csv,
                    document_info.csv, parquet_files/
  scripts/          Individual pipeline steps
  run.py            Orchestrator — runs all 4 steps + hash check

llm_analysis/       AI-powered document analysis
  data/             sir_summaries.csv, sir_violation_levels.csv,
                    staffing_summaries.csv, violation_curation_keyword_reduction.csv
  scripts/          Individual LLM steps
  theming/          Severity classification criteria
  run.py            Orchestrator — runs all 3 LLM steps

website/            Static dashboard (Vite + React)
  build.sh          Full build (data generation + Vite)
  src/              React components and pages

.github/workflows/  CI/CD automation

Quick start

# Install Python dependencies
uv sync

# Run the ingestion pipeline (limit to 5 docs for testing)
python ingestion/run.py --limit 5

# Run LLM analysis (requires OPENROUTER_KEY)
export OPENROUTER_KEY="your-key"
python llm_analysis/run.py --max-count 10

# Build the website
cd website
npm install
./build.sh

GitHub Actions

Workflow	Trigger	What it does
Run Download Pipeline	Manual	Runs ingestion, commits outputs
Run LLM	Manual	Runs LLM analysis, commits outputs
Deploy to GitHub Pages	Push to `main`	Builds and deploys website
Check Unique SHA256 Hashes	PRs with parquet changes	Validates no duplicate hashes
Sync Production	Manual	Pushes `main` to `production`

Details

ingestion/README.md — Ingestion pipeline steps, data schemas, and CLI flags
llm_analysis/README.md — LLM analysis scripts, theming criteria, and output formats
website/README.md — Dashboard development, build process, and deployment

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
ingestion		ingestion
llm_analysis		llm_analysis
public		public
tests		tests
website		website
.gitignore		.gitignore
ONBOARDING.md		ONBOARDING.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCYJ Data Pipeline

Repository structure

Quick start

GitHub Actions

Details

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MCYJ Data Pipeline

Repository structure

Quick start

GitHub Actions

Details

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages