Transform any website into clean, proveable Markdown with full OCR accuracy
Render any URL with deterministic Chrome-for-Testing, tile screenshots into OCR-friendly slices, and stream structured Markdown + provenance back to AI agents, web apps, and automation pipelines.
Two ways to use it:
- 🌐 Browser UI (
/browser) - Interactive web browsing with navigation history, address bar, and dual markdown views - ⚙️ CLI + API - Programmatic capture for automation, batch processing, and agent workflows
The Challenge: Finviz.com is protected by Cloudflare bot detection that blocks traditional web scrapers. Our system bypasses this with comprehensive stealth techniques.
Rendered Markdown - Beautiful GitHub-styled formatting:
Raw Markdown - Syntax-highlighted source with full provenance:
<!-- source: tile_0000, y=0, height=1288, sha256=557720698e6ee5e6474e69abc8305307d9e080198ab89cdccb0f7cfbe5e176dc, scale=0.50, viewport_y=0, overlap_px=120, path=artifact/tiles/tile_0000.png, highlight=/jobs/690fec5fca24499c901305d38bc85b6f/artifact/highlight?tile=artifact%2Ftiles%2Ftile_0000.png&y0=0&y1=1288 -->
This screenshot from Finviz shows a financial visualization dashboard with various stock market indices and tickers, as well as a color-coded sector heatmap. Here's a breakdown of the main sections:
1. **Indices and Charts:**
- **DOW:** Nov 7, +74.80 (0.16%), 46987.1
- **NASDAQ:** Nov 7, -49.46 (0.21%), 23004.5
- **S&P 500:** Nov 7, +8.48 (0.13%), 6728.80
2. **Advancing vs Declining Stocks:**
- Advancing: 56.0% (3116)
- Declining: 40.5% (2254)
- New High: 19.5% (110)
- New Low: 19.5% (110)
3. **Top Gainers and Top Losers:**
- Top Gainers:
- MSGM: 70.78%
- BKYI: 51.57%
- GIFJ: 49.68%
- ORGO: 44.73%
- Top Losers:
- DTCK: -77.93%
- ENGS: -54.55%
- ELDN: -49.76%
- MEHA: -46.93%
4. **Sector Heatmap:**
- Shows color-coded sectors such as Technology, Consumer Cyclical, Communication Services, Industrials, etc.
5. **Headlines:**
- 05:15PM What private data says about America's job engine
- 03:39PM The 'buy everything' rally now feels like an uphill battle, putting bull market to the test
- Nov-07 Stock Market News, Nov. 7, 2025: Nasdaq Has Its Worst Week Since April
<!-- source: tile_0001, y=1440, height=1288, sha256=9a04a7f422964951f8b411e11790ca476389c777614d5085e05008b750eb90bf, scale=0.50, viewport_y=0, overlap_px=120, path=artifact/tiles/tile_0001.png, highlight=/jobs/690fec5fca24499c901305d38bc85b6f/artifact/highlight?tile=artifact%2Ftiles%2Ftile_0001.png&y0=0&y1=1288 -->
### Insider Trading:
- OSIS: Morben Paul Keith (PRES., OPTOELEC) sold 416 shares at $279.10, valued at $116,106.
- OSIS: HAWKINS JAMES B (Director) sold 1,500 shares at $283.15, valued at $424,725.
- EL: Leonard A. Lauder 20 (10% Owner) sold 2,786,040 shares at $89.70, valued at $249,907,788.
### Futures Prices:
- Crude Oil: Last 59.78, Change +0.03 (+0.05%)
- Natural Gas: Last 4.4530, Change -0.1380 (+3.20%)
- Gold: Last 4019.50, Change +9.70 (+0.24%)
- Dow: Last 47250.00, Change +165.00 (+0.35%)
- S&P 500: Last 6786.50, Change +32.75 (+0.48%)
- Nasdaq 100: Last 25345.50, Change +179.25 (+0.71%)
<!-- ... 2 more tiles with earnings releases, forex data, and full market overview ... -->- ✅ Bypasses Cloudflare bot detection - Chrome's
--headless=newmode (undetectable) + 60+ lines of stealth JavaScript - ✅ Comprehensive fingerprint masking - navigator.webdriver, plugins, permissions API, hardware specs
- ✅ Captures 4 tiles with overlapping regions for seamless stitching
- ✅ 95%+ OCR accuracy - Extracts all stock tickers, prices, and percentages
- ✅ Full provenance - Every section links back to exact pixel coordinates
- ✅ Works on protected sites - finviz.com (Cloudflare), financial dashboards, SPAs
NEW: A complete browser-like interface for viewing web pages as clean, readable markdown in real-time.
Once installed, navigate to:
http://localhost:8000/browser
- 🔍 Smart Address Bar: Enter any URL or search term (auto-detects and searches Google)
- ⬅️➡️ Navigation History: Back/forward buttons with full browsing history
- 🔄 Refresh: Force reload to bypass cache
- 👁️ Dual View Modes:
- Rendered: Beautiful GitHub-styled markdown with proper formatting
- Raw: Syntax-highlighted markdown source (Prism.js)
- ⚡ Smart Caching: Pages cached for 1 hour for instant repeat visits
- 📊 Real-time Progress: Live tile processing updates with progress bars
- ⌨️ Keyboard Shortcuts:
Alt+Left/Right- Navigate back/forwardCtrl+R- Refresh pageCtrl+U- Toggle rendered/raw viewCtrl+L- Focus address bar
- Enter a URL (e.g.,
https://news.ycombinator.com) or search term - Backend captures page as tiled screenshots via headless Chrome
- OCR extracts text from each tile with 95%+ accuracy
- Markdown is stitched together with provenance tracking
- View rendered markdown OR syntax-highlighted raw source
- Navigate back/forward like a real browser
Perfect for:
- Reading web content without distractions
- AI agents browsing websites without vision models
- Archiving web pages as clean markdown
- Research with full history and navigation
📖 Full documentation: See docs/BROWSER_UI.md for detailed features, keyboard shortcuts, and troubleshooting.
- Deterministic output: Same input = same markdown every time
- Verifiable provenance: Every sentence links back to exact pixel coordinates
- Rich metadata: Links, headings, tables extracted from both DOM and visuals
- OCR + DOM fusion: Catches content missed by traditional scrapers
| Method | Visual Accuracy | Provenance | Deterministic | Complex Layouts |
|---|---|---|---|---|
| Markdown Web Browser | ✅ 95%+ | ✅ Pixel-level | ✅ Chrome-for-Testing | ✅ OCR + DOM |
| Puppeteer + Readability | ❌ 60% | ❌ None | ❌ DOM-only | |
| BeautifulSoup | ❌ 40% | ❌ None | ✅ Yes | ❌ No visuals |
| Selenium screenshots | ✅ 90% | ❌ None | ❌ Driver variance |
AI Research & Analysis
- Process 10,000+ financial reports/day with 95% accuracy
- Extract data from PDFs, SPAs, and interactive dashboards
- Archive regulatory filings with full audit trails
Content Intelligence
- Monitor competitor websites with pixel-perfect change detection
- Extract structured data from news sites, forums, and social platforms
- Generate documentation from live web applications
Compliance & Legal
- Create admissible evidence with cryptographic provenance
- Archive website states for regulatory submissions
- Track website changes with timestamped, verifiable records
Get started in under 2 minutes with our automated installer:
curl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/markdown_web_browser/main/install.sh | bash -s -- --yesWhat this installer does for you:
- Checks your system - Detects your OS (Ubuntu/Debian, macOS, RHEL, Arch)
- Installs uv package manager - The modern Python package manager from Astral
- Installs system dependencies - Automatically installs libvips (image processing library)
- Clones the repository - Downloads the latest Markdown Web Browser code
- Sets up Python 3.13 environment - Creates isolated virtual environment with all dependencies
- Installs Playwright browsers - Downloads Chrome for Testing with bot detection evasion built-in
- Configures environment - Sets up
.envfile with default settings - Runs verification tests - Ensures everything is working correctly
- Creates launcher script - Provides a convenient
mdwbcommand for CLI usage
For interactive installation or custom options:
# Interactive mode (prompts for each step)
curl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/markdown_web_browser/main/install.sh | bash
# Custom directory with OCR API key
curl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/markdown_web_browser/main/install.sh | bash -s -- \
--dir=/opt/mdwb --ocr-key=sk-YOUR-API-KEY
# See all options
curl -fsSL https://raw.githubusercontent.com/Dicklesworthstone/markdown_web_browser/main/install.sh | bash -s -- --help- Screenshot-first: Captures exactly what users see—no PDF/print CSS surprises.
- Deterministic + auditable: Every run emits tiles,
out.md,links.json, andmanifest.json(with CfT label/build, Playwright version, screenshot style hash, warnings, and timings). - Agent-friendly extras: DOM-derived
links.json, sqlite-vec embeddings, SSE/NDJSON feeds, and CLI helpers so builders can consume Markdown immediately. - Ops-ready: Python 3.13 + FastAPI + Playwright with uv packaging, structured settings via
python-decouple, telemetry hooks, and smoke/latency automation.
- User Interface Layer:
- Browser UI (
/browser) - Interactive browsing with history, view toggling, and real-time progress - Job Dashboard (
/) - HTMX-based monitoring with SSE live updates - CLI + API - Programmatic access for automation and agents
- Browser UI (
- FastAPI
/jobsendpoint enqueues a capture via theJobManager. - Playwright (Chromium CfT, viewport 1280×2000, DPR 2, reduced motion) performs a deterministic viewport sweep.
pyvipsslices sweeps into ≤1288 px tiles with ≈120 px overlap; each tile carries offsets, DPR, hashes.- The OCR client submits tiles (HTTP/2) to hosted or local olmOCR, with retries + concurrency autotune.
- Stitcher merges Markdown, aligns headings with the DOM outline, trims overlaps via SSIM + fuzzy text comparisons, injects provenance comments (with tile metadata + highlight links), and builds the Links Appendix.
Storewrites artifacts under a content-addressed path and updates sqlite + sqlite-vec metadata for embeddings search./jobs/{id},/jobs/{id}/stream,/jobs/{id}/events,/jobs/{id}/links.json, etc., feed all UI layers (Browser UI, Dashboard, CLI) with consistent data.- The job dashboard relies on the HTMX SSE extension for real-time updates (state, manifest, warning pills), while the Browser UI uses JavaScript polling for simplicity.
See PLAN_TO_IMPLEMENT_MARKDOWN_WEB_BROWSER_PROJECT.md §§2–5, 19 for the full breakdown.
-
Start the server:
mdwb demo stream
-
Open your browser:
http://localhost:8000/browser -
Enter a URL or search term:
- Try:
https://example.comor just searchmarkdown tutorial - Watch the progress bar as tiles are processed
- Toggle between rendered and raw markdown views
- Try:
✅ Success indicators:
- Clean markdown appears in ~30-60 seconds
- Navigation buttons work (back/forward)
- Toggle switches between rendered/raw views
Step 1: Verify Setup
# Test the installation
mdwb demo stream✅ Success indicators:
- Fake job runs with progress bars
- No import or dependency errors
- Server responds on localhost:8000
Tip: Any mdwb CLI command that supports --json also accepts --format toon for TOON output (falls back to JSON if tru is unavailable).
Step 2: Capture a Real Page
# Start with a simple page
mdwb fetch https://example.com --watch✅ What you should see:
🔄 Job abc123 submitted successfully
📸 Screenshots: ████████████ 100% (2/2 tiles)
🔤 OCR Processing: ████████████ 100% (completed in 12.4s)
🧵 Stitching: ████████████ 100% (completed in 0.3s)
✅ Job completed successfully in 15.2s
📄 Markdown saved to: /cache/example.com/abc123/out.md
🔗 Links extracted: /cache/example.com/abc123/links.json
📊 Full manifest: /cache/example.com/abc123/manifest.json
Step 3: Validate Output Quality
# Check the generated markdown
cat /cache/example.com/abc123/out.md
# Verify provenance comments are included
grep "source: tile_" /cache/example.com/abc123/out.md
# Check extracted links
cat /cache/example.com/abc123/links.json | jq '.anchors | length'✅ Quality indicators:
- Markdown contains readable text (not OCR gibberish)
- Provenance comments show
<!-- source: tile_X --> - Links.json contains discovered anchors
- No "ERROR" or "FAILED" in manifest.json
Step 4: Try a Complex Page with Bot Detection
# Test with a real-world site that has Cloudflare protection
mdwb fetch https://finviz.com --watch- Install prerequisites
- Python 3.13, uv ≥0.8, and the system deps Playwright requires.
- Install system dependencies:
sudo apt-get install libvips-dev(Ubuntu/Debian) orbrew install vips(macOS). - Install the CfT build Playwright expects:
playwright install chromium --with-deps --channel=cft. - Create/sync the env:
uv venv --python 3.13 && uv sync. - Optional (GPU/olmOCR power users): run
scripts/setup_olmocr_cuda12.shto provision CUDA 12.6 + the local vLLM toolchain described indocs/olmocr_cli_tool_documentation.md.
- Configure environment
- Copy
.env.example→.env. - Fill in OCR creds,
API_BASE_URL, CfT label/build, screenshot style hash overrides, webhook secret, etc. - Settings are loaded exclusively via
python-decouple(app/settings.py), so keep.envprivate.
- Copy
- Run the API/UI
scripts/dev_run.sh(defaults to uvicorn with reload). Openhttp://localhost:8000for the HTMX/Alpine interface.- For production-style smoke, flip to Granian:
SERVER_IMPL=granian UVICORN_RELOAD=false HOST=0.0.0.0 PORT=8000 scripts/dev_run.sh --workers 4 --granian-runtime-threads 2. This wrapsscripts/run_server.py, so the same flags work in CI or systemd units.
- Trigger a capture
- UI Run button posts
/jobs. - CLI example:
uv run python scripts/mdwb_cli.py fetch https://example.com --watch
- UI Run button posts
- The UI profile dropdown and CLI
--profile <id>flag reuse login/storage state underCACHE_ROOT/profiles/<id>/storage_state.json. Pick distinct IDs for red/blue teams or authenticated personas. - Profiles are recorded in
manifest.profile_id, surfaced via/jobs/{id}/SSE/CLI diagnostics, and stored inruns.dbso ops can audit which captures used which credentials. - Storage directories are slugged automatically (
[A-Za-z0-9._-]), so feel free to pass human-friendly names (e.g.,agent.alpha).
- Links now stream into domain-grouped sections so it is easy to scan anchors/forms per host (relative URLs and fragments fall into
(relative)/(fragment)buckets). - Coverage badges highlight whether a link came from the DOM, OCR, or both, and raise warnings for text mismatches; attribute badges summarize
target/relmetadata, which is useful when triaging overlays or sandbox issues. - Each row exposes inline actions:
- Open in new job populates the toolbar URL field and immediately triggers a capture run.
- Copy Markdown copies the Markdown anchor (or best-effort fallback) to the clipboard.
- Mark crawled toggles a local badge + dimmed state so agents can keep track of which URLs they have already followed; the selection persists in
localStorage.
- The OCR client now starts at
OCR_MIN_CONCURRENCYand automatically scales up towardOCR_MAX_CONCURRENCYwhen latency is healthy, or throttles when responses turn slow/errored. The live Events tab and Manifest view both stream these adjustments so you can see when the controller steps in. - Manifests (
ocr_autotune) and CLI commands (mdwb diag,mdwb jobs ocr-metrics) include the initial/peak/final limits plus a short history of adjustments. UseMDWB_SERVER_IMPL=granian+ higher worker counts when you want the autotune headroom to matter.
POST /jobsnow deduplicates captures using a content-address (url + CfT + viewport + DSF + OCR model + profile). By default the CLI enables this, so identical requests return immediately withcache_hit=trueand reuse existing artifacts.- Disable reuse with
mdwb fetch --no-cache(orreuse_cache=falsein the API payload) when you need a fresh capture even if nothing changed. - Manifests,
/jobs/{id}snapshots, SSE logs, andmdwb diagall exposecache_hitso downstream tooling can tell whether a job ran or reused cached output.
fetch <url> [--watch]— enqueue + optionally stream Markdown as tiles finish (percent/ETA shown unless--no-progress; add--reuse-sessionto keep one HTTP/2 client alive across submit + stream).fetch <url> --no-cache— force a fresh capture even if an identical cache entry exists.fetch <url> --resume [--resume-root path]— skip URLs already recorded indone_flags/(optionallywork_index_list.csv.zst) under the chosen root; the CLI auto-enables--watchso completed jobs write their flag/index entries. Override locations via--resume-index/--resume-done-dir.fetch <url> --webhook-url https://... [--webhook-event DONE --webhook-event FAILED]— register callbacks right after the job is created.show <job-id> [--ocr-metrics]— dump the latest job snapshot, optionally with OCR batch/quota telemetry.stream <job-id>— follow the SSE feed.watch <job-id>/events <job-id> --follow --since <ISO>— tail the/jobs/{id}/eventsNDJSON log (use--on EVENT=COMMANDfor hooks; add--no-progressto suppress the percent/ETA overlay,--reuse-sessionto keep a single HTTP client). DOM-assist events now print counts/reasons so you immediately see when hybrid recovery patched a tile.diag <job-id>— print CfT/Playwright metadata, capture/OCR timings, warnings, and blocklist hits for incident triage.jobs replay manifest <manifest.json>— resubmit a stored manifest via/replaywith validation/JSON output support.jobs embeddings search <job-id> --vector-file vector.json [--top-k 5]— search sqlite-vec section embeddings for a run (supports inline--vectorstrings and--jsonoutput).jobs agents bead-summary <plan.md>— convert a markdown checklist into bead-ready summaries (mirrors the intra-agent tracker described in PLAN §21).warnings --count 50— tailops/warnings.jsonlfor capture/blocklist incidents.dom links --job-id <id>— render the storedlinks.json(anchors/forms/headings/meta).jobs ocr-metrics <job-id> [--json]— summarize OCR batch latency, request IDs, and quota usage from the manifest.resume status --root path [--limit 10 --pending --json]— inspect the resume state;--pendingshows outstanding URLs,--jsonemitscompleted_entries+pending_entriesfor automation.demo snapshot|stream|events— exercise the demo endpoints without hitting a live pipeline.
The CLI reads API_BASE_URL + MDWB_API_KEY from .env; override with --api-base when targeting staging. For CUDA/vLLM workflows, see docs/olmocr_cli_tool_documentation.md and docs/olmocr_cli_integration.md for detailed setup + merge notes.
uv run python -m scripts.agents.summarize_article summarize --url https://example.com [--out summary.txt]— submit (or reuse via--job-id) and print/save a short summary (defaults to--reuse-session).uv run python -m scripts.agents.generate_todos todos --job-id <id> [--json] [--out todos.json]— extract TODO-style bullets (JSON when--json, newline text otherwise); accepts--urlto run a fresh capture and also defaults to--reuse-session.
Both helpers reuse the CLI’s auth + HTTP plumbing, accept the same --api-base/--http2 flags, fall back to existing jobs when you only need post-processing, and now support --out so automations can ingest the results directly.
- Chrome for Testing pin: Set
CFT_VERSION+CFT_LABELin.envso manifests and ops dashboards stay consistent. Re-runplaywright installwhenever the label/build changes. - Transport + viewport: Defaults (
PLAYWRIGHT_TRANSPORT=cdp, viewport 1280×2000, DPR 2) live inapp/settings.pyand must align with PLAN §§3, 19. - OCR credentials:
OLMOCR_SERVER,OLMOCR_API_KEY, andOLMOCR_MODELare required unless you point atOCR_LOCAL_URL. - Warning log + blocklist: Keep
WARNING_LOG_PATHandBLOCKLIST_PATHwritable so scroll/overlay incidents are persisted (docs/config.mddocuments every field). - System packages: Install libvips 8.15+ so the pyvips-based tiler works (
sudo apt-get install libvipson Debian/Ubuntu,brew install vipson macOS).scripts/run_checks.shchecks forpyvipsand fails fast with install instructions unless you explicitly setSKIP_LIBVIPS_CHECK=1(for targeted CLI/unit runs on machines without libvips).
Run these before pushing or shipping capture-facing changes:
uv run ruff check --fix --unsafe-fixes
uvx ty check
npx playwright test --config=playwright.config.mjs # or PLAYWRIGHT_BIN=/path/to/playwright-test …./scripts/run_checks.sh wraps the same sequence for CI. Set PLAYWRIGHT_BIN=/path/to/playwright-test
if you need to invoke the Node-based runner; otherwise the script prefers npx playwright test --config=playwright.config.mjs
(which inherits the defaults from PLAN/AGENTS: viewport 1280×2000, DPR 2, reduced motion, light scheme, mask selectors, CDP/BiDi transport via PLAYWRIGHT_TRANSPORT). When Node Playwright isn’t installed it falls back to uv run playwright test and prints a warning if the Python CLI lacks test.
When you already know libvips isn’t available in a minimal container, export SKIP_LIBVIPS_CHECK=1 to bypass the preflight warning. Optional toggles inside scripts/run_checks.sh:
MDWB_CHECK_METRICS=1(optionallyCHECK_METRICS_TIMEOUT=<seconds>) appends the Prometheus health check after pytest/Playwright.MDWB_RUN_E2E=1runs the lightweight placeholder suite intests/test_e2e_small.pyso CI can keep a fast E2E sentinel without invoking FlowLogger.MDWB_RUN_E2E_RICH=1runs the full FlowLogger scenarios intests/test_e2e_cli.py; transcript artifacts are copied totmp/rich_e2e_cli/(override viaRICH_E2E_ARTIFACT_DIR=/path/to/dir) so operators can review the panels/tables/progress output without hunting through pytest temp dirs.MDWB_RUN_E2E_GENERATED=1runs the generative guardrail suite (tests/test_e2e_generated.py). PointMDWB_GENERATED_E2E_CASES=/path/to/cases.jsonat a bespoke cases file when you need to refresh or extend the Markdown baselines.
Grab the resulting tmp/rich_e2e_cli/*.log|*.html files in CI for postmortems.
-
The bundled pytest targets now include the store/manifest persistence suite (
tests/test_store_manifest.py,tests/test_manifest_contract.py), the Prometheus CLI health checks (tests/test_check_metrics.py), and the ops regressions forshow_latest_smoke/update_smoke_pointersin addition to the CLI coverage. This keeps RunRecord fields, smoke pointer tooling, and metrics hooks under CI without needing a live API server. -
Playwright defaults to the Chrome for Testing build. Leave
PLAYWRIGHT_CHANNELunset (or set it tocft) so local smoke runs match the capture pipeline; if you have to fall back to stock Chromium, setPLAYWRIGHT_CHANNEL=chromiumor use a comma-separated preference such asPLAYWRIGHT_CHANNEL="chromium,cft". Likewise, keepPLAYWRIGHT_TRANSPORT=cdpunless you are explicitly exercising WebDriver BiDi—when you do, a value likePLAYWRIGHT_TRANSPORT="bidi,cdp"makes the preferred/fallback order obvious to anyone reading CI metadata. -
Every
run_checksinvocation now emitstmp/pytest_report.xmlandtmp/pytest_summary.json(override withPYTEST_JUNIT_PATH/PYTEST_SUMMARY_PATH). The JSON digest lists totals and the first few failing test names, so CI/Agent Mail can quote failures without re-running pytest.
Also run uv run python scripts/check_env.py whenever .env changes—CI and nightly smokes depend on it to confirm CfT pins + OCR secrets.
Additional expectations (per PLAN §§14, 19.10, 22):
- Keep nightly smokes green via
uv run python scripts/run_smoke.py --date $(date -u +%Y-%m-%d). - Refresh
benchmarks/production/weekly_summary.json(generated automatically by the smoke script) for Monday ops reports. - Run
uv run python scripts/check_metrics.py --check-weekly(with the defaultbenchmarks/production/weekly_summary.json) before handoff so we fail fast when capture/OCR SLO p99 values exceed their 2×p95 budgets. - Tail
ops/warnings.jsonlormdwb warningsfor canvas/video/overlay spikes.
- Reserve + communicate: Before editing, reserve files and announce the pickup via Agent Mail (cite the bead id). Keep PLAN sections annotated with
_Status — <agent>entries so the written record matches reality. - Track via beads: Use
bd list/showto pick the next unblocked issue, add comments for status updates, and close with findings/tests noted. - Run the required checks:
ruff,ty, Playwright smoke,scripts/check_env.py, plus any bead-specific tests (e.g., sqlite-vec search or CLI watch). Never skip the capture smoke after touching Playwright/OCR code. - Sync docs: README, PLAN,
docs/config.md, anddocs/ops.mdmust stay consistent; update them alongside code changes so ops can trust the written guidance. - Ops handoff: For capture/OCR fixes, capture job ids + manifest paths in your bead comment and Mail thread so others can reproduce issues quickly.
scripts/run_smoke.py— nightly URL set capture + manifest/latency aggregation.scripts/show_latest_smoke.py— quick pointers to the latest smoke outputs; manifest rows now include overlap ratios, validation failure counts, and seam marker/hash counts so regressions stand out. The--weeklyview prints seam marker percentiles plus capture/OCR SLO status (p99 vs 2×p95) using the data generated by the nightly smoke script, and--slorenders the aggregatedlatest_slo_summary.jsontable (counts, p50/p95 capture/OCR/total, budget breaches). It now fails fast whenlatest.txtexists but is empty, so rerunscripts/update_smoke_pointers.py <run-dir>whenever the pointer guard triggers.scripts/olmocr_cli.py+docs/olmocr_cli.md— hosted olmOCR orchestration/diagnostics.scripts/analyze_stitch.py— lightweight research helper that reads a manifest index and reports seam counts/hash diversity plus hyphen-break dom-assist incidents per run (optionally--json). Handy for bd-0jc overlap/SSIM and hyphen guard experiments.- Weekly seam telemetry — run
uv run python scripts/show_latest_smoke.py --weekly --json(or parsebenchmarks/production/weekly_summary.json) to pullseam_markers.count/hashes/eventsp50/p95 for every category. Feed those numbers straight into Grafana/Prometheus so seam regressions (or fallback spikes) page operators alongside capture/OCR SLO breaches, and archiveweekly_slo.json/.promfor the rolling capture/OCR SLO window. mdwb jobs replay manifest <manifest.json>— re-run a job with a stored manifest viaPOST /replay(accepts--api-base,--http2,--json); keepscripts/replay_job.sharound for legacy automation until everything points at the CLI.mdwb jobs show <job-id>— inspect the latest snapshot plus sweep stats/validation issues in one table. When manifests are missing (cached jobs, trimmed SSE payloads), the CLI still prints stored seam counts (Seam markers: X (unique hashes: Y)) so you can spot duplicate sweeps without spelunking manifests.mdwb diag --ocr-metricsshows the detailed seam marker table when manifests are available.scripts/update_smoke_pointers.py <run-dir> [--root path]— refreshlatest_summary.md,latest_manifest_index.json, andlatest_metrics.jsonafter ad-hoc smoke runs so dashboards point at the right data (defaults toMDWB_SMOKE_ROOTunless--rootis provided; add--weekly-sourcewhen overriding the rolling summary). The command now computeslatest_slo_summary.jsonby default using the manifest index + PLAN §22 budget file; pass--no-compute-sloto skip or--budget-fileto point at an alternate budget definition.scripts/check_metrics.py— ping/metricsplus the exporter; supports--api-base,--exporter-url,--json, and now--check-weekly(validatesbenchmarks/production/weekly_summary.jsonso release builds fail fast if the rolling SLOs are blown). When you pass--check-weekly --json, the CLI always emits aweeklyblock (status,summary_path,failures) even if the summary file is missing/unreadable, which makes automation logs self-explanatory.scripts/prom_scrape_check.pyremains as a compatibility wrapper but simply re-exports the same Typer CLI.scripts/compute_slo.py— consumes the latestlatest_manifest_index.json(or any manifest index) plus the benchmark budget file to produce capture/OCR SLO summaries. The CLI writes a JSON report (--out benchmarks/production/latest_slo_summary.json) and optionally emits Prometheus textfile metrics via--prom-output tmp/mdwb_slo.prom, enabling dashboards/alerts to track per-category p95 totals, breach ratios, and overall SLO status.scripts/run_smoke.pyinvokes this automatically after each smoke run sobenchmarks/production/latest_slo_summary.jsonandlatest_slo.promstay fresh; rerun manually when you need ad-hoc SLO snapshots.- Prometheus metrics now cover capture/OCR/stitch durations, warning/blocklist counts, job completions, and SSE heartbeats via
prometheus-fastapi-instrumentator. Scrape/metricson the API port or hit the background exporter onPROMETHEUS_PORT(default 9000); docs/ops.md lists the metric names + alert hooks. - Set
MDWB_CHECK_METRICS=1(optionallyCHECK_METRICS_TIMEOUT=<seconds>) when runningscripts/run_checks.shto include the Prometheus smoke (scripts/check_metrics.py) alongside the usual lint/type/pytest/Playwright stack.
# Validate env
uv run python scripts/check_env.py
# Run cli demo job
uv run python scripts/mdwb_cli.py demo stream
# Replay an existing manifest
uv run python scripts/mdwb_cli.py jobs replay manifest cache/example.com/.../manifest.json
# Search embeddings for a run (vector as JSON array)
uv run python scripts/mdwb_cli.py jobs embeddings search JOB_ID --vector "[0.12, 0.04, ...]" --top-k 3
# Tail warning log via CLI
uv run python scripts/mdwb_cli.py warnings --count 25
# Download a job's tar bundle (tiles + markdown + manifest)
uv run python scripts/mdwb_cli.py jobs bundle <job-id> --out path/to/bundle.tar.zst
# Run nightly smoke for docs/articles only (dry run)
uv run python scripts/run_smoke.py --date $(date -u +%Y-%m-%d) --category docs_articles --dry-runartifact/tiles/tile_*.png— viewport-sweep tiles (≤1288 px long side) with overlap + SHA metadata./jobs/{id}/artifact/highlight?tile=…&y0=…&y1=…— quick HTML viewer that overlays the region referenced by each provenance comment (handy for code reviews and incident reports).out.md— final Markdown with DOM-guided heading normalization plus provenance comments (<!-- source: tile_i ... , path=…, highlight=/jobs/... -->) and Links Appendix.links.json— anchors/forms/headings/meta harvested from the DOM snapshot.manifest.json— CfT label/build, Playwright version, screenshot style hash, warnings, sweep stats, timings, and the newseam_marker_eventslist whenever seam hints were required to align tiles.dom_snapshot.html— raw DOM capture used for link diffs and hybrid recovery (when enabled).bundle.tar.zst— optional tarball for incidents/export (Store.build_bundle).- Markdown output now includes seam markers (
<!-- seam-marker … -->) and enriched provenance comments (viewport_y,overlap_px, highlight links) plus detailed<!-- table-header-trimmed reason=… -->breadcrumbs so reviewers can jump straight to stitched regions.
Use mdwb jobs bundle … or mdwb jobs artifacts manifest … (or /jobs/{id}/artifact/...) to reproduce a job locally and fetch its artifacts for debugging.
- Beads (
bd ...) track every feature/bug (map bead IDs to Plan sections in Agent Mail threads). - Agent Mail (MCP) is the coordination channel—reserve files before editing, summarize work in the relevant bead thread, and note Plan updates inline (see §§10–11 for example status notes).
AGENTS.md— ground rules (no destructive git cmds, uv usage, capture policies).PLAN_TO_IMPLEMENT_MARKDOWN_WEB_BROWSER_PROJECT.md— canonical spec + incremental upgrades.docs/architecture.md— best practices + data flow diagrams.docs/blocklist.md,docs/config.md,docs/models.yaml,docs/ops.md,docs/olmocr_cli.md— supporting specs.docs/release_checklist.md— step-by-step release & regression runbook (CfT/Playwright/model toggles, smoke commands, artifact list).
Q: Why is OCR slow/failing?
# Check OCR quota and performance
mdwb jobs ocr-metrics job123
mdwb warnings --count 20 | grep -i "ocr\|quota"
# Reduce concurrency if hitting rate limits
export OCR_MAX_CONCURRENCY=5A: Common causes:
- OCR API rate limiting (reduce
OCR_MAX_CONCURRENCY) - Network latency to OCR service (consider local olmOCR)
- Complex images requiring more processing time
Q: Poor OCR accuracy on my content?
# Check image quality in tiles
ls cache/your-site.com/job123/artifact/tiles/
# View highlight links for problematic sections
curl "http://localhost:8000/jobs/job123/artifact/highlight?tile=5&y0=100&y1=200"A: Optimization strategies:
- Increase viewport size for better text rendering
- Use authenticated profiles for login-walled content
- Check for overlay/popup interference in manifest warnings
Q: Missing content compared to browser view?
# Check DOM snapshot vs final markdown
curl "http://localhost:8000/jobs/job123/artifact/dom_snapshot.html"
mdwb dom links --job-id job123A: Common causes:
- JavaScript-heavy SPAs (content loads after initial render)
- Authentication required (use
--profilewith logged-in state) - Overlays/popups blocking content (check blocklist configuration)
Slow jobs:
- Check tile count:
mdwb diag job123- High tile counts increase OCR time - Review warnings:
mdwb warnings- Canvas/scroll issues affect performance - Monitor concurrency: OCR auto-tune may be throttling due to latency
Memory issues:
- Large pages: Set
TILE_MAX_SIZElower to reduce memory per tile - Concurrent jobs: Limit active jobs with
MAX_ACTIVE_JOBS - Cache cleanup: Implement retention policy for old artifacts
Network problems:
- OCR connectivity: Test with
curl ${OLMOCR_SERVER}/health - Firewall issues: Ensure outbound HTTPS access
- Proxy configuration: Set HTTP_PROXY/HTTPS_PROXY if needed
| Error Code | Meaning | Solution |
|---|---|---|
OCR_QUOTA_EXCEEDED |
Hit API rate limits | Wait or increase quota |
SCREENSHOT_TIMEOUT |
Page load too slow | Increase timeout, check URL |
TILE_PROCESSING_FAILED |
Image processing error | Check libvips installation |
MANIFEST_VALIDATION_FAILED |
Corrupt job state | Restart job, check disk space |
DOM_SNAPSHOT_FAILED |
Can't save DOM | Check write permissions |


