feat(annotate): support HTML files and URL annotation#545
Merged
backnotprop merged 32 commits intomainfrom Apr 13, 2026
Merged
Conversation
Block javascript:, data:, and vbscript: URLs in InlineMarkdown link rendering. Links with dangerous protocols render as plain text instead of clickable anchors. Uses a blocklist approach so existing links with custom protocols (obsidian://, vscode://, Windows C:\ paths) continue to work. For provenance purposes, this commit was AI assisted.
- html-to-markdown.ts: Turndown wrapper with GFM table rule, strips script/style/noscript tags - url-to-markdown.ts: Jina Reader (free, returns markdown) with fetch+Turndown fallback. Warns on Jina failure, auto-skips Jina for local/private URLs (localhost, 192.168.*, 10.*, etc.) - config.ts: add jina setting and resolveUseJina() with priority chain --no-jina flag > PLANNOTATOR_JINA env > config.json > default true For provenance purposes, this commit was AI assisted.
Extend the annotate subcommand to accept .html/.htm local files (converted via Turndown) and https:// URLs (fetched via Jina Reader with fetch+Turndown fallback). URL content is fetched terminal-side before opening the browser. Add --no-jina global flag to disable Jina Reader per-invocation. Add 10MB file size guard for local HTML files. For provenance purposes, this commit was AI assisted.
- Widen file browser glob to include .html/.htm alongside markdown - handleDoc converts HTML files via Turndown on demand when selected - hasMarkdownFiles accepts optional extensions param for folder validation - Add sourceInfo field to annotate server API response - Add _site/, public/, out/, .docusaurus/, .jekyll-cache/, storybook-static/ to FILE_BROWSER_EXCLUDED For provenance purposes, this commit was AI assisted.
Show a subtle badge in DocBadges displaying the URL hostname or HTML filename for converted content. Thread sourceInfo from API response through App → Viewer → DocBadges. Also update Pi extension to accept HTML-only folders in annotate mode. For provenance purposes, this commit was AI assisted.
For provenance purposes, this commit was AI assisted.
Security:
- Add project-root containment check for HTML files in /api/doc handler
using exported isWithinProjectRoot() from resolve-file.ts
- Blocks path traversal via absolute paths or ../ escapes
isLocalUrl fixes:
- Add bracketed IPv6 loopback [::1] detection
- Replace hostname.startsWith('10.') with proper IPv4 regex to avoid
matching public hostnames like 10.example.com
Revert Pi extension change:
- Pi server doesn't implement HTML file browsing or conversion yet
- Keep Pi folder validation markdown-only until both implementations
are updated per CLAUDE.md guidelines
Cleanup:
- Remove dead el.children || el.childNodes fallback in table rule
- Extract hostnameOrFallback() helper to @plannotator/shared/project
replacing duplicated try/catch IIFEs in DocBadges and index.ts
For provenance purposes, this commit was AI assisted.
Bring the Pi extension to full parity with the Bun server for HTML annotation support: - Vendor html-to-markdown and url-to-markdown via vendor.sh - walkMarkdownFiles now scans .html/.htm alongside markdown - handleDocRequest converts HTML files on-demand via Turndown with isWithinProjectRoot containment check - serverAnnotate includes sourceInfo in /api/plan response - index.ts supports URL detection (Jina Reader + fallback), HTML file detection with Turndown conversion, folder HTML validation, and 10MB file size guard - openMarkdownAnnotation accepts and threads sourceInfo - Add turndown as a Pi extension dependency For provenance purposes, this commit was AI assisted.
- Add extensions param to walkMarkdownFiles (default: HTML-inclusive) - Obsidian callers pass /\.mdx?$/i to match Bun server behavior - Add try/catch around HTML file reads in handleDocRequest For provenance purposes, this commit was AI assisted.
… IP, dead code Security: - Add isWithinProjectRoot check to the base-relative block for HTML files in both Bun and Pi /api/doc handlers. Previously HTML files served via the base query param bypassed the containment guard. - Add 169.254.0.0/16 (link-local / cloud metadata) to isLocalUrl private IP ranges Cleanup: - Remove dead hostname === "[::1]" check (WHATWG URL parser strips brackets; hostname === "::1" already handles it) - Remove dead parent?.childNodes fallback in table cell() function For provenance purposes, this commit was AI assisted.
Drop ~60 lines of hand-rolled GFM table conversion that had a bug (tables without explicit <thead> produced invalid GFM). Use the official turndown-plugin-gfm plugin (24KB) which correctly handles all table patterns plus adds strikethrough and task list support. For provenance purposes, this commit was AI assisted.
Expand the backslash escape regex to cover all CommonMark-defined
escapable characters (. ) - # > + | { } &), not just the subset
the parser uses for formatting. Fixes literal backslashes appearing
in rendered output for Turndown-escaped content like "1\." → "1.".
For provenance purposes, this commit was AI assisted.
Replace redirect: "follow" with redirect: "manual" in fetchViaTurndown and validate each redirect hop against isLocalUrl. Blocks attacks where an external URL redirects to cloud metadata endpoints (169.254.169.254) or other private IPs. Limits redirect chain to 10 hops. For provenance purposes, this commit was AI assisted.
bun install needed to resolve turndown-plugin-gfm in the Pi extension workspace after adding it to apps/pi-extension/package.json. For provenance purposes, this commit was AI assisted.
Replace unmaintained turndown-plugin-gfm (2017, v1.0.2) with the actively maintained Joplin fork (2025, v1.0.64, 16KB). Fix TypeScript errors that broke CI: - Add @ts-expect-error for untyped @joplin/turndown-plugin-gfm import - Restructure fetchViaTurndown redirect loop to avoid uninitialized variable — first fetch before loop, loop only for redirects For provenance purposes, this commit was AI assisted.
Add declarations.d.ts for @joplin/turndown-plugin-gfm with typed function signatures, remove the ts-expect-error suppression. For provenance purposes, this commit was AI assisted.
CI's tsc wasn't finding the ambient module declaration with implicit include. Add explicit include to ensure declarations.d.ts is always picked up regardless of environment. For provenance purposes, this commit was AI assisted.
CI's tsc does not pick up ambient declarations.d.ts files despite local tsc finding them — likely a module resolution discrepancy between environments. Revert to @ts-expect-error which passes in both CI and local typecheck. For provenance purposes, this commit was AI assisted.
… protocol - Add 10MB body size limit to both Jina and fetch+Turndown URL paths, matching the local HTML file guard. Streams response body and aborts if limit exceeded. - Distinguish "Too many redirects" from a genuine 3xx response after redirect loop exhaustion. - Add file: to the dangerous protocol blocklist in sanitizeLinkUrl. For provenance purposes, this commit was AI assisted.
- Remove containment check from base-relative block for HTML files in both Bun and Pi /api/doc handlers. Matches markdown behavior so HTML files in annotated folders outside cwd are served correctly. Standalone block (no base) retains its cwd check as fallback. - Widen isLocalMd → isLocalDoc to treat .html/.htm links as linked documents. Clicking [Next](next.html) in a converted page now opens it via /api/doc with Turndown conversion instead of a new browser tab. For provenance purposes, this commit was AI assisted.
…nv vars - Expand loopback check from just 127.0.0.1 to the full 127.0.0.0/8 range so all loopback addresses skip Jina Reader - Cancel redirect response body before re-fetching to avoid leaking TCP connections back to the pool - Document PLANNOTATOR_JINA and JINA_API_KEY in CLAUDE.md env var table For provenance purposes, this commit was AI assisted.
…s, comments - Add [::1] back to isLocalUrl — WHATWG URL hostname getter preserves brackets for IPv6 (verified: Bun and Node both return "[::1]"). Add comment explaining the empirical verification so future reviewers don't re-flag. - Fix readBodyWithLimit null-body fallback to still enforce the 10MB limit via text length check instead of silently falling through. - Document PLANNOTATOR_JINA and JINA_API_KEY in AGENTS.md env var table (CLAUDE.md is a symlink to AGENTS.md). - Add comments to base-relative blocks in both Bun and Pi handleDoc explaining the intentional lack of containment check (matches pre-existing markdown behavior, base is set server-side). For provenance purposes, this commit was AI assisted.
…calUrl Add PRIVATE_IPV6 regex matching bracketed IPv6 private/reserved ranges: - ::ffff: (IPv4-mapped — embeds private IPv4 as hex, e.g. [::ffff:c0a8:1]) - fe80: (link-local) - fc00::/7 (unique-local, covers fc00:: through fdff::) Closes the redirect-SSRF bypass where a public URL redirects to a private address expressed as IPv4-mapped IPv6, e.g. http://[::ffff:169.254.169.254]/latest/meta-data/ For provenance purposes, this commit was AI assisted.
…annotate flow - Expand isLocalUrl comment with full empirical verification table showing actual hostname getter output for every IPv6 format in both Bun and Node — prevents false-positive review findings about brackets - Add sourceInfo to /api/plan response type in App.tsx for type safety - Update CLAUDE.md annotate flow diagram to reflect HTML/URL/folder input types For provenance purposes, this commit was AI assisted.
…Info - Add ( to backslash escape regex alongside existing ) — Turndown emits \( in link-adjacent contexts - Cancel response body before throwing on !res.ok in both fetchViaJina and fetchViaTurndown error paths (redirect loop already did this) - Document sourceInfo field in AGENTS.md annotate server API table For provenance purposes, this commit was AI assisted.
- Skip dirname(filePath) base injection when filePath is a URL in both Bun and Pi annotate servers. dirname on a URL string produces a nonsensical filesystem path, causing linked doc clicks to 404. URL annotations now let links open normally instead. - Cancel response body before throwing on content-type mismatch and content-length overflow in fetchViaTurndown/readBodyWithLimit. - Fix double parseInt in readBodyWithLimit content-length check. - Correct AGENTS.md flow diagram: OpenCode not yet implemented for HTML/URL annotation. For provenance purposes, this commit was AI assisted.
Add URL detection (Jina Reader + fallback), HTML file detection with Turndown conversion, 10MB file size guard, and sourceInfo threading to OpenCode's handleAnnotateCommand. Uses the same shared utilities as the Bun CLI and Pi extension. OpenCode uses the Bun server directly (startAnnotateServer from @plannotator/server/annotate), so no server-side changes needed — only the command handler routing was missing. Note: folder annotation mode is not added (OpenCode didn't have it before this PR for markdown either — separate scope). For provenance purposes, this commit was AI assisted.
…ssages - OpenCode plannotator-annotate.md description now mentions HTML/URL - Align fetch progress messages across all three clients: all now show "(via Jina Reader)" or "(via fetch+Turndown)" consistently For provenance purposes, this commit was AI assisted.
…leanup - URLs ending in .md/.mdx are fetched raw — no Jina, no Turndown. Content is already markdown. Removes text/plain from fetchViaTurndown content-type whitelist since .md URLs are now short-circuited. - Wikilink regex widened to preserve .html/.htm targets instead of appending .md (e.g. [[page.html]] no longer becomes page.html.md) - Remove redundant existsSync before statSync in OpenCode handler For provenance purposes, this commit was AI assisted.
Tests cover the core conversion utility that all three clients depend on: - Basic HTML → markdown (headings, paragraphs, links, code blocks) - Tables with and without <thead> (the GFM plugin bug that was caught) - Script/style/noscript stripping - Strikethrough (GFM) - Empty HTML handling - Dangerous links preserved (sanitization is in the renderer, not here) For provenance purposes, this commit was AI assisted.
…kdown URLs ending in .md/.mdx (e.g. GitHub's viewer page for README.md) may return HTML instead of raw markdown. fetchRawText now checks the response content-type — if the server returns HTML, returns null so the caller falls through to Jina/Turndown for proper conversion. For provenance purposes, this commit was AI assisted.
fetchRawText (for .md/.mdx URLs) was using default redirect: "follow" with no isLocalUrl validation on redirect hops — a .md URL redirecting to 169.254.169.254 would be followed and credentials returned as "markdown". Now uses redirect: "manual" with per-hop isLocalUrl checks, matching fetchViaTurndown's SSRF protection. For provenance purposes, this commit was AI assisted.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
plannotator annotate file.htmlconverts HTML to markdown via Turndown (~192KB dep) and feeds it into the existing annotation pipelineplannotator annotate https://...fetches content via Jina Reader (free, handles JS-rendered pages) with fetch+Turndown fallback. Local/private URLs skip Jina automatically.html/.htmfiles alongside markdown, with on-demand Turndown conversionjavascript:,data:,vbscript:) blocked in the markdown renderer; 10MB file size guard for local HTML;isWithinProjectRootcontainment check on HTML/api/dochandler--no-jinaCLI flag,PLANNOTATOR_JINAenv var, andconfig.jsonsetting to disable Jina Reader/api/docHTML conversion, URL/Jina support,sourceInfoin API,walkMarkdownFileswith HTML extensions, vendored shared utilities viavendor.shTest plan
plannotator annotate test.html— opens browser with converted content, annotations workplannotator annotate https://en.wikipedia.org/wiki/Markdown— fetches via Jina, renders clean markdownplannotator annotate --no-jina https://example.com— uses plain fetch+Turndownplannotator annotate ./folder-with-html/— file browser shows HTML files, clicking one renders converted markdown<a href="javascript:alert(1)">— link renders as plain text, not clickableplannotator annotate https://localhost:3000— skips Jina, fetches directly