Skip to content

feat(annotate): support HTML files and URL annotation#545

Merged
backnotprop merged 32 commits intomainfrom
feat/annotate-html
Apr 13, 2026
Merged

feat(annotate): support HTML files and URL annotation#545
backnotprop merged 32 commits intomainfrom
feat/annotate-html

Conversation

@backnotprop
Copy link
Copy Markdown
Owner

@backnotprop backnotprop commented Apr 12, 2026

Summary

  • HTML file annotation: plannotator annotate file.html converts HTML to markdown via Turndown (~192KB dep) and feeds it into the existing annotation pipeline
  • URL annotation: plannotator annotate https://... fetches content via Jina Reader (free, handles JS-rendered pages) with fetch+Turndown fallback. Local/private URLs skip Jina automatically
  • Folder mode: file browser now shows .html/.htm files alongside markdown, with on-demand Turndown conversion
  • Security: dangerous link protocols (javascript:, data:, vbscript:) blocked in the markdown renderer; 10MB file size guard for local HTML; isWithinProjectRoot containment check on HTML /api/doc handler
  • Config: --no-jina CLI flag, PLANNOTATOR_JINA env var, and config.json setting to disable Jina Reader
  • UX: subtle source attribution badge showing URL hostname or "Converted from file.html"
  • Pi extension parity: full HTML annotation support mirrored in the Pi (Node.js) server — file browser, /api/doc HTML conversion, URL/Jina support, sourceInfo in API, walkMarkdownFiles with HTML extensions, vendored shared utilities via vendor.sh

Test plan

  • plannotator annotate test.html — opens browser with converted content, annotations work
  • plannotator annotate https://en.wikipedia.org/wiki/Markdown — fetches via Jina, renders clean markdown
  • plannotator annotate --no-jina https://example.com — uses plain fetch+Turndown
  • plannotator annotate ./folder-with-html/ — file browser shows HTML files, clicking one renders converted markdown
  • HTML file with <a href="javascript:alert(1)"> — link renders as plain text, not clickable
  • plannotator annotate https://localhost:3000 — skips Jina, fetches directly
  • Existing markdown annotation workflow unchanged (no regressions)
  • Source badge shows hostname for URLs, filename for HTML files
  • Pi extension: HTML file and URL annotation works identically to Bun server
  • Pi extension: Obsidian vault listings show only markdown (not HTML)

Block javascript:, data:, and vbscript: URLs in InlineMarkdown link
rendering. Links with dangerous protocols render as plain text instead
of clickable anchors. Uses a blocklist approach so existing links with
custom protocols (obsidian://, vscode://, Windows C:\ paths) continue
to work.

For provenance purposes, this commit was AI assisted.
- html-to-markdown.ts: Turndown wrapper with GFM table rule, strips
  script/style/noscript tags
- url-to-markdown.ts: Jina Reader (free, returns markdown) with
  fetch+Turndown fallback. Warns on Jina failure, auto-skips Jina for
  local/private URLs (localhost, 192.168.*, 10.*, etc.)
- config.ts: add jina setting and resolveUseJina() with priority chain
  --no-jina flag > PLANNOTATOR_JINA env > config.json > default true

For provenance purposes, this commit was AI assisted.
Extend the annotate subcommand to accept .html/.htm local files
(converted via Turndown) and https:// URLs (fetched via Jina Reader
with fetch+Turndown fallback). URL content is fetched terminal-side
before opening the browser.

Add --no-jina global flag to disable Jina Reader per-invocation.
Add 10MB file size guard for local HTML files.

For provenance purposes, this commit was AI assisted.
- Widen file browser glob to include .html/.htm alongside markdown
- handleDoc converts HTML files via Turndown on demand when selected
- hasMarkdownFiles accepts optional extensions param for folder validation
- Add sourceInfo field to annotate server API response
- Add _site/, public/, out/, .docusaurus/, .jekyll-cache/,
  storybook-static/ to FILE_BROWSER_EXCLUDED

For provenance purposes, this commit was AI assisted.
Show a subtle badge in DocBadges displaying the URL hostname or HTML
filename for converted content. Thread sourceInfo from API response
through App → Viewer → DocBadges.

Also update Pi extension to accept HTML-only folders in annotate mode.

For provenance purposes, this commit was AI assisted.
For provenance purposes, this commit was AI assisted.
Security:
- Add project-root containment check for HTML files in /api/doc handler
  using exported isWithinProjectRoot() from resolve-file.ts
- Blocks path traversal via absolute paths or ../ escapes

isLocalUrl fixes:
- Add bracketed IPv6 loopback [::1] detection
- Replace hostname.startsWith('10.') with proper IPv4 regex to avoid
  matching public hostnames like 10.example.com

Revert Pi extension change:
- Pi server doesn't implement HTML file browsing or conversion yet
- Keep Pi folder validation markdown-only until both implementations
  are updated per CLAUDE.md guidelines

Cleanup:
- Remove dead el.children || el.childNodes fallback in table rule
- Extract hostnameOrFallback() helper to @plannotator/shared/project
  replacing duplicated try/catch IIFEs in DocBadges and index.ts

For provenance purposes, this commit was AI assisted.
Bring the Pi extension to full parity with the Bun server for HTML
annotation support:

- Vendor html-to-markdown and url-to-markdown via vendor.sh
- walkMarkdownFiles now scans .html/.htm alongside markdown
- handleDocRequest converts HTML files on-demand via Turndown with
  isWithinProjectRoot containment check
- serverAnnotate includes sourceInfo in /api/plan response
- index.ts supports URL detection (Jina Reader + fallback), HTML file
  detection with Turndown conversion, folder HTML validation, and 10MB
  file size guard
- openMarkdownAnnotation accepts and threads sourceInfo
- Add turndown as a Pi extension dependency

For provenance purposes, this commit was AI assisted.
- Add extensions param to walkMarkdownFiles (default: HTML-inclusive)
- Obsidian callers pass /\.mdx?$/i to match Bun server behavior
- Add try/catch around HTML file reads in handleDocRequest

For provenance purposes, this commit was AI assisted.
… IP, dead code

Security:
- Add isWithinProjectRoot check to the base-relative block for HTML
  files in both Bun and Pi /api/doc handlers. Previously HTML files
  served via the base query param bypassed the containment guard.
- Add 169.254.0.0/16 (link-local / cloud metadata) to isLocalUrl
  private IP ranges

Cleanup:
- Remove dead hostname === "[::1]" check (WHATWG URL parser strips
  brackets; hostname === "::1" already handles it)
- Remove dead parent?.childNodes fallback in table cell() function

For provenance purposes, this commit was AI assisted.
Drop ~60 lines of hand-rolled GFM table conversion that had a bug
(tables without explicit <thead> produced invalid GFM). Use the
official turndown-plugin-gfm plugin (24KB) which correctly handles
all table patterns plus adds strikethrough and task list support.

For provenance purposes, this commit was AI assisted.
Expand the backslash escape regex to cover all CommonMark-defined
escapable characters (. ) - # > + | { } &), not just the subset
the parser uses for formatting. Fixes literal backslashes appearing
in rendered output for Turndown-escaped content like "1\." → "1.".

For provenance purposes, this commit was AI assisted.
Replace redirect: "follow" with redirect: "manual" in fetchViaTurndown
and validate each redirect hop against isLocalUrl. Blocks attacks where
an external URL redirects to cloud metadata endpoints (169.254.169.254)
or other private IPs. Limits redirect chain to 10 hops.

For provenance purposes, this commit was AI assisted.
bun install needed to resolve turndown-plugin-gfm in the Pi extension
workspace after adding it to apps/pi-extension/package.json.

For provenance purposes, this commit was AI assisted.
Replace unmaintained turndown-plugin-gfm (2017, v1.0.2) with the
actively maintained Joplin fork (2025, v1.0.64, 16KB).

Fix TypeScript errors that broke CI:
- Add @ts-expect-error for untyped @joplin/turndown-plugin-gfm import
- Restructure fetchViaTurndown redirect loop to avoid uninitialized
  variable — first fetch before loop, loop only for redirects

For provenance purposes, this commit was AI assisted.
Add declarations.d.ts for @joplin/turndown-plugin-gfm with typed
function signatures, remove the ts-expect-error suppression.

For provenance purposes, this commit was AI assisted.
CI's tsc wasn't finding the ambient module declaration with implicit
include. Add explicit include to ensure declarations.d.ts is always
picked up regardless of environment.

For provenance purposes, this commit was AI assisted.
CI's tsc does not pick up ambient declarations.d.ts files despite
local tsc finding them — likely a module resolution discrepancy
between environments. Revert to @ts-expect-error which passes in
both CI and local typecheck.

For provenance purposes, this commit was AI assisted.
… protocol

- Add 10MB body size limit to both Jina and fetch+Turndown URL paths,
  matching the local HTML file guard. Streams response body and aborts
  if limit exceeded.
- Distinguish "Too many redirects" from a genuine 3xx response after
  redirect loop exhaustion.
- Add file: to the dangerous protocol blocklist in sanitizeLinkUrl.

For provenance purposes, this commit was AI assisted.
- Remove containment check from base-relative block for HTML files in
  both Bun and Pi /api/doc handlers. Matches markdown behavior so HTML
  files in annotated folders outside cwd are served correctly.
  Standalone block (no base) retains its cwd check as fallback.
- Widen isLocalMd → isLocalDoc to treat .html/.htm links as linked
  documents. Clicking [Next](next.html) in a converted page now opens
  it via /api/doc with Turndown conversion instead of a new browser tab.

For provenance purposes, this commit was AI assisted.
…nv vars

- Expand loopback check from just 127.0.0.1 to the full 127.0.0.0/8
  range so all loopback addresses skip Jina Reader
- Cancel redirect response body before re-fetching to avoid leaking
  TCP connections back to the pool
- Document PLANNOTATOR_JINA and JINA_API_KEY in CLAUDE.md env var table

For provenance purposes, this commit was AI assisted.
…s, comments

- Add [::1] back to isLocalUrl — WHATWG URL hostname getter preserves
  brackets for IPv6 (verified: Bun and Node both return "[::1]").
  Add comment explaining the empirical verification so future reviewers
  don't re-flag.
- Fix readBodyWithLimit null-body fallback to still enforce the 10MB
  limit via text length check instead of silently falling through.
- Document PLANNOTATOR_JINA and JINA_API_KEY in AGENTS.md env var table
  (CLAUDE.md is a symlink to AGENTS.md).
- Add comments to base-relative blocks in both Bun and Pi handleDoc
  explaining the intentional lack of containment check (matches
  pre-existing markdown behavior, base is set server-side).

For provenance purposes, this commit was AI assisted.
…calUrl

Add PRIVATE_IPV6 regex matching bracketed IPv6 private/reserved ranges:
- ::ffff: (IPv4-mapped — embeds private IPv4 as hex, e.g. [::ffff:c0a8:1])
- fe80: (link-local)
- fc00::/7 (unique-local, covers fc00:: through fdff::)

Closes the redirect-SSRF bypass where a public URL redirects to a
private address expressed as IPv4-mapped IPv6, e.g.
http://[::ffff:169.254.169.254]/latest/meta-data/

For provenance purposes, this commit was AI assisted.
…annotate flow

- Expand isLocalUrl comment with full empirical verification table
  showing actual hostname getter output for every IPv6 format in both
  Bun and Node — prevents false-positive review findings about brackets
- Add sourceInfo to /api/plan response type in App.tsx for type safety
- Update CLAUDE.md annotate flow diagram to reflect HTML/URL/folder
  input types

For provenance purposes, this commit was AI assisted.
…Info

- Add ( to backslash escape regex alongside existing ) — Turndown
  emits \( in link-adjacent contexts
- Cancel response body before throwing on !res.ok in both fetchViaJina
  and fetchViaTurndown error paths (redirect loop already did this)
- Document sourceInfo field in AGENTS.md annotate server API table

For provenance purposes, this commit was AI assisted.
- Skip dirname(filePath) base injection when filePath is a URL in both
  Bun and Pi annotate servers. dirname on a URL string produces a
  nonsensical filesystem path, causing linked doc clicks to 404.
  URL annotations now let links open normally instead.
- Cancel response body before throwing on content-type mismatch and
  content-length overflow in fetchViaTurndown/readBodyWithLimit.
- Fix double parseInt in readBodyWithLimit content-length check.
- Correct AGENTS.md flow diagram: OpenCode not yet implemented for
  HTML/URL annotation.

For provenance purposes, this commit was AI assisted.
Add URL detection (Jina Reader + fallback), HTML file detection with
Turndown conversion, 10MB file size guard, and sourceInfo threading
to OpenCode's handleAnnotateCommand. Uses the same shared utilities
as the Bun CLI and Pi extension.

OpenCode uses the Bun server directly (startAnnotateServer from
@plannotator/server/annotate), so no server-side changes needed —
only the command handler routing was missing.

Note: folder annotation mode is not added (OpenCode didn't have it
before this PR for markdown either — separate scope).

For provenance purposes, this commit was AI assisted.
…ssages

- OpenCode plannotator-annotate.md description now mentions HTML/URL
- Align fetch progress messages across all three clients: all now show
  "(via Jina Reader)" or "(via fetch+Turndown)" consistently

For provenance purposes, this commit was AI assisted.
…leanup

- URLs ending in .md/.mdx are fetched raw — no Jina, no Turndown.
  Content is already markdown. Removes text/plain from fetchViaTurndown
  content-type whitelist since .md URLs are now short-circuited.
- Wikilink regex widened to preserve .html/.htm targets instead of
  appending .md (e.g. [[page.html]] no longer becomes page.html.md)
- Remove redundant existsSync before statSync in OpenCode handler

For provenance purposes, this commit was AI assisted.
Tests cover the core conversion utility that all three clients depend on:
- Basic HTML → markdown (headings, paragraphs, links, code blocks)
- Tables with and without <thead> (the GFM plugin bug that was caught)
- Script/style/noscript stripping
- Strikethrough (GFM)
- Empty HTML handling
- Dangerous links preserved (sanitization is in the renderer, not here)

For provenance purposes, this commit was AI assisted.
…kdown

URLs ending in .md/.mdx (e.g. GitHub's viewer page for README.md)
may return HTML instead of raw markdown. fetchRawText now checks the
response content-type — if the server returns HTML, returns null so
the caller falls through to Jina/Turndown for proper conversion.

For provenance purposes, this commit was AI assisted.
fetchRawText (for .md/.mdx URLs) was using default redirect: "follow"
with no isLocalUrl validation on redirect hops — a .md URL redirecting
to 169.254.169.254 would be followed and credentials returned as
"markdown". Now uses redirect: "manual" with per-hop isLocalUrl checks,
matching fetchViaTurndown's SSRF protection.

For provenance purposes, this commit was AI assisted.
@backnotprop backnotprop merged commit b780739 into main Apr 13, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant