fix(security): validate file paths in /api/documents/upload by Chen17-sq · Pull Request #361 · volcengine/MineContext

Chen17-sq · 2026-04-27T11:53:55Z

Summary

/api/documents/upload currently accepts any absolute path on the host and feeds the file's contents into the document-processing pipeline, which forwards them to the configured VLM / embedding provider (often a remote API). With api_auth.enabled defaulting to false, any process that can reach 127.0.0.1:1733 can exfiltrate arbitrary user files this way — defeating the project's stated privacy posture ("Your data is processed and stored entirely on your local device").

This PR adds a path allow-list plus a defense-in-depth deny list to that endpoint. The set of permitted directories is sourced from the user's existing watched folders and a new explicit config knob, with sensible fallback defaults.

Threat model

The endpoint is reachable not only from the official UI but also from any co-located process able to talk to localhost:

Browser extensions with host_permissions for http://127.0.0.1:1733/* (background pages bypass CORS).
Malicious dev-time dependencies (npm, pip, VS Code extensions) running under the user.
Sibling containers sharing the host network namespace.
Any user that has changed web.host to bind beyond loopback.

Because these clients are not browsers issuing a cross-origin fetch, the existing CORS allow-list does not apply.

Reproducer (against `main` before this patch)

# Default config has api_auth.enabled=false, so no auth header needed.
curl -s -X POST http://127.0.0.1:1733/api/documents/upload \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"/Users/<you>/.ssh/id_rsa"}'
# -> {"code":0,"message":"Document queued for processing successfully"}

The file is then read by ContextOperations.add_document and passed through generate_with_messages_async (opencontext/llm/global_vlm_client.py) to the configured vlm_model.base_url for OCR / summarization, and through do_vectorize_async for embedding. An attacker who additionally calls /api/model_settings/update (also unauthenticated by default) can swap that base URL to a server they control before triggering the upload, completing the exfiltration in two requests.

Fix

opencontext/server/context_operations.py — add_document() now:

Requires an absolute path and resolve()s it (symlinks followed) before any check, so .. and symlink games cannot escape the allow-list.
Hard-denies paths whose components match a small list of clearly sensitive names (.ssh, .aws, .gnupg, .azure, .gcloud, .kube, .docker, .password-store, Keychains), whose filename matches .env, id_rsa, id_ed25519, id_ecdsa, id_dsa, shadow, or whose prefix is under /etc, /proc, /sys, /dev, /root, /var/log, /var/db (and the /private/... equivalents on macOS) — independent of the allow-list.
Otherwise requires the resolved path to live under one of:
- capture.folder_monitor.watch_folder_paths (directories the user has already opted in to having watched).
- security.document_upload_allowed_paths (new config, explicit extension point — see updated config/config.yaml).
- Fallback when neither is configured: ~/Documents, ~/Downloads, ~/Desktop (only those that exist).
Returns a descriptive error message on rejection, including the active allow-list and the config key the user can edit to extend it.

Helpers (_is_sensitive_path, _resolve_paths_from_config, _resolve_allowed_upload_roots, _is_path_under_any_root) are private module-level functions to keep add_document itself readable and to make the predicates easy to reason about in isolation.

Behavior on the legitimate path

Users who have configured watch_folder_paths (the documented "watch this folder" workflow) keep working with no config changes.
Users on a fresh install whose chosen documents live in ~/Documents, ~/Downloads, or ~/Desktop keep working with no config changes.
Users who put documents elsewhere see a clear error pointing at security.document_upload_allowed_paths, which they can extend in config.yaml.

Out of scope (worth follow-ups)

This PR fixes the most concrete arbitrary-file-read primitive but does not address the broader pattern of "privileged endpoints rely on api_auth which is off by default." In particular /api/model_settings/update allowing the LLM base_url to be reset over plain HTTP is a separate concern that I'd be happy to follow up on if maintainers agree on the direction.

Test plan

python -m black --check opencontext/server/context_operations.py
python -m isort --check-only opencontext/server/context_operations.py
Direct exercise of helpers — verified DENY for ~/.ssh/id_rsa, ~/.aws/credentials, /etc/passwd, /proc/1/environ, ~/.env, ~/Documents/id_rsa; ALLOW for ~/Documents/report.pdf, ~/Downloads/notes.md. Verified _resolve_allowed_upload_roots honors watch_folder_paths, security.document_upload_allowed_paths, and the ~/Documents / ~/Downloads / ~/Desktop fallback. Verified _is_path_under_any_root rejects paths outside the configured roots.
Maintainer to reproduce the curl PoC against main, then re-run against this branch to confirm the second request returns Document path is not allowed: ....

The /api/documents/upload endpoint accepted any absolute path on the host and forwarded the file's contents to the configured VLM / embedding provider for processing. Combined with api_auth being disabled by default, any process that could reach 127.0.0.1:1733 — for example a browser extension with host_permissions, a malicious npm/pip dependency, or another container sharing the host network — could exfiltrate arbitrary user files (SSH keys, cloud credentials, etc.) by submitting their paths and letting the document pipeline ship the contents to the cloud LLM endpoint. This change rejects upload paths that are not under an allowed root, with a defense-in-depth deny list for clearly sensitive locations regardless of configuration: * Allow-list sources, in order: 1. capture.folder_monitor.watch_folder_paths (already opted-in) 2. security.document_upload_allowed_paths (new, explicit extension) 3. Fallback to ~/Documents, ~/Downloads, ~/Desktop when present. * Hard deny: paths whose components include .ssh / .aws / .gnupg / .azure / .gcloud / .kube / .docker / .password-store / Keychains; filenames .env / id_rsa / id_ed25519 / id_ecdsa / id_dsa / shadow; and paths under /etc, /proc, /sys, /dev, /root, /var/log, /var/db (and the /private equivalents on macOS). Paths are resolved (with symlinks followed) before checks, so the allow-list cannot be escaped via `..` traversal or symlink tricks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(security): validate file paths in /api/documents/upload#361

fix(security): validate file paths in /api/documents/upload#361
Chen17-sq wants to merge 1 commit intovolcengine:mainfrom
Chen17-sq:security/validate-document-upload-path

Chen17-sq commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chen17-sq commented Apr 27, 2026

Summary

Threat model

Reproducer (against main before this patch)

Fix

Behavior on the legitimate path

Out of scope (worth follow-ups)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Reproducer (against `main` before this patch)