Skip to content

fix(security): validate file paths in /api/documents/upload#361

Open
Chen17-sq wants to merge 1 commit intovolcengine:mainfrom
Chen17-sq:security/validate-document-upload-path
Open

fix(security): validate file paths in /api/documents/upload#361
Chen17-sq wants to merge 1 commit intovolcengine:mainfrom
Chen17-sq:security/validate-document-upload-path

Conversation

@Chen17-sq
Copy link
Copy Markdown

Summary

/api/documents/upload currently accepts any absolute path on the host and feeds the file's contents into the document-processing pipeline, which forwards them to the configured VLM / embedding provider (often a remote API). With api_auth.enabled defaulting to false, any process that can reach 127.0.0.1:1733 can exfiltrate arbitrary user files this way — defeating the project's stated privacy posture ("Your data is processed and stored entirely on your local device").

This PR adds a path allow-list plus a defense-in-depth deny list to that endpoint. The set of permitted directories is sourced from the user's existing watched folders and a new explicit config knob, with sensible fallback defaults.

Threat model

The endpoint is reachable not only from the official UI but also from any co-located process able to talk to localhost:

  • Browser extensions with host_permissions for http://127.0.0.1:1733/* (background pages bypass CORS).
  • Malicious dev-time dependencies (npm, pip, VS Code extensions) running under the user.
  • Sibling containers sharing the host network namespace.
  • Any user that has changed web.host to bind beyond loopback.

Because these clients are not browsers issuing a cross-origin fetch, the existing CORS allow-list does not apply.

Reproducer (against main before this patch)

# Default config has api_auth.enabled=false, so no auth header needed.
curl -s -X POST http://127.0.0.1:1733/api/documents/upload \
  -H 'Content-Type: application/json' \
  -d '{"file_path":"/Users/<you>/.ssh/id_rsa"}'
# -> {"code":0,"message":"Document queued for processing successfully"}

The file is then read by ContextOperations.add_document and passed through generate_with_messages_async (opencontext/llm/global_vlm_client.py) to the configured vlm_model.base_url for OCR / summarization, and through do_vectorize_async for embedding. An attacker who additionally calls /api/model_settings/update (also unauthenticated by default) can swap that base URL to a server they control before triggering the upload, completing the exfiltration in two requests.

Fix

opencontext/server/context_operations.pyadd_document() now:

  1. Requires an absolute path and resolve()s it (symlinks followed) before any check, so .. and symlink games cannot escape the allow-list.
  2. Hard-denies paths whose components match a small list of clearly sensitive names (.ssh, .aws, .gnupg, .azure, .gcloud, .kube, .docker, .password-store, Keychains), whose filename matches .env, id_rsa, id_ed25519, id_ecdsa, id_dsa, shadow, or whose prefix is under /etc, /proc, /sys, /dev, /root, /var/log, /var/db (and the /private/... equivalents on macOS) — independent of the allow-list.
  3. Otherwise requires the resolved path to live under one of:
    • capture.folder_monitor.watch_folder_paths (directories the user has already opted in to having watched).
    • security.document_upload_allowed_paths (new config, explicit extension point — see updated config/config.yaml).
    • Fallback when neither is configured: ~/Documents, ~/Downloads, ~/Desktop (only those that exist).
  4. Returns a descriptive error message on rejection, including the active allow-list and the config key the user can edit to extend it.

Helpers (_is_sensitive_path, _resolve_paths_from_config, _resolve_allowed_upload_roots, _is_path_under_any_root) are private module-level functions to keep add_document itself readable and to make the predicates easy to reason about in isolation.

Behavior on the legitimate path

  • Users who have configured watch_folder_paths (the documented "watch this folder" workflow) keep working with no config changes.
  • Users on a fresh install whose chosen documents live in ~/Documents, ~/Downloads, or ~/Desktop keep working with no config changes.
  • Users who put documents elsewhere see a clear error pointing at security.document_upload_allowed_paths, which they can extend in config.yaml.

Out of scope (worth follow-ups)

This PR fixes the most concrete arbitrary-file-read primitive but does not address the broader pattern of "privileged endpoints rely on api_auth which is off by default." In particular /api/model_settings/update allowing the LLM base_url to be reset over plain HTTP is a separate concern that I'd be happy to follow up on if maintainers agree on the direction.

Test plan

  • python -m black --check opencontext/server/context_operations.py
  • python -m isort --check-only opencontext/server/context_operations.py
  • Direct exercise of helpers — verified DENY for ~/.ssh/id_rsa, ~/.aws/credentials, /etc/passwd, /proc/1/environ, ~/.env, ~/Documents/id_rsa; ALLOW for ~/Documents/report.pdf, ~/Downloads/notes.md. Verified _resolve_allowed_upload_roots honors watch_folder_paths, security.document_upload_allowed_paths, and the ~/Documents / ~/Downloads / ~/Desktop fallback. Verified _is_path_under_any_root rejects paths outside the configured roots.
  • Maintainer to reproduce the curl PoC against main, then re-run against this branch to confirm the second request returns Document path is not allowed: ....

The /api/documents/upload endpoint accepted any absolute path on the host
and forwarded the file's contents to the configured VLM / embedding
provider for processing. Combined with api_auth being disabled by default,
any process that could reach 127.0.0.1:1733 — for example a browser
extension with host_permissions, a malicious npm/pip dependency, or another
container sharing the host network — could exfiltrate arbitrary user files
(SSH keys, cloud credentials, etc.) by submitting their paths and letting
the document pipeline ship the contents to the cloud LLM endpoint.

This change rejects upload paths that are not under an allowed root, with
a defense-in-depth deny list for clearly sensitive locations regardless of
configuration:

  * Allow-list sources, in order:
      1. capture.folder_monitor.watch_folder_paths (already opted-in)
      2. security.document_upload_allowed_paths (new, explicit extension)
      3. Fallback to ~/Documents, ~/Downloads, ~/Desktop when present.
  * Hard deny: paths whose components include .ssh / .aws / .gnupg / .azure
    / .gcloud / .kube / .docker / .password-store / Keychains; filenames
    .env / id_rsa / id_ed25519 / id_ecdsa / id_dsa / shadow; and paths
    under /etc, /proc, /sys, /dev, /root, /var/log, /var/db (and the
    /private equivalents on macOS).

Paths are resolved (with symlinks followed) before checks, so the
allow-list cannot be escaped via `..` traversal or symlink tricks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant