fix(security): validate file paths in /api/documents/upload#361
Open
Chen17-sq wants to merge 1 commit intovolcengine:mainfrom
Open
fix(security): validate file paths in /api/documents/upload#361Chen17-sq wants to merge 1 commit intovolcengine:mainfrom
Chen17-sq wants to merge 1 commit intovolcengine:mainfrom
Conversation
The /api/documents/upload endpoint accepted any absolute path on the host
and forwarded the file's contents to the configured VLM / embedding
provider for processing. Combined with api_auth being disabled by default,
any process that could reach 127.0.0.1:1733 — for example a browser
extension with host_permissions, a malicious npm/pip dependency, or another
container sharing the host network — could exfiltrate arbitrary user files
(SSH keys, cloud credentials, etc.) by submitting their paths and letting
the document pipeline ship the contents to the cloud LLM endpoint.
This change rejects upload paths that are not under an allowed root, with
a defense-in-depth deny list for clearly sensitive locations regardless of
configuration:
* Allow-list sources, in order:
1. capture.folder_monitor.watch_folder_paths (already opted-in)
2. security.document_upload_allowed_paths (new, explicit extension)
3. Fallback to ~/Documents, ~/Downloads, ~/Desktop when present.
* Hard deny: paths whose components include .ssh / .aws / .gnupg / .azure
/ .gcloud / .kube / .docker / .password-store / Keychains; filenames
.env / id_rsa / id_ed25519 / id_ecdsa / id_dsa / shadow; and paths
under /etc, /proc, /sys, /dev, /root, /var/log, /var/db (and the
/private equivalents on macOS).
Paths are resolved (with symlinks followed) before checks, so the
allow-list cannot be escaped via `..` traversal or symlink tricks.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/api/documents/uploadcurrently accepts any absolute path on the host and feeds the file's contents into the document-processing pipeline, which forwards them to the configured VLM / embedding provider (often a remote API). Withapi_auth.enableddefaulting tofalse, any process that can reach127.0.0.1:1733can exfiltrate arbitrary user files this way — defeating the project's stated privacy posture ("Your data is processed and stored entirely on your local device").This PR adds a path allow-list plus a defense-in-depth deny list to that endpoint. The set of permitted directories is sourced from the user's existing watched folders and a new explicit config knob, with sensible fallback defaults.
Threat model
The endpoint is reachable not only from the official UI but also from any co-located process able to talk to localhost:
host_permissionsforhttp://127.0.0.1:1733/*(background pages bypass CORS).npm,pip, VS Code extensions) running under the user.web.hostto bind beyond loopback.Because these clients are not browsers issuing a cross-origin
fetch, the existing CORS allow-list does not apply.Reproducer (against
mainbefore this patch)The file is then read by
ContextOperations.add_documentand passed throughgenerate_with_messages_async(opencontext/llm/global_vlm_client.py) to the configuredvlm_model.base_urlfor OCR / summarization, and throughdo_vectorize_asyncfor embedding. An attacker who additionally calls/api/model_settings/update(also unauthenticated by default) can swap that base URL to a server they control before triggering the upload, completing the exfiltration in two requests.Fix
opencontext/server/context_operations.py—add_document()now:resolve()s it (symlinks followed) before any check, so..and symlink games cannot escape the allow-list..ssh,.aws,.gnupg,.azure,.gcloud,.kube,.docker,.password-store,Keychains), whose filename matches.env,id_rsa,id_ed25519,id_ecdsa,id_dsa,shadow, or whose prefix is under/etc,/proc,/sys,/dev,/root,/var/log,/var/db(and the/private/...equivalents on macOS) — independent of the allow-list.capture.folder_monitor.watch_folder_paths(directories the user has already opted in to having watched).security.document_upload_allowed_paths(new config, explicit extension point — see updatedconfig/config.yaml).~/Documents,~/Downloads,~/Desktop(only those that exist).Helpers (
_is_sensitive_path,_resolve_paths_from_config,_resolve_allowed_upload_roots,_is_path_under_any_root) are private module-level functions to keepadd_documentitself readable and to make the predicates easy to reason about in isolation.Behavior on the legitimate path
watch_folder_paths(the documented "watch this folder" workflow) keep working with no config changes.~/Documents,~/Downloads, or~/Desktopkeep working with no config changes.security.document_upload_allowed_paths, which they can extend inconfig.yaml.Out of scope (worth follow-ups)
This PR fixes the most concrete arbitrary-file-read primitive but does not address the broader pattern of "privileged endpoints rely on
api_authwhich is off by default." In particular/api/model_settings/updateallowing the LLMbase_urlto be reset over plain HTTP is a separate concern that I'd be happy to follow up on if maintainers agree on the direction.Test plan
python -m black --check opencontext/server/context_operations.pypython -m isort --check-only opencontext/server/context_operations.py~/.ssh/id_rsa,~/.aws/credentials,/etc/passwd,/proc/1/environ,~/.env,~/Documents/id_rsa; ALLOW for~/Documents/report.pdf,~/Downloads/notes.md. Verified_resolve_allowed_upload_rootshonorswatch_folder_paths,security.document_upload_allowed_paths, and the~/Documents/~/Downloads/~/Desktopfallback. Verified_is_path_under_any_rootrejects paths outside the configured roots.main, then re-run against this branch to confirm the second request returnsDocument path is not allowed: ....