feat: add URL scanning support to PlanResolver trait#587
feat: add URL scanning support to PlanResolver trait#587
Conversation
Add scan_url method to PlanResolver so resolvers participate in data source URL handling. DataFusionResolver moves from a privileged terminal resolver into a regular resolver in the pipeline chain. Key changes: - ParsedUrl struct for structured URL representation passed to scanners - ResolverCapabilities proto + MergedCapabilities for URL support negotiation - DataBaseUrlSetting enum for explicit base URL API (Default/Disabled/Custom) - resolve_url() shared function for plan-time and eval-time URL resolution - GetCapabilities RPC for remote capability propagation (gRPC + WASM) - Python bridge: scan_url, scan_url_proto, capabilities on PlanResolver - data_base_url parameter threaded through pre_transform_* and ChartState APIs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace unreachable!() with proper error in pipeline resolve - Use url::Url::join() for RFC 3986 URL resolution - Check URL scheme in DataSpec::supported against capabilities - Handle protocol-relative URLs by prepending https: - Deduplicate scheme lists in DataFusionResolver::scan_url - Gate url::Url::from_file_path behind cfg(not(wasm32)) - Add catch-all arm in server test match for GetCapabilities variant - Format Python files with ruff Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The scan_url abstraction was losing the Vega format.parse spec by passing &None to read_csv. This caused incorrect date/timezone handling for CSV datasets with explicit parse directives (e.g., seattle-weather.csv). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Skip scheme validation for raw absolute paths in data.rs (e.g., C:\Users\...) which haven't been resolved to file:// URLs yet. url::Url::parse misinterprets "C:" as a scheme on Windows. - Gate Unix-path tests with #[cfg(not(target_os = "windows"))] since Url::from_file_path rejects Unix paths on Windows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
a748a67 to
0eb7694
Compare
…ystems Use the proper URL term "scheme" (RFC 3986) instead of "protocol" for ExternalTableProvider, ExternalDataset, codec serialization keys, and all related APIs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…runtime This enables a default resolve_plan implementation that walks the LogicalPlan tree, calls resolve_table for each ExternalTableProvider, and replaces them with MemTable scans. Implementers can now override just resolve_table instead of the full resolve_plan method. The trait needed to live in vegafusion-runtime because the default implementation depends on DataFusion types (ExternalTableProvider, MemTable, TreeNodeRewriter) that are not available in vegafusion-core. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Threads the ExternalTableProvider's scheme through to resolve_table so resolvers can identify the data source type without parsing metadata. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ataset No backward compatibility needed — this API hasn't been released yet. Scheme now comes before schema in all signatures to reflect its role as the primary discriminator for external data sources. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Accurately document why the fallback exists (method absent on wasm32-unknown-unknown, not just a runtime failure) and note the percent-encoding limitation for reserved characters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
has_url_scheme used `contains("://")`, so relative references like
`fetch?target=http://evil.com/data` were misclassified as absolute URLs.
resolve_url then returned them as-is, causing downstream Url::parse to
fail with RelativeUrlWithoutBase.
Now validates that `://` is preceded by a valid RFC 3986 scheme prefix.
Also replaced the duplicated inline `contains("://")` check in data.rs
with a call to the fixed has_url_scheme.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion Add `supports_arrow_tables` bool to `ResolverCapabilities` proto so resolvers can declare whether they efficiently consume in-memory Arrow tables. DataFusion sets this true; remote resolvers (e.g. Spark) default to false. Replaces the `has_user_resolvers()` heuristic with `should_materialize(plan)` which inspects the actual LogicalPlan: - Materialize if all resolvers support Arrow tables (fast path) - Materialize if the plan has no ExternalTableProvider nodes - Otherwise keep lazy for resolver interception This avoids unnecessary lazy plans when a non-arrow resolver is registered but the specific plan doesn't involve any external tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ments Add Sphinx feature page, three Python example scripts, and improved docstrings for the PlanResolver extensibility system. New files: - docs/source/features/plan_resolver.md - examples/python-examples/plan_resolver_basic.py - examples/python-examples/plan_resolver_url_scanning.py - examples/python-examples/plan_resolver_sql.py Docstring improvements: - capabilities() now documents supports_arrow_tables key - resolve_plan_proto() and resolve_plan() have Args/Returns sections - ExternalDataset.schema, .metadata, .data properties have docstrings - Rust PlanResolver trait has top-level doc comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes ruff FA102 (PEP 604 union syntax without future annotations). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace manual char-by-char RFC 3986 scheme validation with a compiled regex. Also fix E501 line-length violations in the capabilities() docstring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9aa3880 to
97d0c4f
Compare
…tion at runtime Remove the capabilities system (supported_schemes, supported_format_types, supported_extensions) from the planner. The planner no longer checks URL schemes or format types against resolver capabilities -- all URL-backed datasets are considered plannable and errors surface at runtime. Move URL resolution from planning time to runtime. Static URLs are no longer resolved by MakeTasksVisitor; instead DataUrlTask::eval() resolves both static and signal-based URLs uniformly. Replace the supports_arrow_tables proto field with a direct trait method on PlanResolver. ResolverPipeline queries resolvers directly instead of going through MergedCapabilities. Removed: - ResolverCapabilities proto message and GetCapabilities RPC - DataBaseUrlSettingProto and data_base_url from pretransform opts - MergedCapabilities struct and planner_capabilities() trait method - capabilities() from Rust and Python PlanResolver - fetch_capabilities_via_query_fn() from WASM - Scheme/format checks from DataSpec::supported() 35 files changed, +70 -593 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After removing capabilities-based format checks, the planner pushed all URL-backed datasets server-side including topojson, which DataFusion can't read. Add a hardcoded SUPPORTED_FORMATS list so formats like topojson stay client-side for Vega JS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…l.rs The file no longer contains the PlanResolver trait (moved to vegafusion-runtime in an earlier commit). Rename to reflect its actual contents: URL types, resolution, and parsing utilities. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
data_base_url is a runtime configuration, not a task graph property. Remove it from the DataUrlTask proto, MakeTasksVisitor, ChartStateOpts, and the VegaFusionRuntimeTrait method signatures. Instead, store the resolved base URL on ResolverPipeline and read it at eval time. Add data_base_url parameter to Python VegaFusionRuntime: None/True = CDN default, str = custom URL, False = disabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…untime Bundle the scattered eval parameters (tz_config, inline_datasets, pipeline, data_base_url) into a TaskContext struct. Move data_base_url from ResolverPipeline to VegaFusionRuntime since it's orthogonal to the resolver chain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add filters parameter to PlanResolver::resolve_table (Rust and Python) so resolvers can optimize data loading with pushed-down predicates. Filters are hints — DataFusion re-applies them regardless. ExternalTableProvider now reports Inexact for supports_filters_pushdown so DataFusion pushes filter expressions into TableScan nodes. Add unparse_expr_to_sql function (Rust pyfunction + Python wrapper) that converts LogicalExprNode proto messages to SQL strings. Accepts a single expression or a list (joined with AND). Supports all existing SQL dialects (default, postgres, mysql, sqlite, duckdb, bigquery). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move ResolutionResult from vegafusion-core to vegafusion-runtime (lives with PlanResolver where it's used) - Remove reserved 6 from DataUrlTask proto (never merged with that field) - Update stale docs referencing removed capabilities concept - Remove source field from ExternalTableProvider (unused, metadata covers this use case) - Introduce VegaFusionRuntimeOpts with Default, replacing two constructors with a single new(opts) method - Remove section header comments from Python test file Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add test proving filter transforms work end-to-end with resolve_table (DataFusion applies filters after resolution). Assert that filters are currently not pushed down to resolve_table due to _vf_order window blocking PushDownFilter, with TODO to address via with_index changes. Remove unused optimize_filters infrastructure from ResolverPipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop plan_resolver_basic.py (covered by url_scanning example) and custom_resolver.rs (logging pass-through doesn't demonstrate real use). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove dead _vf_scheme check (removed implementation detail) - Drop redundant snapshot from proto_message unparse test (already covered by from_resolver test, keep bytes==proto equality check) - Add clarifying comment to scan_url_not_called_without_override test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove links to deleted examples (plan_resolver_basic.py, custom_resolver.rs) - Add filters parameter to resolve_table code snippets - Add unparse_expr_to_sql to API reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add imports to all code snippets - Use generic examples with comments explaining real-world usage - Show data_base_url for relative URL resolution - Fix configuration bullets to show defaults (thread_safe=True, skip_when_no_external_tables=True, supports_arrow_tables=False) - Clarify protobuf dependency note (external_table_scan_node needs it, not scan_url itself) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove pandas dependency from url_scanning example (use print(table)) - Use parsed_url["url"] as table_name to avoid collisions - Remove unused source_table from SQL example constructor - Clarify hardcoded return in SQL example with explicit comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vegafusion-core/src/data/url.rs
Outdated
| /// Resolver produced a rewritten plan for the next resolver to handle, | ||
| /// or for DataFusion to execute if this is the last resolver | ||
| Plan(LogicalPlan), | ||
| } |
There was a problem hiding this comment.
I don't think this is used in vegafusion-core any more, move to vegafusion-runtime with the PlanResolver trait itself?
| int32 batch_size = 3; | ||
| ScanUrlFormat format_type = 4; | ||
| transforms.TransformPipeline pipeline = 5; | ||
| reserved 6; |
There was a problem hiding this comment.
remove reserved, a version with base url here was never merged
There was a problem hiding this comment.
this was moved from vegafusion-core/src/data/plan_resolver.rs, but git didn't pickup on that
| /// | ||
| /// 1. **Planning phase**: [`capabilities`](Self::capabilities) declares supported | ||
| /// URL schemes/formats, and [`scan_url`](Self::scan_url) converts URLs into | ||
| /// `LogicalPlan` nodes (typically `ExternalTableProvider` markers). |
There was a problem hiding this comment.
capabilities concept was removed, update docs here
| _scheme: &str, | ||
| _schema: SchemaRef, | ||
| _metadata: &serde_json::Value, | ||
| _projected_columns: Option<Vec<String>>, |
There was a problem hiding this comment.
add filters from DataFusion
| // in a server context, wrap in catch_unwind just in case. | ||
| let pipeline = self.pipeline.clone(); | ||
| let task_ctx = TaskContext { | ||
| tz_config: None, // overridden per-task from task.tz_config |
There was a problem hiding this comment.
Oh, double check that we ever use this one then
There was a problem hiding this comment.
it is overridden
vegafusion-python/src/lib.rs
Outdated
| /// scheme: Scheme identifier (e.g. "spark"). | ||
| /// schema: Arrow schema (arro3.core.Schema) — required for logical planning. | ||
| /// metadata: Optional JSON-serializable dict of metadata. | ||
| /// source: Optional source identifier. |
There was a problem hiding this comment.
double check that we still need a dedicated source here (rather then this going in metadata if needed)
| Ok(!cls_method.is(&base_method)) | ||
| })(); | ||
| result.unwrap_or(false) | ||
| } |
There was a problem hiding this comment.
look into whether this handles the case of a subclass that overrides the method, and a subclass of that does not override the method
|
|
||
| /// Formats that VegaFusion can read server-side. Anything else (e.g. topojson) | ||
| /// stays client-side for Vega JS to handle. | ||
| const SUPPORTED_FORMATS: &'static [&'static str] = &["csv", "tsv", "json", "arrow", "parquet"]; |
There was a problem hiding this comment.
will we let through urls that don't have an extension, and don't have a format specified? we want to support cases like relative paths don't have file extensions
| assert "Unknown dialect" in str(resolver.error) | ||
|
|
||
|
|
||
| # ── scan_url tests ── |
There was a problem hiding this comment.
drop this style of comment
# ── scan_url tests ──
Strip extended-length path prefix (\\?\) from fs::canonicalize on Windows to fix path prefix matching in allowed_base_urls checks. Also fix formatting issues across Rust and Python files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
cc @OlegWock in case you're interested. These changes will make it possible for a custom resolver to handle URLs. So your Spark resolver could handle csv or parquet urls or file paths, or the spec could contain custom |
|
In our case we fully control input Vega spec, so |
Summary
Adds URL scanning support to the
PlanResolvertrait, allowing resolvers to claim and handle data source URLs (e.g., custom schemes likespark://,delta://or built-in schemes likehttps://with csv files).DataFusionResolveris a regular resolver in the pipeline chain, establishing a uniform scan/resolve two-phase pattern for all resolvers.Standardizes runtime URL configuration around
base_urlandallowed_base_urls.Relative URLs resolve against a configurable
base_url(defaulting to the Vega datasets CDN). The Python API accepts three states:None/Truefor the CDN default, a string for a custom absolute URL or absolute path, orFalseto disable relative URL resolution.allowed_base_urlsadds an optional allowlist for external access.Nonepreserves VegaFusion's current behavior (no additional restriction), while an empty list denies all external URLs. Explicit entries support*, generic"<scheme>:"matches, URL prefixes, wildcard-host prefixes likehttps://*.example.com/, and filesystem roots.URL policy is normalized once at runtime construction and passed to tasks at eval time via
TaskContext— it is not part of the task graph or proto.DataUrlTaskresolves the raw URL againstbase_url, strips fragments, skips internal dataset URLs, checks the initial resolved URL againstallowed_base_urls, and only then dispatches to resolvers or built-in readers.For consistency across protocols, access control applies to the initial resolved URL only. VegaFusion does not re-check redirect destinations after a fetch begins.
Makes
vegafusion-serverthe configuration authority for gRPC mode. The server now exposes--base-url,--no-base-url, repeated--allowed-base-url, and--no-allowed-urls, and maps them directly intoVegaFusionRuntimeOpts. Pythongrpc_connect()rejects local non-default URL policy settings, and the corresponding setters reject changes while connected over gRPC, so client-side configuration cannot appear to override the server.This is a breaking change: previously if a relative path didn't match a vega dataset, it would fall back to looking for a local file. The public runtime/Python option is also named
base_url.Renames
ExternalTableProvider.protocolandExternalDataset.protocoltoschemefor consistent RFC 3986 terminology.schemeis a required parameter onresolve_table,ExternalTableProvider, andExternalDataset.PlanResolverlives invegafusion-runtimesince it depends on runtime types. Theschemeparameter is added toresolve_tableand made required (non-optional) onExternalTableProviderandExternalDatasetsince the API is unreleased.supports_arrow_tablesis a trait method onPlanResolver. The materialization decision (should_materialize) inspects theLogicalPlan: plans with noExternalTableProvidernodes are always materialized, even when a non-Arrow resolver is registered.has_url_schemeanchors scheme detection at the start of the string (RFC 3986), preventing relative references likefetch?target=http://evil.com/datafrom being misclassified as absolute URLs.Adds
resolve_tablefilter hints (filtersparameter) andunparse_expr_to_sqlfor converting DataFusion filter expressions to SQL strings. Filter pushdown toExternalTableProviderscans is not yet active due to the_vf_orderwindow placement (#589).Introduces
VegaFusionRuntimeOptswithDefaultimplementation, replacing multiple constructors with a singleVegaFusionRuntime::new(opts).Removes
sourcefield fromExternalTableProvider(unused — metadata covers this use case).Moves
ResolutionResultfromvegafusion-coretovegafusion-runtime(lives withPlanResolverwhere it's used).Adds documentation and tests for plan resolvers, URL policy helpers, server flags, and gRPC/runtime behavior.
Motivation
Previously, URL-backed datasets were handled exclusively by DataFusion's built-in readers. Custom resolvers could only participate after a plan node was already constructed, which meant they couldn't intercept URLs with custom schemes or formats that DataFusion doesn't support.
This change enables resolvers to:
scan_url-- inspect a pre-parsed URL and return aLogicalPlannodesupports_arrow_tables-- resolvers that can't efficiently consume Arrow tables keep data as lazy plansresolve_table-- the simplest path, no protobuf neededresolve_plan-- full control for SQL transpilation or remote executionExternal data connectors (Spark, Delta Lake, custom APIs) can now register as resolvers and handle their own URL schemes without forking or wrapping the runtime.
Moving relative URL resolution and URL permission checks into the runtime also makes server-side execution predictable: embedded runtimes and gRPC runtimes share the same URL semantics, while the server remains the source of truth for remote access policy.