feat: add URL scanning support to PlanResolver trait by jonmmease · Pull Request #587 · vega/vegafusion

jonmmease · 2026-03-12T00:40:29Z

Summary

Adds URL scanning support to the PlanResolver trait, allowing resolvers to claim and handle data source URLs (e.g., custom schemes like spark://, delta:// or built-in schemes like https:// with csv files). DataFusionResolver is a regular resolver in the pipeline chain, establishing a uniform scan/resolve two-phase pattern for all resolvers.

Standardizes runtime URL configuration around base_url and allowed_base_urls.

Relative URLs resolve against a configurable base_url (defaulting to the Vega datasets CDN). The Python API accepts three states: None/True for the CDN default, a string for a custom absolute URL or absolute path, or False to disable relative URL resolution.

allowed_base_urls adds an optional allowlist for external access. None preserves VegaFusion's current behavior (no additional restriction), while an empty list denies all external URLs. Explicit entries support *, generic "<scheme>:" matches, URL prefixes, wildcard-host prefixes like https://*.example.com/, and filesystem roots.

URL policy is normalized once at runtime construction and passed to tasks at eval time via TaskContext — it is not part of the task graph or proto. DataUrlTask resolves the raw URL against base_url, strips fragments, skips internal dataset URLs, checks the initial resolved URL against allowed_base_urls, and only then dispatches to resolvers or built-in readers.

For consistency across protocols, access control applies to the initial resolved URL only. VegaFusion does not re-check redirect destinations after a fetch begins.

Makes vegafusion-server the configuration authority for gRPC mode. The server now exposes --base-url, --no-base-url, repeated --allowed-base-url, and --no-allowed-urls, and maps them directly into VegaFusionRuntimeOpts. Python grpc_connect() rejects local non-default URL policy settings, and the corresponding setters reject changes while connected over gRPC, so client-side configuration cannot appear to override the server.

This is a breaking change: previously if a relative path didn't match a vega dataset, it would fall back to looking for a local file. The public runtime/Python option is also named base_url.

Renames ExternalTableProvider.protocol and ExternalDataset.protocol to scheme for consistent RFC 3986 terminology. scheme is a required parameter on resolve_table, ExternalTableProvider, and ExternalDataset.

PlanResolver lives in vegafusion-runtime since it depends on runtime types. The scheme parameter is added to resolve_table and made required (non-optional) on ExternalTableProvider and ExternalDataset since the API is unreleased.

supports_arrow_tables is a trait method on PlanResolver. The materialization decision (should_materialize) inspects the LogicalPlan: plans with no ExternalTableProvider nodes are always materialized, even when a non-Arrow resolver is registered.

has_url_scheme anchors scheme detection at the start of the string (RFC 3986), preventing relative references like fetch?target=http://evil.com/data from being misclassified as absolute URLs.

Adds resolve_table filter hints (filters parameter) and unparse_expr_to_sql for converting DataFusion filter expressions to SQL strings. Filter pushdown to ExternalTableProvider scans is not yet active due to the _vf_order window placement (#589).

Introduces VegaFusionRuntimeOpts with Default implementation, replacing multiple constructors with a single VegaFusionRuntime::new(opts).

Removes source field from ExternalTableProvider (unused — metadata covers this use case).

Moves ResolutionResult from vegafusion-core to vegafusion-runtime (lives with PlanResolver where it's used).

Adds documentation and tests for plan resolvers, URL policy helpers, server flags, and gRPC/runtime behavior.

Motivation

Previously, URL-backed datasets were handled exclusively by DataFusion's built-in readers. Custom resolvers could only participate after a plan node was already constructed, which meant they couldn't intercept URLs with custom schemes or formats that DataFusion doesn't support.

This change enables resolvers to:

Claim URLs at eval time via scan_url -- inspect a pre-parsed URL and return a LogicalPlan node
Control materialization via supports_arrow_tables -- resolvers that can't efficiently consume Arrow tables keep data as lazy plans
Provide per-table data via resolve_table -- the simplest path, no protobuf needed
Rewrite or transpile plans via resolve_plan -- full control for SQL transpilation or remote execution

External data connectors (Spark, Delta Lake, custom APIs) can now register as resolvers and handle their own URL schemes without forking or wrapping the runtime.

Moving relative URL resolution and URL permission checks into the runtime also makes server-side execution predictable: embedded runtimes and gRPC runtimes share the same URL semantics, while the server remains the source of truth for remote access policy.

Add scan_url method to PlanResolver so resolvers participate in data source URL handling. DataFusionResolver moves from a privileged terminal resolver into a regular resolver in the pipeline chain. Key changes: - ParsedUrl struct for structured URL representation passed to scanners - ResolverCapabilities proto + MergedCapabilities for URL support negotiation - DataBaseUrlSetting enum for explicit base URL API (Default/Disabled/Custom) - resolve_url() shared function for plan-time and eval-time URL resolution - GetCapabilities RPC for remote capability propagation (gRPC + WASM) - Python bridge: scan_url, scan_url_proto, capabilities on PlanResolver - data_base_url parameter threaded through pre_transform_* and ChartState APIs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace unreachable!() with proper error in pipeline resolve - Use url::Url::join() for RFC 3986 URL resolution - Check URL scheme in DataSpec::supported against capabilities - Handle protocol-relative URLs by prepending https: - Deduplicate scheme lists in DataFusionResolver::scan_url - Gate url::Url::from_file_path behind cfg(not(wasm32)) - Add catch-all arm in server test match for GetCapabilities variant - Format Python files with ruff Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The scan_url abstraction was losing the Vega format.parse spec by passing &None to read_csv. This caused incorrect date/timezone handling for CSV datasets with explicit parse directives (e.g., seattle-weather.csv). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Skip scheme validation for raw absolute paths in data.rs (e.g., C:\Users\...) which haven't been resolved to file:// URLs yet. url::Url::parse misinterprets "C:" as a scheme on Windows. - Gate Unix-path tests with #[cfg(not(target_os = "windows"))] since Url::from_file_path rejects Unix paths on Windows. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ystems Use the proper URL term "scheme" (RFC 3986) instead of "protocol" for ExternalTableProvider, ExternalDataset, codec serialization keys, and all related APIs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…runtime This enables a default resolve_plan implementation that walks the LogicalPlan tree, calls resolve_table for each ExternalTableProvider, and replaces them with MemTable scans. Implementers can now override just resolve_table instead of the full resolve_plan method. The trait needed to live in vegafusion-runtime because the default implementation depends on DataFusion types (ExternalTableProvider, MemTable, TreeNodeRewriter) that are not available in vegafusion-core. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Threads the ExternalTableProvider's scheme through to resolve_table so resolvers can identify the data source type without parsing metadata. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ataset No backward compatibility needed — this API hasn't been released yet. Scheme now comes before schema in all signatures to reflect its role as the primary discriminator for external data sources. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Accurately document why the fallback exists (method absent on wasm32-unknown-unknown, not just a runtime failure) and note the percent-encoding limitation for reserved characters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

has_url_scheme used `contains("://")`, so relative references like `fetch?target=http://evil.com/data` were misclassified as absolute URLs. resolve_url then returned them as-is, causing downstream Url::parse to fail with RelativeUrlWithoutBase. Now validates that `://` is preceded by a valid RFC 3986 scheme prefix. Also replaced the duplicated inline `contains("://")` check in data.rs with a call to the fixed has_url_scheme. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tion Add `supports_arrow_tables` bool to `ResolverCapabilities` proto so resolvers can declare whether they efficiently consume in-memory Arrow tables. DataFusion sets this true; remote resolvers (e.g. Spark) default to false. Replaces the `has_user_resolvers()` heuristic with `should_materialize(plan)` which inspects the actual LogicalPlan: - Materialize if all resolvers support Arrow tables (fast path) - Materialize if the plan has no ExternalTableProvider nodes - Otherwise keep lazy for resolver interception This avoids unnecessary lazy plans when a non-arrow resolver is registered but the specific plan doesn't involve any external tables. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ments Add Sphinx feature page, three Python example scripts, and improved docstrings for the PlanResolver extensibility system. New files: - docs/source/features/plan_resolver.md - examples/python-examples/plan_resolver_basic.py - examples/python-examples/plan_resolver_url_scanning.py - examples/python-examples/plan_resolver_sql.py Docstring improvements: - capabilities() now documents supports_arrow_tables key - resolve_plan_proto() and resolve_plan() have Args/Returns sections - ExternalDataset.schema, .metadata, .data properties have docstrings - Rust PlanResolver trait has top-level doc comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Fixes ruff FA102 (PEP 604 union syntax without future annotations). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace manual char-by-char RFC 3986 scheme validation with a compiled regex. Also fix E501 line-length violations in the capabilities() docstring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion at runtime Remove the capabilities system (supported_schemes, supported_format_types, supported_extensions) from the planner. The planner no longer checks URL schemes or format types against resolver capabilities -- all URL-backed datasets are considered plannable and errors surface at runtime. Move URL resolution from planning time to runtime. Static URLs are no longer resolved by MakeTasksVisitor; instead DataUrlTask::eval() resolves both static and signal-based URLs uniformly. Replace the supports_arrow_tables proto field with a direct trait method on PlanResolver. ResolverPipeline queries resolvers directly instead of going through MergedCapabilities. Removed: - ResolverCapabilities proto message and GetCapabilities RPC - DataBaseUrlSettingProto and data_base_url from pretransform opts - MergedCapabilities struct and planner_capabilities() trait method - capabilities() from Rust and Python PlanResolver - fetch_capabilities_via_query_fn() from WASM - Scheme/format checks from DataSpec::supported() 35 files changed, +70 -593 lines. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After removing capabilities-based format checks, the planner pushed all URL-backed datasets server-side including topojson, which DataFusion can't read. Add a hardcoded SUPPORTED_FORMATS list so formats like topojson stay client-side for Vega JS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…l.rs The file no longer contains the PlanResolver trait (moved to vegafusion-runtime in an earlier commit). Rename to reflect its actual contents: URL types, resolution, and parsing utilities. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

data_base_url is a runtime configuration, not a task graph property. Remove it from the DataUrlTask proto, MakeTasksVisitor, ChartStateOpts, and the VegaFusionRuntimeTrait method signatures. Instead, store the resolved base URL on ResolverPipeline and read it at eval time. Add data_base_url parameter to Python VegaFusionRuntime: None/True = CDN default, str = custom URL, False = disabled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…untime Bundle the scattered eval parameters (tz_config, inline_datasets, pipeline, data_base_url) into a TaskContext struct. Move data_base_url from ResolverPipeline to VegaFusionRuntime since it's orthogonal to the resolver chain. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add filters parameter to PlanResolver::resolve_table (Rust and Python) so resolvers can optimize data loading with pushed-down predicates. Filters are hints — DataFusion re-applies them regardless. ExternalTableProvider now reports Inexact for supports_filters_pushdown so DataFusion pushes filter expressions into TableScan nodes. Add unparse_expr_to_sql function (Rust pyfunction + Python wrapper) that converts LogicalExprNode proto messages to SQL strings. Accepts a single expression or a list (joined with AND). Supports all existing SQL dialects (default, postgres, mysql, sqlite, duckdb, bigquery). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Move ResolutionResult from vegafusion-core to vegafusion-runtime (lives with PlanResolver where it's used) - Remove reserved 6 from DataUrlTask proto (never merged with that field) - Update stale docs referencing removed capabilities concept - Remove source field from ExternalTableProvider (unused, metadata covers this use case) - Introduce VegaFusionRuntimeOpts with Default, replacing two constructors with a single new(opts) method - Remove section header comments from Python test file Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add test proving filter transforms work end-to-end with resolve_table (DataFusion applies filters after resolution). Assert that filters are currently not pushed down to resolve_table due to _vf_order window blocking PushDownFilter, with TODO to address via with_index changes. Remove unused optimize_filters infrastructure from ResolverPipeline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Drop plan_resolver_basic.py (covered by url_scanning example) and custom_resolver.rs (logging pass-through doesn't demonstrate real use). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove dead _vf_scheme check (removed implementation detail) - Drop redundant snapshot from proto_message unparse test (already covered by from_resolver test, keep bytes==proto equality check) - Add clarifying comment to scan_url_not_called_without_override test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove links to deleted examples (plan_resolver_basic.py, custom_resolver.rs) - Add filters parameter to resolve_table code snippets - Add unparse_expr_to_sql to API reference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add imports to all code snippets - Use generic examples with comments explaining real-world usage - Show data_base_url for relative URL resolution - Fix configuration bullets to show defaults (thread_safe=True, skip_when_no_external_tables=True, supports_arrow_tables=False) - Clarify protobuf dependency note (external_table_scan_node needs it, not scan_url itself) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove pandas dependency from url_scanning example (use print(table)) - Use parsed_url["url"] as table_name to avoid collisions - Remove unused source_table from SQL example constructor - Clarify hardcoded return in SQL example with explicit comment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jonmmease

self review

jonmmease · 2026-03-18T13:16:52Z

vegafusion-core/src/data/url.rs

+    /// Resolver produced a rewritten plan for the next resolver to handle,
+    /// or for DataFusion to execute if this is the last resolver
+    Plan(LogicalPlan),
+}


I don't think this is used in vegafusion-core any more, move to vegafusion-runtime with the PlanResolver trait itself?

jonmmease · 2026-03-18T13:34:05Z

vegafusion-core/src/proto/tasks.proto

  int32 batch_size = 3;
  ScanUrlFormat format_type = 4;
  transforms.TransformPipeline pipeline = 5;
+  reserved 6;


remove reserved, a version with base url here was never merged

jonmmease · 2026-03-18T13:43:15Z

vegafusion-runtime/src/data/plan_resolver.rs

this was moved from vegafusion-core/src/data/plan_resolver.rs, but git didn't pickup on that

jonmmease · 2026-03-18T13:51:51Z

vegafusion-runtime/src/data/plan_resolver.rs

+///
+/// 1. **Planning phase**: [`capabilities`](Self::capabilities) declares supported
+///    URL schemes/formats, and [`scan_url`](Self::scan_url) converts URLs into
+///    `LogicalPlan` nodes (typically `ExternalTableProvider` markers).


capabilities concept was removed, update docs here

jonmmease · 2026-03-18T13:53:01Z

vegafusion-runtime/src/data/plan_resolver.rs

+        _scheme: &str,
+        _schema: SchemaRef,
+        _metadata: &serde_json::Value,
+        _projected_columns: Option<Vec<String>>,


add filters from DataFusion

jonmmease · 2026-03-18T14:09:24Z

vegafusion-runtime/src/task_graph/runtime.rs

        // in a server context, wrap in catch_unwind just in case.
-        let pipeline = self.pipeline.clone();
+        let task_ctx = TaskContext {
+            tz_config: None, // overridden per-task from task.tz_config


Oh, double check that we ever use this one then

it is overridden

jonmmease · 2026-03-18T14:15:36Z

vegafusion-python/src/lib.rs

+///     scheme: Scheme identifier (e.g. "spark").
+///     schema: Arrow schema (arro3.core.Schema) — required for logical planning.
+///     metadata: Optional JSON-serializable dict of metadata.
+///     source: Optional source identifier.


double check that we still need a dedicated source here (rather then this going in metadata if needed)

jonmmease · 2026-03-18T14:17:19Z

vegafusion-python/src/plan_resolver.rs

+            Ok(!cls_method.is(&base_method))
+        })();
+        result.unwrap_or(false)
+    }


look into whether this handles the case of a subclass that overrides the method, and a subclass of that does not override the method

jonmmease · 2026-03-18T14:19:25Z

vegafusion-core/src/spec/data.rs


+    /// Formats that VegaFusion can read server-side. Anything else (e.g. topojson)
+    /// stays client-side for Vega JS to handle.
+    const SUPPORTED_FORMATS: &'static [&'static str] = &["csv", "tsv", "json", "arrow", "parquet"];


will we let through urls that don't have an extension, and don't have a format specified? we want to support cases like relative paths don't have file extensions

jonmmease · 2026-03-18T14:22:07Z

vegafusion-python/tests/test_plan_resolver.py

    assert "Unknown dialect" in str(resolver.error)
+
+
+# ── scan_url tests ──


drop this style of comment

# ── scan_url tests ──

Strip extended-length path prefix (\\?\) from fs::canonicalize on Windows to fix path prefix matching in allowed_base_urls checks. Also fix formatting issues across Rust and Python files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jonmmease · 2026-03-23T15:32:57Z

cc @OlegWock in case you're interested. These changes will make it possible for a custom resolver to handle URLs. So your Spark resolver could handle csv or parquet urls or file paths, or the spec could contain custom spark:// urls that your resolver would handle.

OlegWock · 2026-03-24T10:01:01Z

In our case we fully control input Vega spec, so vegafusion+dataset combined with ExternalDataset covers our needs pretty well, but I think this is nice feature to have

jonmmease and others added 4 commits March 13, 2026 10:14

jonmmease force-pushed the resolver-url-scanning branch from a748a67 to 0eb7694 Compare March 13, 2026 14:14

jonmmease changed the base branch from external-table-provider to main March 13, 2026 14:25

jonmmease and others added 16 commits March 13, 2026 10:47

feat: add scheme parameter to resolve_table API

0ff4df6

Threads the ExternalTableProvider's scheme through to resolve_table so resolvers can identify the data source type without parsing metadata. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: fix ruff formatting in test_plan_resolver.py

1f0e121

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: fix ruff formatting in example scripts

37fdf11

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

style: add future annotations import to example scripts

4df3bfb

Fixes ruff FA102 (PEP 604 union syntax without future annotations). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: remove journey comment from DataBaseUrlSetting

bff8ac5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: simplify has_url_scheme with regex, fix line length

9985a33

Replace manual char-by-char RFC 3986 scheme validation with a compiled regex. Also fix E501 line-length violations in the capabilities() docstring. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

style: remove added comments from tasks.proto

45c65ce

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add three-state comment to DataBaseUrlSettingProto

6d887cb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: clarify QueryRequest is for WASM transport, not gRPC

97d0c4f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jonmmease force-pushed the resolver-url-scanning branch from 9aa3880 to 97d0c4f Compare March 16, 2026 15:47

jonmmease and others added 5 commits March 16, 2026 15:15

style: remove section header comments

d249749

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jonmmease marked this pull request as ready for review March 17, 2026 22:39

jonmmease and others added 5 commits March 18, 2026 10:38

style: remove redundant examples

b081089

Drop plan_resolver_basic.py (covered by url_scanning example) and custom_resolver.rs (logging pass-through doesn't demonstrate real use). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jonmmease mentioned this pull request Mar 18, 2026

Push filter predicates down to ExternalTableProvider scans #589

Open

jonmmease and others added 4 commits March 18, 2026 12:19

Align VegaFusion URL config with server policy

fab9c57

jonmmease commented Mar 21, 2026

View reviewed changes

		assert "Unknown dialect" in str(resolver.error)


		# ── scan_url tests ──

Uh oh!

Conversation

jonmmease commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Uh oh!

jonmmease left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonmmease commented Mar 23, 2026

Uh oh!

OlegWock commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jonmmease commented Mar 12, 2026 •

edited

Loading