Skip to content

feat: add URL scanning support to PlanResolver trait#587

Open
jonmmease wants to merge 36 commits intomainfrom
resolver-url-scanning
Open

feat: add URL scanning support to PlanResolver trait#587
jonmmease wants to merge 36 commits intomainfrom
resolver-url-scanning

Conversation

@jonmmease
Copy link
Copy Markdown
Collaborator

@jonmmease jonmmease commented Mar 12, 2026

Summary

Adds URL scanning support to the PlanResolver trait, allowing resolvers to claim and handle data source URLs (e.g., custom schemes like spark://, delta:// or built-in schemes like https:// with csv files). DataFusionResolver is a regular resolver in the pipeline chain, establishing a uniform scan/resolve two-phase pattern for all resolvers.

Standardizes runtime URL configuration around base_url and allowed_base_urls.

Relative URLs resolve against a configurable base_url (defaulting to the Vega datasets CDN). The Python API accepts three states: None/True for the CDN default, a string for a custom absolute URL or absolute path, or False to disable relative URL resolution.

allowed_base_urls adds an optional allowlist for external access. None preserves VegaFusion's current behavior (no additional restriction), while an empty list denies all external URLs. Explicit entries support *, generic "<scheme>:" matches, URL prefixes, wildcard-host prefixes like https://*.example.com/, and filesystem roots.

URL policy is normalized once at runtime construction and passed to tasks at eval time via TaskContext — it is not part of the task graph or proto. DataUrlTask resolves the raw URL against base_url, strips fragments, skips internal dataset URLs, checks the initial resolved URL against allowed_base_urls, and only then dispatches to resolvers or built-in readers.

For consistency across protocols, access control applies to the initial resolved URL only. VegaFusion does not re-check redirect destinations after a fetch begins.

Makes vegafusion-server the configuration authority for gRPC mode. The server now exposes --base-url, --no-base-url, repeated --allowed-base-url, and --no-allowed-urls, and maps them directly into VegaFusionRuntimeOpts. Python grpc_connect() rejects local non-default URL policy settings, and the corresponding setters reject changes while connected over gRPC, so client-side configuration cannot appear to override the server.

This is a breaking change: previously if a relative path didn't match a vega dataset, it would fall back to looking for a local file. The public runtime/Python option is also named base_url.

Renames ExternalTableProvider.protocol and ExternalDataset.protocol to scheme for consistent RFC 3986 terminology. scheme is a required parameter on resolve_table, ExternalTableProvider, and ExternalDataset.

PlanResolver lives in vegafusion-runtime since it depends on runtime types. The scheme parameter is added to resolve_table and made required (non-optional) on ExternalTableProvider and ExternalDataset since the API is unreleased.

supports_arrow_tables is a trait method on PlanResolver. The materialization decision (should_materialize) inspects the LogicalPlan: plans with no ExternalTableProvider nodes are always materialized, even when a non-Arrow resolver is registered.

has_url_scheme anchors scheme detection at the start of the string (RFC 3986), preventing relative references like fetch?target=http://evil.com/data from being misclassified as absolute URLs.

Adds resolve_table filter hints (filters parameter) and unparse_expr_to_sql for converting DataFusion filter expressions to SQL strings. Filter pushdown to ExternalTableProvider scans is not yet active due to the _vf_order window placement (#589).

Introduces VegaFusionRuntimeOpts with Default implementation, replacing multiple constructors with a single VegaFusionRuntime::new(opts).

Removes source field from ExternalTableProvider (unused — metadata covers this use case).

Moves ResolutionResult from vegafusion-core to vegafusion-runtime (lives with PlanResolver where it's used).

Adds documentation and tests for plan resolvers, URL policy helpers, server flags, and gRPC/runtime behavior.

Motivation

Previously, URL-backed datasets were handled exclusively by DataFusion's built-in readers. Custom resolvers could only participate after a plan node was already constructed, which meant they couldn't intercept URLs with custom schemes or formats that DataFusion doesn't support.

This change enables resolvers to:

  1. Claim URLs at eval time via scan_url -- inspect a pre-parsed URL and return a LogicalPlan node
  2. Control materialization via supports_arrow_tables -- resolvers that can't efficiently consume Arrow tables keep data as lazy plans
  3. Provide per-table data via resolve_table -- the simplest path, no protobuf needed
  4. Rewrite or transpile plans via resolve_plan -- full control for SQL transpilation or remote execution

External data connectors (Spark, Delta Lake, custom APIs) can now register as resolvers and handle their own URL schemes without forking or wrapping the runtime.

Moving relative URL resolution and URL permission checks into the runtime also makes server-side execution predictable: embedded runtimes and gRPC runtimes share the same URL semantics, while the server remains the source of truth for remote access policy.

jonmmease and others added 4 commits March 13, 2026 10:14
Add scan_url method to PlanResolver so resolvers participate in data
source URL handling. DataFusionResolver moves from a privileged terminal
resolver into a regular resolver in the pipeline chain.

Key changes:
- ParsedUrl struct for structured URL representation passed to scanners
- ResolverCapabilities proto + MergedCapabilities for URL support negotiation
- DataBaseUrlSetting enum for explicit base URL API (Default/Disabled/Custom)
- resolve_url() shared function for plan-time and eval-time URL resolution
- GetCapabilities RPC for remote capability propagation (gRPC + WASM)
- Python bridge: scan_url, scan_url_proto, capabilities on PlanResolver
- data_base_url parameter threaded through pre_transform_* and ChartState APIs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace unreachable!() with proper error in pipeline resolve
- Use url::Url::join() for RFC 3986 URL resolution
- Check URL scheme in DataSpec::supported against capabilities
- Handle protocol-relative URLs by prepending https:
- Deduplicate scheme lists in DataFusionResolver::scan_url
- Gate url::Url::from_file_path behind cfg(not(wasm32))
- Add catch-all arm in server test match for GetCapabilities variant
- Format Python files with ruff

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The scan_url abstraction was losing the Vega format.parse spec by passing
&None to read_csv. This caused incorrect date/timezone handling for CSV
datasets with explicit parse directives (e.g., seattle-weather.csv).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Skip scheme validation for raw absolute paths in data.rs (e.g.,
  C:\Users\...) which haven't been resolved to file:// URLs yet.
  url::Url::parse misinterprets "C:" as a scheme on Windows.
- Gate Unix-path tests with #[cfg(not(target_os = "windows"))] since
  Url::from_file_path rejects Unix paths on Windows.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jonmmease jonmmease force-pushed the resolver-url-scanning branch from a748a67 to 0eb7694 Compare March 13, 2026 14:14
@jonmmease jonmmease changed the base branch from external-table-provider to main March 13, 2026 14:25
jonmmease and others added 16 commits March 13, 2026 10:47
…ystems

Use the proper URL term "scheme" (RFC 3986) instead of "protocol"
for ExternalTableProvider, ExternalDataset, codec serialization keys,
and all related APIs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…runtime

This enables a default resolve_plan implementation that walks the
LogicalPlan tree, calls resolve_table for each ExternalTableProvider,
and replaces them with MemTable scans. Implementers can now override
just resolve_table instead of the full resolve_plan method.

The trait needed to live in vegafusion-runtime because the default
implementation depends on DataFusion types (ExternalTableProvider,
MemTable, TreeNodeRewriter) that are not available in vegafusion-core.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Threads the ExternalTableProvider's scheme through to resolve_table
so resolvers can identify the data source type without parsing metadata.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ataset

No backward compatibility needed — this API hasn't been released yet.
Scheme now comes before schema in all signatures to reflect its role
as the primary discriminator for external data sources.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Accurately document why the fallback exists (method absent on
wasm32-unknown-unknown, not just a runtime failure) and note the
percent-encoding limitation for reserved characters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
has_url_scheme used `contains("://")`, so relative references like
`fetch?target=http://evil.com/data` were misclassified as absolute URLs.
resolve_url then returned them as-is, causing downstream Url::parse to
fail with RelativeUrlWithoutBase.

Now validates that `://` is preceded by a valid RFC 3986 scheme prefix.
Also replaced the duplicated inline `contains("://")` check in data.rs
with a call to the fixed has_url_scheme.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion

Add `supports_arrow_tables` bool to `ResolverCapabilities` proto so
resolvers can declare whether they efficiently consume in-memory Arrow
tables. DataFusion sets this true; remote resolvers (e.g. Spark) default
to false.

Replaces the `has_user_resolvers()` heuristic with `should_materialize(plan)`
which inspects the actual LogicalPlan:
- Materialize if all resolvers support Arrow tables (fast path)
- Materialize if the plan has no ExternalTableProvider nodes
- Otherwise keep lazy for resolver interception

This avoids unnecessary lazy plans when a non-arrow resolver is
registered but the specific plan doesn't involve any external tables.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ments

Add Sphinx feature page, three Python example scripts, and improved
docstrings for the PlanResolver extensibility system.

New files:
- docs/source/features/plan_resolver.md
- examples/python-examples/plan_resolver_basic.py
- examples/python-examples/plan_resolver_url_scanning.py
- examples/python-examples/plan_resolver_sql.py

Docstring improvements:
- capabilities() now documents supports_arrow_tables key
- resolve_plan_proto() and resolve_plan() have Args/Returns sections
- ExternalDataset.schema, .metadata, .data properties have docstrings
- Rust PlanResolver trait has top-level doc comment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fixes ruff FA102 (PEP 604 union syntax without future annotations).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace manual char-by-char RFC 3986 scheme validation with a
compiled regex. Also fix E501 line-length violations in the
capabilities() docstring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jonmmease jonmmease force-pushed the resolver-url-scanning branch from 9aa3880 to 97d0c4f Compare March 16, 2026 15:47
jonmmease and others added 5 commits March 16, 2026 15:15
…tion at runtime

Remove the capabilities system (supported_schemes, supported_format_types,
supported_extensions) from the planner. The planner no longer checks URL
schemes or format types against resolver capabilities -- all URL-backed
datasets are considered plannable and errors surface at runtime.

Move URL resolution from planning time to runtime. Static URLs are no
longer resolved by MakeTasksVisitor; instead DataUrlTask::eval() resolves
both static and signal-based URLs uniformly.

Replace the supports_arrow_tables proto field with a direct trait method
on PlanResolver. ResolverPipeline queries resolvers directly instead of
going through MergedCapabilities.

Removed:
- ResolverCapabilities proto message and GetCapabilities RPC
- DataBaseUrlSettingProto and data_base_url from pretransform opts
- MergedCapabilities struct and planner_capabilities() trait method
- capabilities() from Rust and Python PlanResolver
- fetch_capabilities_via_query_fn() from WASM
- Scheme/format checks from DataSpec::supported()

35 files changed, +70 -593 lines.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After removing capabilities-based format checks, the planner pushed
all URL-backed datasets server-side including topojson, which DataFusion
can't read. Add a hardcoded SUPPORTED_FORMATS list so formats like
topojson stay client-side for Vega JS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…l.rs

The file no longer contains the PlanResolver trait (moved to
vegafusion-runtime in an earlier commit). Rename to reflect its
actual contents: URL types, resolution, and parsing utilities.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
data_base_url is a runtime configuration, not a task graph property.
Remove it from the DataUrlTask proto, MakeTasksVisitor, ChartStateOpts,
and the VegaFusionRuntimeTrait method signatures. Instead, store the
resolved base URL on ResolverPipeline and read it at eval time.

Add data_base_url parameter to Python VegaFusionRuntime:
  None/True = CDN default, str = custom URL, False = disabled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jonmmease jonmmease marked this pull request as ready for review March 17, 2026 22:39
…untime

Bundle the scattered eval parameters (tz_config, inline_datasets,
pipeline, data_base_url) into a TaskContext struct. Move data_base_url
from ResolverPipeline to VegaFusionRuntime since it's orthogonal to
the resolver chain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jonmmease and others added 5 commits March 18, 2026 10:38
Add filters parameter to PlanResolver::resolve_table (Rust and Python)
so resolvers can optimize data loading with pushed-down predicates.
Filters are hints — DataFusion re-applies them regardless.

ExternalTableProvider now reports Inexact for supports_filters_pushdown
so DataFusion pushes filter expressions into TableScan nodes.

Add unparse_expr_to_sql function (Rust pyfunction + Python wrapper)
that converts LogicalExprNode proto messages to SQL strings. Accepts
a single expression or a list (joined with AND). Supports all existing
SQL dialects (default, postgres, mysql, sqlite, duckdb, bigquery).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Move ResolutionResult from vegafusion-core to vegafusion-runtime
  (lives with PlanResolver where it's used)
- Remove reserved 6 from DataUrlTask proto (never merged with that field)
- Update stale docs referencing removed capabilities concept
- Remove source field from ExternalTableProvider (unused, metadata
  covers this use case)
- Introduce VegaFusionRuntimeOpts with Default, replacing two
  constructors with a single new(opts) method
- Remove section header comments from Python test file

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add test proving filter transforms work end-to-end with resolve_table
(DataFusion applies filters after resolution). Assert that filters are
currently not pushed down to resolve_table due to _vf_order window
blocking PushDownFilter, with TODO to address via with_index changes.

Remove unused optimize_filters infrastructure from ResolverPipeline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop plan_resolver_basic.py (covered by url_scanning example) and
custom_resolver.rs (logging pass-through doesn't demonstrate real use).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove dead _vf_scheme check (removed implementation detail)
- Drop redundant snapshot from proto_message unparse test (already
  covered by from_resolver test, keep bytes==proto equality check)
- Add clarifying comment to scan_url_not_called_without_override test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
jonmmease and others added 4 commits March 18, 2026 12:19
- Remove links to deleted examples (plan_resolver_basic.py,
  custom_resolver.rs)
- Add filters parameter to resolve_table code snippets
- Add unparse_expr_to_sql to API reference

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add imports to all code snippets
- Use generic examples with comments explaining real-world usage
- Show data_base_url for relative URL resolution
- Fix configuration bullets to show defaults (thread_safe=True,
  skip_when_no_external_tables=True, supports_arrow_tables=False)
- Clarify protobuf dependency note (external_table_scan_node needs
  it, not scan_url itself)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove pandas dependency from url_scanning example (use print(table))
- Use parsed_url["url"] as table_name to avoid collisions
- Remove unused source_table from SQL example constructor
- Clarify hardcoded return in SQL example with explicit comment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator Author

@jonmmease jonmmease left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self review

/// Resolver produced a rewritten plan for the next resolver to handle,
/// or for DataFusion to execute if this is the last resolver
Plan(LogicalPlan),
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is used in vegafusion-core any more, move to vegafusion-runtime with the PlanResolver trait itself?

int32 batch_size = 3;
ScanUrlFormat format_type = 4;
transforms.TransformPipeline pipeline = 5;
reserved 6;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove reserved, a version with base url here was never merged

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was moved from vegafusion-core/src/data/plan_resolver.rs, but git didn't pickup on that

///
/// 1. **Planning phase**: [`capabilities`](Self::capabilities) declares supported
/// URL schemes/formats, and [`scan_url`](Self::scan_url) converts URLs into
/// `LogicalPlan` nodes (typically `ExternalTableProvider` markers).
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capabilities concept was removed, update docs here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

_scheme: &str,
_schema: SchemaRef,
_metadata: &serde_json::Value,
_projected_columns: Option<Vec<String>>,
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add filters from DataFusion

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// in a server context, wrap in catch_unwind just in case.
let pipeline = self.pipeline.clone();
let task_ctx = TaskContext {
tz_config: None, // overridden per-task from task.tz_config
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, double check that we ever use this one then

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is overridden

/// scheme: Scheme identifier (e.g. "spark").
/// schema: Arrow schema (arro3.core.Schema) — required for logical planning.
/// metadata: Optional JSON-serializable dict of metadata.
/// source: Optional source identifier.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check that we still need a dedicated source here (rather then this going in metadata if needed)

Ok(!cls_method.is(&base_method))
})();
result.unwrap_or(false)
}
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

look into whether this handles the case of a subclass that overrides the method, and a subclass of that does not override the method

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does


/// Formats that VegaFusion can read server-side. Anything else (e.g. topojson)
/// stays client-side for Vega JS to handle.
const SUPPORTED_FORMATS: &'static [&'static str] = &["csv", "tsv", "json", "arrow", "parquet"];
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will we let through urls that don't have an extension, and don't have a format specified? we want to support cases like relative paths don't have file extensions

assert "Unknown dialect" in str(resolver.error)


# ── scan_url tests ──
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop this style of comment

# ── scan_url tests ──

Strip extended-length path prefix (\\?\) from fs::canonicalize on
Windows to fix path prefix matching in allowed_base_urls checks.
Also fix formatting issues across Rust and Python files.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jonmmease
Copy link
Copy Markdown
Collaborator Author

cc @OlegWock in case you're interested. These changes will make it possible for a custom resolver to handle URLs. So your Spark resolver could handle csv or parquet urls or file paths, or the spec could contain custom spark:// urls that your resolver would handle.

@OlegWock
Copy link
Copy Markdown
Contributor

In our case we fully control input Vega spec, so vegafusion+dataset combined with ExternalDataset covers our needs pretty well, but I think this is nice feature to have

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants