Skip to content

feat: adds altair.datasets#3848

Merged
mattijn merged 259 commits intomainfrom
pr-3631
Jul 11, 2025
Merged

feat: adds altair.datasets#3848
mattijn merged 259 commits intomainfrom
pr-3631

Conversation

@mattijn
Copy link
Copy Markdown
Contributor

@mattijn mattijn commented Jul 11, 2025

This PR builds upon the amazing work in #3631 by @dangotbanned.

It introduces a modernised dataset interface for Altair, now based on vega-datasets version 3.2.0 from the Vega organisation (link).

This will provide Altair with more than 70 datasets that can be lazily loaded without any hard dependency on a specific dataframe library (single hard dependency on narwhals 🫶)

Primary interface:

from altair.datasets import data

# Load with default engine (pandas)
cars_df = data.cars()

# Load with specific engine
cars_polars = data.cars(engine="polars")
cars_pyarrow = data.cars(engine="pyarrow")

# Get URL
cars_url = data.cars.url

# Set default engine for all datasets
data.set_default_engine("polars")
movies_df = data.movies()  # Uses polars engine

# List available datasets
available_datasets = data.list_datasets()

Expert interface

# Use Loader
from altair.datasets import Loader

load = Loader.from_backend("polars")
load("penguins")
load.url("penguins")

# Use direct functions
from altair.datasets import load, url

# Load a dataset
cars_df = load("cars", backend="polars")

# Get dataset URL
cars_url = url("cars")

Spatial data support
Spatial datasets are automatically loaded using:

  • geopandas when using the pandas engine
  • polars-st when using the polars engine

Example:

from altair.datasets import data

source = data.us_10m(engine="polars", layer="states")
source.st.plot().project(type="albersUsa")
image

Migration path
Migrating from the Python vega_datasets package (link) to the new altair.datasets is just a one-line change:

# old way (this is deprecated)
from vega_datasets import data

# new way (this is awesome)
from altair.datasets import data

No other changes are needed, everything will continue to work and much more becomes possible (at least, that is the aim).

Not required for these requests, but may be helpful to avoid limits
As an example, for comparing against the most recent I've added the 5 most recent
- Basic mechanism for discovering new versions
- Tries to minimise number of and total size of requests
Experimenting with querying the url cache w/ expressions
- `metadata_full.parquet` stores **all known** file metadata
- `GitHub.refresh()` to maintain integrity in a safe manner
- Roughly 3000 rows
- Single release: **9kb** vs 46 releases: **21kb**
- Still undecided exactly how this functionality should work
- Need to resolve `npm` tags != `gh` tags issue as well
Suffix was only added due to *now-removed* test files
dangotbanned and others added 25 commits February 7, 2025 11:19
Made possible via vega/vega-datasets#681

- Removes temp files
- Removes some outdated apis
- Remove test based on removed `"points"` dataset
Still need to update tests
-Seems to work pretty similarly to `geopandas`
- The repr isn't as clean
- Pretty cool that you can get *something* from `load("us-10m").st.plot()`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants