feat: adds altair.datasets by mattijn · Pull Request #3848 · vega/altair

mattijn · 2025-07-11T20:55:49Z

This PR builds upon the amazing work in #3631 by @dangotbanned.

It introduces a modernised dataset interface for Altair, now based on vega-datasets version 3.2.0 from the Vega organisation (link).

This will provide Altair with more than 70 datasets that can be lazily loaded without any hard dependency on a specific dataframe library (single hard dependency on narwhals 🫶)

Primary interface:

from altair.datasets import data

# Load with default engine (pandas)
cars_df = data.cars()

# Load with specific engine
cars_polars = data.cars(engine="polars")
cars_pyarrow = data.cars(engine="pyarrow")

# Get URL
cars_url = data.cars.url

# Set default engine for all datasets
data.set_default_engine("polars")
movies_df = data.movies()  # Uses polars engine

# List available datasets
available_datasets = data.list_datasets()

Expert interface

# Use Loader
from altair.datasets import Loader

load = Loader.from_backend("polars")
load("penguins")
load.url("penguins")

# Use direct functions
from altair.datasets import load, url

# Load a dataset
cars_df = load("cars", backend="polars")

# Get dataset URL
cars_url = url("cars")

Spatial data support
Spatial datasets are automatically loaded using:

geopandas when using the pandas engine
polars-st when using the polars engine

Example:

from altair.datasets import data

source = data.us_10m(engine="polars", layer="states")
source.st.plot().project(type="albersUsa")

Migration path
Migrating from the Python vega_datasets package (link) to the new altair.datasets is just a one-line change:

# old way (this is deprecated)
from vega_datasets import data

# new way (this is awesome)
from altair.datasets import data

No other changes are needed, everything will continue to work and much more becomes possible (at least, that is the aim).

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

Not required for these requests, but may be helpful to avoid limits

As an example, for comparing against the most recent I've added the 5 most recent

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

Experimenting with querying the url cache w/ expressions

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

Suffix was only added due to *now-removed* test files

Made possible via vega/vega-datasets#681 - Removes temp files - Removes some outdated apis - Remove test based on removed `"points"` dataset

Related - narwhals-dev/narwhals#1924 - #3631 (comment)

Makes more sense following (755ab4f)

See microsoft/pyright#10248 (comment)

Still need to update tests

https://github.com/vega/altair/actions/runs/14478364865/job/40609439929

-Seems to work pretty similarly to `geopandas` - The repr isn't as clean - Pretty cool that you can get *something* from `load("us-10m").st.plot()`

https://github.com/vega/altair/actions/runs/14494920661/job/40660098022?pr=3631

dangotbanned added 30 commits October 7, 2024 17:05

feat: Adds .arrow support

3a284a5

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

feat: Add support for caching metadata

22a5039

feat: Support env var VEGA_GITHUB_TOKEN

a618ffc

Not required for these requests, but may be helpful to avoid limits

feat: Add support for multi-version metadata

1792340

As an example, for comparing against the most recent I've added the 5 most recent

refactor: Renaming, docs, reorganize

fa2c9e7

feat: Support collecting release tags

24cd7d7

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

feat: Adds refresh_tags

7dd461f

- Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests

feat(DRAFT): Adds url_from

9768495

Experimenting with querying the url cache w/ expressions

fix: Wrap all requests with auth

c38c235

chore: Remove DATASET_NAMES_USED

a22cc8a

feat: Major GitHub rewrite, handle rate limiting

1181860

- `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb**

feat(DRAFT): Partial implement data("name")

31eeb20

fix(typing): Resolve some mypy errors

511a845

Merge branch 'main' into vega-datasets

c76cfd4

Merge branch 'main' into vega-datasets

d3f0497

Merge branch 'main' into vega-datasets

1b3390b

fix(ruff): Apply 3.8 fixes

a770ba9

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

docs(typing): Add WorkInProgress marker to data(...)

686a485

- Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well

Merge branch 'main' into vega-datasets

ba4491d

Merge branch 'main' into vega-datasets

1a4e107

Merge remote-tracking branch 'upstream/main' into vega-datasets

989b9b7

feat(DRAFT): Add a source for available npm versions

0bbf2e9

refactor: Bake "v" prefix into tags_npm

9c386e2

refactor: Move _npm_metadata into a class

1937f2b

chore: Remove unused, add todo

66fa6d1

Merge remote-tracking branch 'upstream/main' into vega-datasets

937aa01

feat: Adds app context for github<->npm

21b2edd

fix: Invalidate old trees

6527305

chore: Remove early test files#

336eeca

refactor: Rename metadata_full -> metadata

225be0a

Suffix was only added due to *now-removed* test files

dangotbanned and others added 25 commits February 7, 2025 11:19

Merge branch 'main' into vega-datasets

6c724e9

feat: Update for vega-datasets@3.0.0-alpha.1

51a967a

Made possible via vega/vega-datasets#681 - Removes temp files - Removes some outdated apis - Remove test based on removed `"points"` dataset

Merge remote-tracking branch 'upstream/main' into vega-datasets

037dd29

refactor: replace SchemaCache.schema_pyarrow -> nw.Schema.to_arrow

a776e2f

Related - narwhals-dev/narwhals#1924 - #3631 (comment)

feat(typing): Properly annotate dataset_name, suffix

ddda22c

Makes more sense following (755ab4f)

Merge remote-tracking branch 'upstream/main' into vega-datasets

5f30140

chore: bump vega-datasets==3.1.0

e3e4d21

test(typing): Ignore _pytest imports for pyright

3823121

See microsoft/pyright#10248 (comment)

feat: Basic geopandas impl

628a24d

Still need to update tests

fix: Add missing v prefix to url

d362697

test: Update test_spatial

28fb332

ci: Try pinning locked ruff

33a8442

https://github.com/vega/altair/actions/runs/14478364865/job/40609439929

ci(uv): Add --group geospatial

f375e70

chore: Reduce geopandas pin

f125feb

feat: Basic polars-st impl

a730587

-Seems to work pretty similarly to `geopandas` - The repr isn't as clean - Pretty cool that you can get *something* from `load("us-10m").st.plot()`

ci(typing): mypy ignore polars-st

397ca2d

https://github.com/vega/altair/actions/runs/14494920661/job/40660098022?pr=3631

Merge remote-tracking branch 'upstream/main' into vega-datasets

4b348c9

build against vega-datasets 3.2.0

944523d

run generate-schema-wrapper

45242cf

prevent infinite recursion in _split_markers

0707590

Merge branch 'main' of https://github.com/vega/altair into pr-3631

1608873

sync to v6

5752a51

resolve doctest on lower python versions

428999c

resolve comment in github action

f3ad138

changed examples to modern interface to pass docbuild

628ca56

mattijn enabled auto-merge (squash) July 11, 2025 21:53

mattijn merged commit 28f2381 into main Jul 11, 2025
25 checks passed

mattijn mentioned this pull request Jul 15, 2025

improve robustness of altair.datasets #3854

Closed

mattijn mentioned this pull request Jul 23, 2025

ENH: stabilize function _read_file for endpoints that redirects with different headers geopandas/geopandas#3311

Open

joelostblom mentioned this pull request Oct 12, 2025

docs: new dataset source, from vega_dataset to altair.dataset #3859

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: adds altair.datasets#3848

feat: adds altair.datasets#3848
mattijn merged 259 commits intomainfrom
pr-3631

mattijn commented Jul 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mattijn commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mattijn commented Jul 11, 2025 •

edited

Loading