Skip to content

docs: Add missing descriptions, sources, and licenses#663

Merged
dsmedia merged 59 commits intovega:mainfrom
dsmedia:add-missing-metadata
Feb 2, 2025
Merged

docs: Add missing descriptions, sources, and licenses#663
dsmedia merged 59 commits intovega:mainfrom
dsmedia:add-missing-metadata

Conversation

@dsmedia
Copy link
Copy Markdown
Collaborator

@dsmedia dsmedia commented Jan 12, 2025

Pull Request: Add missing descriptions, sources, and licenses to datapackage files

Objective:

Following up on the metadata infrastructure work in #634, #639, and #646, this PR adds missing description, source, and license metadata to datapackage_additions.toml. The checklist below indicates which metadata entries are missing and still need to be added. Schema entries (describing each dataset column) will be handled in a subsequent pull request.

Dataset descriptions written to avoid formulations such as "This dataset...", consistent with PEP 257 – Docstring Conventions

✅ PR Merged - 🚧 Work Remaining

Descriptions were added for all datasets.
Resources still missing sources:

  • flare-dependencies.json, flare.json, movies.json, sp500.csv, stocks.csv, udistrict.json, weekly-weather.json, windvectors.csv

Resources still missing licenses:

  • anscombe.json, driving.json, flare-dependencies.json, flare.json, la-riots.csv, movies.json, ohlc.json, platformer-terrain.json, sp500-2000.csv, sp500.csv, stocks.csv, udistrict.json, volcano.json, weekly-weather.json, windvectors.csv

Open Questions: Complete

  • Decide how to handle license information that cannot be determined after research, ensuring validation with frictionless standard for licenses.
  • [ ] add disclaimer about license information deferring to separate PR
  • Regenerated datackapage.json and datapackage.md

Status:

The following checklist indicates the completion status of the description, sources, and licenses metadata for each dataset. A green checkmark (✅) indicates the metadata is present, a red X (❌) indicates the metadata is missing. The leading checkbox is only checked if all three types of metadata are present.

Process:
Changes are validated using scripts/build_datapackage.py, which generates machine-readable metadata describing the contents of /data/.

Legend:

  • - Indicates the metadata is present
  • - Indicates the metadata is missing
  • [x] - Indicates all three types of metadata are present
  • [ ] - Indicates one or more types of metadata are missing

Checklist:

Update datasets.toml with missing source metadata for 7zip.png dataset
- adds citation to protovis in desscription
- fixes link to image in sources
- adds license
- fixes bad link in annual-precip.json; adds license
- adds license to birdstrikes.csv, budget.json, burtin.json, and cars.json
- expands description to explain units and seasonal adjustment
- adds additional source directly to dataset csv
- adds license details from source
- expands description
- adds license
- Document that data used in flights* datasets are collected under US DOT requirements
- Add row counts to flight dataset descriptions (2k-3M rows)
- Note regulatory basis (14 CFR Part 234) while acknowledging unclear license terms
- adds license for londonBoroughs.json
- adds sources, license for londonCentroids.json (itself derived from londonBoroughs.json)
- expands description, corrects source URL, updates source title, and adds license for londonTubeLines.json
- global-temp.csv
- iowa-electricity.csv
- jobs.json
- monarchs.json
- political-contributions.json (also updates link to FEC github), note that FEC provides an explicit underlying license
- population_engineers_hurricanes.csv
- seattle-weather-hourly-normals.csv
- seattle-weather.csv
- unemployment-across-industries.json
- unemployment.tsv
- us-employment.csv
- weather.csv

Note that many pages hosting US government datasets do not explicitly grant a license. As a result, when there is a doubt, a link is provided to the USA government works page, which explains the nuances of licenses for data on US government web sites.
- adds license (football.json, la-riots.csv, penguins.json, platformer-terrain.json, population.json, sp500-2000.csv, sp500.csv, volcano.json)
- airports.csv (adds description, sources, license)
- barley.csv (updates description and source; adds license)
- disasters.csv (expands description, updates sources, add license)
- driving.json (adds description, updates source, adds license)
- ohlc.json (modifies description, adds additional source, and license)
- stocks.csv (adds source, license)
- weekly-weather.json (adds source, license)
- windvectors.csv (adds source, license)
- updates description, adds sources and
- adds description, source and license
- makes license title of U.S. Government Datasets consistent for cases specific license terms are undetermined
- focuses on how data is used in edge bundling example
- would benefit from additional detail in the description
- corrects description, adds source, license
- ffox.png (updates license)
- gapminder.json (adds license)
- gimp.png (updates description, adds source, license)
- github.csv (adds description, source, license)
- lookup_groups.csv, lookup_people.csv (adds description, source, license)
- miserables.json adds description, source, license)
- movies.json (adds source, license)
- normal-2d.json (adds description, source, license)
- stocks.csv (adds description)
@dsmedia
Copy link
Copy Markdown
Collaborator Author

dsmedia commented Jan 20, 2025

Here is the code to validate the statistical description of normal-2d.json added in a86b9dd.

import pandas as pd
from scipy import stats

# Read data
df = pd.read_json("https://raw.githubusercontent.com/vega/vega-datasets/main/data/normal-2d.json")

# Generate key statistics
stats_output = {
    "Sample size": len(df),
    "Means": df.mean().round(3).to_dict(),
    "Standard deviations": df.std().round(3).to_dict(),
    "Correlation": round(df.corr().iloc[0, 1], 3),
    "Ranges": {col: [round(df[col].min(), 3), round(df[col].max(), 3)] for col in df.columns},
    "Normality p-values": {col: round(stats.normaltest(df[col]).pvalue, 3) for col in df.columns}
}

# Print statistics
print("Dataset Statistics for Description:")
print(f"Sample size: {stats_output['Sample size']} points")
print(f"Centers: {stats_output['Means']}")
print(f"Standard deviations: {stats_output['Standard deviations']}")
print(f"Correlation: {stats_output['Correlation']}")
print(f"Ranges: {stats_output['Ranges']}")
print(f"Normality test p-values: {stats_output['Normality p-values']}")

...which produced:

Dataset Statistics for Description:
Sample size: 500 points
Centers: {'u': 0.005, 'v': -0.011}
Standard deviations: {'u': 0.192, 'v': 0.199}
Correlation: 0.026
Ranges: {'u': [-0.578, 0.533], 'v': [-0.534, 0.606]}
Normality test p-values: {'u': 0.68, 'v': 0.763}

dsmedia and others added 10 commits February 1, 2025 12:50
Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
- runs build_datapackage.py to verify
- source and license can be clarified in a future PR
Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
@dsmedia
Copy link
Copy Markdown
Collaborator Author

dsmedia commented Feb 1, 2025

Some really great stuff in here @dsmedia!

Let me know if you need any clarifications

Have gone through and addressed all I believe - appreciate the diligent review @dangotbanned !

@dangotbanned
Copy link
Copy Markdown
Member

Some really great stuff in here @dsmedia!
Let me know if you need any clarifications

Have gone through and addressed all I believe - appreciate the diligent review @dangotbanned !

Thanks @dsmedia, hopefully will have a chance to look it over tomorrow

@dangotbanned
Copy link
Copy Markdown
Member

Some really great stuff in here @dsmedia!
Let me know if you need any clarifications

Have gone through and addressed all I believe - appreciate the diligent review @dangotbanned !

Thanks @dsmedia, hopefully will have a chance to look it over tomorrow

@dsmedia all the new changes are great 🎉

Reading through the new (https://github.com/vega/vega-datasets/blob/dbecc75a57d3a5cc0bbeea1861b0254d3be4e15c/datapackage.md) some more things stood out to me.
Hopefully they should be easy to resolve

- github.csv: move time range to schema
- add categories to schema in seattle-weather.csv
- sp500.csv, udistrict.json, uniform-2d, weather.json : move description content into schema
- reformat usgs disclaimer in us-state-capitals.json
- rerun build_datapackage.py
@dangotbanned
Copy link
Copy Markdown
Member

Thanks @dsmedia, feel free to merge whenever you're ready (and #672)

@dsmedia dsmedia merged commit 06a3ade into vega:main Feb 2, 2025
2 checks passed
@dsmedia dsmedia deleted the add-missing-metadata branch February 2, 2025 15:01
mattijn added a commit to vega/altair that referenced this pull request Jul 11, 2025
* feat: Adds `.arrow` support

To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow)

* feat: Add support for caching metadata

* feat: Support env var `VEGA_GITHUB_TOKEN`

Not required for these requests, but may be helpful to avoid limits

* feat: Add support for multi-version metadata

As an example, for comparing against the most recent I've added the 5 most recent

* refactor: Renaming, docs, reorganize

* feat: Support collecting release tags

See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags

* feat: Adds `refresh_tags`

- Basic mechanism for discovering new versions
- Tries to minimise number of and total size of requests

* feat(DRAFT): Adds `url_from`

Experimenting with querying the url cache w/ expressions

* fix: Wrap all requests with auth

* chore: Remove `DATASET_NAMES_USED`

* feat: Major `GitHub` rewrite, handle rate limiting

- `metadata_full.parquet` stores **all known** file metadata
- `GitHub.refresh()` to maintain integrity in a safe manner
- Roughly 3000 rows
- Single release: **9kb** vs 46 releases: **21kb**

* feat(DRAFT): Partial implement `data("name")`

* fix(typing): Resolve some `mypy` errors

* fix(ruff): Apply `3.8` fixes

https://github.com/vega/altair/actions/runs/11495437283/job/31994955413

* docs(typing): Add `WorkInProgress` marker to `data(...)`

- Still undecided exactly how this functionality should work
- Need to resolve `npm` tags != `gh` tags issue as well

* feat(DRAFT): Add a source for available `npm` versions

* refactor: Bake `"v"` prefix into `tags_npm`

* refactor: Move `_npm_metadata` into a class

* chore: Remove unused, add todo

* feat: Adds `app` context for github<->npm

* fix: Invalidate old trees

* chore: Remove early test files#

* refactor: Rename `metadata_full` -> `metadata`

Suffix was only added due to *now-removed* test files

* refactor: `tools.vendor_datasets` -> `tools.datasets` package

Will be following up with some more splitting into composite modules

* refactor: Move `TypedDict`, `NamedTuple`(s) -> `datasets.models`

* refactor: Move, rename `semver`-related tools

* refactor: Remove `write_schema` from `_Npm`, `_GitHub`

Handled in `Application` now

* refactor: Rename, split `_Npm`, `_GitHub` into own modules

`tools.datasets.npm` will later be performing the requests that are in `Dataset.__call__` currently

* refactor: Move `DataLoader.__call__` -> `DataLoader.url()`

-`data.name()` -> `data(name)`
- `data.name.url` -> `data.url(name)`

* feat(typing): Generate annotations based on known datasets

* refactor(typing): Utilize `datasets._typing`

* feat: Adds `Npm.dataset` for remote reading]

* refactor: Remove dead code

* refactor: Replace `name_js`, `name_py` with `dataset_name`

Since we're just using strings, there is no need for 2 forms of the name.
The legacy package needed this for `__getattr__` access with valid identifiers

* fix: Remove invalid `semver.sort` op

I think this was added in error, since the schema of the file never had `semver` columns

Only noticed the bug when doing a full rebuild

* fix: Add missing init path for `refresh_trees`

* refactor: Move public interface to `_io`

Temporary home, see module docstring

* refactor(perf): Don't recreate path mapping on every attribute access

* refactor: Split `Reader._url_from` into `url`, `_query`

- Much more generic now in what it can be used for
- For the caching, I'll need more columns than just `"url_npm"`
- `"url_github" contains a hash

* feat(DRAFT): Adds `GitHubUrl.BLOBS`

- Common prefix to all rows in `metadata[url_github]`
- Stripping this leaves only `sha`
- For **2800** rows, there are only **109** unique hashes, so these can be used to reduce cache size

* feat: Store `sha` instead of `github_url`

Related 661a385

* feat(perf): Adds caching to `ALTAIR_DATASETS_DIR`

* feat(DRAFT): Adds initial generic backends

* feat: Generate and move `Metadata` (`TypedDict`) to `datasets._typing`

* feat: Adds optional backends, `polars[pyarrow]`, `with_backend`

* feat: Adds `pyarrow` backend

* docs: Update `.with_backend()`

* chore: Remove `duckdb` comment

Not planning to support this anymore, requires `fsspec` which isn't in `dev`

```
InvalidInputException
Traceback (most recent call last)
Cell In[6], line 5
       3 with duck._reader._opener.open(url) as f:
       4     fn = duck._reader._read_fn['.json']
----> 5     thing = fn(f.read())

InvalidInputException: Invalid Input Error: This operation could not be completed because required module 'fsspec' is not installed"
```

* ci(typing): Add `pyarrow-stubs` to `dev` dependencies

Will put this in another PR, but need it here for IDE support

* refactor: `generate_datasets_typing` -> `Application.generate_typing`

* refactor: Split `datasets` into public/private packages

- `tools.datasets`: Building & updating metadata file(s), generating annotations
- `altair.datasets`: Consuming metadata, remote & cached dataset management

* refactor: Provide `npm` url to `GitHub(...)`

* refactor: Rename `ext` -> `suffix`

* refactor: Remove unimplemented `tag="latest"`

Since `metadata.parquet` is sorted, this was already the behavior when not providing a tag

* feat: Rename `_datasets_dir`, make configurable, add docs

Still on the fence about `Loader.cache_dir` vs `Loader.cache`

* docs: Adds examples to `Loader.with_backend`

* refactor: Clean up requirements -> imports

* docs: Add basic example to `Loader` class

Also incorporates changes from previous commit into `__repr__`
4a2a2e0

* refactor: Reorder `alt.datasets` module

* docs: Fill out `Loader.url`

* feat: Adds `_Reader._read_metadata`

* refactor: Rename `(reader|scanner_from()` -> `(read|scan)_fn()`

* refactor(typing): Replace some explicit casts

* refactor: Shorten and document request delays

* feat(DRAFT): Make `[tag]` a `pl.Enum`

* fix: Handle `pyarrow` scalars conversion

* test: Adds `test_datasets`

Initially quite basic, need to add more parameterize and test caching

* fix(DRAFT): hotfix `pyarrow` read

* fix(DRAFT): Treat `polars` as exception, invalidate cache

Possibly fix https://github.com/vega/altair/actions/runs/11768349827/job/32778071725?pr=3631

* test: Skip `pyarrow` tests on `3.9`

Forgot that this gets uninstalled in CI
https://github.com/vega/altair/actions/runs/11768424121/job/32778234026?pr=3631

* refactor: Tidy up changes from last 4 commits

- Rename and properly document "file-like object" handling
  - Also made a bit clearer what is being called and when
- Use a more granular approach to skipping in `@backends`
  - Previously, everything was skipped regardless of whether it required `pyarrow`
  - Now, `polars`, `pandas` **always** run - with `pandas` expected to fail
- I had to clean up `skip_requires_pyarrow` to make it compatible with `pytest.param`
  - It has a runtime check for if `MarkDecorator`, instead of just a callable

bb7bc17, ebc1bfa, fe0ae88,
7089f2a

* refactor: Rework `_readers.py`

- Moved `_Reader._metadata` -> module-level constant `_METADATA`.
  - It was never modified and is based on the relative directory of this module
- Generally improved the readability with more method-chaining (less assignment)
- Renamed, improved doc `_filter_reduce` -> `_parse_predicates_constraints`

* test: Adds tests for missing dependencies

* test: Adds `test_dataset_not_found`

* test: Adds `test_reader_cache`

* docs: Finish `_Reader`, fill parameters of `Loader.__call__`

Still need examples for `Loader.__call__`

* refactor: Rename `backend` -> `backend_name`, `get_backend` -> `backend`

`get_` was the wrong term since it isn't a free operation

* fix(DRAFT): Add multiple fallbacks for `pyarrow` JSON

* test: Remove `pandas` fallback for `pyarrow`

There are enough alternatives here, it only added complexity

* test: Adds `test_all_datasets`

Disabled by default, since there are 74 datasets

* refactor: Remove `_Reader._response`

Can't reproduce the original issue that led to adding this.
All backends are supporting `HTTPResponse` directly

* fix: Correctly handle no remote connection

Previously, `Path.touch()` appeared to be a cache-hit - despite being an empty file.
- Fixes that bug
- Adds tests

* docs: Align `_typing.Metadata` and `Loader.(url|__call__)` descriptions

Related c572180

* feat: Update to `v2.10.0`, fix tag inconsistency

- Noticed one branch that missed the join to `npm`
  - Moved the join to `.tags()` and added a doc
- https://github.com/vega/vega-datasets/releases/tag/v2.10.0

* refactor: Tidying up `tools.datasets`

* revert: Remove tags schema files

* ci: Introduce `datasets` refresh to `generate_schema_wrapper`

Unrelated to schema, but needs to hook in somewhere

* docs: Add `tools.datasets.Application` doc

* revert: Remove comment

* docs: Add a table preview to `Metadata`

* docs: Add examples for `Loader.__call__`

* refactor: Rename `DatasetName` -> `Dataset`, `VersionTag` -> `Version`

* fix: Ensure latest `[tag]` appears first

When updating from `v2.9.0` -> `v2.10.0`, new tags were appended to the bottom.
This invalidated an assumption in `Loader.(dataset|url)` that the first result is the latest

* refactor: Misc `models.py` updates

- Remove unused `ParsedTreesResponse`
- Align more of the doc style
- Rename `ReParsedTag` -> `SemVerTag`

* docs: Update `tools.datasets.__init__.py`

* test: Fix `@datasets_debug` selection

Wasn't being recognised by `-m not datasets_debug` and always ran

* test: Add support for overrides in `test_all_datasets`

vega/vega-datasets#627

* test: Adds `test_metadata_columns`

* fix: Warn instead of raise for hit rate limit

There should be enough handling elsewhere to stop requesting

https://github.com/vega/altair/actions/runs/11823002117/job/32941324941#step:8:102

* feat: Update for `v2.11.0`

https://github.com/vega/vega-datasets/releases/tag/v2.11.0
Includes support for `.parquet` following:
- vega/vega-datasets#628
- vega/vega-datasets#627

* feat: Always use `pl.read_csv(try_parse_dates=True)`

Related #3631 (comment)

* feat: Adds `_pl_read_json_roundtrip`

First mentioned in #3631 (comment)

Addresses most of the  `polars` part of #3631 (comment)

* feat(DRAFT): Adds infer-based `altair.datasets.load`

Requested by @joelostblom in:
#3631 (comment)
#3631 (comment)

* refactor: Rename `Loader.with_backend` -> `Loader.from_backend`

#3631 (comment)

* feat(DRAFT): Add optional `backend` parameter for `load(...)`

Requested by @jonmmease
#3631 (comment)
#3631 (comment)

* feat(DRAFT): Adds `altair.datasets.url`

A dataframe package is still required currently,.
Can later be adapted to fit the requirements of (#3631 (comment)).

Related:
- #3631 (comment)
- #3631 (comment)
- #3150 (reply in thread)

@mattijn, @joelostblom

* feat: Support `url(...)` without dependencies

#3631 (comment), #3631 (comment), #3631 (comment)

* fix(DRAFT): Don't generate csv on refresh

https://github.com/vega/altair/actions/runs/11942284568/job/33288974210?pr=3631

* test: Replace rogue `NotImplementedError`

https://github.com/vega/altair/actions/runs/11942364658/job/33289235198?pr=3631

* fix: Omit `.gz` last modification time header

Previously was creating a diff on every refresh, since the current time updated.
https://docs.python.org/3/library/gzip.html#gzip.GzipFile.mtime

https://github.com/vega/altair/actions/runs/11942284568/job/33288974210?pr=3631

* docs: Add doc for `Application.write_csv_gzip`

* revert: Remove `"polars[pyarrow]" backend

Partially related to #3631 (comment)

After some thought, this backend didn't add support for any unique dependency configs.
I've only ever used `use_pyarrow=True` for `pl.DataFrame.write_parquet` to resolve an issue with invalid headers in `"polars<1.0.0;>=0.19.0"`

* test: Add a complex `xfail` for `test_load_call`

Doesn't happen in CI, still unclear why the import within `pandas` breaks under these conditions.
Have tried multiple combinations of `pytest.MonkeyPatch`, hard imports, but had no luck in fixing the bug

* refactor: Renaming/recomposing `_readers.py`

The next commits benefit from having functionality decoupled from `_Reader.query`.
Mainly, keeping things lazy and not raising a user-facing error

* build: Generate `VERSION_LATEST`

Simplifies logic that relies on enum/categoricals that may not be recognised as ordered

* feat: Adds `_cache.py` for `UrlCache`, `DatasetCache`

Docs to follow

* ci(ruff): Ignore `0.8.0` violations

#3687 (comment)

* fix: Use stable `narwhals` imports

narwhals-dev/narwhals#1426, #3693 (comment)

* revert(ruff): Ignore `0.8.0` violations

f21b52b

* revert: Remove `_readers._filter`

Feature has been adopted upstream in narwhals-dev/narwhals#1417

* feat: Adds example and tests for disabling caching

* refactor: Tidy up `DatasetCache`

* docs: Finish `Loader.cache`

Not using doctest style here, none of these return anything but I want them hinted at

* refactor(typing): Use `Mapping` instead of `dict`

Mutability is not needed.
Also see #3573

* perf: Use `to_list()` for all backends

narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment)

* feat(DRAFT): Utilize `datapackage` schemas in `pandas` backends

Provides a generalized solution to `pd.read_(csv|json)` requiring the names of date columns to attempt parsing.
cc @joelostblom

The solution is possible in large part to vega/vega-datasets#631

#3631 (comment)

* refactor(ruff): Apply `TC006` fixes in new code

Related #3706

* docs(DRAFT): Add notes on `datapackage.features_typing`

* docs: Update `Loader.from_backend` example w/ dtypes

Related 909e7d0

* feat: Use `_pl_read_json_roundtrip` instead of `pl.read_json` for `pyarrow`

Provides better dtype inference

* docs: Replace example dataset

Switching to one with a timestamp that `frictionless`  recognises

https://github.com/vega/vega-datasets/blob/8745f5c61ba951fe057a42562b8b88604b4a3735/datapackage.json#L2674-L2689

https://github.com/vega/vega-datasets/blob/8745f5c61ba951fe057a42562b8b88604b4a3735/datapackage.json#L45-L57

* fix(ruff): resolve `RUF043` warnings

https://github.com/vega/altair/actions/runs/12439154550/job/34732432411?pr=3631

* build: run `generate-schema-wrapper`

https://github.com/vega/altair/actions/runs/12439184312/job/34732516789?pr=3631

* chore: update schemas

Changes from vega/vega-datasets#648

Currently pinned on `main` until `v3.0.0` introduces `datapackage.json`
https://github.com/vega/vega-datasets/tree/main

* feat(typing): Update `frictionless` model hierarchy

- Adds some incomplete types for fields (`sources`, `licenses`)
- Misc changes from vega/vega-datasets#651, vega/vega-datasets#643

* chore: Freeze all metadata

Mainly for `datapackage.json`, which is now temporarily stored un-transformed

Using version (vega/vega-datasets@7c2e67f)

* feat: Support and extract `hash` from `datapackage.json`

Related vega/vega-datasets#665

* feat: Build dataset url with `datapackage.json`

New column deviates from original approach, to support working from `main`

https://github.com/vega/altair/blob/e259fbabfc38c3803de0a952f7e2b081a22a3ba3/altair/datasets/_readers.py#L154

* revert: Removes `is_name_collision`

Not relevant following upstream change vega/vega-datasets#633

* build: Re-enable and generate `datapackage_features.parquet`

Eventually, will replace `metadata.parquet`
- But for a single version (current) only
- Paired with a **limited** `.csv.gz` version, to support cases where `.parquet` reading is not available (`pandas` w/o (`pyarrow`|`fastparquet`))

* feat: add temp `_Reader.*_dpkg` methods

- Will be replacing the non-suffixed versions
- Need to do this gradually as `tag` will likely be dropped
  - Breaking most of the tests

* test: Remove/replace all `tag` based tests

* revert: Remove all `tag` based features

* feat: Source version from `tool.altair.vega.vega-datasets`

* refactor(DRAFT): Migrate to `datapackage.json` only

Major switch from multiple github/npm endpoints -> a single file.
Was Only possible following vega/vega-datasets#665

Still need to rewrite/fill out the `Metadata` doc, then moving onto features

* docs: Update `Metadata` example

* docs: Add missing descriptions to `Metadata`

* refactor: Renaming/reorganize in `tools/`

Mainly removing `Fl` prefix, as there is no confusion now `models.py` is purely `frictionless` structures

* test: Skip `is_image` datasets

* refactor: Make caching **opt-out**, use `$XDG_CACHE_HOME`

Caching is the more sensible default when considering a notebook environment
Using a standardised path now also https://specifications.freedesktop.org/basedir-spec/latest/#variables

* refactor(typing): Add `_iter_results` helper

* feat(DRAFT): Replace `UrlCache` w/ `CsvCache`

Now that only a single version is supported, it is possible to mitigate the `pandas` case w/o `.parquet` support (#3631 (comment))

This commit adds the file and some tools needed to implement this - but I'll need to follow up with some more changes to integrate this into `_Reader`

* refactor: Misc reworking caching

- Made paths a `ClassVar`
- Removed unused `SchemaCache` methods
- Replace `_FIELD_TO_DTYPE` w/ `_DTYPE_TO_FIELD`
  - Only one variant is ever used
Use a `SchemaCache` instance per-`pandas`-based reader
- Make fallback `csv_cache` initialization lazy
  - Only going to use the global when no dependencies found
  - Otherwise, instance-per-reader

* chore: Include `.parquet` in `metadata.csv.gz`

- Readable via url w/ `vegafusion` installed
- Currently no cases where a dataset has both `.parquet` and another extension

* feat: Extend `_extract_suffix` to support `Metadata`

Most subsequent changes are operating on this `TypedDict` directly, as it provides richer info for error handling

* refactor(typing): Simplify `Dataset` import

* fix: Convert `str` to correct types in `CsvCache`

* feat: Support `pandas` w/o a `.parquet` reader

* refactor: Reduce repetition w/ `_Reader._download`

* feat(DRAFT): `Metadata`-based error handling

- Adds `_exceptions.py` with some initial cases
- Renaming `result` -> `meta`
- Reduced the complexity of `_PyArrowReader`
- Generally, trying to avoid exceptions from 3rd parties - to allow suggesting an alternate path that may work

* chore(ruff): Remove unused `0.9.2` ignores

Related #3771

https://github.com/vega/altair/actions/runs/12810882256/job/35718940621?pr=3631

* refactor: clean up, standardize `_exceptions.py`

* test: Refactor decorators, test new errors

* docs: Replace outdated docs

- Using `load` instead of `data`
- Don't mention multi-versions, as that was dropped

* refactor: Clean up `tools.datasets`

- `Application.generate_typing` now mostly populated by `DataPackage` methods
- Docs are defined alongside expressions
- Factored out repetitive code into `spell_literal_alias`
- `Metadata` examples table is now generated inside the doc

* test: `test_datasets` overhaul

- Eliminated all flaky tests
- Mocking more of the internals that is safer to run in parallel
- Split out non-threadsafe tests with `@no_xdist`
- Huge performance improvement for the slower tests
- Added some helper functions (`is_*`) where common patterns were identified
- **Removed skipping from native `pandas` backend**
  - Confirms that its now safe without `pyarrow` installed

* refactor: Reuse `tools.fs` more, fix `app.(read|scan)`

Using only `.parquet` was relevant in earlier versions that produced multiple `.parquet` files
Now these methods safely handle all formats in use

* feat(typing): Set `"polars"` as default in `Loader.from_backend`

Without a default, I found that VSCode was always suggesting the **last** overload first (`"pyarrow"`)
This is a bad suggestion, as it provides the *worst native* experience.

The default now aligns with the backend providing the *best native* experience

* docs: Adds module-level doc to `altair.datasets`

- Multiple **brief** examples, for a taste of the public API
  - See (#3763)
- Refs to everywhere a first-time user may need help from
- Also aligned the (`Loader`|`load`) docs w/ eachother and the new phrasing here

* test: Clean up `test_datasets`

- Reduce superfluous docs
- Format/reorganize remaining docs
- Follow up on some comments
Misc style changes

* docs: Make `sphinx` happy with docs

These changes are very minor in VSCode, but fix a lot of rendering issues on the website

* refactor: Add `find_spec` fastpath to `is_available`

Have a lot of changes locally that use `find_spec`, but would prefer a single name assoicated with this action
The actual spec is never relevant for this usage

* feat(DRAFT): Private API overhaul

**Public API is unchanged**
Core changes are to simplify testing and extension:

- `_readers.py` -> `_reader.py`
  - w/ two new support modules `_constraints`, and `_readimpl`
- Functions (`BaseImpl`) are declared with what they support (`include`) and restrictions (`exclude`) on that subset
  - Transforms a lot of the imperative logic into set operations
- Greatly improved `pyarrow` support
  - Utilize schema
  - Provides additional fallback `.json` implementations
  - `_stdlib_read_json_to_arrow` finally resolves `"movies.json"` issue

* refactor: Simplify obsolete paths in `CsvCache`

They were an artifact of *previously* using multiple `vega-dataset` versions in `.paquet` - but only the most recent in `.csv.gz`

Currently both store the same range of names, so this error handling never triggered

* chore: add workaround for `narwhals` bug

Opened (narwhals-dev/narwhals#1897)
Marking (#3631 (comment)) as resolved

* feat(typing): replace `(Read|Scan)Impl` classes with aliases

- Shorter names `Read`, `Scan`
- The single unique method is now `into_scan`
- There was no real need to have concrete classes when they behave the same as parent

* feat: Rename, docs `unwrap_or` -> `unwrap_or_skip`

* refactor: Replace `._contents` w/ `.__str__()`

Inspired by https://github.com/pypa/packaging/blob/8510bd9d3bab5571974202ec85f6ef7b0359bfaf/src/packaging/requirements.py#L67-L71

* fix: Use correct type for `pyarrow.csv.read_csv`

Resolves:
```py
File ../altair/.venv/Lib/site-packages/pyarrow/csv.pyx:1258, in pyarrow._csv.read_csv()
TypeError: Cannot convert dict to pyarrow._csv.ParseOptions
```

* docs: Add docs for `Read`, `Scan`, `BaseImpl`

* docs: Clean up `_merge_kwds`, `_solve`

* refactor(typing): Include all suffixes in `Extension`

Also simplifies and removes outdated `Extension`-related tooling

* feat: Finish `Reader.profile`

- Reduced the scope a bit, now just un/supported
- Added `pprint` option
- Finished docs, including example pointing to use `url(...)`

* test: Use `Reader.profile` in `is_polars_backed_pyarrow`

* feat: Clean up, add tests for new exceptions

* feat: Adds `Reader.open_markdown`

- Will be even more useful after merging vega/vega-datasets#663
- Thinking this is a fair tradeoff vs inlining the descriptions into `altair`
  - All the info is available and it is quicker than manually searching the headings in a browser

* docs: fix typo

Resolves #3631 (comment)

* fix: fix typo in error message

#3631 (comment)

* refactor: utilize narwhals fix

narwhals-dev/narwhals#1934

* refactor: utilize `nw.Implementation.from_backend`

See narwhals-dev/narwhals#1888

* feat(typing): utilize `nw.LazyFrame` working `TypeVar`

Possible since narwhals-dev/narwhals#1930

@MarcoGorelli if you're interested what that PR did (besides fix warnings 😉)

* docs: Show less data in examples

* feat: Update for `vega-datasets@3.0.0-alpha.1`

Made possible via vega/vega-datasets#681

- Removes temp files
- Removes some outdated apis
- Remove test based on removed `"points"` dataset

* refactor: replace `SchemaCache.schema_pyarrow` -> `nw.Schema.to_arrow`

Related
- narwhals-dev/narwhals#1924
- #3631 (comment)

* feat(typing): Properly annotate `dataset_name`, `suffix`

Makes more sense following (755ab4f)

* chore: bump `vega-datasets==3.1.0`

* test(typing): Ignore `_pytest` imports for `pyright`

See microsoft/pyright#10248 (comment)

* feat: Basic `geopandas` impl

Still need to update tests

* fix: Add missing `v` prefix to url

* test: Update `test_spatial`

* ci: Try pinning locked `ruff`

https://github.com/vega/altair/actions/runs/14478364865/job/40609439929

* ci(uv): Add `--group geospatial`

* chore: Reduce `geopandas` pin

* feat: Basic `polars-st` impl

-Seems to work pretty similarly to `geopandas`
- The repr isn't as clean
- Pretty cool that you can get *something* from `load("us-10m").st.plot()`

* ci(typing): `mypy` ignore `polars-st`

https://github.com/vega/altair/actions/runs/14494920661/job/40660098022?pr=3631

* build against vega-datasets 3.2.0

* run generate-schema-wrapper

* prevent infinite recursion in _split_markers

* sync to v6

* resolve doctest on lower python versions

* resolve comment in github action

* changed examples to modern interface to pass docbuild

---------

Co-authored-by: dangotbanned <125183946+dangotbanned@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants