docs: Add missing descriptions, sources, and licenses by dsmedia · Pull Request #663 · vega/vega-datasets

dsmedia · 2025-01-12T13:13:06Z

Pull Request: Add missing descriptions, sources, and licenses to datapackage files

Objective:

Following up on the metadata infrastructure work in #634, #639, and #646, this PR adds missing description, source, and license metadata to datapackage_additions.toml. The checklist below indicates which metadata entries are missing and still need to be added. Schema entries (describing each dataset column) will be handled in a subsequent pull request.

Dataset descriptions written to avoid formulations such as "This dataset...", consistent with PEP 257 – Docstring Conventions

✅ PR Merged - 🚧 Work Remaining

Descriptions were added for all datasets.
Resources still missing sources:

flare-dependencies.json, flare.json, movies.json, sp500.csv, stocks.csv, udistrict.json, weekly-weather.json, windvectors.csv

Resources still missing licenses:

anscombe.json, driving.json, flare-dependencies.json, flare.json, la-riots.csv, movies.json, ohlc.json, platformer-terrain.json, sp500-2000.csv, sp500.csv, stocks.csv, udistrict.json, volcano.json, weekly-weather.json, windvectors.csv

~~Open Questions:~~ Complete

Decide how to handle license information that cannot be determined after research, ensuring validation with frictionless standard for licenses.
~~[ ] add disclaimer about license information~~ deferring to separate PR
Regenerated datackapage.json and datapackage.md

Status:

The following checklist indicates the completion status of the description, sources, and licenses metadata for each dataset. A green checkmark (✅) indicates the metadata is present, a red X (❌) indicates the metadata is missing. The leading checkbox is only checked if all three types of metadata are present.

Process:
Changes are validated using scripts/build_datapackage.py, which generates machine-readable metadata describing the contents of /data/.

Legend:

✅ - Indicates the metadata is present
❌ - Indicates the metadata is missing
[x] - Indicates all three types of metadata are present
[ ] - Indicates one or more types of metadata are missing

Checklist:

Update datasets.toml with missing source metadata for 7zip.png dataset

…a-datasets into add-missing-metadata

- adds citation to protovis in desscription - fixes link to image in sources - adds license

- fixes bad link in annual-precip.json; adds license - adds license to birdstrikes.csv, budget.json, burtin.json, and cars.json

- expands description to explain units and seasonal adjustment - adds additional source directly to dataset csv - adds license details from source

- expands description - adds license

- Document that data used in flights* datasets are collected under US DOT requirements - Add row counts to flight dataset descriptions (2k-3M rows) - Note regulatory basis (14 CFR Part 234) while acknowledging unclear license terms

- adds license for londonBoroughs.json - adds sources, license for londonCentroids.json (itself derived from londonBoroughs.json) - expands description, corrects source URL, updates source title, and adds license for londonTubeLines.json

- global-temp.csv - iowa-electricity.csv - jobs.json - monarchs.json - political-contributions.json (also updates link to FEC github), note that FEC provides an explicit underlying license - population_engineers_hurricanes.csv - seattle-weather-hourly-normals.csv - seattle-weather.csv - unemployment-across-industries.json - unemployment.tsv - us-employment.csv - weather.csv Note that many pages hosting US government datasets do not explicitly grant a license. As a result, when there is a doubt, a link is provided to the USA government works page, which explains the nuances of licenses for data on US government web sites.

- adds license (football.json, la-riots.csv, penguins.json, platformer-terrain.json, population.json, sp500-2000.csv, sp500.csv, volcano.json) - airports.csv (adds description, sources, license) - barley.csv (updates description and source; adds license) - disasters.csv (expands description, updates sources, add license) - driving.json (adds description, updates source, adds license) - ohlc.json (modifies description, adds additional source, and license) - stocks.csv (adds source, license) - weekly-weather.json (adds source, license) - windvectors.csv (adds source, license)

- updates description, adds sources and

- adds description, source and license - makes license title of U.S. Government Datasets consistent for cases specific license terms are undetermined

- focuses on how data is used in edge bundling example - would benefit from additional detail in the description

- corrects description, adds source, license

- ffox.png (updates license) - gapminder.json (adds license) - gimp.png (updates description, adds source, license) - github.csv (adds description, source, license) - lookup_groups.csv, lookup_people.csv (adds description, source, license) - miserables.json adds description, source, license) - movies.json (adds source, license) - normal-2d.json (adds description, source, license) - stocks.csv (adds description)

dsmedia · 2025-01-20T02:58:18Z

Here is the code to validate the statistical description of normal-2d.json added in a86b9dd.

import pandas as pd
from scipy import stats

# Read data
df = pd.read_json("https://raw.githubusercontent.com/vega/vega-datasets/main/data/normal-2d.json")

# Generate key statistics
stats_output = {
    "Sample size": len(df),
    "Means": df.mean().round(3).to_dict(),
    "Standard deviations": df.std().round(3).to_dict(),
    "Correlation": round(df.corr().iloc[0, 1], 3),
    "Ranges": {col: [round(df[col].min(), 3), round(df[col].max(), 3)] for col in df.columns},
    "Normality p-values": {col: round(stats.normaltest(df[col]).pvalue, 3) for col in df.columns}
}

# Print statistics
print("Dataset Statistics for Description:")
print(f"Sample size: {stats_output['Sample size']} points")
print(f"Centers: {stats_output['Means']}")
print(f"Standard deviations: {stats_output['Standard deviations']}")
print(f"Correlation: {stats_output['Correlation']}")
print(f"Ranges: {stats_output['Ranges']}")
print(f"Normality test p-values: {stats_output['Normality p-values']}")

...which produced:

Dataset Statistics for Description:
Sample size: 500 points
Centers: {'u': 0.005, 'v': -0.011}
Standard deviations: {'u': 0.192, 'v': 0.199}
Correlation: 0.026
Ranges: {'u': [-0.578, 0.533], 'v': [-0.534, 0.606]}
Normality test p-values: {'u': 0.68, 'v': 0.763}

- related to #668

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

- runs build_datapackage.py to verify

- source and license can be clarified in a future PR

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

dsmedia · 2025-02-01T21:38:36Z

Some really great stuff in here @dsmedia!

Let me know if you need any clarifications

Have gone through and addressed all I believe - appreciate the diligent review @dangotbanned !

dangotbanned · 2025-02-01T22:38:19Z

Some really great stuff in here @dsmedia!
Let me know if you need any clarifications

Have gone through and addressed all I believe - appreciate the diligent review @dangotbanned !

Thanks @dsmedia, hopefully will have a chance to look it over tomorrow

_data/datapackage_additions.toml

dangotbanned · 2025-02-02T11:37:02Z

Some really great stuff in here @dsmedia!
Let me know if you need any clarifications

Have gone through and addressed all I believe - appreciate the diligent review @dangotbanned !

Thanks @dsmedia, hopefully will have a chance to look it over tomorrow

@dsmedia all the new changes are great 🎉

Reading through the new (https://github.com/vega/vega-datasets/blob/dbecc75a57d3a5cc0bbeea1861b0254d3be4e15c/datapackage.md) some more things stood out to me.
Hopefully they should be easy to resolve

- github.csv: move time range to schema - add categories to schema in seattle-weather.csv - sp500.csv, udistrict.json, uniform-2d, weather.json : move description content into schema - reformat usgs disclaimer in us-state-capitals.json - rerun build_datapackage.py

_data/datapackage_additions.toml

dangotbanned · 2025-02-02T13:19:50Z

Thanks @dsmedia, feel free to merge whenever you're ready (and #672)

@joelostblom

* feat: Adds `.arrow` support To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow) * feat: Add support for caching metadata * feat: Support env var `VEGA_GITHUB_TOKEN` Not required for these requests, but may be helpful to avoid limits * feat: Add support for multi-version metadata As an example, for comparing against the most recent I've added the 5 most recent * refactor: Renaming, docs, reorganize * feat: Support collecting release tags See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags * feat: Adds `refresh_tags` - Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests * feat(DRAFT): Adds `url_from` Experimenting with querying the url cache w/ expressions * fix: Wrap all requests with auth * chore: Remove `DATASET_NAMES_USED` * feat: Major `GitHub` rewrite, handle rate limiting - `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb** * feat(DRAFT): Partial implement `data("name")` * fix(typing): Resolve some `mypy` errors * fix(ruff): Apply `3.8` fixes https://github.com/vega/altair/actions/runs/11495437283/job/31994955413 * docs(typing): Add `WorkInProgress` marker to `data(...)` - Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well * feat(DRAFT): Add a source for available `npm` versions * refactor: Bake `"v"` prefix into `tags_npm` * refactor: Move `_npm_metadata` into a class * chore: Remove unused, add todo * feat: Adds `app` context for github<->npm * fix: Invalidate old trees * chore: Remove early test files# * refactor: Rename `metadata_full` -> `metadata` Suffix was only added due to *now-removed* test files * refactor: `tools.vendor_datasets` -> `tools.datasets` package Will be following up with some more splitting into composite modules * refactor: Move `TypedDict`, `NamedTuple`(s) -> `datasets.models` * refactor: Move, rename `semver`-related tools * refactor: Remove `write_schema` from `_Npm`, `_GitHub` Handled in `Application` now * refactor: Rename, split `_Npm`, `_GitHub` into own modules `tools.datasets.npm` will later be performing the requests that are in `Dataset.__call__` currently * refactor: Move `DataLoader.__call__` -> `DataLoader.url()` -`data.name()` -> `data(name)` - `data.name.url` -> `data.url(name)` * feat(typing): Generate annotations based on known datasets * refactor(typing): Utilize `datasets._typing` * feat: Adds `Npm.dataset` for remote reading] * refactor: Remove dead code * refactor: Replace `name_js`, `name_py` with `dataset_name` Since we're just using strings, there is no need for 2 forms of the name. The legacy package needed this for `__getattr__` access with valid identifiers * fix: Remove invalid `semver.sort` op I think this was added in error, since the schema of the file never had `semver` columns Only noticed the bug when doing a full rebuild * fix: Add missing init path for `refresh_trees` * refactor: Move public interface to `_io` Temporary home, see module docstring * refactor(perf): Don't recreate path mapping on every attribute access * refactor: Split `Reader._url_from` into `url`, `_query` - Much more generic now in what it can be used for - For the caching, I'll need more columns than just `"url_npm"` - `"url_github" contains a hash * feat(DRAFT): Adds `GitHubUrl.BLOBS` - Common prefix to all rows in `metadata[url_github]` - Stripping this leaves only `sha` - For **2800** rows, there are only **109** unique hashes, so these can be used to reduce cache size * feat: Store `sha` instead of `github_url` Related 661a385 * feat(perf): Adds caching to `ALTAIR_DATASETS_DIR` * feat(DRAFT): Adds initial generic backends * feat: Generate and move `Metadata` (`TypedDict`) to `datasets._typing` * feat: Adds optional backends, `polars[pyarrow]`, `with_backend` * feat: Adds `pyarrow` backend * docs: Update `.with_backend()` * chore: Remove `duckdb` comment Not planning to support this anymore, requires `fsspec` which isn't in `dev` ``` InvalidInputException Traceback (most recent call last) Cell In[6], line 5 3 with duck._reader._opener.open(url) as f: 4 fn = duck._reader._read_fn['.json'] ----> 5 thing = fn(f.read()) InvalidInputException: Invalid Input Error: This operation could not be completed because required module 'fsspec' is not installed" ``` * ci(typing): Add `pyarrow-stubs` to `dev` dependencies Will put this in another PR, but need it here for IDE support * refactor: `generate_datasets_typing` -> `Application.generate_typing` * refactor: Split `datasets` into public/private packages - `tools.datasets`: Building & updating metadata file(s), generating annotations - `altair.datasets`: Consuming metadata, remote & cached dataset management * refactor: Provide `npm` url to `GitHub(...)` * refactor: Rename `ext` -> `suffix` * refactor: Remove unimplemented `tag="latest"` Since `metadata.parquet` is sorted, this was already the behavior when not providing a tag * feat: Rename `_datasets_dir`, make configurable, add docs Still on the fence about `Loader.cache_dir` vs `Loader.cache` * docs: Adds examples to `Loader.with_backend` * refactor: Clean up requirements -> imports * docs: Add basic example to `Loader` class Also incorporates changes from previous commit into `__repr__` 4a2a2e0 * refactor: Reorder `alt.datasets` module * docs: Fill out `Loader.url` * feat: Adds `_Reader._read_metadata` * refactor: Rename `(reader|scanner_from()` -> `(read|scan)_fn()` * refactor(typing): Replace some explicit casts * refactor: Shorten and document request delays * feat(DRAFT): Make `[tag]` a `pl.Enum` * fix: Handle `pyarrow` scalars conversion * test: Adds `test_datasets` Initially quite basic, need to add more parameterize and test caching * fix(DRAFT): hotfix `pyarrow` read * fix(DRAFT): Treat `polars` as exception, invalidate cache Possibly fix https://github.com/vega/altair/actions/runs/11768349827/job/32778071725?pr=3631 * test: Skip `pyarrow` tests on `3.9` Forgot that this gets uninstalled in CI https://github.com/vega/altair/actions/runs/11768424121/job/32778234026?pr=3631 * refactor: Tidy up changes from last 4 commits - Rename and properly document "file-like object" handling - Also made a bit clearer what is being called and when - Use a more granular approach to skipping in `@backends` - Previously, everything was skipped regardless of whether it required `pyarrow` - Now, `polars`, `pandas` **always** run - with `pandas` expected to fail - I had to clean up `skip_requires_pyarrow` to make it compatible with `pytest.param` - It has a runtime check for if `MarkDecorator`, instead of just a callable bb7bc17, ebc1bfa, fe0ae88, 7089f2a * refactor: Rework `_readers.py` - Moved `_Reader._metadata` -> module-level constant `_METADATA`. - It was never modified and is based on the relative directory of this module - Generally improved the readability with more method-chaining (less assignment) - Renamed, improved doc `_filter_reduce` -> `_parse_predicates_constraints` * test: Adds tests for missing dependencies * test: Adds `test_dataset_not_found` * test: Adds `test_reader_cache` * docs: Finish `_Reader`, fill parameters of `Loader.__call__` Still need examples for `Loader.__call__` * refactor: Rename `backend` -> `backend_name`, `get_backend` -> `backend` `get_` was the wrong term since it isn't a free operation * fix(DRAFT): Add multiple fallbacks for `pyarrow` JSON * test: Remove `pandas` fallback for `pyarrow` There are enough alternatives here, it only added complexity * test: Adds `test_all_datasets` Disabled by default, since there are 74 datasets * refactor: Remove `_Reader._response` Can't reproduce the original issue that led to adding this. All backends are supporting `HTTPResponse` directly * fix: Correctly handle no remote connection Previously, `Path.touch()` appeared to be a cache-hit - despite being an empty file. - Fixes that bug - Adds tests * docs: Align `_typing.Metadata` and `Loader.(url|__call__)` descriptions Related c572180 * feat: Update to `v2.10.0`, fix tag inconsistency - Noticed one branch that missed the join to `npm` - Moved the join to `.tags()` and added a doc - https://github.com/vega/vega-datasets/releases/tag/v2.10.0 * refactor: Tidying up `tools.datasets` * revert: Remove tags schema files * ci: Introduce `datasets` refresh to `generate_schema_wrapper` Unrelated to schema, but needs to hook in somewhere * docs: Add `tools.datasets.Application` doc * revert: Remove comment * docs: Add a table preview to `Metadata` * docs: Add examples for `Loader.__call__` * refactor: Rename `DatasetName` -> `Dataset`, `VersionTag` -> `Version` * fix: Ensure latest `[tag]` appears first When updating from `v2.9.0` -> `v2.10.0`, new tags were appended to the bottom. This invalidated an assumption in `Loader.(dataset|url)` that the first result is the latest * refactor: Misc `models.py` updates - Remove unused `ParsedTreesResponse` - Align more of the doc style - Rename `ReParsedTag` -> `SemVerTag` * docs: Update `tools.datasets.__init__.py` * test: Fix `@datasets_debug` selection Wasn't being recognised by `-m not datasets_debug` and always ran * test: Add support for overrides in `test_all_datasets` vega/vega-datasets#627 * test: Adds `test_metadata_columns` * fix: Warn instead of raise for hit rate limit There should be enough handling elsewhere to stop requesting https://github.com/vega/altair/actions/runs/11823002117/job/32941324941#step:8:102 * feat: Update for `v2.11.0` https://github.com/vega/vega-datasets/releases/tag/v2.11.0 Includes support for `.parquet` following: - vega/vega-datasets#628 - vega/vega-datasets#627 * feat: Always use `pl.read_csv(try_parse_dates=True)` Related #3631 (comment) * feat: Adds `_pl_read_json_roundtrip` First mentioned in #3631 (comment) Addresses most of the `polars` part of #3631 (comment) * feat(DRAFT): Adds infer-based `altair.datasets.load` Requested by @joelostblom in: #3631 (comment) #3631 (comment) * refactor: Rename `Loader.with_backend` -> `Loader.from_backend` #3631 (comment) * feat(DRAFT): Add optional `backend` parameter for `load(...)` Requested by @jonmmease #3631 (comment) #3631 (comment) * feat(DRAFT): Adds `altair.datasets.url` A dataframe package is still required currently,. Can later be adapted to fit the requirements of (#3631 (comment)). Related: - #3631 (comment) - #3631 (comment) - #3150 (reply in thread) @mattijn, @joelostblom * feat: Support `url(...)` without dependencies #3631 (comment), #3631 (comment), #3631 (comment) * fix(DRAFT): Don't generate csv on refresh https://github.com/vega/altair/actions/runs/11942284568/job/33288974210?pr=3631 * test: Replace rogue `NotImplementedError` https://github.com/vega/altair/actions/runs/11942364658/job/33289235198?pr=3631 * fix: Omit `.gz` last modification time header Previously was creating a diff on every refresh, since the current time updated. https://docs.python.org/3/library/gzip.html#gzip.GzipFile.mtime https://github.com/vega/altair/actions/runs/11942284568/job/33288974210?pr=3631 * docs: Add doc for `Application.write_csv_gzip` * revert: Remove `"polars[pyarrow]" backend Partially related to #3631 (comment) After some thought, this backend didn't add support for any unique dependency configs. I've only ever used `use_pyarrow=True` for `pl.DataFrame.write_parquet` to resolve an issue with invalid headers in `"polars<1.0.0;>=0.19.0"` * test: Add a complex `xfail` for `test_load_call` Doesn't happen in CI, still unclear why the import within `pandas` breaks under these conditions. Have tried multiple combinations of `pytest.MonkeyPatch`, hard imports, but had no luck in fixing the bug * refactor: Renaming/recomposing `_readers.py` The next commits benefit from having functionality decoupled from `_Reader.query`. Mainly, keeping things lazy and not raising a user-facing error * build: Generate `VERSION_LATEST` Simplifies logic that relies on enum/categoricals that may not be recognised as ordered * feat: Adds `_cache.py` for `UrlCache`, `DatasetCache` Docs to follow * ci(ruff): Ignore `0.8.0` violations #3687 (comment) * fix: Use stable `narwhals` imports narwhals-dev/narwhals#1426, #3693 (comment) * revert(ruff): Ignore `0.8.0` violations f21b52b * revert: Remove `_readers._filter` Feature has been adopted upstream in narwhals-dev/narwhals#1417 * feat: Adds example and tests for disabling caching * refactor: Tidy up `DatasetCache` * docs: Finish `Loader.cache` Not using doctest style here, none of these return anything but I want them hinted at * refactor(typing): Use `Mapping` instead of `dict` Mutability is not needed. Also see #3573 * perf: Use `to_list()` for all backends narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment) * feat(DRAFT): Utilize `datapackage` schemas in `pandas` backends Provides a generalized solution to `pd.read_(csv|json)` requiring the names of date columns to attempt parsing. cc @joelostblom The solution is possible in large part to vega/vega-datasets#631 #3631 (comment) * refactor(ruff): Apply `TC006` fixes in new code Related #3706 * docs(DRAFT): Add notes on `datapackage.features_typing` * docs: Update `Loader.from_backend` example w/ dtypes Related 909e7d0 * feat: Use `_pl_read_json_roundtrip` instead of `pl.read_json` for `pyarrow` Provides better dtype inference * docs: Replace example dataset Switching to one with a timestamp that `frictionless` recognises https://github.com/vega/vega-datasets/blob/8745f5c61ba951fe057a42562b8b88604b4a3735/datapackage.json#L2674-L2689 https://github.com/vega/vega-datasets/blob/8745f5c61ba951fe057a42562b8b88604b4a3735/datapackage.json#L45-L57 * fix(ruff): resolve `RUF043` warnings https://github.com/vega/altair/actions/runs/12439154550/job/34732432411?pr=3631 * build: run `generate-schema-wrapper` https://github.com/vega/altair/actions/runs/12439184312/job/34732516789?pr=3631 * chore: update schemas Changes from vega/vega-datasets#648 Currently pinned on `main` until `v3.0.0` introduces `datapackage.json` https://github.com/vega/vega-datasets/tree/main * feat(typing): Update `frictionless` model hierarchy - Adds some incomplete types for fields (`sources`, `licenses`) - Misc changes from vega/vega-datasets#651, vega/vega-datasets#643 * chore: Freeze all metadata Mainly for `datapackage.json`, which is now temporarily stored un-transformed Using version (vega/vega-datasets@7c2e67f) * feat: Support and extract `hash` from `datapackage.json` Related vega/vega-datasets#665 * feat: Build dataset url with `datapackage.json` New column deviates from original approach, to support working from `main` https://github.com/vega/altair/blob/e259fbabfc38c3803de0a952f7e2b081a22a3ba3/altair/datasets/_readers.py#L154 * revert: Removes `is_name_collision` Not relevant following upstream change vega/vega-datasets#633 * build: Re-enable and generate `datapackage_features.parquet` Eventually, will replace `metadata.parquet` - But for a single version (current) only - Paired with a **limited** `.csv.gz` version, to support cases where `.parquet` reading is not available (`pandas` w/o (`pyarrow`|`fastparquet`)) * feat: add temp `_Reader.*_dpkg` methods - Will be replacing the non-suffixed versions - Need to do this gradually as `tag` will likely be dropped - Breaking most of the tests * test: Remove/replace all `tag` based tests * revert: Remove all `tag` based features * feat: Source version from `tool.altair.vega.vega-datasets` * refactor(DRAFT): Migrate to `datapackage.json` only Major switch from multiple github/npm endpoints -> a single file. Was Only possible following vega/vega-datasets#665 Still need to rewrite/fill out the `Metadata` doc, then moving onto features * docs: Update `Metadata` example * docs: Add missing descriptions to `Metadata` * refactor: Renaming/reorganize in `tools/` Mainly removing `Fl` prefix, as there is no confusion now `models.py` is purely `frictionless` structures * test: Skip `is_image` datasets * refactor: Make caching **opt-out**, use `$XDG_CACHE_HOME` Caching is the more sensible default when considering a notebook environment Using a standardised path now also https://specifications.freedesktop.org/basedir-spec/latest/#variables * refactor(typing): Add `_iter_results` helper * feat(DRAFT): Replace `UrlCache` w/ `CsvCache` Now that only a single version is supported, it is possible to mitigate the `pandas` case w/o `.parquet` support (#3631 (comment)) This commit adds the file and some tools needed to implement this - but I'll need to follow up with some more changes to integrate this into `_Reader` * refactor: Misc reworking caching - Made paths a `ClassVar` - Removed unused `SchemaCache` methods - Replace `_FIELD_TO_DTYPE` w/ `_DTYPE_TO_FIELD` - Only one variant is ever used Use a `SchemaCache` instance per-`pandas`-based reader - Make fallback `csv_cache` initialization lazy - Only going to use the global when no dependencies found - Otherwise, instance-per-reader * chore: Include `.parquet` in `metadata.csv.gz` - Readable via url w/ `vegafusion` installed - Currently no cases where a dataset has both `.parquet` and another extension * feat: Extend `_extract_suffix` to support `Metadata` Most subsequent changes are operating on this `TypedDict` directly, as it provides richer info for error handling * refactor(typing): Simplify `Dataset` import * fix: Convert `str` to correct types in `CsvCache` * feat: Support `pandas` w/o a `.parquet` reader * refactor: Reduce repetition w/ `_Reader._download` * feat(DRAFT): `Metadata`-based error handling - Adds `_exceptions.py` with some initial cases - Renaming `result` -> `meta` - Reduced the complexity of `_PyArrowReader` - Generally, trying to avoid exceptions from 3rd parties - to allow suggesting an alternate path that may work * chore(ruff): Remove unused `0.9.2` ignores Related #3771 https://github.com/vega/altair/actions/runs/12810882256/job/35718940621?pr=3631 * refactor: clean up, standardize `_exceptions.py` * test: Refactor decorators, test new errors * docs: Replace outdated docs - Using `load` instead of `data` - Don't mention multi-versions, as that was dropped * refactor: Clean up `tools.datasets` - `Application.generate_typing` now mostly populated by `DataPackage` methods - Docs are defined alongside expressions - Factored out repetitive code into `spell_literal_alias` - `Metadata` examples table is now generated inside the doc * test: `test_datasets` overhaul - Eliminated all flaky tests - Mocking more of the internals that is safer to run in parallel - Split out non-threadsafe tests with `@no_xdist` - Huge performance improvement for the slower tests - Added some helper functions (`is_*`) where common patterns were identified - **Removed skipping from native `pandas` backend** - Confirms that its now safe without `pyarrow` installed * refactor: Reuse `tools.fs` more, fix `app.(read|scan)` Using only `.parquet` was relevant in earlier versions that produced multiple `.parquet` files Now these methods safely handle all formats in use * feat(typing): Set `"polars"` as default in `Loader.from_backend` Without a default, I found that VSCode was always suggesting the **last** overload first (`"pyarrow"`) This is a bad suggestion, as it provides the *worst native* experience. The default now aligns with the backend providing the *best native* experience * docs: Adds module-level doc to `altair.datasets` - Multiple **brief** examples, for a taste of the public API - See (#3763) - Refs to everywhere a first-time user may need help from - Also aligned the (`Loader`|`load`) docs w/ eachother and the new phrasing here * test: Clean up `test_datasets` - Reduce superfluous docs - Format/reorganize remaining docs - Follow up on some comments Misc style changes * docs: Make `sphinx` happy with docs These changes are very minor in VSCode, but fix a lot of rendering issues on the website * refactor: Add `find_spec` fastpath to `is_available` Have a lot of changes locally that use `find_spec`, but would prefer a single name assoicated with this action The actual spec is never relevant for this usage * feat(DRAFT): Private API overhaul **Public API is unchanged** Core changes are to simplify testing and extension: - `_readers.py` -> `_reader.py` - w/ two new support modules `_constraints`, and `_readimpl` - Functions (`BaseImpl`) are declared with what they support (`include`) and restrictions (`exclude`) on that subset - Transforms a lot of the imperative logic into set operations - Greatly improved `pyarrow` support - Utilize schema - Provides additional fallback `.json` implementations - `_stdlib_read_json_to_arrow` finally resolves `"movies.json"` issue * refactor: Simplify obsolete paths in `CsvCache` They were an artifact of *previously* using multiple `vega-dataset` versions in `.paquet` - but only the most recent in `.csv.gz` Currently both store the same range of names, so this error handling never triggered * chore: add workaround for `narwhals` bug Opened (narwhals-dev/narwhals#1897) Marking (#3631 (comment)) as resolved * feat(typing): replace `(Read|Scan)Impl` classes with aliases - Shorter names `Read`, `Scan` - The single unique method is now `into_scan` - There was no real need to have concrete classes when they behave the same as parent * feat: Rename, docs `unwrap_or` -> `unwrap_or_skip` * refactor: Replace `._contents` w/ `.__str__()` Inspired by https://github.com/pypa/packaging/blob/8510bd9d3bab5571974202ec85f6ef7b0359bfaf/src/packaging/requirements.py#L67-L71 * fix: Use correct type for `pyarrow.csv.read_csv` Resolves: ```py File ../altair/.venv/Lib/site-packages/pyarrow/csv.pyx:1258, in pyarrow._csv.read_csv() TypeError: Cannot convert dict to pyarrow._csv.ParseOptions ``` * docs: Add docs for `Read`, `Scan`, `BaseImpl` * docs: Clean up `_merge_kwds`, `_solve` * refactor(typing): Include all suffixes in `Extension` Also simplifies and removes outdated `Extension`-related tooling * feat: Finish `Reader.profile` - Reduced the scope a bit, now just un/supported - Added `pprint` option - Finished docs, including example pointing to use `url(...)` * test: Use `Reader.profile` in `is_polars_backed_pyarrow` * feat: Clean up, add tests for new exceptions * feat: Adds `Reader.open_markdown` - Will be even more useful after merging vega/vega-datasets#663 - Thinking this is a fair tradeoff vs inlining the descriptions into `altair` - All the info is available and it is quicker than manually searching the headings in a browser * docs: fix typo Resolves #3631 (comment) * fix: fix typo in error message #3631 (comment) * refactor: utilize narwhals fix narwhals-dev/narwhals#1934 * refactor: utilize `nw.Implementation.from_backend` See narwhals-dev/narwhals#1888 * feat(typing): utilize `nw.LazyFrame` working `TypeVar` Possible since narwhals-dev/narwhals#1930 @MarcoGorelli if you're interested what that PR did (besides fix warnings 😉) * docs: Show less data in examples * feat: Update for `vega-datasets@3.0.0-alpha.1` Made possible via vega/vega-datasets#681 - Removes temp files - Removes some outdated apis - Remove test based on removed `"points"` dataset * refactor: replace `SchemaCache.schema_pyarrow` -> `nw.Schema.to_arrow` Related - narwhals-dev/narwhals#1924 - #3631 (comment) * feat(typing): Properly annotate `dataset_name`, `suffix` Makes more sense following (755ab4f) * chore: bump `vega-datasets==3.1.0` * test(typing): Ignore `_pytest` imports for `pyright` See microsoft/pyright#10248 (comment) * feat: Basic `geopandas` impl Still need to update tests * fix: Add missing `v` prefix to url * test: Update `test_spatial` * ci: Try pinning locked `ruff` https://github.com/vega/altair/actions/runs/14478364865/job/40609439929 * ci(uv): Add `--group geospatial` * chore: Reduce `geopandas` pin * feat: Basic `polars-st` impl -Seems to work pretty similarly to `geopandas` - The repr isn't as clean - Pretty cool that you can get *something* from `load("us-10m").st.plot()` * ci(typing): `mypy` ignore `polars-st` https://github.com/vega/altair/actions/runs/14494920661/job/40660098022?pr=3631 * build against vega-datasets 3.2.0 * run generate-schema-wrapper * prevent infinite recursion in _split_markers * sync to v6 * resolve doctest on lower python versions * resolve comment in github action * changed examples to modern interface to pass docbuild --------- Co-authored-by: dangotbanned <125183946+dangotbanned@users.noreply.github.com>

dsmedia added 2 commits January 12, 2025 12:23

docs: add sources and license for 7zip resource

e82ffa2

Update datasets.toml with missing source metadata for 7zip.png dataset

chore: uvx taplo fmt

d128104

dsmedia mentioned this pull request Jan 12, 2025

Expand "development process" section in README.md #664

Open

dsmedia added the documentation label Jan 12, 2025

dsmedia added 5 commits January 12, 2025 19:14

docs: add sources and license for ffox.png

ea2f53e

Merge branch 'main' into add-missing-metadata

3612da6

docs: updates zipcodes.csv resource in datapackage_additions.toml

4682b95

Merge branch 'add-missing-metadata' of https://github.com/dsmedia/veg…

87e66e1

…a-datasets into add-missing-metadata

update world-110m.json

e565ab2

dsmedia mentioned this pull request Jan 14, 2025

Dead API link in windVectors.md example gicentre/litvis#99

Open

dsmedia added 8 commits January 16, 2025 04:20

docs: updates us-10m.json

99410bd

docs: updates wheat.json

bc1509e

- adds citation to protovis in desscription - fixes link to image in sources - adds license

docs: adds missing license data to several

1cfffc3

- fixes bad link in annual-precip.json; adds license - adds license to birdstrikes.csv, budget.json, burtin.json, and cars.json

docs: update metadata for co2-concerntration.csv

4e8ea09

- expands description to explain units and seasonal adjustment - adds additional source directly to dataset csv - adds license details from source

docs: adds license to crimea.json metadata

607f90e

docs: update metadata for earthquakes.json

3be7eb6

- expands description - adds license

docs: complete metadata for flights* datasets

516de54

- Document that data used in flights* datasets are collected under US DOT requirements - Add row counts to flight dataset descriptions (2k-3M rows) - Note regulatory basis (14 CFR Part 234) while acknowledging unclear license terms

docs: updates london dataset metadata

6b317af

- adds license for londonBoroughs.json - adds sources, license for londonCentroids.json (itself derived from londonBoroughs.json) - expands description, corrects source URL, updates source title, and adds license for londonTubeLines.json

dsmedia mentioned this pull request Jan 18, 2025

Add Generation Script for londonTubeLines.json Dataset #667

Open

dsmedia added 7 commits January 18, 2025 23:55

docs: compltes anscombe.json metadata

b8a340c

- updates description, adds sources and

docs: adds budgets.json metadata

c6648fb

- adds description, source and license - makes license title of U.S. Government Datasets consistent for cases specific license terms are undetermined

docs: adds basic metadata to flare*.json datasets

e5ccd9c

- focuses on how data is used in edge bundling example - would benefit from additional detail in the description

docs: completes flights-airport.csv metadata

76ce709

- corrects description, adds source, license

This was referenced Jan 21, 2025

feat: add generation script for us-state-capitals.json #668

Merged

"deduct" points.json? #670

Closed

docs: adds us-state-capitals.json metadata

bd20137

- related to #668

dsmedia and others added 10 commits February 1, 2025 12:50

adjust markdown in anscombe.json

7ea58df

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

adjust punctuation in anscombe.json

9fd72b1

adds column schema for budgets.json, penguins.json

04b5fa0

- runs build_datapackage.py to verify

docs: removes 'undetermined' source and license info

96667b2

- source and license can be clarified in a future PR

fix: correct lookup example url

9ea8732

docs: moves gapminder clusters to schema

ac3fb2a

update file markdown in flare.json metadata

d90bdea

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

docs: adjust markdown in flare-dependencies.json metadata

df63a3a

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

docs: reformats driving.json metadata

e4e2e63

fix formatting

dbecc75