feat: adds generation script for income.json#672
Conversation
| # Repository structure | ||
| REPO_ROOT = Path(__file__).parent.parent | ||
| OUTPUT_DIR = REPO_ROOT / "data" |
There was a problem hiding this comment.
Note
No need to address in this PR
I'm thinking we should split out some common utilities to another file.
These are some examples of path/filesystem ops in other repos:
- https://github.com/vega/altair/blob/82ec2692efab4ad8f47b39623f1769d90e62f6ae/tools/fs.py
- https://github.com/pypa/hatch/blob/64031c1cf5d02d85203f68cb7a4fd5db2aa7a004/scripts/utils.py
Things I've noticed we have that may be broadly useful
vega-datasets/scripts/income.py
Lines 202 to 204 in 3e30803
vega-datasets/scripts/us-state-capitals.py
Lines 164 to 172 in dd43f29
vega-datasets/scripts/build_datapackage.py
Lines 583 to 589 in dd43f29
vega-datasets/scripts/build_datapackage.py
Lines 614 to 617 in 3e30803
vega-datasets/scripts/flights.py
Lines 1015 to 1020 in 3e30803
vega-datasets/scripts/us-state-capitals.py
Lines 38 to 40 in dd43f29
vega-datasets/scripts/us-state-capitals.py
Lines 175 to 176 in dd43f29
vega-datasets/scripts/flights.py
Lines 1008 to 1010 in 3e30803
vega-datasets/scripts/build_datapackage.py
Lines 520 to 524 in dd43f29
Replace lambda sort key in process_state_records with a named get_state_income_sort_key function for better readability and maintainability. This makes the sorting logic more explicit and follows Python's guidance on avoiding complex lambdas.
3e30803 to
c79f12e
Compare
- shared `group` field is now hinted for all 4 places it is used - Including as a key function - added `Region` to indicate only `5` unique values - changed global constants `dict` -> `Mapping` to reflect they are not mutated - changed `AggregatedIncomeGroup` field to required - Otherwise it is exactly `BaseIncomeGroup` - The annotation already reflects that `BaseIncomeGroup | AggregatedIncomeGroup`
Should have been done during #671 (`point.json`) The diff on `income.json` seems like removing a newline char?
* docs: add sources and license for 7zip resource Update datasets.toml with missing source metadata for 7zip.png dataset * chore: uvx taplo fmt * docs: add sources and license for ffox.png * docs: updates zipcodes.csv resource in datapackage_additions.toml * update world-110m.json * docs: updates us-10m.json * docs: updates wheat.json - adds citation to protovis in desscription - fixes link to image in sources - adds license * docs: adds missing license data to several - fixes bad link in annual-precip.json; adds license - adds license to birdstrikes.csv, budget.json, burtin.json, and cars.json * docs: update metadata for co2-concerntration.csv - expands description to explain units and seasonal adjustment - adds additional source directly to dataset csv - adds license details from source * docs: adds license to crimea.json metadata * docs: update metadata for earthquakes.json - expands description - adds license * docs: complete metadata for flights* datasets - Document that data used in flights* datasets are collected under US DOT requirements - Add row counts to flight dataset descriptions (2k-3M rows) - Note regulatory basis (14 CFR Part 234) while acknowledging unclear license terms * docs: updates london dataset metadata - adds license for londonBoroughs.json - adds sources, license for londonCentroids.json (itself derived from londonBoroughs.json) - expands description, corrects source URL, updates source title, and adds license for londonTubeLines.json * docs: adds government and IPUMS license metadata to several - global-temp.csv - iowa-electricity.csv - jobs.json - monarchs.json - political-contributions.json (also updates link to FEC github), note that FEC provides an explicit underlying license - population_engineers_hurricanes.csv - seattle-weather-hourly-normals.csv - seattle-weather.csv - unemployment-across-industries.json - unemployment.tsv - us-employment.csv - weather.csv Note that many pages hosting US government datasets do not explicitly grant a license. As a result, when there is a doubt, a link is provided to the USA government works page, which explains the nuances of licenses for data on US government web sites. * docs: adds 'undetermined' licenses and sources - adds license (football.json, la-riots.csv, penguins.json, platformer-terrain.json, population.json, sp500-2000.csv, sp500.csv, volcano.json) - airports.csv (adds description, sources, license) - barley.csv (updates description and source; adds license) - disasters.csv (expands description, updates sources, add license) - driving.json (adds description, updates source, adds license) - ohlc.json (modifies description, adds additional source, and license) - stocks.csv (adds source, license) - weekly-weather.json (adds source, license) - windvectors.csv (adds source, license) * docs: compltes anscombe.json metadata - updates description, adds sources and * docs: adds budgets.json metadata - adds description, source and license - makes license title of U.S. Government Datasets consistent for cases specific license terms are undetermined * docs: adds basic metadata to flare*.json datasets - focuses on how data is used in edge bundling example - would benefit from additional detail in the description * docs: completes flights-airport.csv metadata - corrects description, adds source, license * docs: update several file metadata entries - ffox.png (updates license) - gapminder.json (adds license) - gimp.png (updates description, adds source, license) - github.csv (adds description, source, license) - lookup_groups.csv, lookup_people.csv (adds description, source, license) - miserables.json adds description, source, license) - movies.json (adds source, license) - normal-2d.json (adds description, source, license) - stocks.csv (adds description) * docs: adds us-state-capitals.json metadata - related to #668 * docs: adds uniform-2d.json metadata * docs: adds obesity.json metadata * docs: remove points.json metadata - dataset was removed from repo in #671 * docs: adds metadata for income.json - relies on income.py script from #672 * docs: adds metadata for udistrict.json * docs: adds, fixes metadata - adds description, sources for sp500.csv - fixes formatting for weekly-weather.json * docs: updates datapackage - uv run scripts/build_datapackage.py # doctest: +SKIP * docs: begins to recast in PEP 257 style - Partial fix for #663 (comment) - edits descriptions through earthquakes.json * docs: recasts all in PEP 257 format - avoids 'this dataset' and similar - reruns datapackage script (json, md) * fix: corrects year in description of obesity.json - new source found confirming 1995, not 2008, data is shown, consistent with CDC data - removes link to vega example that references wrong source year * fix: Use correct heading level in `burtin.json` Drive-by fix, really been bugging me that this breaks the flow of the navigation * fix: remove extra space Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * fix: remove extra space from source Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * docs: add column schema to normal-2d.json metadata Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * reformats revision note for monarchs.json Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * fix: typo in monarchs.json metadata * adjust markdown in anscombe.json Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * adjust punctuation in anscombe.json * adds column schema for budgets.json, penguins.json - runs build_datapackage.py to verify * docs: removes 'undetermined' source and license info - source and license can be clarified in a future PR * fix: correct lookup example url * docs: moves gapminder clusters to schema * update file markdown in flare.json metadata Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * docs: adjust markdown in flare-dependencies.json metadata Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * docs: reformats driving.json metadata * fix formatting * adjustments to schemas - github.csv: move time range to schema - add categories to schema in seattle-weather.csv - sp500.csv, udistrict.json, uniform-2d, weather.json : move description content into schema - reformat usgs disclaimer in us-state-capitals.json - rerun build_datapackage.py * remove duplication in udistrict description * uvx run scripts --------- Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
prepares for upcoming metadata improvements in docs: Add missing descriptions, sources, and licenses #663 by tying dataset to specific source: Census Bureau's American Community Survey 3-Year Data (2013), B19001, Household Income in the Past 12 Months (in 2023 Inflation-Adjusted Dollars).
generates a clone of the existing
income.jsonincluding the reversed ordering of two income groups (75000 to 99999comes before50000 to 74999), in case this was introduced intentionally for educational purposes. if not desired this can be addressed in a future pr. only diff appears to be extra carriage return in legacy json.census api documentation for reference
region mapping and census codes are hard-coded inside script. open to recommendations for if/how to move any configuration to a separate toml or other file.