feat: adds generation script for income.json by dsmedia · Pull Request #672 · vega/vega-datasets

dsmedia · 2025-01-24T03:09:54Z

prepares for upcoming metadata improvements in docs: Add missing descriptions, sources, and licenses #663 by tying dataset to specific source: Census Bureau's American Community Survey 3-Year Data (2013), B19001, Household Income in the Past 12 Months (in 2023 Inflation-Adjusted Dollars).
generates a clone of the existing income.json including the reversed ordering of two income groups (75000 to 99999 comes before 50000 to 74999), in case this was introduced intentionally for educational purposes. if not desired this can be addressed in a future pr. only diff appears to be extra carriage return in legacy json.
census api documentation for reference
region mapping and census codes are hard-coded inside script. open to recommendations for if/how to move any configuration to a separate toml or other file.

dangotbanned · 2025-01-24T10:30:40Z

scripts/income.py

+# Repository structure
+REPO_ROOT = Path(__file__).parent.parent
+OUTPUT_DIR = REPO_ROOT / "data"


Note
No need to address in this PR

I'm thinking we should split out some common utilities to another file.

These are some examples of path/filesystem ops in other repos:

https://github.com/vega/altair/blob/82ec2692efab4ad8f47b39623f1769d90e62f6ae/tools/fs.py

https://github.com/pypa/hatch/blob/64031c1cf5d02d85203f68cb7a4fd5db2aa7a004/scripts/utils.py

Things I've noticed we have that may be broadly useful

vega-datasets/scripts/income.py

Lines 202 to 204 in 3e30803

def write_json(data: list[StateIncome], output: Path) -> None:

"""Writes data to JSON file."""

output.write_text(json.dumps(data, indent=2), encoding="utf-8")

vega-datasets/scripts/us-state-capitals.py

Lines 164 to 172 in dd43f29

def write_json(data: Sequence[StateCapitol], output: Path) -> None:

"""Saves ``data`` to ``output`` with consistent formatting."""

INDENT, OB, CB, NL = " ", "[", "]", "\n"

to_str = partial(json.dumps, separators=(", ", ":"))

with output.open("w", encoding="utf-8", newline="\n") as f:

f.write(f"{OB}{NL}")

for record in data[:-1]:

f.write(f"{INDENT}{to_str(record)},{NL}")

f.write(f"{INDENT}{to_str(data[-1])}{NL}{CB}{NL}")

vega-datasets/scripts/build_datapackage.py

Lines 583 to 589 in dd43f29

def read_toml(fp: Path, /) -> dict[str, Any]:

return tomllib.loads(fp.read_text("utf-8"))

def read_json(fp: Path, /) -> Any:

with fp.open(encoding="utf-8") as f:

return json.load(f)

vega-datasets/scripts/build_datapackage.py

Lines 614 to 617 in 3e30803

repo_dir: Path = Path(__file__).parent.parent

data_dir: Path = repo_dir / "data"

sources_toml: Path = repo_dir / "_data" / ADDITIONS_TOML

npm_json = repo_dir / NPM_PACKAGE

vega-datasets/scripts/flights.py

Lines 1015 to 1020 in 3e30803

repo_root = Path(__file__).parent.parent

source_toml = repo_root / "_data" / "flights.toml"

app = Flights.from_toml(

source_toml,

input_dir=Path.home() / ".vega_datasets",

output_dir=repo_root / "data",

vega-datasets/scripts/us-state-capitals.py

Lines 38 to 40 in dd43f29

REPO_ROOT: Path = Path(__file__).parent.parent

INPUT_DIR: Path = REPO_ROOT / "_data"

OUTPUT_DIR: Path = REPO_ROOT / "data"

vega-datasets/scripts/us-state-capitals.py

Lines 175 to 176 in dd43f29

def _get_args(tp: Any, /) -> tuple[Any, ...]:

return typing.get_args(getattr(tp, "__value__", tp))

vega-datasets/scripts/flights.py

Lines 1008 to 1010 in 3e30803

def _get_args(tp: Any, /) -> tuple[Any, ...]:

unwrapped = getattr(tp, "__value__", tp)

return _typing_get_args(unwrapped)

vega-datasets/scripts/build_datapackage.py

Lines 520 to 524 in dd43f29

def run_check[T: (str, bytes)](

args: OneOrSeq[str | Path], /, into: type[T] = str

) -> sp.CompletedProcess[T]:

"""

Run a command in a `subprocess`_, capturing its output.

dangotbanned · 2025-01-24T10:43:07Z

@dsmedia I'm going to review this in more detail later today.

Wanted to say I'm really impressed by how quickly you were able to adapt to feedback from #668!

Even in the GitHub view, I'm finding it easy to follow 👏

scripts/income.py

Replace lambda sort key in process_state_records with a named get_state_income_sort_key function for better readability and maintainability. This makes the sorting logic more explicit and follows Python's guidance on avoiding complex lambdas.

dangotbanned

Thanks @dsmedia

c8f3056, #653

- shared `group` field is now hinted for all 4 places it is used - Including as a key function - added `Region` to indicate only `5` unique values - changed global constants `dict` -> `Mapping` to reflect they are not mutated - changed `AggregatedIncomeGroup` field to required - Otherwise it is exactly `BaseIncomeGroup` - The annotation already reflects that `BaseIncomeGroup | AggregatedIncomeGroup`

Should have been done during #671 (`point.json`) The diff on `income.json` seems like removing a newline char?

* docs: add sources and license for 7zip resource Update datasets.toml with missing source metadata for 7zip.png dataset * chore: uvx taplo fmt * docs: add sources and license for ffox.png * docs: updates zipcodes.csv resource in datapackage_additions.toml * update world-110m.json * docs: updates us-10m.json * docs: updates wheat.json - adds citation to protovis in desscription - fixes link to image in sources - adds license * docs: adds missing license data to several - fixes bad link in annual-precip.json; adds license - adds license to birdstrikes.csv, budget.json, burtin.json, and cars.json * docs: update metadata for co2-concerntration.csv - expands description to explain units and seasonal adjustment - adds additional source directly to dataset csv - adds license details from source * docs: adds license to crimea.json metadata * docs: update metadata for earthquakes.json - expands description - adds license * docs: complete metadata for flights* datasets - Document that data used in flights* datasets are collected under US DOT requirements - Add row counts to flight dataset descriptions (2k-3M rows) - Note regulatory basis (14 CFR Part 234) while acknowledging unclear license terms * docs: updates london dataset metadata - adds license for londonBoroughs.json - adds sources, license for londonCentroids.json (itself derived from londonBoroughs.json) - expands description, corrects source URL, updates source title, and adds license for londonTubeLines.json * docs: adds government and IPUMS license metadata to several - global-temp.csv - iowa-electricity.csv - jobs.json - monarchs.json - political-contributions.json (also updates link to FEC github), note that FEC provides an explicit underlying license - population_engineers_hurricanes.csv - seattle-weather-hourly-normals.csv - seattle-weather.csv - unemployment-across-industries.json - unemployment.tsv - us-employment.csv - weather.csv Note that many pages hosting US government datasets do not explicitly grant a license. As a result, when there is a doubt, a link is provided to the USA government works page, which explains the nuances of licenses for data on US government web sites. * docs: adds 'undetermined' licenses and sources - adds license (football.json, la-riots.csv, penguins.json, platformer-terrain.json, population.json, sp500-2000.csv, sp500.csv, volcano.json) - airports.csv (adds description, sources, license) - barley.csv (updates description and source; adds license) - disasters.csv (expands description, updates sources, add license) - driving.json (adds description, updates source, adds license) - ohlc.json (modifies description, adds additional source, and license) - stocks.csv (adds source, license) - weekly-weather.json (adds source, license) - windvectors.csv (adds source, license) * docs: compltes anscombe.json metadata - updates description, adds sources and * docs: adds budgets.json metadata - adds description, source and license - makes license title of U.S. Government Datasets consistent for cases specific license terms are undetermined * docs: adds basic metadata to flare*.json datasets - focuses on how data is used in edge bundling example - would benefit from additional detail in the description * docs: completes flights-airport.csv metadata - corrects description, adds source, license * docs: update several file metadata entries - ffox.png (updates license) - gapminder.json (adds license) - gimp.png (updates description, adds source, license) - github.csv (adds description, source, license) - lookup_groups.csv, lookup_people.csv (adds description, source, license) - miserables.json adds description, source, license) - movies.json (adds source, license) - normal-2d.json (adds description, source, license) - stocks.csv (adds description) * docs: adds us-state-capitals.json metadata - related to #668 * docs: adds uniform-2d.json metadata * docs: adds obesity.json metadata * docs: remove points.json metadata - dataset was removed from repo in #671 * docs: adds metadata for income.json - relies on income.py script from #672 * docs: adds metadata for udistrict.json * docs: adds, fixes metadata - adds description, sources for sp500.csv - fixes formatting for weekly-weather.json * docs: updates datapackage - uv run scripts/build_datapackage.py # doctest: +SKIP * docs: begins to recast in PEP 257 style - Partial fix for #663 (comment) - edits descriptions through earthquakes.json * docs: recasts all in PEP 257 format - avoids 'this dataset' and similar - reruns datapackage script (json, md) * fix: corrects year in description of obesity.json - new source found confirming 1995, not 2008, data is shown, consistent with CDC data - removes link to vega example that references wrong source year * fix: Use correct heading level in `burtin.json` Drive-by fix, really been bugging me that this breaks the flow of the navigation * fix: remove extra space Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * fix: remove extra space from source Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * docs: add column schema to normal-2d.json metadata Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * reformats revision note for monarchs.json Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * fix: typo in monarchs.json metadata * adjust markdown in anscombe.json Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * adjust punctuation in anscombe.json * adds column schema for budgets.json, penguins.json - runs build_datapackage.py to verify * docs: removes 'undetermined' source and license info - source and license can be clarified in a future PR * fix: correct lookup example url * docs: moves gapminder clusters to schema * update file markdown in flare.json metadata Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * docs: adjust markdown in flare-dependencies.json metadata Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com> * docs: reformats driving.json metadata * fix formatting * adjustments to schemas - github.csv: move time range to schema - add categories to schema in seattle-weather.csv - sp500.csv, udistrict.json, uniform-2d, weather.json : move description content into schema - reformat usgs disclaimer in us-state-capitals.json - rerun build_datapackage.py * remove duplication in udistrict description * uvx run scripts --------- Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

dsmedia requested review from dangotbanned and domoritz January 24, 2025 03:09

domoritz approved these changes Jan 24, 2025

View reviewed changes

dangotbanned reviewed Jan 24, 2025

View reviewed changes

scripts/income.py Outdated Show resolved Hide resolved

dsmedia added 3 commits January 24, 2025 23:49

feat: adds generation script for income.json

ad0395d

style: format income.py with ruff

0115c73

dsmedia force-pushed the feature/add-script-income branch from 3e30803 to c79f12e Compare January 24, 2025 23:50

dangotbanned approved these changes Jan 25, 2025

View reviewed changes

dangotbanned added 4 commits January 25, 2025 11:31

ci(typing): Include income.py for pyright

f1de40a

fix: Avoid CRLF on win32

712d393

c8f3056, #653

build: run build_datapackage.py

16485ea

Should have been done during #671 (`point.json`) The diff on `income.json` seems like removing a newline char?

dangotbanned added the enhancement label Jan 26, 2025

dangotbanned mentioned this pull request Feb 2, 2025

docs: Add missing descriptions, sources, and licenses #663

Merged

74 tasks

dsmedia merged commit 40620d6 into main Feb 2, 2025

dsmedia deleted the feature/add-script-income branch February 2, 2025 14:58

dangotbanned mentioned this pull request Mar 17, 2025

feat: Add Species Habitat Dataset for Faceted Map Examples #684

Merged

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: adds generation script for income.json#672

feat: adds generation script for income.json#672
dsmedia merged 7 commits intomainfrom
feature/add-script-income

dsmedia commented Jan 24, 2025 •

edited by dangotbanned

Loading

Uh oh!

dangotbanned Jan 24, 2025

Uh oh!

dangotbanned commented Jan 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

dangotbanned left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	def write_json(data: list[StateIncome], output: Path) -> None:
	"""Writes data to JSON file."""
	output.write_text(json.dumps(data, indent=2), encoding="utf-8")

	def write_json(data: Sequence[StateCapitol], output: Path) -> None:
	"""Saves ``data`` to ``output`` with consistent formatting."""
	INDENT, OB, CB, NL = " ", "[", "]", "\n"
	to_str = partial(json.dumps, separators=(", ", ":"))
	with output.open("w", encoding="utf-8", newline="\n") as f:
	f.write(f"{OB}{NL}")
	for record in data[:-1]:
	f.write(f"{INDENT}{to_str(record)},{NL}")
	f.write(f"{INDENT}{to_str(data[-1])}{NL}{CB}{NL}")

	def read_toml(fp: Path, /) -> dict[str, Any]:
	return tomllib.loads(fp.read_text("utf-8"))


	def read_json(fp: Path, /) -> Any:
	with fp.open(encoding="utf-8") as f:
	return json.load(f)

	repo_dir: Path = Path(__file__).parent.parent
	data_dir: Path = repo_dir / "data"
	sources_toml: Path = repo_dir / "_data" / ADDITIONS_TOML
	npm_json = repo_dir / NPM_PACKAGE

	repo_root = Path(__file__).parent.parent
	source_toml = repo_root / "_data" / "flights.toml"
	app = Flights.from_toml(
	source_toml,
	input_dir=Path.home() / ".vega_datasets",
	output_dir=repo_root / "data",

	REPO_ROOT: Path = Path(__file__).parent.parent
	INPUT_DIR: Path = REPO_ROOT / "_data"
	OUTPUT_DIR: Path = REPO_ROOT / "data"

	def _get_args(tp: Any, /) -> tuple[Any, ...]:
	return typing.get_args(getattr(tp, "__value__", tp))

	def _get_args(tp: Any, /) -> tuple[Any, ...]:
	unwrapped = getattr(tp, "__value__", tp)
	return _typing_get_args(unwrapped)

	def run_check[T: (str, bytes)](
	args: OneOrSeq[str \| Path], /, into: type[T] = str
	) -> sp.CompletedProcess[T]:
	"""
	Run a command in a `subprocess`_, capturing its output.

Uh oh!

Conversation

dsmedia commented Jan 24, 2025 • edited by dangotbanned Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dangotbanned Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned commented Jan 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dangotbanned left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dsmedia commented Jan 24, 2025 •

edited by dangotbanned

Loading

dangotbanned commented Jan 24, 2025 •

edited

Loading