Skip to content

feat: adds generation script for income.json#672

Merged
dsmedia merged 7 commits intomainfrom
feature/add-script-income
Feb 2, 2025
Merged

feat: adds generation script for income.json#672
dsmedia merged 7 commits intomainfrom
feature/add-script-income

Conversation

@dsmedia
Copy link
Copy Markdown
Collaborator

@dsmedia dsmedia commented Jan 24, 2025

  • prepares for upcoming metadata improvements in docs: Add missing descriptions, sources, and licenses #663 by tying dataset to specific source: Census Bureau's American Community Survey 3-Year Data (2013), B19001, Household Income in the Past 12 Months (in 2023 Inflation-Adjusted Dollars).

  • generates a clone of the existing income.json including the reversed ordering of two income groups (75000 to 99999 comes before 50000 to 74999), in case this was introduced intentionally for educational purposes. if not desired this can be addressed in a future pr. only diff appears to be extra carriage return in legacy json.

  • census api documentation for reference

  • region mapping and census codes are hard-coded inside script. open to recommendations for if/how to move any configuration to a separate toml or other file.

Comment on lines +9 to +11
# Repository structure
REPO_ROOT = Path(__file__).parent.parent
OUTPUT_DIR = REPO_ROOT / "data"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

No need to address in this PR

I'm thinking we should split out some common utilities to another file.

These are some examples of path/filesystem ops in other repos:

Things I've noticed we have that may be broadly useful

def write_json(data: list[StateIncome], output: Path) -> None:
"""Writes data to JSON file."""
output.write_text(json.dumps(data, indent=2), encoding="utf-8")

def write_json(data: Sequence[StateCapitol], output: Path) -> None:
"""Saves ``data`` to ``output`` with consistent formatting."""
INDENT, OB, CB, NL = " ", "[", "]", "\n"
to_str = partial(json.dumps, separators=(", ", ":"))
with output.open("w", encoding="utf-8", newline="\n") as f:
f.write(f"{OB}{NL}")
for record in data[:-1]:
f.write(f"{INDENT}{to_str(record)},{NL}")
f.write(f"{INDENT}{to_str(data[-1])}{NL}{CB}{NL}")

def read_toml(fp: Path, /) -> dict[str, Any]:
return tomllib.loads(fp.read_text("utf-8"))
def read_json(fp: Path, /) -> Any:
with fp.open(encoding="utf-8") as f:
return json.load(f)

repo_dir: Path = Path(__file__).parent.parent
data_dir: Path = repo_dir / "data"
sources_toml: Path = repo_dir / "_data" / ADDITIONS_TOML
npm_json = repo_dir / NPM_PACKAGE

repo_root = Path(__file__).parent.parent
source_toml = repo_root / "_data" / "flights.toml"
app = Flights.from_toml(
source_toml,
input_dir=Path.home() / ".vega_datasets",
output_dir=repo_root / "data",

REPO_ROOT: Path = Path(__file__).parent.parent
INPUT_DIR: Path = REPO_ROOT / "_data"
OUTPUT_DIR: Path = REPO_ROOT / "data"

def _get_args(tp: Any, /) -> tuple[Any, ...]:
return typing.get_args(getattr(tp, "__value__", tp))

def _get_args(tp: Any, /) -> tuple[Any, ...]:
unwrapped = getattr(tp, "__value__", tp)
return _typing_get_args(unwrapped)

def run_check[T: (str, bytes)](
args: OneOrSeq[str | Path], /, into: type[T] = str
) -> sp.CompletedProcess[T]:
"""
Run a command in a `subprocess`_, capturing its output.

@dangotbanned
Copy link
Copy Markdown
Member

dangotbanned commented Jan 24, 2025

@dsmedia I'm going to review this in more detail later today.

Wanted to say I'm really impressed by how quickly you were able to adapt to feedback from #668!

Even in the GitHub view, I'm finding it easy to follow 👏

Replace lambda sort key in process_state_records with a named
get_state_income_sort_key function for better readability and
maintainability. This makes the sorting logic more explicit and
follows Python's guidance on avoiding complex lambdas.
@dsmedia dsmedia force-pushed the feature/add-script-income branch from 3e30803 to c79f12e Compare January 24, 2025 23:50
Copy link
Copy Markdown
Member

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dsmedia

- shared `group` field is now hinted for all 4 places it is used
  - Including as a key function
- added `Region` to indicate only `5` unique values
- changed global constants `dict` -> `Mapping` to reflect they are not mutated
- changed `AggregatedIncomeGroup` field to required
   - Otherwise it is exactly `BaseIncomeGroup`
   - The annotation already reflects that `BaseIncomeGroup | AggregatedIncomeGroup`
Should have been done during #671 (`point.json`)

The diff on `income.json` seems like removing a newline char?
@dsmedia dsmedia merged commit 40620d6 into main Feb 2, 2025
@dsmedia dsmedia deleted the feature/add-script-income branch February 2, 2025 14:58
dsmedia added a commit that referenced this pull request Feb 2, 2025
* docs: add sources and license for 7zip resource

Update datasets.toml with missing source metadata for 7zip.png dataset

* chore: uvx taplo fmt

* docs: add sources and license for ffox.png

* docs: updates zipcodes.csv resource in datapackage_additions.toml

* update world-110m.json

* docs: updates us-10m.json

* docs: updates wheat.json
- adds citation to protovis in desscription
- fixes link to image in sources
- adds license

* docs: adds missing license data to several
- fixes bad link in annual-precip.json; adds license
- adds license to birdstrikes.csv, budget.json, burtin.json, and cars.json

* docs: update metadata for co2-concerntration.csv
- expands description to explain units and seasonal adjustment
- adds additional source directly to dataset csv
- adds license details from source

* docs: adds license to crimea.json metadata

* docs: update metadata for earthquakes.json
- expands description
- adds license

* docs: complete metadata for flights* datasets

- Document that data used in flights* datasets are collected under US DOT requirements
- Add row counts to flight dataset descriptions (2k-3M rows)
- Note regulatory basis (14 CFR Part 234) while acknowledging unclear license terms

* docs: updates london dataset metadata
- adds license for londonBoroughs.json
- adds sources, license for londonCentroids.json (itself derived from londonBoroughs.json)
- expands description, corrects source URL, updates source title, and adds license for londonTubeLines.json

* docs: adds government and IPUMS license metadata to several
- global-temp.csv
- iowa-electricity.csv
- jobs.json
- monarchs.json
- political-contributions.json (also updates link to FEC github), note that FEC provides an explicit underlying license
- population_engineers_hurricanes.csv
- seattle-weather-hourly-normals.csv
- seattle-weather.csv
- unemployment-across-industries.json
- unemployment.tsv
- us-employment.csv
- weather.csv

Note that many pages hosting US government datasets do not explicitly grant a license. As a result, when there is a doubt, a link is provided to the USA government works page, which explains the nuances of licenses for data on US government web sites.

* docs: adds 'undetermined' licenses and sources

- adds license (football.json, la-riots.csv, penguins.json, platformer-terrain.json, population.json, sp500-2000.csv, sp500.csv, volcano.json)
- airports.csv (adds description, sources, license)
- barley.csv (updates description and source; adds license)
- disasters.csv (expands description, updates sources, add license)
- driving.json (adds description, updates source, adds license)
- ohlc.json (modifies description, adds additional source, and license)
- stocks.csv (adds source, license)
- weekly-weather.json (adds source, license)
- windvectors.csv (adds source, license)

* docs: compltes anscombe.json metadata
- updates description, adds sources and

* docs: adds budgets.json metadata
- adds description, source and license
- makes license title of U.S. Government Datasets consistent for cases specific license terms are undetermined

* docs: adds basic metadata to flare*.json datasets
- focuses on how data is used in edge bundling example
- would benefit from additional detail in the description

* docs: completes flights-airport.csv metadata
- corrects description, adds source, license

* docs: update several file metadata entries
- ffox.png (updates license)
- gapminder.json (adds license)
- gimp.png (updates description, adds source, license)
- github.csv (adds description, source, license)
- lookup_groups.csv, lookup_people.csv (adds description, source, license)
- miserables.json adds description, source, license)
- movies.json (adds source, license)
- normal-2d.json (adds description, source, license)
- stocks.csv (adds description)

* docs: adds us-state-capitals.json metadata
- related to #668

* docs: adds uniform-2d.json metadata

* docs: adds obesity.json metadata

* docs: remove points.json metadata
- dataset was removed from repo in #671

* docs: adds metadata for income.json
- relies on income.py script from #672

* docs: adds metadata for udistrict.json

* docs: adds, fixes metadata
- adds description, sources for sp500.csv
- fixes formatting for weekly-weather.json

* docs: updates datapackage
- uv run scripts/build_datapackage.py # doctest: +SKIP

* docs: begins to recast in PEP 257 style
- Partial fix for #663 (comment)
- edits descriptions through earthquakes.json

* docs: recasts all in PEP 257 format
- avoids 'this dataset' and similar
- reruns datapackage script (json, md)

* fix: corrects year in description of obesity.json
- new source found confirming 1995, not 2008, data is shown, consistent with CDC data
- removes link to vega example that references wrong source year

* fix: Use correct heading level in `burtin.json`

Drive-by fix, really been bugging me that this breaks the flow of the navigation

* fix: remove extra space

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

* fix: remove extra space from source

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

* docs: add column schema to normal-2d.json metadata

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

* reformats revision note for monarchs.json

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

* fix: typo in monarchs.json metadata

* adjust markdown in anscombe.json

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

* adjust punctuation in anscombe.json

* adds column schema for budgets.json, penguins.json
- runs build_datapackage.py to verify

* docs: removes 'undetermined' source and license info
- source and license can be clarified in a future PR

* fix: correct lookup example url

* docs: moves gapminder clusters to schema

* update file markdown in flare.json metadata

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

* docs: adjust markdown in flare-dependencies.json metadata

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

* docs: reformats driving.json metadata

* fix formatting

* adjustments to schemas
- github.csv: move time range to schema
- add categories to schema in seattle-weather.csv
- sp500.csv, udistrict.json, uniform-2d, weather.json : move description content into schema
- reformat usgs disclaimer in us-state-capitals.json
- rerun build_datapackage.py

* remove duplication in udistrict description

* uvx run scripts

---------

Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants