feat: Add Species Habitat Dataset for Faceted Map Examples by dsmedia · Pull Request #684 · vega/vega-datasets

dsmedia · 2025-02-11T03:11:05Z

will close #683

This PR adds a new dataset to vega-datasets containing county-level species habitat distribution data for the United States. The dataset is designed to support categorical faceted map examples in Vega visualization libraries, addressing needs discussed in vega/altair#1711 and vega/vega-datasets#683.

Implementation Status

To Do

Dataset Details

Current Implementation (`species.csv`)

Structure

Format: CSV
Content: County-level habitat percentages for four US species (American Bullfrog, American Robin, White-tailed Deer, Common Gartersnake)
Location: data/ directory
Rows: 11,651
Columns: 6

Fields

Field	Type	Description
`item_id`	string	Unique identifier for the species data item on ScienceBase
`common_name`	string	Common name of the species (e.g., "American Bullfrog")
`scientific_name`	string	Scientific name of the species (e.g., "Lithobates catesbeianus")
`gap_species_code`	string	GAP Species Code, a unique identifier for the species within the GAP dataset (e.g., "aAMBUx")
`county_id`	int	Combined state and county FIPS code, identifying the US county. Four or five digits (i.e. does not pad with leading zero)
`habitat_yearround_pct`	float	Percentage of the county area that is classified as year-round suitable habitat for the species (rounded to 4 decimal places)

Data Generation

The dataset is generated using scripts/species.py, which implements the approach suggested by @mattijn in this comment. The script:

Downloads habitat raster data from USGS ScienceBase
Processes county-level habitat percentages using exactextract
Generates data/species.csv

Known Issues

~~Runtime warning: Spatial reference system of input features does not exactly match raster~~ (fixed by 9aade3d)
Species column currently uses USGS codes (e.g., "aAMBUx", "bAMROx")
- Will be updated to use descriptive species names
Percentage precision to be optimized for typical use cases
Check for documentation purposes if Alaska data is out of scope

Future work

Consider expanding the dataset (in a backward compatible way) to include summer and winter habitat data, and summer/winter/all_season data on range.

Validation of Habitat Percentage Calculation

This image compares the USGS potential habitat map (within this zip file) for bullfrogs (zoomed into the southeastern United States, for clarity) with the output of our code, demonstrating the correct implementation of zonal statistics.

Left: USGS potential habitat map for bullfrogs. This is a raster dataset showing predicted suitable habitat (purple areas) at a 30-meter resolution. It is not aggregated by any administrative boundaries.
Right: Generated choropleth map showing the percentage of suitable habitat within each county. Lighter colors indicate a higher percentage of the county is classified as suitable habitat, while darker colors indicate a lower percentage.

The code successfully overlays the USGS raster data with county boundaries (from us-10m.json, a vector dataset) and uses exactextract to calculate the percentage of each county that falls within the USGS-defined habitat. The resulting map visually confirms that the habitat percentages are being calculated correctly, with higher percentages in counties that overlap significantly with the USGS's predicted habitat area.

Species Habitat Visualization Code

import altair as alt
import pyarrow as pa
import requests
import io
from vega_datasets import data

# Download and load the Arrow file
url = "https://github.com/vega/vega-datasets/raw/c4ff52d22bfa2e70a1b40554cc4c56df216838f3/data/species.arrow"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
table = pa.ipc.RecordBatchFileReader(io.BytesIO(response.content)).read_all()

# Disable row limit for Altair (if using vegafusion, this may not be needed)
alt.data_transformers.disable_max_rows()

# Load US counties topology
counties = alt.topo_feature(data.us_10m.url, 'counties')

# Filter for the specific species ---
target_item_id = "58fa3f0be4b0b7ea54524859"

# Use a try-except block for more robust error handling
try:
    # Filter the table based on 'item_id'
    filtered_table = table.filter(pa.compute.equal(table.column('item_id'), target_item_id))

    # Check if the item_id was found
    if filtered_table.num_rows == 0:
        raise ValueError(f"item_id '{target_item_id}' not found in the dataset.")

    # Get the species name (assuming it's unique for the given item_id)
    species_name = filtered_table.column('CommonName')[0].as_py()

except pa.lib.ArrowInvalid as e:
    raise ValueError(f"Invalid item_id format or column not found: {e}") from e
except Exception as e:
    raise RuntimeError(f"An unexpected error occurred: {e}") from e

# --- END MODIFICATION ---

# Create a single chart for the filtered species
chart = alt.Chart(counties, title=species_name).mark_geoshape(tooltip=True).encode(
    color=alt.Color(
        'percent_habitat:Q',
        scale=alt.Scale(
            domain=[0, 1],
            scheme='viridis',
            zero=True,
            nice=False
        ),
        title='Habitat Percentage',
        legend=alt.Legend(format=".0%")
    ),
    tooltip=[
        alt.Tooltip('county_id:N', title='County ID'),
        alt.Tooltip('percent_habitat:Q', title='Habitat %', format='.2%')
    ]
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(
        data=filtered_table,  # Use the filtered table directly
        key='county_id',
        fields=['percent_habitat']
    )
).transform_filter(
    "indexof(['2', '15'], '' + floor(datum.id / 1000)) == -1"
).transform_calculate(
    percent_habitat="datum.percent_habitat === null ? 0 : datum.percent_habitat"
).project(type='albersUsa').properties(
    width=600, height=400
)

# Display the chart
chart
</details>

- Replace direct URL downloads with ScienceBase API client (sciencebasepy) - Remove niquests dependency in favor of native ScienceBase download handling - Enhance logging for better debugging of file downloads and extractions - Add explicit ZIP file cleanup after TIF extraction - Update dependencies to include sciencebasepy and setuptools

dsmedia · 2025-02-13T12:25:06Z

Just as a reference for development, this will generate a complete list of species and all species-level metadata.

Generation script

# /// script
# requires-python = ">=3.8"
# dependencies = [
#     "pandas",
#     "sciencebasepy",
#     "tqdm",
#     "requests",
# ]
# ///
"""
This script retrieves and processes species identifier data from ScienceBase.

It fetches all child items under a specified parent item ID, extracts identifiers
like ECOS and ITIS codes, and compiles the data into a CSV file. The script
utilizes parallel processing to efficiently handle a large number of items and
includes error handling and retry mechanisms for robustness.
"""
import pandas as pd
from sciencebasepy import SbSession
from collections import defaultdict
from tqdm import tqdm
import concurrent.futures
import time
import requests  # Import the requests library


def get_all_item_ids(parent_id: str = "527d0a83e4b0850ea0518326") -> list[str]:
    """
    Retrieves all child item IDs from a given ScienceBase parent item.

    This function efficiently fetches all item IDs that are children of the
    specified parent item ID. It is optimized for performance by using the
    `get_child_ids` method from the ScienceBase API.

    Args:
        parent_id: The ScienceBase item ID of the parent item.
                   Defaults to "527d0a83e4b0850ea0518326" (Habitat Map parent item).

    Returns:
        A list of strings, where each string is a ScienceBase item ID.
        Returns an empty list if there are errors or no child items found.
    """
    print("Retrieving species item IDs from ScienceBase...")
    sb = SbSession()

    try:
        parent_item = sb.get_item(parent_id)
        print(f"Found parent item: '{parent_item.get('title', 'No Title')}'")

        all_ids = list(sb.get_child_ids(parent_id))  # Efficiently get all child IDs

        print(f"Found {len(all_ids)} species items.")
        return all_ids

    except Exception as e:
        print(f"Error retrieving items from ScienceBase: {e}")
        import traceback
        traceback.print_exc()  # Print full traceback for debugging
        return []


def process_single_item(item_id: str, sb: SbSession) -> dict:
    """
    Processes a single ScienceBase item to extract relevant identifier data.

    This function fetches the JSON representation of a ScienceBase item, extracts
    the title and identifiers (like ECOS, ITIS), and returns the data as a dictionary.
    It includes error handling and retry logic for network issues, particularly
    HTTP errors.

    Args:
        item_id: The ScienceBase item ID to process.
        sb: An authenticated SbSession object for interacting with ScienceBase.

    Returns:
        A dictionary containing the extracted item data, including 'item_id',
        'title', and any identifiers found (e.g., 'ECOS', 'ITIS').
        Returns None if the item could not be processed due to errors,
        after retries in case of transient network issues, or if the item is not found (404).
    """
    try:
        item_json = sb.get_item(item_id)
        title = item_json.get('title', 'Unknown Title')

        item_data = defaultdict(lambda: 'Not Available')  # Default value for missing identifiers
        item_data['item_id'] = item_id
        item_data['title'] = title

        if 'identifiers' in item_json:
            for identifier in item_json['identifiers']:
                scheme = identifier.get('scheme', 'Unknown Scheme')
                key = identifier.get('key', 'No Value')
                clean_scheme = scheme.split('/')[-1] if '/' in scheme else scheme  # Extract last part of scheme
                item_data[clean_scheme] = key

        return dict(item_data)

    except requests.exceptions.HTTPError as e:
        print(f"\nHTTPError processing item {item_id}: {e}")
        if e.response.status_code == 404:
            print(f"Item {item_id} not found (404 error). Skipping.")
            return None  # Indicate item not found
        else:
            retries = 3
            for i in range(retries):
                try:
                    print(f"Retrying item {item_id} (attempt {i+1}/{retries})...")
                    time.sleep(2**i)  # Exponential backoff
                    item_json = sb.get_item(item_id)  # Retry fetching the item
                    title = item_json.get('title', 'Unknown Title')

                    item_data = defaultdict(lambda: 'Not Available')
                    item_data['item_id'] = item_id
                    item_data['title'] = title
                    if 'identifiers' in item_json:
                        for identifier in item_json['identifiers']:
                            scheme = identifier.get('scheme', 'Unknown Scheme')
                            key = identifier.get('key', 'No Value')
                            clean_scheme = scheme.split('/')[-1] if '/' in scheme else scheme
                            item_data[clean_scheme] = key
                    return dict(item_data)
                except requests.exceptions.RequestException:
                    if i == retries - 1:
                        print(f"Failed to retrieve item {item_id} after multiple retries.")
                        return None  # Indicate failure after retries
    except Exception as e:
        print(f"\nError processing item {item_id}: {e}")
        return None  # Indicate processing failure


def explore_species_identifiers_parallel(item_ids: list[str], max_workers: int = 10) -> pd.DataFrame:
    """
    Explores and extracts identifiers from ScienceBase items in parallel.

    This function processes a list of ScienceBase item IDs concurrently using
    a thread pool executor. It fetches data for each item using the
    `process_single_item` function and aggregates the results into a pandas DataFrame.

    Args:
        item_ids: A list of ScienceBase item IDs to process.
        max_workers: The maximum number of threads to use for parallel processing.
                     Adjust this value based on your system and network to optimize
                     performance without overloading resources.

    Returns:
        A pandas DataFrame where each row represents a ScienceBase item,
        and columns include 'item_id', 'title', and identifier schemes
        (e.g., 'ECOS', 'ITIS'). Returns an empty DataFrame if no data is processed.
    """
    sb = SbSession()  # Create a single session for all threads
    all_schemes = set()
    results = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(process_single_item, item_id, sb) for item_id in item_ids]

        for future in tqdm(concurrent.futures.as_completed(futures), total=len(item_ids), desc="Processing species items"):
            result = future.result()
            if result:
                results.append(result)
                all_schemes.update(key for key in result if key not in ('item_id', 'title')) # Update schemes efficiently

    if not results:
        return pd.DataFrame()  # Return empty DataFrame if no results

    df = pd.DataFrame(results)

    # Ensure all identifier scheme columns are present and reorder
    for scheme in all_schemes:
        if scheme not in df.columns:
            df[scheme] = 'Not Available'
    cols = ['item_id', 'title'] + sorted(list(all_schemes))
    df = df[cols]

    return df


def main():
    """
    Main function to execute the species identifier exploration script.

    This function orchestrates the process of:
    1. Retrieving all species item IDs from ScienceBase.
    2. Processing these items in parallel to extract identifiers.
    3. Generating summary statistics of the extracted data.
    4. Saving the results to a CSV file named 'all_species_identifiers.csv'.
    5. Displaying the first few rows of the resulting DataFrame.
    """
    all_ids = get_all_item_ids()  # Get all species item IDs

    if not all_ids:
        print("No species IDs found. Exiting.")
        return

    print("\nProcessing species item identifiers...")
    df = explore_species_identifiers_parallel(all_ids, max_workers=10)  # Process in parallel

    if df.empty:
        print("No data was processed successfully. Exiting.")
        return

    print("\nIdentifier Summary:")
    print(f"Total species items processed: {len(df)}")
    print("\nIdentifier columns found and count of non 'Not Available' values:")
    for col in df.columns:
        # Corrected line to exclude NaN values from the count
        non_empty_count = df[col].ne('Not Available').sum() - df[col].isna().sum()
        print(f"- {col}: {non_empty_count} items have this identifier")

    output_file = 'all_species_identifiers.csv'
    df.to_csv(output_file, index=False)
    print(f"\nResults saved to '{output_file}'")

    print("\nFirst 5 rows of the data:")
    pd.set_option('display.max_columns', None)  # Display all columns
    pd.set_option('display.max_colwidth', None) # Display full column width
    print(df.head()) # Replaced display(df.head()) with print()


if __name__ == "__main__":
    main()

USGS report with domain-specific explanations

💡Note : initial character of GAP_SpeciesCode refers to: amphibians, birds, mammals, reptiles.

💡Note : The dataset includes 1,590 species and 129 subspecies. The hierarchical relationship between species and subspecies can be inferred from the final character of GAP_SpeciesCode. For example, mPASHc and mPASHp refer to Sorex pacificus cascadensis and Sorex pacificus pacificus, respectively, which is part of mPASHx, Pacific Shrew. The species parent should contain the habitat locations for all subspecies.

Column Name	Sample Values
item_id	58fa4341e4b0b7ea54524b8f; 58fa6abfe4b0b7ea545258e1; 58fa6b9ce4b0b7ea54525905
title	San Simeon Slender Salamander (Batrachoseps incognitus) aSSSSx_CONUS_2001v1 Habitat Map; Hooded Skunk (Mephitis macroura) mHOSKx_CONUS_2001v1 Habitat Map; Santa Rosa Island Fox (Urocyon littoralis santarosae) mISFOr_CONUS_2001v1 Habitat Map
CommonName	San Simeon Slender Salamander; Hooded Skunk; Santa Rosa Island Fox
GAP_SpeciesCode	aSSSSx; mHOSKx; mISFOr
ScientificName	Batrachoseps incognitus; Mephitis macroura; Urocyon littoralis santarosae
doi	doi:10.5066/F7X065BT; doi:10.5066/F7FJ2F60; doi:10.5066/F7WM1BSQ
itis_tsn_validMatch	668244; 180563; 726908
itis_tsn_validSpeciesConceptEncompassingUnrecognizedSpeciesConcept	174233; 174092; 174233
itis_tsn_validSpeciesForUnrecognizedSubspecies	180347; 180310; 552479
itis_tsn_validSpeciesFromSubspecies	997723
itis_tsn_validSpeciesFromUnrecognizedSuppopulation	552521
itis_tsn_validSpeciesWithFormerName	174024; 914103; 174238
itis_tsn_validSubspeciesFromSpecies	208992; 209163; 668673
iucn_id_verified	59125; 41634; 103798425
nsid_acceptedMatch	102298; 103207; 102640
nsid_acceptedSpeciesConceptEncompassingUnrecognizedSpeciesConcept	106286; 106231
nsid_acceptedSpeciesFromRecognizedGenus	102290; 101405; 103451
nsid_acceptedSpeciesFromSubspecies	902215
nsid_acceptedSpeciesFromUnrecognizedSubspecies	102073; 101780; 104212
nsid_acceptedSupspeciesFromSpecies	102277

Restructures the species habitat analysis script: - Implement modular architecture with ScienceBaseClient, RasterSet, and HabitatDataProcessor classes for better maintainability - Integrate sciencebasepy for ZIP-based habitat map downloads from USGS ScienceBase, with automatic cleanup of temporary files - Add multi-format output support (CSV/Parquet/Arrow) with Arrow as default, using dictionary encoding for optimized storage and performance - Enhance metadata by including species common and scientific names from ScienceBase API - Add comprehensive CLI arguments for configuration and debug logging - Improve robustness with better error handling and type annotations

dsmedia · 2025-02-16T13:43:51Z

@dangotbanned would welcome your thoughts on whether i'm on the right track here.

dangotbanned · 2025-02-16T15:01:58Z

@dangotbanned would welcome your thoughts on whether i'm on the right track here.

Thanks @dsmedia

I've only skimmed through, but could you switch from argparse to configuring via a .toml?

You'd be able to remove the defaults and the parsing logic from the script - or at least just validate the options in one place

See for more info: - https://typing.readthedocs.io/en/latest/spec/aliases.html#type-statement - https://typing.readthedocs.io/en/latest/spec/literal.html#literal

dangotbanned · 2025-02-16T16:13:29Z

@dsmedia everything appears to be working so far 🎉

scripts/species.py

* Switch to using zipfile.Path for more Pythonic ZIP file handling * Enforce expectation of exactly one TIF file per ZIP * Add error handling for unexpected file counts

Refactors `species.py` to use TOML configuration (`_data/species.toml`) instead of `argparse`, improving flexibility and maintainability. Settings include `item_ids`, `vector_fp`, `output_dir`, `output_format`, and `debug`. Relative paths (e.g., `../data/us-10m.json`) are resolved relative to the TOML file, and basic validation is added. `RasterSet.extract_tifs_from_zips()` now uses `zipfile.Path` and enforces a single `.tif` file per ZIP, raising a `RuntimeError` otherwise. Type hinting fixes are also included.

dsmedia · 2025-02-19T11:32:09Z

Currently this PR provides county-level percentages of year-round predicted habitat for four species. I'm considering expanding the dataset to include both predicted and known ranges across different seasons (summer-only, winter-only, and year-round) for the four species, matching the scope of the USGS report on this dataset. This would:

Better represent the full richness of the USGS GAP species data
Enable more complex faceted visualization examples
Provide an engaging use case for Vega's interactive selections and cross-filtering across temporal and spatial dimensions

This expanded scope would also make the dataset more versatile for the broader Vega community beyond the original faceted map use case. Before formally proposing this expansion, I'll first generate a test dataset to evaluate size, performance, and data quality.

mattijn · 2025-02-19T11:59:28Z

Thanks for being so sophisticated with this PR @dsmedia!

I'm personally not really up to date with the arrow format. If the source can be parsed into a dataframe or referenced by url in the Vega editor than it is perfect.

Currently the data is in row oriented format as it is the easiest way to facet the chart, but it means we pass the threshold of 5K rows. If we store the data column oriented we stay within the 5K rows limit, but it means there will be additional melt transforms in the specification. I'm leaning towards keeping it as is to simplify the chart specification.

I understand that you want to provide wider access to all options of the data source, but don't lose yourself in all available possibilities. What you currently have is really great already.

Once this referenced feature request is in, vega/vega-lite#9389 we can revisit this dataset to make a map like your screenshot to highlight pixel based seasonal differences for a single or few counties.

But if you do have great ideas to approach this, go for it. I just want to say that what you have is awesome already!

- Rename species columns for consistency and clarity: - GAP_Species -> gap_species_code - percent_habitat -> habitat_yearround_pct - CommonName -> common_name - ScientificName -> scientific_name - Expand module docstring with detailed information about: - Data sources and resolution - Projection details - Habitat value meanings - Output format options - Improve code comments for future extensibility - reference habitat data source in species.toml - list alternative output format options in toml The changes prepare for potential future addition of seasonal habitat data (and summer/winter habitat data) while maintaining backward compatibility.

dsmedia · 2025-02-20T04:29:46Z

I'm personally not really up to date with the arrow format. If the source can be parsed into a dataframe or referenced by url in the Vega editor than it is perfect.

Happy to revise to csv if that is deemed best here. I'm curious though if this could be a nice opportunity to create the first standalone dataset in arrow format in the repo. (flights-200k exists in this format now but exists alongside a json variant.) The arrow format was suggested by @domoritz (along with csv) if this was intended for use beyond Altair. The compactness of arrow (with its columnar format) seemed to give it the edge. @dangotbanned could perhaps speak more directly to usability/suitability of the arrow format? I do see that Altair docs indicate arrow is supported though experimental?

as a DataFrame that supports the DataFrame Interchange Protocol (contains a dataframe attribute), e.g. polars and pyarrow. This is experimental.

I understand that you want to provide wider access to all options of the data source, but don't lose yourself in all available possibilities. What you currently have is really great already.

Once this referenced feature request is in, vega/vega-lite#9389 we can revisit this dataset to make a map like your screenshot to highlight pixel based seasonal differences for a single or few counties.

But if you do have great ideas to approach this, go for it. I just want to say that what you have is awesome already!

This is great perspective. In aa4516f I've kept it as is (with the all-season habitat data only) but reformulated the column names to permit a backward-compatible update in the future. (The data column is now called habitat_yearround_pct, so it would be just a drop-in to add habitat_summer_pct and/or range_winter_pct, for example.)

Separately, @mattijn, exactextract is generating this warning in the terminal.

(RuntimeWarning: Spatial reference system of input features does not exactly match raster)

If harmless, could you help explain/document why this happens (and what it means practically), or can we revise the code to address the conditions causing the warning?

dsmedia · 2025-03-05T01:34:32Z

Issue with exactextract installation

I encountered an installation error when trying to run the script with uv run --group="geo-species" scripts/species.py. The error occurs because:
The current release of exactextract (0.2.0) has its URLs defined in the wrong section of its pyproject.toml file
When installing with newer build tools (like scikit-build-core 0.11.0, released just 2 days ago), this causes a validation error:
scikit_build_core._vendor.pyproject_metadata.errors.ConfigurationError: Extra keys present in "project": 'url'
This has been fixed in a PR that was already merged (isciences/exactextract#155), but the fix is only available in the development version.

@mattijn do you face the same issue here?

- Fix BLE001 linting errors by replacing `except Exception` with specific exception types - Use explicit exception handling for RequestException, ValueError, and KeyError - Improve error messages with more specific diagnostic information - Maintain existing error handling logic while conforming to coding standards - Update ScienceBaseClient methods for better exception specificity - Address linter warnings flagged in CI pipeline (#684)

mattijn · 2025-03-13T17:14:54Z

I noticed that last week there was a new release of exactextract, https://github.com/isciences/exactextract/releases/tag/v0.2.1.

Is this still an issue for you @dsmedia? Otherwise we can just merge this PR and create another issue with the problem we are currently facing, since the computed CSV file is already in this PR isn't it?

mattijn · 2025-03-14T19:08:39Z

With 9aade3d I forced exactextract to use version 0.2.1 this works fine for me without errors and also no false positive warnings as reported before.

dsmedia · 2025-03-14T20:35:50Z

Works like a charm; thanks @mattijn. @dangotbanned could you see if the download issues are resolved for you and if so whether there's anything else needed here?

uv.lock

datapackage.json

…tamps

dangotbanned · 2025-03-15T19:20:58Z

Works like a charm; thanks @mattijn. @dangotbanned could you see if the download issues are resolved for you and if so whether there's anything else needed here?

Looking better so far @dsmedia 🎉!

I hadn't got to this stage in (#684 (comment))

Will let you know how it goes

Update

It worked!

Had an issue with removing the temp files, so I switched to shutil.rmtree in (6c9e3f7)

- Removed paths from config - Use `FileExtension` - Reduce comments - Fix 26/27 typing error - TODO: add the remaining one to narwhals-dev/narwhals#2124 - Avoid some deprecated `pandas` api - Define the hierarchical config in `TypedDict`(s)

https://docs.astral.sh/ruff/rules/#tryceratops-try

No need for comments, they all have docstrings and clear names

pyproject.toml

dangotbanned · 2025-03-15T21:57:36Z

This isn't blocking, but I'd suggest thinking about caching the downloads instead of using a tempdir.

The 4x files total 1.40GB - which it could be handy to not need to download every time.
It took me 15mins to download and 18mins for the processing.
The first half could be avoided if we saved our progress 🙂

#684 (comment)

For me it works fine, downloading data in ~2 minutes and analysis in ~11 minutes:

I wish I knew how @mattijn got these numbers 🤯

dsmedia · 2025-03-17T01:13:31Z

@dangotbanned I noticed in (6c9e3f7) that you moved vector_fp and output_dir from the TOML file to hardcoded constants, while keeping other settings like item_ids and output_format configurable. I'm curious about your thinking behind this separation - is there a general rule you follow for deciding what belongs in configuration versus what should be hardcoded? I'd love to learn from your approach to designing configuration systems in projects like this. Thanks!

dangotbanned · 2025-03-17T11:46:49Z

@dangotbanned I noticed in (6c9e3f7) that you moved vector_fp and output_dir from the TOML file to hardcoded constants, while keeping other settings like item_ids and output_format configurable. I'm curious about your thinking behind this separation - is there a general rule you follow for deciding what belongs in configuration versus what should be hardcoded? I'd love to learn from your approach to designing configuration systems in projects like this. Thanks!

Thanks @dsmedia

In terms of a rule, I'd say DRY (Don't repeat yourself) is relevant here.
The See also section is worth following through for other similar concepts.

How does this apply to the original config?

Most of the path config is unlikely to change
- But there were multiple places to configure them
- That decision introduced the need for resolving paths
The directories are identicial to the other scripts
- I'd previously suggested moving all of that into a single file and importing from there
- We still have a place to configure
  - but don't need to recreate that and maintain it independently in every script
- feat: adds generation script for income.json #672 (comment)

`_data/species.toml`

[processing]
# ...
vector_fp = "../data/us-10m.json" # Relative path from TOML file
output_dir = "../data" # Relative path from TOML file

`scripts/species.py`

from pathlib import Path
FILE_DIR = Path(__file__).parent
CONFIG_DIR = FILE_DIR.parent / "_data"  # Configuration directory
DATA_DIR = FILE_DIR.parent / "data"
VECTOR_FP = DATA_DIR / "us-10m.json"

...
def main() -> None:
    config_path = CONFIG_DIR / "species.toml"
    if not config_path.exists():
        msg = f"Configuration file not found: {config_path}"
        raise FileNotFoundError(msg)

    config = tomllib.loads(config_path.read_text("utf-8"))
    ...
    vector_fp = processing_config.get("vector_fp", str(VECTOR_FP))
    output_dir = processing_config.get("output_dir", str(DATA_DIR))
    ...
    vector_fp = (config_path.parent / vector_fp).resolve()
    output_dir = (config_path.parent / output_dir).resolve()
    ...
    if not isinstance(vector_fp, str | Path):
        msg = "`vector_fp` must be a string or Path."
        raise TypeError(msg)

    if not isinstance(output_dir, str | Path):
        msg = "`output_dir` must be a string or Path."
        raise TypeError(msg)
    ...
    processor = HabitatDataProcessor(
        item_ids,
        vector_fp,
        output_dir,
        ...
    )

Misc

... deciding what belongs in configuration versus what should be hardcoded

I don't think the term hardcoded is useful here.
toml does not have first-class support for path-like objects, but python does.
We can't describe this as concisely in toml and getting close to it will require .resolve() in .py anyway

from pathlib import Path

REPO_ROOT: Path = Path(__file__).parent.parent
INPUT_DIR: Path = REPO_ROOT / "_data"
OUTPUT_DIR: Path = REPO_ROOT / "data"

REPO_ROOT is dynamic and all others paths are dependent on it.

If __file__ moves, it breaks everything.
If one sub-directory or file name changes, it breaks anything dependent upon it.

These are useful properties to have since you can usually fix the issue by updating a single line.
Bringing it back to (#672 (comment)); it would apply for all of the scripts 🙂

dangotbanned

Thanks @dsmedia 🎉

Sorry for the delays in reviewing - really appreciate your efforts!

See (#684 (comment)) for a big response to (#684 (comment))

* feat: Add faceted map example using Species Habitat dataset This commit introduces a new example to the Altair documentation showcasing choropleth maps faceted by category. The example visualizes the distribution of suitable habitat for different species across US counties, using the proposed new Species Habitat dataset from vega-datasets (vega/vega-datasets#684). Key features of this example: - Demonstrates the use of `alt.Chart.mark_geoshape()` for geographical visualizations. - Shows how to create faceted maps for comparing categorical data across geographic regions. - Utilizes the `transform_lookup` and `transform_calculate` transforms for data manipulation within Altair. - Uses a CSV data file temporarily hosted in the vega-datasets repository branch (pending dataset merge). This example addresses issue #1711, which requested a faceted map example for the Altair documentation. Co-authored-by: Mattijn van Hoek <mattijn@gmail.com> * match changes to source dataset * chore: update url to cdn * update origin of species dataset * refactor: use declarative faceting instead of manual chart concatenation * slightly better comment --------- Co-authored-by: Mattijn van Hoek <mattijn@gmail.com> Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>

…itat data (#9661) ## PR Description The example visualizes the distribution of suitable habitat for different species across US counties, using the new Species Habitat dataset from vega-datasets (vega/vega-datasets#684). Matches example in vega/altair#3809. This example addresses issue vega/altair#1711, which requested a faceted map example for the Altair documentation. - [x] This PR is atomic (i.e., it fixes one issue at a time). - [x] The title is a concise [semantic commit message](https://www.conventionalcommits.org/) (e.g. "fix: correctly handle undefined properties"). - [x] `npm test` runs successfully Supersedes #9659 Closing the original PR as it was incorrectly based on the main branch. This new PR is from a dedicated feature branch (ds/geo-facet-viz) to align with the project's contribution guidelines. --------- Co-authored-by: GitHub Actions Bot <vega-actions-bot@users.noreply.github.com>

dsmedia added 5 commits February 11, 2025 02:28

feat: add species.csv and generation script

d2cc2f6

chore: fix Ruff linting warnings

f9c979c

chore: formatting with Ruff

93f06ee

fix: ruff formatting

8311060

dsmedia added 2 commits February 16, 2025 13:36

chore: ruff format

853d5ce

dangotbanned added 2 commits February 16, 2025 16:12

fix(typing): declare aliases as types, use Literal correctly

9a778c6

See for more info: - https://typing.readthedocs.io/en/latest/spec/aliases.html#type-statement - https://typing.readthedocs.io/en/latest/spec/literal.html#literal

Merge remote-tracking branch 'upstream/main' into add-species-dataset

728a4a7

dangotbanned reviewed Feb 16, 2025

View reviewed changes

scripts/species.py Outdated Show resolved Hide resolved

dangotbanned reviewed Feb 16, 2025

View reviewed changes

scripts/species.py Show resolved Hide resolved

dsmedia added 4 commits February 17, 2025 00:04

refactor: use zipfile.Path for TIF extraction

2a1baba

* Switch to using zipfile.Path for more Pythonic ZIP file handling * Enforce expectation of exactly one TIF file per ZIP * Add error handling for unexpected file counts

fix: taplo fmt

cb01bd3

feat: update species to align with altair issue

c4ff52d

dsmedia mentioned this pull request Feb 18, 2025

Faceted map example chart vega/altair#1711

Open

6 tasks

fix: remove hardcoded habitat ids; update urls.ts

6cee9a4

dsmedia added 2 commits February 20, 2025 04:03

docs: add species.arrow to datapackage_additions

62d4e60

dsmedia mentioned this pull request Feb 20, 2025

fix: facet fails on geoshape vega/vega-lite#9292

Closed

dsmedia marked this pull request as ready for review February 20, 2025 13:00

dsmedia requested a review from mattijn February 20, 2025 13:03

dsmedia self-assigned this Feb 20, 2025

exactextract>=0.2.1

9aade3d

mattijn self-requested a review March 14, 2025 19:08

mattijn marked this pull request as ready for review March 14, 2025 19:09

mattijn approved these changes Mar 14, 2025

View reviewed changes

dangotbanned reviewed Mar 14, 2025

View reviewed changes

uv.lock Outdated Show resolved Hide resolved

dangotbanned reviewed Mar 14, 2025

View reviewed changes

datapackage.json Outdated Show resolved Hide resolved

chore: update exactextract to version 0.2.1 and adjust metadata times…

f5bbbbd

…tamps

dangotbanned added 4 commits March 15, 2025 20:25

chore: add pyarrow-stubs

98fabd7

refactor: Simplify, fix typing, win32 compat

6c9e3f7

- Removed paths from config - Use `FileExtension` - Reduce comments - Fix 26/27 typing error - TODO: add the remaining one to narwhals-dev/narwhals#2124 - Avoid some deprecated `pandas` api - Define the hierarchical config in `TypedDict`(s)

ci(ruff): Add (TRY) rule group

8487b58

https://docs.astral.sh/ruff/rules/#tryceratops-try

refactor: replace repeat assignment w/ pipes

215df96

No need for comments, they all have docstrings and clear names

dangotbanned reviewed Mar 15, 2025

View reviewed changes

pyproject.toml Show resolved Hide resolved

refactor: update logging to use exception handling for error messages

5bd1805

dangotbanned approved these changes Mar 17, 2025

View reviewed changes

dsmedia merged commit 7732f91 into main Mar 17, 2025
2 checks passed

dsmedia deleted the add-species-dataset branch March 17, 2025 12:07

dsmedia mentioned this pull request Mar 24, 2025

clarification on integration with vega-editor #692

Closed

This was referenced Aug 23, 2025

feat: add interactive geographic facet visualizations for species habitat data vega/vega-lite#9659

Closed

feat: add interactive geographic facet visualizations for species habitat data vega/vega-lite#9661

Merged

Uh oh!

Conversation

dsmedia commented Feb 11, 2025 • edited by mattijn Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation Status

To Do

Dataset Details

Current Implementation (species.csv)

Structure

Fields

Data Generation

Known Issues

Future work

Validation of Habitat Percentage Calculation

Uh oh!

dsmedia commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsmedia commented Feb 16, 2025

Uh oh!

dangotbanned commented Feb 16, 2025

Uh oh!

dangotbanned commented Feb 16, 2025

Uh oh!

Uh oh!

Uh oh!

dsmedia commented Feb 19, 2025

Uh oh!

mattijn commented Feb 19, 2025

Uh oh!

dsmedia commented Feb 20, 2025

Uh oh!

dsmedia commented Mar 5, 2025

Issue with exactextract installation

Uh oh!

mattijn commented Mar 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattijn commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dsmedia commented Mar 14, 2025

Uh oh!

Uh oh!

Uh oh!

dangotbanned commented Mar 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update

Uh oh!

Uh oh!

dangotbanned commented Mar 15, 2025

Uh oh!

dsmedia commented Mar 17, 2025

Uh oh!

dangotbanned commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How does this apply to the original config?

_data/species.toml

scripts/species.py

Misc

Uh oh!

dangotbanned left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dsmedia commented Feb 11, 2025 •

edited by mattijn

Loading

Current Implementation (`species.csv`)

dsmedia commented Feb 13, 2025 •

edited

Loading

mattijn commented Mar 13, 2025 •

edited

Loading

mattijn commented Mar 14, 2025 •

edited

Loading

dangotbanned commented Mar 15, 2025 •

edited

Loading

dangotbanned commented Mar 17, 2025 •

edited

Loading

`_data/species.toml`

`scripts/species.py`