feat: Add Species Habitat Dataset for Faceted Map Examples#684
feat: Add Species Habitat Dataset for Faceted Map Examples#684
Conversation
- Replace direct URL downloads with ScienceBase API client (sciencebasepy) - Remove niquests dependency in favor of native ScienceBase download handling - Enhance logging for better debugging of file downloads and extractions - Add explicit ZIP file cleanup after TIF extraction - Update dependencies to include sciencebasepy and setuptools
|
Just as a reference for development, this will generate a complete list of species and all species-level metadata. Generation script# /// script
# requires-python = ">=3.8"
# dependencies = [
# "pandas",
# "sciencebasepy",
# "tqdm",
# "requests",
# ]
# ///
"""
This script retrieves and processes species identifier data from ScienceBase.
It fetches all child items under a specified parent item ID, extracts identifiers
like ECOS and ITIS codes, and compiles the data into a CSV file. The script
utilizes parallel processing to efficiently handle a large number of items and
includes error handling and retry mechanisms for robustness.
"""
import pandas as pd
from sciencebasepy import SbSession
from collections import defaultdict
from tqdm import tqdm
import concurrent.futures
import time
import requests # Import the requests library
def get_all_item_ids(parent_id: str = "527d0a83e4b0850ea0518326") -> list[str]:
"""
Retrieves all child item IDs from a given ScienceBase parent item.
This function efficiently fetches all item IDs that are children of the
specified parent item ID. It is optimized for performance by using the
`get_child_ids` method from the ScienceBase API.
Args:
parent_id: The ScienceBase item ID of the parent item.
Defaults to "527d0a83e4b0850ea0518326" (Habitat Map parent item).
Returns:
A list of strings, where each string is a ScienceBase item ID.
Returns an empty list if there are errors or no child items found.
"""
print("Retrieving species item IDs from ScienceBase...")
sb = SbSession()
try:
parent_item = sb.get_item(parent_id)
print(f"Found parent item: '{parent_item.get('title', 'No Title')}'")
all_ids = list(sb.get_child_ids(parent_id)) # Efficiently get all child IDs
print(f"Found {len(all_ids)} species items.")
return all_ids
except Exception as e:
print(f"Error retrieving items from ScienceBase: {e}")
import traceback
traceback.print_exc() # Print full traceback for debugging
return []
def process_single_item(item_id: str, sb: SbSession) -> dict:
"""
Processes a single ScienceBase item to extract relevant identifier data.
This function fetches the JSON representation of a ScienceBase item, extracts
the title and identifiers (like ECOS, ITIS), and returns the data as a dictionary.
It includes error handling and retry logic for network issues, particularly
HTTP errors.
Args:
item_id: The ScienceBase item ID to process.
sb: An authenticated SbSession object for interacting with ScienceBase.
Returns:
A dictionary containing the extracted item data, including 'item_id',
'title', and any identifiers found (e.g., 'ECOS', 'ITIS').
Returns None if the item could not be processed due to errors,
after retries in case of transient network issues, or if the item is not found (404).
"""
try:
item_json = sb.get_item(item_id)
title = item_json.get('title', 'Unknown Title')
item_data = defaultdict(lambda: 'Not Available') # Default value for missing identifiers
item_data['item_id'] = item_id
item_data['title'] = title
if 'identifiers' in item_json:
for identifier in item_json['identifiers']:
scheme = identifier.get('scheme', 'Unknown Scheme')
key = identifier.get('key', 'No Value')
clean_scheme = scheme.split('/')[-1] if '/' in scheme else scheme # Extract last part of scheme
item_data[clean_scheme] = key
return dict(item_data)
except requests.exceptions.HTTPError as e:
print(f"\nHTTPError processing item {item_id}: {e}")
if e.response.status_code == 404:
print(f"Item {item_id} not found (404 error). Skipping.")
return None # Indicate item not found
else:
retries = 3
for i in range(retries):
try:
print(f"Retrying item {item_id} (attempt {i+1}/{retries})...")
time.sleep(2**i) # Exponential backoff
item_json = sb.get_item(item_id) # Retry fetching the item
title = item_json.get('title', 'Unknown Title')
item_data = defaultdict(lambda: 'Not Available')
item_data['item_id'] = item_id
item_data['title'] = title
if 'identifiers' in item_json:
for identifier in item_json['identifiers']:
scheme = identifier.get('scheme', 'Unknown Scheme')
key = identifier.get('key', 'No Value')
clean_scheme = scheme.split('/')[-1] if '/' in scheme else scheme
item_data[clean_scheme] = key
return dict(item_data)
except requests.exceptions.RequestException:
if i == retries - 1:
print(f"Failed to retrieve item {item_id} after multiple retries.")
return None # Indicate failure after retries
except Exception as e:
print(f"\nError processing item {item_id}: {e}")
return None # Indicate processing failure
def explore_species_identifiers_parallel(item_ids: list[str], max_workers: int = 10) -> pd.DataFrame:
"""
Explores and extracts identifiers from ScienceBase items in parallel.
This function processes a list of ScienceBase item IDs concurrently using
a thread pool executor. It fetches data for each item using the
`process_single_item` function and aggregates the results into a pandas DataFrame.
Args:
item_ids: A list of ScienceBase item IDs to process.
max_workers: The maximum number of threads to use for parallel processing.
Adjust this value based on your system and network to optimize
performance without overloading resources.
Returns:
A pandas DataFrame where each row represents a ScienceBase item,
and columns include 'item_id', 'title', and identifier schemes
(e.g., 'ECOS', 'ITIS'). Returns an empty DataFrame if no data is processed.
"""
sb = SbSession() # Create a single session for all threads
all_schemes = set()
results = []
with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = [executor.submit(process_single_item, item_id, sb) for item_id in item_ids]
for future in tqdm(concurrent.futures.as_completed(futures), total=len(item_ids), desc="Processing species items"):
result = future.result()
if result:
results.append(result)
all_schemes.update(key for key in result if key not in ('item_id', 'title')) # Update schemes efficiently
if not results:
return pd.DataFrame() # Return empty DataFrame if no results
df = pd.DataFrame(results)
# Ensure all identifier scheme columns are present and reorder
for scheme in all_schemes:
if scheme not in df.columns:
df[scheme] = 'Not Available'
cols = ['item_id', 'title'] + sorted(list(all_schemes))
df = df[cols]
return df
def main():
"""
Main function to execute the species identifier exploration script.
This function orchestrates the process of:
1. Retrieving all species item IDs from ScienceBase.
2. Processing these items in parallel to extract identifiers.
3. Generating summary statistics of the extracted data.
4. Saving the results to a CSV file named 'all_species_identifiers.csv'.
5. Displaying the first few rows of the resulting DataFrame.
"""
all_ids = get_all_item_ids() # Get all species item IDs
if not all_ids:
print("No species IDs found. Exiting.")
return
print("\nProcessing species item identifiers...")
df = explore_species_identifiers_parallel(all_ids, max_workers=10) # Process in parallel
if df.empty:
print("No data was processed successfully. Exiting.")
return
print("\nIdentifier Summary:")
print(f"Total species items processed: {len(df)}")
print("\nIdentifier columns found and count of non 'Not Available' values:")
for col in df.columns:
# Corrected line to exclude NaN values from the count
non_empty_count = df[col].ne('Not Available').sum() - df[col].isna().sum()
print(f"- {col}: {non_empty_count} items have this identifier")
output_file = 'all_species_identifiers.csv'
df.to_csv(output_file, index=False)
print(f"\nResults saved to '{output_file}'")
print("\nFirst 5 rows of the data:")
pd.set_option('display.max_columns', None) # Display all columns
pd.set_option('display.max_colwidth', None) # Display full column width
print(df.head()) # Replaced display(df.head()) with print()
if __name__ == "__main__":
main()USGS report with domain-specific explanations 💡Note : initial character of GAP_SpeciesCode refers to: amphibians, birds, mammals, reptiles. 💡Note : The dataset includes 1,590 species and 129 subspecies. The hierarchical relationship between species and subspecies can be inferred from the final character of GAP_SpeciesCode. For example, mPASHc and mPASHp refer to Sorex pacificus cascadensis and Sorex pacificus pacificus, respectively, which is part of mPASHx, Pacific Shrew. The species parent should contain the habitat locations for all subspecies.
|
Restructures the species habitat analysis script: - Implement modular architecture with ScienceBaseClient, RasterSet, and HabitatDataProcessor classes for better maintainability - Integrate sciencebasepy for ZIP-based habitat map downloads from USGS ScienceBase, with automatic cleanup of temporary files - Add multi-format output support (CSV/Parquet/Arrow) with Arrow as default, using dictionary encoding for optimized storage and performance - Enhance metadata by including species common and scientific names from ScienceBase API - Add comprehensive CLI arguments for configuration and debug logging - Improve robustness with better error handling and type annotations
|
@dangotbanned would welcome your thoughts on whether i'm on the right track here. |
Thanks @dsmedia I've only skimmed through, but could you switch from You'd be able to remove the defaults and the parsing logic from the script - or at least just validate the options in one place |
|
@dsmedia everything appears to be working so far 🎉 |
* Switch to using zipfile.Path for more Pythonic ZIP file handling * Enforce expectation of exactly one TIF file per ZIP * Add error handling for unexpected file counts
Refactors `species.py` to use TOML configuration (`_data/species.toml`) instead of `argparse`, improving flexibility and maintainability. Settings include `item_ids`, `vector_fp`, `output_dir`, `output_format`, and `debug`. Relative paths (e.g., `../data/us-10m.json`) are resolved relative to the TOML file, and basic validation is added. `RasterSet.extract_tifs_from_zips()` now uses `zipfile.Path` and enforces a single `.tif` file per ZIP, raising a `RuntimeError` otherwise. Type hinting fixes are also included.
|
Currently this PR provides county-level percentages of year-round predicted habitat for four species. I'm considering expanding the dataset to include both predicted and known ranges across different seasons (summer-only, winter-only, and year-round) for the four species, matching the scope of the USGS report on this dataset. This would:
This expanded scope would also make the dataset more versatile for the broader Vega community beyond the original faceted map use case. Before formally proposing this expansion, I'll first generate a test dataset to evaluate size, performance, and data quality. |
|
Thanks for being so sophisticated with this PR @dsmedia! I'm personally not really up to date with the arrow format. If the source can be parsed into a dataframe or referenced by url in the Vega editor than it is perfect. Currently the data is in row oriented format as it is the easiest way to facet the chart, but it means we pass the threshold of 5K rows. If we store the data column oriented we stay within the 5K rows limit, but it means there will be additional melt transforms in the specification. I'm leaning towards keeping it as is to simplify the chart specification. I understand that you want to provide wider access to all options of the data source, but don't lose yourself in all available possibilities. What you currently have is really great already. Once this referenced feature request is in, vega/vega-lite#9389 we can revisit this dataset to make a map like your screenshot to highlight pixel based seasonal differences for a single or few counties. But if you do have great ideas to approach this, go for it. I just want to say that what you have is awesome already! |
- Rename species columns for consistency and clarity: - GAP_Species -> gap_species_code - percent_habitat -> habitat_yearround_pct - CommonName -> common_name - ScientificName -> scientific_name - Expand module docstring with detailed information about: - Data sources and resolution - Projection details - Habitat value meanings - Output format options - Improve code comments for future extensibility - reference habitat data source in species.toml - list alternative output format options in toml The changes prepare for potential future addition of seasonal habitat data (and summer/winter habitat data) while maintaining backward compatibility.
Happy to revise to csv if that is deemed best here. I'm curious though if this could be a nice opportunity to create the first standalone dataset in arrow format in the repo. (
This is great perspective. In aa4516f I've kept it as is (with the all-season habitat data only) but reformulated the column names to permit a backward-compatible update in the future. (The data column is now called Separately, @mattijn, exactextract is generating this warning in the terminal.
If harmless, could you help explain/document why this happens (and what it means practically), or can we revise the code to address the conditions causing the warning? |
@mattijn do you face the same issue here? |
- Fix BLE001 linting errors by replacing `except Exception` with specific exception types - Use explicit exception handling for RequestException, ValueError, and KeyError - Improve error messages with more specific diagnostic information - Maintain existing error handling logic while conforming to coding standards - Update ScienceBaseClient methods for better exception specificity - Address linter warnings flagged in CI pipeline (#684)
|
I noticed that last week there was a new release of exactextract, https://github.com/isciences/exactextract/releases/tag/v0.2.1. Is this still an issue for you @dsmedia? Otherwise we can just merge this PR and create another issue with the problem we are currently facing, since the computed CSV file is already in this PR isn't it? |
|
With 9aade3d I forced exactextract to use version |
|
Works like a charm; thanks @mattijn. @dangotbanned could you see if the download issues are resolved for you and if so whether there's anything else needed here? |
Looking better so far @dsmedia 🎉! I hadn't got to this stage in (#684 (comment)) Will let you know how it goes UpdateIt worked! Had an issue with removing the temp files, so I switched to |
- Removed paths from config - Use `FileExtension` - Reduce comments - Fix 26/27 typing error - TODO: add the remaining one to narwhals-dev/narwhals#2124 - Avoid some deprecated `pandas` api - Define the hierarchical config in `TypedDict`(s)
No need for comments, they all have docstrings and clear names
|
This isn't blocking, but I'd suggest thinking about caching the downloads instead of using a tempdir. The 4x files total 1.40GB - which it could be handy to not need to download every time.
I wish I knew how @mattijn got these numbers 🤯 |
|
@dangotbanned I noticed in (6c9e3f7) that you moved vector_fp and output_dir from the TOML file to hardcoded constants, while keeping other settings like item_ids and output_format configurable. I'm curious about your thinking behind this separation - is there a general rule you follow for deciding what belongs in configuration versus what should be hardcoded? I'd love to learn from your approach to designing configuration systems in projects like this. Thanks! |
Thanks @dsmedia In terms of a rule, I'd say DRY (Don't repeat yourself) is relevant here. How does this apply to the original config?
|
dangotbanned
left a comment
There was a problem hiding this comment.
Thanks @dsmedia 🎉
Sorry for the delays in reviewing - really appreciate your efforts!
See (#684 (comment)) for a big response to (#684 (comment))
* feat: Add faceted map example using Species Habitat dataset This commit introduces a new example to the Altair documentation showcasing choropleth maps faceted by category. The example visualizes the distribution of suitable habitat for different species across US counties, using the proposed new Species Habitat dataset from vega-datasets (vega/vega-datasets#684). Key features of this example: - Demonstrates the use of `alt.Chart.mark_geoshape()` for geographical visualizations. - Shows how to create faceted maps for comparing categorical data across geographic regions. - Utilizes the `transform_lookup` and `transform_calculate` transforms for data manipulation within Altair. - Uses a CSV data file temporarily hosted in the vega-datasets repository branch (pending dataset merge). This example addresses issue #1711, which requested a faceted map example for the Altair documentation. Co-authored-by: Mattijn van Hoek <mattijn@gmail.com> * match changes to source dataset * chore: update url to cdn * update origin of species dataset * refactor: use declarative faceting instead of manual chart concatenation * slightly better comment --------- Co-authored-by: Mattijn van Hoek <mattijn@gmail.com> Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
…itat data (#9661) ## PR Description The example visualizes the distribution of suitable habitat for different species across US counties, using the new Species Habitat dataset from vega-datasets (vega/vega-datasets#684). Matches example in vega/altair#3809. This example addresses issue vega/altair#1711, which requested a faceted map example for the Altair documentation. - [x] This PR is atomic (i.e., it fixes one issue at a time). - [x] The title is a concise [semantic commit message](https://www.conventionalcommits.org/) (e.g. "fix: correctly handle undefined properties"). - [x] `npm test` runs successfully Supersedes #9659 Closing the original PR as it was incorrectly based on the main branch. This new PR is from a dedicated feature branch (ds/geo-facet-viz) to align with the project's contribution guidelines. --------- Co-authored-by: GitHub Actions Bot <vega-actions-bot@users.noreply.github.com>



will close #683
This PR adds a new dataset to
vega-datasetscontaining county-level species habitat distribution data for the United States. The dataset is designed to support categorical faceted map examples in Vega visualization libraries, addressing needs discussed in vega/altair#1711 and vega/vega-datasets#683.Implementation Status
To Do
pctto three decimal placespctcolumnDataset Details
Current Implementation (
species.csv)Structure
data/directoryFields
item_idcommon_namescientific_namegap_species_codecounty_idhabitat_yearround_pctData Generation
The dataset is generated using
scripts/species.py, which implements the approach suggested by @mattijn in this comment. The script:exactextractdata/species.csvKnown Issues
Runtime warning:(fixed by 9aade3d)Spatial reference system of input features does not exactly match rasterFuture work
Consider expanding the dataset (in a backward compatible way) to include summer and winter habitat data, and summer/winter/all_season data on range.
Validation of Habitat Percentage Calculation
This image compares the USGS potential habitat map (within this zip file) for bullfrogs (zoomed into the southeastern United States, for clarity) with the output of our code, demonstrating the correct implementation of zonal statistics.
Left: USGS potential habitat map for bullfrogs. This is a raster dataset showing predicted suitable habitat (purple areas) at a 30-meter resolution. It is not aggregated by any administrative boundaries.
Right: Generated choropleth map showing the percentage of suitable habitat within each county. Lighter colors indicate a higher percentage of the county is classified as suitable habitat, while darker colors indicate a lower percentage.
The code successfully overlays the USGS raster data with county boundaries (from
us-10m.json, a vector dataset) and usesexactextractto calculate the percentage of each county that falls within the USGS-defined habitat. The resulting map visually confirms that the habitat percentages are being calculated correctly, with higher percentages in counties that overlap significantly with the USGS's predicted habitat area.Species Habitat Visualization Code