Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 7 additions & 15 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,16 +1,8 @@
include *.md
include *.sh
include *.txt
include *.py
include README.md
include LICENSE
include license
include .pylintrc
include Makefile
include .editorconfig
recursive-include .github *.yaml
recursive-include invenio_subjects_cessda *.csv
recursive-include invenio_subjects_cessda *.py
recursive-include invenio_subjects_cessda *.yaml
recursive-include tests *.py
include *.lock
include Pipfile
include .vscode/*
include CHANGES.md
include pyproject.toml
include invenio_subjects_cessda/__init__.py
include invenio_subjects_cessda/vocabularies/__init__.py
recursive-include invenio_subjects_cessda/vocabularies *.yaml *.json *.md
5 changes: 4 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,14 @@ format: ## Format code with Ruff
@uv run ruff format .

lint: ## Lint code with Ruff
@uv run ruff check .
@uv run ruff check . --fix

run: ## Fetch and convert the latest CESSDA vocabularies
@uv run python main.py

run-force-delete: ## Drops legacy subjects from generated files and re-fetches and converts the latest CESSDA vocabularies
@uv run python main.py --drop-removed-vocabs

package: ## Build source and wheel distributions
@rm -rf dist
@uvx --from build pyproject-build
Expand Down
77 changes: 28 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,52 +73,43 @@ Modify `Makefile` to set the DEBUGGER environment variable to False for less det

### Updating CESSDA Versions

Last updated: `2024-02-01`
Check the version date in this README. To fetch the latest CESSDA versions, run:
To fetch the latest CESSDA vocabularies, run:

```bash
make run
```

in [config.py](invenio_subjects_cessda/config.py) you have the ability to modify the preferred language and specify the directory for saving vocabularies.
The endpoint `fullListOfpublishedVocabVersions` includes a full list of all published vocabulary versions enabling you to compare them with the versions that have been installed.
You can change the preferred languages and output locations in
[config.py](invenio_subjects_cessda/config.py). The command downloads all
vocabularies, writes the canonical list to
`invenio_subjects_cessda/vocabularies/cessda_voc.yaml`, and persists a delta
report alongside it. Delta filenames now include the UTC timestamp of the run
(`cessda_voc_delta_YYYYMMDDTHH_MMSS.json`) so each execution leaves an audit
trail you can revisit for curation.

The following vocabulary versions are included in this release. Remember to update this list during your next upgrade.
Review the delta report after each update to see which vocabularies were added,
changed, or missing from the latest upstream catalogue. When entries disappear
from the fetch but are still retained in the YAML, they are reported under the
`missing_from_latest` key so it is clear that the export still contains them. If
you run the synchronisation with removals enabled, the report switches back to
the traditional `removed` section.

```console
https://vocabularies.cessda.eu/v2/codes/CdcPublisherNames/6.0.0/en
https://vocabularies.cessda.eu/v2/codes/CessdaPersistentIdentifierTypes/1.0.0/en
https://vocabularies.cessda.eu/v2/codes/CountryNamesAndCodes/1.0.0/en
https://vocabularies.cessda.eu/v2/codes/TopicClassification/4.2.2/en
https://vocabularies.cessda.eu/v2/codes/AggregationMethod/1.1.2/en
https://vocabularies.cessda.eu/v2/codes/AnalysisUnit/2.1.3/en
https://vocabularies.cessda.eu/v2/codes/CharacterSet/1.0.0/en
https://vocabularies.cessda.eu/v2/codes/CommonalityType/1.0.2/en
https://vocabularies.cessda.eu/v2/codes/ContributorRole/1.0.2/en
https://vocabularies.cessda.eu/v2/codes/DataSourceType/1.0.2/en
https://vocabularies.cessda.eu/v2/codes/DataType/1.1.2/en
https://vocabularies.cessda.eu/v2/codes/DateType/1.1.2/en
https://vocabularies.cessda.eu/v2/codes/GeneralDataFormat/2.0.3/en
https://vocabularies.cessda.eu/v2/codes/LanguageProficiency/1.0.2/en
https://vocabularies.cessda.eu/v2/codes/LifecycleEventType/1.0.2/en
https://vocabularies.cessda.eu/v2/codes/ModeOfCollection/4.0.3/en
https://vocabularies.cessda.eu/v2/codes/NumericType/1.1.0/en
https://vocabularies.cessda.eu/v2/codes/ResponseUnit/1.0.2/en
https://vocabularies.cessda.eu/v2/codes/SamplingProcedure/1.1.4/en
https://vocabularies.cessda.eu/v2/codes/SoftwarePackage/1.0.0/en
https://vocabularies.cessda.eu/v2/codes/SummaryStatisticType/2.1.2/en
https://vocabularies.cessda.eu/v2/codes/TimeMethod/1.2.3/en
https://vocabularies.cessda.eu/v2/codes/TimeZone/1.0.0/en
https://vocabularies.cessda.eu/v2/codes/TypeOfAddress/1.1.0/en
https://vocabularies.cessda.eu/v2/codes/TypeOfConceptGroup/1.0.2/en
https://vocabularies.cessda.eu/v2/codes/TypeOfFrequency/1.0.0/en
https://vocabularies.cessda.eu/v2/codes/TypeOfInstrument/1.1.2/en
https://vocabularies.cessda.eu/v2/codes/TypeOfNote/1.1.0/en
https://vocabularies.cessda.eu/v2/codes/TypeOfTelephone/1.0.0/en
https://vocabularies.cessda.eu/v2/codes/TypeOfTranslationMethod/1.0.0/en
https://vocabularies.cessda.eu/v2/codes/Variables-Relations/1.0.0/en
Existing subjects keep their original identifiers so Invenio instances will not
create duplicate records when the upstream catalogue republishes entries with
new IDs.

To explicitly prune removed vocabularies from the YAML, call the synchronisation
script with the `--drop-removed-vocabs` flag:

```bash
make run-force-delete
# or
python main.py --drop-removed-vocabs
```

Without the flag, legacy entries remain in the export for backward
compatibility, while the delta report records that they are missing upstream.

## Upload to pypi

Publishing will be done automatically by GitHub actions when a new tag is created.
Expand All @@ -127,15 +118,3 @@ Publishing will be done automatically by GitHub actions when a new tag is create
git tag vX.Y.Z
git push origin master vX.Y.Z
```

## manually upload to pypi

```bash
make install-package-tools # this will install twine (install-package-tools-pipenv if you use pipenv)
make package # this will zip the package into dist dir
make package-check # verify if the package pass twine checks

export TWINE_USERNAME=__token__
export TWINE_PASSWORD=pypi-<YOUR_TOKEN>
twine upload dist/*
```
28 changes: 24 additions & 4 deletions invenio_subjects_cessda/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,13 @@
languages = ["en(SL)"]
"""Languages to fetch"""

en_vocabularies_output_path = (
Path.cwd() / "invenio_subjects_cessda" / "vocabularies" / "cessda_voc.yaml"
)
"""CESSDA Vocabularies path destination"""
VOCABULARY_DIR = Path.cwd() / "invenio_subjects_cessda" / "vocabularies"

en_vocabularies_output_path = VOCABULARY_DIR / "cessda_voc.yaml"
"""CESSDA vocabulary export destination."""

en_vocabularies_delta_path = VOCABULARY_DIR / "cessda_voc_delta.json"
"""Delta report file path recording vocabulary changes between runs."""


fullListOfpublishedVocabVersions = [
Expand All @@ -22,3 +25,20 @@
}
]
"""https://api.tech.cessda.eu/#/vocabulary-resource-v-2/getAllVocabulariesUsingGET"""


# # Sort existing CESSDA vocabulary entries by subject and id
# ```python
# python3 - <<'PY'
# from pathlib import Path
# import yaml

# path = Path("invenio_subjects_cessda/vocabularies/cessda_voc.yaml")
# entries = yaml.safe_load(path.read_text(encoding="utf-8")) or []
# entries.sort(key=lambda item: (item["subject"].lower(), item["id"]))
# path.write_text(
# yaml.safe_dump(entries, allow_unicode=True, sort_keys=False),
# encoding="utf-8",
# )
# PY
# ```
197 changes: 163 additions & 34 deletions invenio_subjects_cessda/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,53 +4,182 @@
# invenio-subjects-CESSDA is free software, you can redistribute it and/or
# modify it under the terms of the MIT License; see LICENSE file details.


# from typing import Any, Callable, Dict, Iterator, NoReturn

from collections.abc import Iterable, Sequence
from datetime import datetime, timezone
from pathlib import Path

from click import secho
from yaml import dump
from yaml import safe_dump

from invenio_subjects_cessda.delta import (
build_delta_report,
load_previous_snapshot,
write_delta_report,
)
from invenio_subjects_cessda.schemas import cessda_schema
from invenio_subjects_cessda.utils import logger

VOCAB_FIELDS = ("id", "scheme", "subject")
TIMESTAMP_FORMAT = "%Y%m%dT%H_%M%S"

def sort_vocabularies(data):
"""Sort vocabularies by 'id'."""
logger.debug("Sorting ...")
return sorted(
[(v["name"], s) for v in data for s in v["data"]], key=lambda x: x[1]["id"]
)

def _sort_key(entry: dict) -> tuple[str, str]:
subject = (entry.get("subject") or "").strip().lower()
identifier = entry.get("id") or ""
return subject, identifier

def process_vocabularies(sorted_vocabularies):
"""Process sorted vocabularies."""
logger.debug("Processing schema ...")
return [
cessda_schema((name, entry))
for name, entry in sorted_vocabularies
if cessda_schema((name, entry))
]

def _normalise_entries(data: Sequence[dict]) -> list[dict]:
"""Flatten and normalise raw vocabulary payloads."""
entries: list[dict] = []
for vocabulary in data:
vocabulary_name = vocabulary.get("name", "")
for raw_entry in vocabulary.get("data", []):
normalised = cessda_schema(raw_entry)
if not normalised:
continue
normalised["vocabulary"] = vocabulary_name
entries.append(normalised)

def write_to_file(processed_data, dpath):
"""Write processed data to a file."""
logger.debug("Writing to: '%s'", dpath)
with open(dpath, "w", encoding="utf-8") as yaml_f:
for data in processed_data:
dump(data, yaml_f, allow_unicode=True, sort_keys=False)
entries.sort(key=_sort_key)
logger.debug("Normalised %s vocabulary entries", len(entries))
return entries


def log_conversion_success(dpath):
"""Log success message."""
secho("Converted successfully to ", fg="green", nl=False)
secho(f" {dpath}", fg="yellow", bold=True)
def _prune_for_yaml(entries: Iterable[dict]) -> list[dict]:
"""Project entries to the public YAML representation."""
return [{field: entry[field] for field in VOCAB_FIELDS} for entry in entries]


def _build_dated_delta_path(base_path: Path, timestamp: datetime) -> Path:
"""Return a timestamped filename for the delta report."""

suffix = base_path.suffix
stem = base_path.stem
formatted = timestamp.strftime(TIMESTAMP_FORMAT)
filename = f"{stem}_{formatted}{suffix}"
return base_path.with_name(filename)


def _merge_with_previous_snapshot(
current_entries: Sequence[dict], previous_entries: Sequence[dict]
) -> list[dict]:
"""Keep prior vocab entries and append genuinely new ones."""

if not previous_entries:
# First run - nothing to preserve, return current snapshot verbatim.
return list(current_entries)

merged: list[dict] = [dict(entry) for entry in previous_entries]

seen_ids = {entry["id"] for entry in previous_entries if entry.get("id")}

for entry in current_entries:
vocab_id = entry.get("id")
if vocab_id and vocab_id in seen_ids:
continue

merged.append(entry)
if vocab_id:
seen_ids.add(vocab_id)

merged.sort(key=_sort_key)
return merged


def _write_yaml(entries: Sequence[dict], destination: Path) -> None:
"""Persist YAML vocabulary payload to disk."""
logger.debug("Writing %s entries to '%s'", len(entries), destination)
destination.parent.mkdir(parents=True, exist_ok=True)
with destination.open("w", encoding="utf-8") as yaml_f:
safe_dump(entries, yaml_f, sort_keys=False, allow_unicode=True)


def _reuse_existing_ids(
entries: list[dict], previous_entries: Sequence[dict]
) -> list[dict]:
"""Reuse identifiers for matching subjects to avoid duplicates in updates."""

if not previous_entries:
return entries

previous_by_subject = {
item["subject"].strip().lower(): item
for item in previous_entries
if item.get("subject") and item.get("id")
}

merged: list[dict] = []
seen_subjects: set[str] = set()

for entry in entries:
subject = entry["subject"].strip()
subject_key = subject.lower()

if subject_key in seen_subjects:
continue

previous = previous_by_subject.get(subject_key)
if previous:
entry["id"] = previous["id"]
if previous.get("scheme"):
entry["scheme"] = previous["scheme"]

merged.append(entry)
seen_subjects.add(subject_key)

return merged


def convert_vocabularies(
data: Sequence[dict],
output_path: str,
delta_path: str | None = None,
retain_removed_entries: bool = True,
) -> None:
"""Convert fetched vocabularies to YAML and record a delta report.

Parameters
----------
data:
Raw vocabulary payloads grouped by vocabulary name.
output_path:
Destination for the consolidated YAML export.
delta_path:
Base path for the JSON delta report (timestamp will be appended).
retain_removed_entries:
When ``True`` (default) previously exported entries missing from the
current payload remain in the YAML so they can be reintroduced later.
When ``False`` the output only contains entries present in the latest
fetch, effectively pruning removed vocabularies.
"""

def convert_vocabularies(data, dpath):
"""Convert vocabularies to yaml."""
logger.debug("Convert vocabularies started ...")
sorted_vocabularies = sort_vocabularies(data)
processed_data = process_vocabularies(sorted_vocabularies)
write_to_file(processed_data, dpath)
log_conversion_success(dpath)
destination = Path(output_path)
delta_destination = Path(delta_path) if delta_path else None
timestamp = datetime.now(timezone.utc)

current_entries = _normalise_entries(data)
previous_entries = load_previous_snapshot(destination)

current_entries = _reuse_existing_ids(current_entries, previous_entries)
merged_entries = (
_merge_with_previous_snapshot(current_entries, previous_entries)
if retain_removed_entries
else current_entries
)

_write_yaml(_prune_for_yaml(merged_entries), destination)

if delta_destination is not None:
delta_destination = _build_dated_delta_path(delta_destination, timestamp)
delta_report = build_delta_report(
previous_entries,
current_entries,
classify_missing_as_removed=not retain_removed_entries,
)
delta_report["generated_at"] = timestamp.isoformat()
write_delta_report(delta_destination, delta_report)

secho("Converted successfully to ", fg="green", nl=False)
secho(f" {destination}", fg="yellow", bold=True)
Loading