Samk13 · Samk13 · Sep 30, 2025
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,16 +1,8 @@
-include *.md
-include *.sh
-include *.txt
-include *.py
+include README.md
+include LICENSE
 include license
-include .pylintrc
-include Makefile
-include .editorconfig
-recursive-include .github *.yaml
-recursive-include invenio_subjects_cessda *.csv
-recursive-include invenio_subjects_cessda *.py
-recursive-include invenio_subjects_cessda *.yaml
-recursive-include tests *.py
-include *.lock
-include Pipfile
-include .vscode/*
+include CHANGES.md
+include pyproject.toml
+include invenio_subjects_cessda/__init__.py
+include invenio_subjects_cessda/vocabularies/__init__.py
+recursive-include invenio_subjects_cessda/vocabularies *.yaml *.json *.md
diff --git a/Makefile b/Makefile
@@ -13,11 +13,14 @@ format: ## Format code with Ruff
 	@uv run ruff format .
 
 lint: ## Lint code with Ruff
-	@uv run ruff check .
+	@uv run ruff check . --fix
 
 run: ## Fetch and convert the latest CESSDA vocabularies
 	@uv run python main.py
 
+run-force-delete: ## Drops legacy subjects from generated files and re-fetches and converts the latest CESSDA vocabularies
+	@uv run python main.py --drop-removed-vocabs
+
 package: ## Build source and wheel distributions
 	@rm -rf dist
 	@uvx --from build pyproject-build

diff --git a/README.md b/README.md
@@ -73,52 +73,43 @@ Modify `Makefile` to set the DEBUGGER environment variable to False for less det
 
 ### Updating CESSDA Versions
 
-Last updated: `2024-02-01`
-Check the version date in this README. To fetch the latest CESSDA versions, run:
+To fetch the latest CESSDA vocabularies, run:
 
 ```bash
 make run
 ```
 
-in [config.py](invenio_subjects_cessda/config.py) you have the ability to modify the preferred language and specify the directory for saving vocabularies.
-The endpoint `fullListOfpublishedVocabVersions` includes a full list of all published vocabulary versions enabling you to compare them with the versions that have been installed.
+You can change the preferred languages and output locations in
+[config.py](invenio_subjects_cessda/config.py). The command downloads all
+vocabularies, writes the canonical list to
+`invenio_subjects_cessda/vocabularies/cessda_voc.yaml`, and persists a delta
+report alongside it. Delta filenames now include the UTC timestamp of the run
+(`cessda_voc_delta_YYYYMMDDTHH_MMSS.json`) so each execution leaves an audit
+trail you can revisit for curation.
 
-The following vocabulary versions are included in this release. Remember to update this list during your next upgrade.
+Review the delta report after each update to see which vocabularies were added,
+changed, or missing from the latest upstream catalogue. When entries disappear
+from the fetch but are still retained in the YAML, they are reported under the
+`missing_from_latest` key so it is clear that the export still contains them. If
+you run the synchronisation with removals enabled, the report switches back to
+the traditional `removed` section.
 
-```console
-https://vocabularies.cessda.eu/v2/codes/CdcPublisherNames/6.0.0/en
-https://vocabularies.cessda.eu/v2/codes/CessdaPersistentIdentifierTypes/1.0.0/en
-https://vocabularies.cessda.eu/v2/codes/CountryNamesAndCodes/1.0.0/en
-https://vocabularies.cessda.eu/v2/codes/TopicClassification/4.2.2/en
-https://vocabularies.cessda.eu/v2/codes/AggregationMethod/1.1.2/en
-https://vocabularies.cessda.eu/v2/codes/AnalysisUnit/2.1.3/en
-https://vocabularies.cessda.eu/v2/codes/CharacterSet/1.0.0/en
-https://vocabularies.cessda.eu/v2/codes/CommonalityType/1.0.2/en
-https://vocabularies.cessda.eu/v2/codes/ContributorRole/1.0.2/en
-https://vocabularies.cessda.eu/v2/codes/DataSourceType/1.0.2/en
-https://vocabularies.cessda.eu/v2/codes/DataType/1.1.2/en
-https://vocabularies.cessda.eu/v2/codes/DateType/1.1.2/en
-https://vocabularies.cessda.eu/v2/codes/GeneralDataFormat/2.0.3/en
-https://vocabularies.cessda.eu/v2/codes/LanguageProficiency/1.0.2/en
-https://vocabularies.cessda.eu/v2/codes/LifecycleEventType/1.0.2/en
-https://vocabularies.cessda.eu/v2/codes/ModeOfCollection/4.0.3/en
-https://vocabularies.cessda.eu/v2/codes/NumericType/1.1.0/en
-https://vocabularies.cessda.eu/v2/codes/ResponseUnit/1.0.2/en
-https://vocabularies.cessda.eu/v2/codes/SamplingProcedure/1.1.4/en
-https://vocabularies.cessda.eu/v2/codes/SoftwarePackage/1.0.0/en
-https://vocabularies.cessda.eu/v2/codes/SummaryStatisticType/2.1.2/en
-https://vocabularies.cessda.eu/v2/codes/TimeMethod/1.2.3/en
-https://vocabularies.cessda.eu/v2/codes/TimeZone/1.0.0/en
-https://vocabularies.cessda.eu/v2/codes/TypeOfAddress/1.1.0/en
-https://vocabularies.cessda.eu/v2/codes/TypeOfConceptGroup/1.0.2/en
-https://vocabularies.cessda.eu/v2/codes/TypeOfFrequency/1.0.0/en
-https://vocabularies.cessda.eu/v2/codes/TypeOfInstrument/1.1.2/en
-https://vocabularies.cessda.eu/v2/codes/TypeOfNote/1.1.0/en
-https://vocabularies.cessda.eu/v2/codes/TypeOfTelephone/1.0.0/en
-https://vocabularies.cessda.eu/v2/codes/TypeOfTranslationMethod/1.0.0/en
-https://vocabularies.cessda.eu/v2/codes/Variables-Relations/1.0.0/en
+Existing subjects keep their original identifiers so Invenio instances will not
+create duplicate records when the upstream catalogue republishes entries with
+new IDs.
+
+To explicitly prune removed vocabularies from the YAML, call the synchronisation
+script with the `--drop-removed-vocabs` flag:
+
+```bash
+make run-force-delete
+# or
+python main.py --drop-removed-vocabs
 ```
 
+Without the flag, legacy entries remain in the export for backward
+compatibility, while the delta report records that they are missing upstream.
+
 ## Upload to pypi
 
 Publishing will be done automatically by GitHub actions when a new tag is created.
@@ -127,15 +118,3 @@ Publishing will be done automatically by GitHub actions when a new tag is create
 git tag vX.Y.Z
 git push origin master vX.Y.Z
 ```
-
-## manually upload to pypi
-
-```bash
-make install-package-tools # this will install twine (install-package-tools-pipenv if you use pipenv)
-make package # this will zip the package into dist dir
-make package-check # verify if the package pass twine checks
-
-export TWINE_USERNAME=__token__
-export TWINE_PASSWORD=pypi-<YOUR_TOKEN>
-twine upload dist/*
-```
diff --git a/invenio_subjects_cessda/config.py b/invenio_subjects_cessda/config.py
@@ -9,10 +9,13 @@
 languages = ["en(SL)"]
 """Languages to fetch"""
 
-en_vocabularies_output_path = (
-    Path.cwd() / "invenio_subjects_cessda" / "vocabularies" / "cessda_voc.yaml"
-)
-"""CESSDA Vocabularies path destination"""
+VOCABULARY_DIR = Path.cwd() / "invenio_subjects_cessda" / "vocabularies"
+
+en_vocabularies_output_path = VOCABULARY_DIR / "cessda_voc.yaml"
+"""CESSDA vocabulary export destination."""
+
+en_vocabularies_delta_path = VOCABULARY_DIR / "cessda_voc_delta.json"
+"""Delta report file path recording vocabulary changes between runs."""
 
 
 fullListOfpublishedVocabVersions = [
@@ -22,3 +25,20 @@
     }
 ]
 """https://api.tech.cessda.eu/#/vocabulary-resource-v-2/getAllVocabulariesUsingGET"""
+
+
+# # Sort existing CESSDA vocabulary entries by subject and id
+# ```python
+# python3 - <<'PY'
+# from pathlib import Path
+# import yaml
+
+# path = Path("invenio_subjects_cessda/vocabularies/cessda_voc.yaml")
+# entries = yaml.safe_load(path.read_text(encoding="utf-8")) or []
+# entries.sort(key=lambda item: (item["subject"].lower(), item["id"]))
+# path.write_text(
+#     yaml.safe_dump(entries, allow_unicode=True, sort_keys=False),
+#     encoding="utf-8",
+# )
+# PY
+# ```
diff --git a/invenio_subjects_cessda/convert.py b/invenio_subjects_cessda/convert.py
@@ -4,53 +4,182 @@
 # invenio-subjects-CESSDA is free software, you can redistribute it and/or
 # modify it under the terms of the MIT License; see LICENSE file details.
 
-
-# from typing import Any, Callable, Dict, Iterator, NoReturn
-
+from collections.abc import Iterable, Sequence
+from datetime import datetime, timezone
+from pathlib import Path
 
 from click import secho
-from yaml import dump
+from yaml import safe_dump
 
+from invenio_subjects_cessda.delta import (
+    build_delta_report,
+    load_previous_snapshot,
+    write_delta_report,
+)
 from invenio_subjects_cessda.schemas import cessda_schema
 from invenio_subjects_cessda.utils import logger
 
+VOCAB_FIELDS = ("id", "scheme", "subject")
+TIMESTAMP_FORMAT = "%Y%m%dT%H_%M%S"
 
-def sort_vocabularies(data):
-    """Sort vocabularies by 'id'."""
-    logger.debug("Sorting ...")
-    return sorted(
-        [(v["name"], s) for v in data for s in v["data"]], key=lambda x: x[1]["id"]
-    )
 
+def _sort_key(entry: dict) -> tuple[str, str]:
+    subject = (entry.get("subject") or "").strip().lower()
+    identifier = entry.get("id") or ""
+    return subject, identifier
 
-def process_vocabularies(sorted_vocabularies):
-    """Process sorted vocabularies."""
-    logger.debug("Processing schema ...")
-    return [
-        cessda_schema((name, entry))
-        for name, entry in sorted_vocabularies
-        if cessda_schema((name, entry))
-    ]
 
+def _normalise_entries(data: Sequence[dict]) -> list[dict]:
+    """Flatten and normalise raw vocabulary payloads."""
+    entries: list[dict] = []
+    for vocabulary in data:
+        vocabulary_name = vocabulary.get("name", "")
+        for raw_entry in vocabulary.get("data", []):
+            normalised = cessda_schema(raw_entry)
+            if not normalised:
+                continue
+            normalised["vocabulary"] = vocabulary_name
+            entries.append(normalised)
 
-def write_to_file(processed_data, dpath):
-    """Write processed data to a file."""
-    logger.debug("Writing to: '%s'", dpath)
-    with open(dpath, "w", encoding="utf-8") as yaml_f:
-        for data in processed_data:
-            dump(data, yaml_f, allow_unicode=True, sort_keys=False)
+    entries.sort(key=_sort_key)
+    logger.debug("Normalised %s vocabulary entries", len(entries))
+    return entries
 
 
-def log_conversion_success(dpath):
-    """Log success message."""
-    secho("Converted successfully to ", fg="green", nl=False)
-    secho(f" {dpath}", fg="yellow", bold=True)
+def _prune_for_yaml(entries: Iterable[dict]) -> list[dict]:
+    """Project entries to the public YAML representation."""
+    return [{field: entry[field] for field in VOCAB_FIELDS} for entry in entries]
+
+
+def _build_dated_delta_path(base_path: Path, timestamp: datetime) -> Path:
+    """Return a timestamped filename for the delta report."""
+
+    suffix = base_path.suffix
+    stem = base_path.stem
+    formatted = timestamp.strftime(TIMESTAMP_FORMAT)
+    filename = f"{stem}_{formatted}{suffix}"
+    return base_path.with_name(filename)
+
+
+def _merge_with_previous_snapshot(
+    current_entries: Sequence[dict], previous_entries: Sequence[dict]
+) -> list[dict]:
+    """Keep prior vocab entries and append genuinely new ones."""
+
+    if not previous_entries:
+        # First run - nothing to preserve, return current snapshot verbatim.
+        return list(current_entries)
+
+    merged: list[dict] = [dict(entry) for entry in previous_entries]
+
+    seen_ids = {entry["id"] for entry in previous_entries if entry.get("id")}
+
+    for entry in current_entries:
+        vocab_id = entry.get("id")
+        if vocab_id and vocab_id in seen_ids:
+            continue
+
+        merged.append(entry)
+        if vocab_id:
+            seen_ids.add(vocab_id)
+
+    merged.sort(key=_sort_key)
+    return merged
+
+
+def _write_yaml(entries: Sequence[dict], destination: Path) -> None:
+    """Persist YAML vocabulary payload to disk."""
+    logger.debug("Writing %s entries to '%s'", len(entries), destination)
+    destination.parent.mkdir(parents=True, exist_ok=True)
+    with destination.open("w", encoding="utf-8") as yaml_f:
+        safe_dump(entries, yaml_f, sort_keys=False, allow_unicode=True)
+
 
+def _reuse_existing_ids(
+    entries: list[dict], previous_entries: Sequence[dict]
+) -> list[dict]:
+    """Reuse identifiers for matching subjects to avoid duplicates in updates."""
+
+    if not previous_entries:
+        return entries
+
+    previous_by_subject = {
+        item["subject"].strip().lower(): item
+        for item in previous_entries
+        if item.get("subject") and item.get("id")
+    }
+
+    merged: list[dict] = []
+    seen_subjects: set[str] = set()
+
+    for entry in entries:
+        subject = entry["subject"].strip()
+        subject_key = subject.lower()
+
+        if subject_key in seen_subjects:
+            continue
+
+        previous = previous_by_subject.get(subject_key)
+        if previous:
+            entry["id"] = previous["id"]
+            if previous.get("scheme"):
+                entry["scheme"] = previous["scheme"]
+
+        merged.append(entry)
+        seen_subjects.add(subject_key)
+
+    return merged
+
+
+def convert_vocabularies(
+    data: Sequence[dict],
+    output_path: str,
+    delta_path: str | None = None,
+    retain_removed_entries: bool = True,
+) -> None:
+    """Convert fetched vocabularies to YAML and record a delta report.
+
+    Parameters
+    ----------
+    data:
+        Raw vocabulary payloads grouped by vocabulary name.
+    output_path:
+        Destination for the consolidated YAML export.
+    delta_path:
+        Base path for the JSON delta report (timestamp will be appended).
+    retain_removed_entries:
+        When ``True`` (default) previously exported entries missing from the
+        current payload remain in the YAML so they can be reintroduced later.
+        When ``False`` the output only contains entries present in the latest
+        fetch, effectively pruning removed vocabularies.
+    """
 
-def convert_vocabularies(data, dpath):
-    """Convert vocabularies to yaml."""
     logger.debug("Convert vocabularies started ...")
-    sorted_vocabularies = sort_vocabularies(data)
-    processed_data = process_vocabularies(sorted_vocabularies)
-    write_to_file(processed_data, dpath)
-    log_conversion_success(dpath)
+    destination = Path(output_path)
+    delta_destination = Path(delta_path) if delta_path else None
+    timestamp = datetime.now(timezone.utc)
+
+    current_entries = _normalise_entries(data)
+    previous_entries = load_previous_snapshot(destination)
+
+    current_entries = _reuse_existing_ids(current_entries, previous_entries)
+    merged_entries = (
+        _merge_with_previous_snapshot(current_entries, previous_entries)
+        if retain_removed_entries
+        else current_entries
+    )
+
+    _write_yaml(_prune_for_yaml(merged_entries), destination)
+
+    if delta_destination is not None:
+        delta_destination = _build_dated_delta_path(delta_destination, timestamp)
+        delta_report = build_delta_report(
+            previous_entries,
+            current_entries,
+            classify_missing_as_removed=not retain_removed_entries,
+        )
+        delta_report["generated_at"] = timestamp.isoformat()
+        write_delta_report(delta_destination, delta_report)
+
+    secho("Converted successfully to ", fg="green", nl=False)
+    secho(f" {destination}", fg="yellow", bold=True)