feat: add Pruefidentifikator mapping from AHB database#341
Conversation
- Bump rebdhuhn>=1.2.0, add fundamend[sqlmodels] and py7zr dependencies - Add ahb_pruefi module to download AHB DB and query EBD-to-Pruefi mapping - Integrate mapping into the pipeline: populate pruefidentifikatoren on EbdTableMetaData and pass through to SVG generation - Add AHB_DB_PATH, GITHUB_TOKEN, FORMAT_VERSION settings Relates to Hochfrequenz/entscheidungsbaumdiagramme#637 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove backports-zstd, librt, mypy, greenlet etc. which are platform/version-specific transitive deps that fail on Python 3.14. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix dict type annotation to use EbdPruefidentifikator instead of str - Use EdifactFormatVersion enum properly in ahb_pruefi.py - Add types-requests to type_check dependencies Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Empty AHB_DB_PATH= in .env was being interpreted as current directory, causing SQLite errors. Now empty strings are coerced to None. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use .get(ebd_key) instead of .get(ebd_key, []) so missing keys return None (= didn't check) rather than [] (= checked, none found) - Make pruefidentifikatoren assignment consistent (always post-assignment) - Push EBD qualifier filter into SQL WHERE clause (LIKE 'E_%') - Delete .7z archive after extraction to free disk space Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
hf-kklein
left a comment
There was a problem hiding this comment.
please update the readme of the repository and explain how to use it together with xml-migs-and-ahbs
src/ebd_toolchain/ahb_pruefi.py
Outdated
| if qualifier is not None and _EBD_QUALIFIER_PATTERN.match(qualifier): | ||
| if qualifier not in seen: | ||
| seen[qualifier] = set() | ||
| seen[qualifier].add((EdifactFormatVersion(fv), pruefidentifikator)) |
There was a problem hiding this comment.
why don't you use the SQL i suggested where the regex is evaluted directy in the sqlite instead of in python? wouldn't that dramatically reduce the number of rows to read from the db?
Hochfrequenz/entscheidungsbaumdiagramme#637 (comment)
SELECT v_ahbtabellen.format_version,
qualifier AS ebd_key,
JSON_GROUP_ARRAY(DISTINCT pruefidentifikator) AS pruefidentifikatoren
FROM v_ahbtabellen
WHERE qualifier REGEXP 'E_[0-9]+'
GROUP BY v_ahbtabellen.format_version, qualifier
ORDER BY v_ahbtabellen.format_version DESC, ebd_key;There was a problem hiding this comment.
Done in cfd3999 — replaced SQLModel query with the raw SQL from the issue comment. REGEXP + JSON_GROUP_ARRAY + GROUP BY all run in SQLite now.
There was a problem hiding this comment.
Note: replaced REGEXP with GLOB 'E_[0-9][0-9][0-9][0-9]' in 6001b3f because SQLite does not ship a built-in REGEXP implementation — it requires a user-defined function to be registered on the connection, which neither sqlmodel nor sqlalchemy do by default. GLOB supports character classes natively and is equivalent here.
src/ebd_toolchain/main.py
Outdated
| db_path = settings.ahb_db_path | ||
| if db_path is None and settings.github_token is not None: | ||
| click.secho("Downloading AHB database from xml-migs-and-ahbs...", fg="cyan") | ||
| db_path = download_ahb_db(settings.github_token) | ||
| if db_path is not None: | ||
| click.secho(f"Loading EBD-to-Prüfi mapping from {db_path} (format_version={settings.format_version})") |
There was a problem hiding this comment.
we don't need the settings.ahb_db_path to be configurable. just assume its never given and always None when the script starts, so as a consequence you always download it from the release artifact on the first run. YAGNI.
There was a problem hiding this comment.
Done in b1e8c62 — removed ahb_db_path, always downloads from release when GITHUB_TOKEN is set.
src/ebd_toolchain/main.py
Outdated
|
|
||
| kroki_port: int = Field(alias="KROKI_PORT") | ||
| kroki_host: str = Field(alias="KROKI_HOST") | ||
| ahb_db_path: Optional[Path] = Field(default=None, alias="AHB_DB_PATH") |
There was a problem hiding this comment.
we don't need that.
src/ebd_toolchain/main.py
Outdated
| kroki_port: int = Field(alias="KROKI_PORT") | ||
| kroki_host: str = Field(alias="KROKI_HOST") | ||
| ahb_db_path: Optional[Path] = Field(default=None, alias="AHB_DB_PATH") | ||
| github_token: Optional[str] = Field(default=None, alias="GITHUB_TOKEN") |
There was a problem hiding this comment.
add a description what it's used for
There was a problem hiding this comment.
Done in b1e8c62 — added description to the github_token field.
Use the suggested SQL query from the issue comment — REGEXP filtering and JSON_GROUP_ARRAY aggregation happen entirely in SQLite, dramatically reducing rows transferred to Python. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
YAGNI — always download the AHB DB from the xml-migs-and-ahbs release when GITHUB_TOKEN is set. Add description to github_token field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…te assignment - REGEXP is not natively supported in SQLite — use GLOB instead - Merge pruefidentifikatoren across format versions instead of silently overwriting when format_version filter is not set - Consolidate duplicated pruefidentifikatoren assignment into one line after the if/else block Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
rebdhuhn>=1.2.0to use the newpruefidentifikatorenfield andEbdPruefidentifikatormodelfundamend[sqlmodels]andpy7zrdependenciesahb_pruefimodule: downloads AHB SQLite DB fromxml-migs-and-ahbsreleases and queries thev_ahbtabellenview for EBD↔Prüfidentifikator mappingspruefidentifikatorenon bothEbdNoTableSectionandDocxTableConverterpathspruefidentifikatorenthrough to SVG generation for clickable links in the footerAHB_DB_PATH,GITHUB_TOKEN,FORMAT_VERSIONContext
Relates to Hochfrequenz/entscheidungsbaumdiagramme#637
Depends on Hochfrequenz/rebdhuhn v1.2.0
Test plan
rebdhuhn>=1.2.0is available on PyPIAHB_DB_PATHpointing to a local AHB databasepruefidentifikatorenfield🤖 Generated with Claude Code