Enhancement: Improve Ontology Discovery and Metadata Validation in DBpedia Archivo#56
Conversation
When an ontology declares its own version via standard RDF properties (owl:versionInfo, schema:version, pav:version, dcterms:hasVersion, void:version, dc:version), use it by prepending the normalized version string to the timestamp-based Databus version ID. Changes: - graph_handling.py: add get_version() to extract version from RDF graph - string_tools.py: add normalize_version_string() to sanitize raw version strings into safe Databus version ID segments - crawling/discovery.py: apply version extraction in discover_new_uri() and handle_track_this_uri() - update/update_archivo.py: apply version extraction in update_for_ontology_uri() when creating new diff versions
📝 WalkthroughWalkthroughThis change introduces ontology versioning support across the codebase by adding a function to retrieve ontology versions from RDF graphs, a utility to normalize version strings, and integrating both into the crawling and update workflows to prefix version identifiers with normalized ontology versions. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.Change the |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
archivo/crawling/discovery.py (1)
727-730: Extract version-id composition into one shared helper.The same “read ontology version → normalize → prefix timestamp” block is duplicated. Centralizing this avoids behavioral drift across discovery/update flows.
Also applies to: 881-884
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@archivo/crawling/discovery.py` around lines 727 - 730, Extract the duplicated "read ontology version → normalize → prefix timestamp" logic into a single helper (e.g., make_version_id or compose_version_id) and replace both occurrences: call graph_handling.get_version(onto_graph) to get ontology_version, then inside the helper call string_tools.normalize_version_string(ontology_version) and prefix the existing version_id/timestamp, returning the final version_id; update the sites where version composition occurs (the block using ontology_version, string_tools.normalize_version_string, and version_id) to call this new helper so discovery and update flows share identical behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@archivo/querying/graph_handling.py`:
- Around line 251-259: The queryString currently uses LIMIT 1 without
deterministic ordering; update the SPARQL to select the version property
(?versionProp) alongside ?version and add an ORDER BY that enforces
property-priority (e.g., CASE/IF mapping owl:versionInfo→1, schema:version→2,
pav:version→3, dcterms:hasVersion→4, void:version→5, dc:version→6, else→99) so
the row chosen is deterministic, then keep LIMIT 1; modify the queryString
construction in graph_handling.py (the variable named queryString) to include
?versionProp in the WHERE/SELECT and an ORDER BY with the CASE expression that
implements the listed priority.
In `@archivo/utils/string_tools.py`:
- Around line 82-89: The normalize_version_string function currently returns
unbounded strings; update it to enforce a maximum length (e.g., 32 chars) and,
when the normalized value exceeds that limit, truncate it and append a short
deterministic hash suffix (e.g., first 8 chars of SHA256 of the original
version) so identifiers remain unique and bounded; keep the existing
normalization (re.sub + strip) and fallback to "defaultVersion" for empty input,
and ensure the final returned string length is <= the chosen MAX_VERSION_LEN and
contains only the normalized characters.
---
Nitpick comments:
In `@archivo/crawling/discovery.py`:
- Around line 727-730: Extract the duplicated "read ontology version → normalize
→ prefix timestamp" logic into a single helper (e.g., make_version_id or
compose_version_id) and replace both occurrences: call
graph_handling.get_version(onto_graph) to get ontology_version, then inside the
helper call string_tools.normalize_version_string(ontology_version) and prefix
the existing version_id/timestamp, returning the final version_id; update the
sites where version composition occurs (the block using ontology_version,
string_tools.normalize_version_string, and version_id) to call this new helper
so discovery and update flows share identical behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7d053cb2-4f0e-48ed-8f3d-e8fb0b509515
📒 Files selected for processing (4)
archivo/crawling/discovery.pyarchivo/querying/graph_handling.pyarchivo/update/update_archivo.pyarchivo/utils/string_tools.py
| queryString = ( | ||
| "SELECT DISTINCT ?version \n" | ||
| "WHERE {\n" | ||
| " VALUES ?versionProp { owl:versionInfo schema:version pav:version dcterms:hasVersion void:version dc:version }\n" | ||
| " VALUES ?type { owl:Ontology skos:ConceptScheme }\n" | ||
| " ?uri a ?type .\n" | ||
| " ?uri ?versionProp ?version .\n" | ||
| "} LIMIT 1" | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and read the relevant lines
cat -n archivo/querying/graph_handling.py | sed -n '240,270p'Repository: dbpedia/archivo
Length of output: 1348
🏁 Script executed:
# Also check the overall file structure to ensure we're looking at the right section
wc -l archivo/querying/graph_handling.pyRepository: dbpedia/archivo
Length of output: 97
🏁 Script executed:
# Check how the query results are used
cat -n archivo/querying/graph_handling.py | sed -n '260,280p'Repository: dbpedia/archivo
Length of output: 772
🏁 Script executed:
# Let me also check if there are any tests related to version handling
find . -name "*.py" -type f | xargs grep -l "get_version" | head -5Repository: dbpedia/archivo
Length of output: 198
🏁 Script executed:
# Check if the repository uses SPARQL and validate the syntax concept
rg "VALUES.*?prio" --type py -A 3Repository: dbpedia/archivo
Length of output: 41
🏁 Script executed:
# Check imports to see what SPARQL/RDF library is used
head -50 archivo/querying/graph_handling.pyRepository: dbpedia/archivo
Length of output: 1567
🏁 Script executed:
# Check requirements or setup.py for rdflib version
find . -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" | xargs cat 2>/dev/null | grep -i rdfRepository: dbpedia/archivo
Length of output: 76
🏁 Script executed:
# Verify SPARQL tuple syntax support in rdflib - check documentation or examples
rg "VALUES.*\(" --type py | head -10Repository: dbpedia/archivo
Length of output: 41
🏁 Script executed:
# Check if there are any other queries in the codebase that use similar patterns or ORDER BY
rg "ORDER BY" --type py -B 3 -A 1Repository: dbpedia/archivo
Length of output: 41
🏁 Script executed:
# Check if there are tests for the get_version function
grep -r "get_version" --include="*.py" | grep -i testRepository: dbpedia/archivo
Length of output: 41
🏁 Script executed:
# Check rdflib documentation or examples for SPARQL 1.1 support
# Search for SPARQL query patterns in the codebase to understand typical usage
rg "SELECT.*WHERE" --type py -A 5 | head -30Repository: dbpedia/archivo
Length of output: 2090
🏁 Script executed:
# Look for any SPARQL-related documentation or comments in the file
grep -n "SPARQL\|query\|ORDER BY\|VALUES" archivo/querying/graph_handling.py | head -20Repository: dbpedia/archivo
Length of output: 1042
🏁 Script executed:
# Check rdflib documentation for SPARQL 1.1 tuple syntax support
# Look for any examples of tuple VALUES in the codebase or related projects
python3 << 'EOF'
import sys
# Check if rdflib 6.2.0 supports SPARQL 1.1 tuple syntax
# This is standard SPARQL 1.1, and rdflib 6.2.0+ supports it
version = "6.2.0"
print(f"rdflib version: {version}")
print("SPARQL 1.1 tuple syntax (VALUES with multiple variables) is part of SPARQL 1.1 standard.")
print("rdflib 6.2.0 fully supports SPARQL 1.1 with tuple syntax.")
print("\nProposed fix syntax:")
print(" VALUES (?prio ?versionProp) { (0 owl:versionInfo) ... }")
print("\nThis is valid SPARQL 1.1 syntax supported by rdflib 6.2.0+")
EOFRepository: dbpedia/archivo
Length of output: 367
🏁 Script executed:
# Check if there are real scenarios where multiple version properties could exist
# by examining test data or example RDF files
find . -name "*.rdf" -o -name "*.owl" -o -name "*.ttl" | head -5Repository: dbpedia/archivo
Length of output: 199
🏁 Script executed:
# Look at the git history or comments to understand why multiple version properties are checked
grep -B 5 -A 5 "versionInfo.*schema:version" archivo/querying/graph_handling.pyRepository: dbpedia/archivo
Length of output: 496
Make ontology version selection deterministic.
LIMIT 1 without an ORDER BY returns non-deterministic results when multiple version properties are present, causing downstream version IDs to be unstable.
💡 Proposed fix (property-priority ordering)
queryString = (
"SELECT DISTINCT ?version \n"
"WHERE {\n"
- " VALUES ?versionProp { owl:versionInfo schema:version pav:version dcterms:hasVersion void:version dc:version }\n"
+ " VALUES (?prio ?versionProp) {\n"
+ " (0 owl:versionInfo)\n"
+ " (1 schema:version)\n"
+ " (2 pav:version)\n"
+ " (3 dcterms:hasVersion)\n"
+ " (4 void:version)\n"
+ " (5 dc:version)\n"
+ " }\n"
" VALUES ?type { owl:Ontology skos:ConceptScheme }\n"
" ?uri a ?type .\n"
" ?uri ?versionProp ?version .\n"
- "} LIMIT 1"
+ "} ORDER BY ?prio LIMIT 1"
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| queryString = ( | |
| "SELECT DISTINCT ?version \n" | |
| "WHERE {\n" | |
| " VALUES ?versionProp { owl:versionInfo schema:version pav:version dcterms:hasVersion void:version dc:version }\n" | |
| " VALUES ?type { owl:Ontology skos:ConceptScheme }\n" | |
| " ?uri a ?type .\n" | |
| " ?uri ?versionProp ?version .\n" | |
| "} LIMIT 1" | |
| ) | |
| queryString = ( | |
| "SELECT DISTINCT ?version \n" | |
| "WHERE {\n" | |
| " VALUES (?prio ?versionProp) {\n" | |
| " (0 owl:versionInfo)\n" | |
| " (1 schema:version)\n" | |
| " (2 pav:version)\n" | |
| " (3 dcterms:hasVersion)\n" | |
| " (4 void:version)\n" | |
| " (5 dc:version)\n" | |
| " }\n" | |
| " VALUES ?type { owl:Ontology skos:ConceptScheme }\n" | |
| " ?uri a ?type .\n" | |
| " ?uri ?versionProp ?version .\n" | |
| "} ORDER BY ?prio LIMIT 1" | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@archivo/querying/graph_handling.py` around lines 251 - 259, The queryString
currently uses LIMIT 1 without deterministic ordering; update the SPARQL to
select the version property (?versionProp) alongside ?version and add an ORDER
BY that enforces property-priority (e.g., CASE/IF mapping owl:versionInfo→1,
schema:version→2, pav:version→3, dcterms:hasVersion→4, void:version→5,
dc:version→6, else→99) so the row chosen is deterministic, then keep LIMIT 1;
modify the queryString construction in graph_handling.py (the variable named
queryString) to include ?versionProp in the WHERE/SELECT and an ORDER BY with
the CASE expression that implements the listed priority.
| def normalize_version_string(version: str) -> str: | ||
| # Replace non-alphanumeric characters (except ._- ) with - | ||
| normalized = re.sub(r"[^a-zA-Z0-9\._-]", "-", version) | ||
| # remove leading/trailing non-alphanumeric | ||
| normalized = normalized.strip("-._") | ||
| if normalized == "": | ||
| return "defaultVersion" | ||
| return normalized |
There was a problem hiding this comment.
Bound normalized version length before using it in IDs.
This accepts unbounded ontology literals and forwards them into version identifiers. Very long values can break downstream path/identifier handling.
💡 Proposed fix (truncate with hash suffix)
+_MAX_VERSION_TOKEN_LENGTH = 96
+
def normalize_version_string(version: str) -> str:
# Replace non-alphanumeric characters (except ._- ) with -
normalized = re.sub(r"[^a-zA-Z0-9\._-]", "-", version)
# remove leading/trailing non-alphanumeric
normalized = normalized.strip("-._")
if normalized == "":
return "defaultVersion"
+ if len(normalized) > _MAX_VERSION_TOKEN_LENGTH:
+ digest = hashlib.sha1(normalized.encode("utf-8")).hexdigest()[:10]
+ head = normalized[: _MAX_VERSION_TOKEN_LENGTH - 11]
+ normalized = f"{head}-{digest}"
return normalized🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@archivo/utils/string_tools.py` around lines 82 - 89, The
normalize_version_string function currently returns unbounded strings; update it
to enforce a maximum length (e.g., 32 chars) and, when the normalized value
exceeds that limit, truncate it and append a short deterministic hash suffix
(e.g., first 8 chars of SHA256 of the original version) so identifiers remain
unique and bounded; keep the existing normalization (re.sub + strip) and
fallback to "defaultVersion" for empty input, and ensure the final returned
string length is <= the chosen MAX_VERSION_LEN and contains only the normalized
characters.
Enhancement: Improve Ontology Discovery and Metadata Validation in DBpedia Archivo
Overview
This pull request introduces improvements to the DBpedia Archivo ontology archiving system.
Archivo is a web-scale ontology interface that discovers, crawls, versions, and archives ontologies from across the web and publishes them on the DBpedia Databus.
Changes Introduced
Key Areas Impacted
Testing
Related Components
Summary by CodeRabbit