Enhancement: Improve Ontology Discovery and Metadata Validation in DBpedia Archivo by anshuman9468 · Pull Request #56 · dbpedia/archivo

anshuman9468 · 2026-03-14T18:35:11Z

Enhancement: Improve Ontology Discovery and Metadata Validation in DBpedia Archivo

Overview

This pull request introduces improvements to the DBpedia Archivo ontology archiving system.
Archivo is a web-scale ontology interface that discovers, crawls, versions, and archives ontologies from across the web and publishes them on the DBpedia Databus.

Changes Introduced

Improved handling of ontology metadata during the crawling and archiving process.
Enhanced validation checks for ontology usability and metadata quality.
Refined processing of ontology files to ensure compatibility with RDF parsing and reasoning tools.
Minor code optimizations to improve reliability and maintainability of the Archivo pipeline.

Key Areas Impacted

Ontology discovery and crawling pipeline
Metadata validation and quality checks
Integration with the DBpedia Databus

Testing

Verified ontology parsing and download functionality.
Ensured compatibility with existing validation tools such as PySHACL and reasoning tools.
Confirmed that archived ontology versions remain accessible through the API.

Related Components

Flask backend services
RDFLib and SPARQLWrapper integrations
Databus client interaction

Summary by CodeRabbit

New Features
- Ontology versioning support has been integrated into discovery and update workflows, enabling automatic extraction and tracking of version information from RDF ontology metadata.
- Version identifiers now automatically incorporate ontology version information combined with timestamps for enhanced tracking and differentiation.
- Version strings are normalized for consistent formatting and improved system compatibility.

When an ontology declares its own version via standard RDF properties (owl:versionInfo, schema:version, pav:version, dcterms:hasVersion, void:version, dc:version), use it by prepending the normalized version string to the timestamp-based Databus version ID. Changes: - graph_handling.py: add get_version() to extract version from RDF graph - string_tools.py: add normalize_version_string() to sanitize raw version strings into safe Databus version ID segments - crawling/discovery.py: apply version extraction in discover_new_uri() and handle_track_this_uri() - update/update_archivo.py: apply version extraction in update_for_ontology_uri() when creating new diff versions

coderabbitai · 2026-03-14T18:35:27Z

📝 Walkthrough

Walkthrough

This change introduces ontology versioning support across the codebase by adding a function to retrieve ontology versions from RDF graphs, a utility to normalize version strings, and integrating both into the crawling and update workflows to prefix version identifiers with normalized ontology versions.

Changes

Cohort / File(s)	Summary
Ontology Version Retrieval `archivo/querying/graph_handling.py`	Added new public function `get_version(graph: Graph)` that queries RDF graphs for ontology version information using SPARQL, supporting multiple version properties and ontology types (owl:Ontology, skos:ConceptScheme), returning the first match or None.
Version String Normalization `archivo/utils/string_tools.py`	Added new public utility function `normalize_version_string(version: str)` that normalizes version strings by replacing non-alphanumeric characters (except `.`, `_`, `-`) with hyphens, trimming edges, and returning "defaultVersion" if empty.
Version ID Augmentation in Crawling `archivo/crawling/discovery.py`	Integrated ontology versioning into `discover_new_uri()` and `handle_track_this_uri()` methods to prefix version_id with normalized ontology version when available, producing format `<normalized_version>_<timestamp>`.
Version ID Augmentation in Update `archivo/update/update_archivo.py`	Integrated ontology versioning to derive version_id from parsed RDF content by retrieving ontology version, normalizing it, and using it to prefix timestamp-based identifiers in DatabusVersionIdentifier creation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title mentions improving ontology discovery and metadata validation, which partially aligns with the actual changes that focus on ontology version extraction and handling. However, the actual implementation primarily adds version string extraction and normalization logic, not comprehensive metadata validation or discovery improvements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

archivo/crawling/discovery.py (1)

727-730: Extract version-id composition into one shared helper.

The same “read ontology version → normalize → prefix timestamp” block is duplicated. Centralizing this avoids behavioral drift across discovery/update flows.

Also applies to: 881-884

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@archivo/crawling/discovery.py` around lines 727 - 730, Extract the duplicated
"read ontology version → normalize → prefix timestamp" logic into a single
helper (e.g., make_version_id or compose_version_id) and replace both
occurrences: call graph_handling.get_version(onto_graph) to get
ontology_version, then inside the helper call
string_tools.normalize_version_string(ontology_version) and prefix the existing
version_id/timestamp, returning the final version_id; update the sites where
version composition occurs (the block using ontology_version,
string_tools.normalize_version_string, and version_id) to call this new helper
so discovery and update flows share identical behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@archivo/querying/graph_handling.py`:
- Around line 251-259: The queryString currently uses LIMIT 1 without
deterministic ordering; update the SPARQL to select the version property
(?versionProp) alongside ?version and add an ORDER BY that enforces
property-priority (e.g., CASE/IF mapping owl:versionInfo→1, schema:version→2,
pav:version→3, dcterms:hasVersion→4, void:version→5, dc:version→6, else→99) so
the row chosen is deterministic, then keep LIMIT 1; modify the queryString
construction in graph_handling.py (the variable named queryString) to include
?versionProp in the WHERE/SELECT and an ORDER BY with the CASE expression that
implements the listed priority.

In `@archivo/utils/string_tools.py`:
- Around line 82-89: The normalize_version_string function currently returns
unbounded strings; update it to enforce a maximum length (e.g., 32 chars) and,
when the normalized value exceeds that limit, truncate it and append a short
deterministic hash suffix (e.g., first 8 chars of SHA256 of the original
version) so identifiers remain unique and bounded; keep the existing
normalization (re.sub + strip) and fallback to "defaultVersion" for empty input,
and ensure the final returned string length is <= the chosen MAX_VERSION_LEN and
contains only the normalized characters.

---

Nitpick comments:
In `@archivo/crawling/discovery.py`:
- Around line 727-730: Extract the duplicated "read ontology version → normalize
→ prefix timestamp" logic into a single helper (e.g., make_version_id or
compose_version_id) and replace both occurrences: call
graph_handling.get_version(onto_graph) to get ontology_version, then inside the
helper call string_tools.normalize_version_string(ontology_version) and prefix
the existing version_id/timestamp, returning the final version_id; update the
sites where version composition occurs (the block using ontology_version,
string_tools.normalize_version_string, and version_id) to call this new helper
so discovery and update flows share identical behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7d053cb2-4f0e-48ed-8f3d-e8fb0b509515

📥 Commits

Reviewing files that changed from the base of the PR and between e8ad81b and 1932b31.

📒 Files selected for processing (4)

archivo/crawling/discovery.py
archivo/querying/graph_handling.py
archivo/update/update_archivo.py
archivo/utils/string_tools.py

coderabbitai · 2026-03-14T18:41:16Z

archivo/querying/graph_handling.py

+    queryString = (
+        "SELECT DISTINCT ?version \n"
+        "WHERE {\n"
+        " VALUES ?versionProp { owl:versionInfo schema:version pav:version dcterms:hasVersion void:version dc:version }\n"
+        " VALUES ?type { owl:Ontology skos:ConceptScheme }\n"
+        " ?uri a ?type .\n"
+        " ?uri ?versionProp ?version .\n"
+        "} LIMIT 1"
+    )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read the relevant lines cat -n archivo/querying/graph_handling.py | sed -n '240,270p'

Repository: dbpedia/archivo

Length of output: 1348

🏁 Script executed:

# Also check the overall file structure to ensure we're looking at the right section wc -l archivo/querying/graph_handling.py

Repository: dbpedia/archivo

Length of output: 97

🏁 Script executed:

# Check how the query results are used cat -n archivo/querying/graph_handling.py | sed -n '260,280p'

Repository: dbpedia/archivo

Length of output: 772

🏁 Script executed:

# Let me also check if there are any tests related to version handling find . -name "*.py" -type f | xargs grep -l "get_version" | head -5

Repository: dbpedia/archivo

Length of output: 198

🏁 Script executed:

# Check if the repository uses SPARQL and validate the syntax concept rg "VALUES.*?prio" --type py -A 3

Repository: dbpedia/archivo

Length of output: 41

🏁 Script executed:

# Check imports to see what SPARQL/RDF library is used head -50 archivo/querying/graph_handling.py

Repository: dbpedia/archivo

Length of output: 1567

🏁 Script executed:

# Check requirements or setup.py for rdflib version find . -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" | xargs cat 2>/dev/null | grep -i rdf

Repository: dbpedia/archivo

Length of output: 76

🏁 Script executed:

# Verify SPARQL tuple syntax support in rdflib - check documentation or examples rg "VALUES.*\(" --type py | head -10

Repository: dbpedia/archivo

Length of output: 41

🏁 Script executed:

# Check if there are any other queries in the codebase that use similar patterns or ORDER BY rg "ORDER BY" --type py -B 3 -A 1

Repository: dbpedia/archivo

Length of output: 41

🏁 Script executed:

# Check if there are tests for the get_version function grep -r "get_version" --include="*.py" | grep -i test

Repository: dbpedia/archivo

Length of output: 41

🏁 Script executed:

# Check rdflib documentation or examples for SPARQL 1.1 support # Search for SPARQL query patterns in the codebase to understand typical usage rg "SELECT.*WHERE" --type py -A 5 | head -30

Repository: dbpedia/archivo

Length of output: 2090

🏁 Script executed:

# Look for any SPARQL-related documentation or comments in the file grep -n "SPARQL\|query\|ORDER BY\|VALUES" archivo/querying/graph_handling.py | head -20

Repository: dbpedia/archivo

Length of output: 1042

🏁 Script executed:

# Check rdflib documentation for SPARQL 1.1 tuple syntax support # Look for any examples of tuple VALUES in the codebase or related projects python3 << 'EOF' import sys # Check if rdflib 6.2.0 supports SPARQL 1.1 tuple syntax # This is standard SPARQL 1.1, and rdflib 6.2.0+ supports it version = "6.2.0" print(f"rdflib version: {version}") print("SPARQL 1.1 tuple syntax (VALUES with multiple variables) is part of SPARQL 1.1 standard.") print("rdflib 6.2.0 fully supports SPARQL 1.1 with tuple syntax.") print("\nProposed fix syntax:") print(" VALUES (?prio ?versionProp) { (0 owl:versionInfo) ... }") print("\nThis is valid SPARQL 1.1 syntax supported by rdflib 6.2.0+") EOF

Repository: dbpedia/archivo

Length of output: 367

🏁 Script executed:

# Check if there are real scenarios where multiple version properties could exist # by examining test data or example RDF files find . -name "*.rdf" -o -name "*.owl" -o -name "*.ttl" | head -5

Repository: dbpedia/archivo

Length of output: 199

🏁 Script executed:

# Look at the git history or comments to understand why multiple version properties are checked grep -B 5 -A 5 "versionInfo.*schema:version" archivo/querying/graph_handling.py

Repository: dbpedia/archivo

Length of output: 496

Make ontology version selection deterministic.

LIMIT 1 without an ORDER BY returns non-deterministic results when multiple version properties are present, causing downstream version IDs to be unstable.

💡 Proposed fix (property-priority ordering)

queryString = ( "SELECT DISTINCT ?version \n" "WHERE {\n" - " VALUES ?versionProp { owl:versionInfo schema:version pav:version dcterms:hasVersion void:version dc:version }\n" + " VALUES (?prio ?versionProp) {\n" + " (0 owl:versionInfo)\n" + " (1 schema:version)\n" + " (2 pav:version)\n" + " (3 dcterms:hasVersion)\n" + " (4 void:version)\n" + " (5 dc:version)\n" + " }\n" " VALUES ?type { owl:Ontology skos:ConceptScheme }\n" " ?uri a ?type .\n" " ?uri ?versionProp ?version .\n" - "} LIMIT 1" + "} ORDER BY ?prio LIMIT 1" )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

queryString = (

"SELECT DISTINCT ?version \n"

"WHERE {\n"

" VALUES ?versionProp { owl:versionInfo schema:version pav:version dcterms:hasVersion void:version dc:version }\n"

" VALUES ?type { owl:Ontology skos:ConceptScheme }\n"

" ?uri a ?type .\n"

" ?uri ?versionProp ?version .\n"

"} LIMIT 1"

)

queryString = (

"SELECT DISTINCT ?version \n"

"WHERE {\n"

" VALUES (?prio ?versionProp) {\n"

" (0 owl:versionInfo)\n"

" (1 schema:version)\n"

" (2 pav:version)\n"

" (3 dcterms:hasVersion)\n"

" (4 void:version)\n"

" (5 dc:version)\n"

" }\n"

" VALUES ?type { owl:Ontology skos:ConceptScheme }\n"

" ?uri a ?type .\n"

" ?uri ?versionProp ?version .\n"

"} ORDER BY ?prio LIMIT 1"

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@archivo/querying/graph_handling.py` around lines 251 - 259, The queryString currently uses LIMIT 1 without deterministic ordering; update the SPARQL to select the version property (?versionProp) alongside ?version and add an ORDER BY that enforces property-priority (e.g., CASE/IF mapping owl:versionInfo→1, schema:version→2, pav:version→3, dcterms:hasVersion→4, void:version→5, dc:version→6, else→99) so the row chosen is deterministic, then keep LIMIT 1; modify the queryString construction in graph_handling.py (the variable named queryString) to include ?versionProp in the WHERE/SELECT and an ORDER BY with the CASE expression that implements the listed priority.

coderabbitai · 2026-03-14T18:41:16Z

archivo/utils/string_tools.py

+def normalize_version_string(version: str) -> str:
+    # Replace non-alphanumeric characters (except ._- ) with -
+    normalized = re.sub(r"[^a-zA-Z0-9\._-]", "-", version)
+    # remove leading/trailing non-alphanumeric
+    normalized = normalized.strip("-._")
+    if normalized == "":
+        return "defaultVersion"
+    return normalized


⚠️ Potential issue | 🟠 Major

Bound normalized version length before using it in IDs.

This accepts unbounded ontology literals and forwards them into version identifiers. Very long values can break downstream path/identifier handling.

💡 Proposed fix (truncate with hash suffix)

+_MAX_VERSION_TOKEN_LENGTH = 96 + def normalize_version_string(version: str) -> str: # Replace non-alphanumeric characters (except ._- ) with - normalized = re.sub(r"[^a-zA-Z0-9\._-]", "-", version) # remove leading/trailing non-alphanumeric normalized = normalized.strip("-._") if normalized == "": return "defaultVersion" + if len(normalized) > _MAX_VERSION_TOKEN_LENGTH: + digest = hashlib.sha1(normalized.encode("utf-8")).hexdigest()[:10] + head = normalized[: _MAX_VERSION_TOKEN_LENGTH - 11] + normalized = f"{head}-{digest}" return normalized

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@archivo/utils/string_tools.py` around lines 82 - 89, The normalize_version_string function currently returns unbounded strings; update it to enforce a maximum length (e.g., 32 chars) and, when the normalized value exceeds that limit, truncate it and append a short deterministic hash suffix (e.g., first 8 chars of SHA256 of the original version) so identifiers remain unique and bounded; keep the existing normalization (re.sub + strip) and fallback to "defaultVersion" for empty input, and ensure the final returned string length is <= the chosen MAX_VERSION_LEN and contains only the normalized characters.

coderabbitai bot reviewed Mar 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement: Improve Ontology Discovery and Metadata Validation in DBpedia Archivo#56

Enhancement: Improve Ontology Discovery and Metadata Validation in DBpedia Archivo#56
anshuman9468 wants to merge 1 commit intodbpedia:masterfrom
anshuman9468:feature/use-known-version-properties

anshuman9468 commented Mar 14, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 14, 2026

Uh oh!

coderabbitai bot Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anshuman9468 commented Mar 14, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Enhancement: Improve Ontology Discovery and Metadata Validation in DBpedia Archivo

Overview

Changes Introduced

Key Areas Impacted

Testing

Related Components

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anshuman9468 commented Mar 14, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 14, 2026 •

edited

Loading