Skip to content

Enhancement: Improve Ontology Discovery and Metadata Validation in DBpedia Archivo#56

Open
anshuman9468 wants to merge 1 commit intodbpedia:masterfrom
anshuman9468:feature/use-known-version-properties
Open

Enhancement: Improve Ontology Discovery and Metadata Validation in DBpedia Archivo#56
anshuman9468 wants to merge 1 commit intodbpedia:masterfrom
anshuman9468:feature/use-known-version-properties

Conversation

@anshuman9468
Copy link
Copy Markdown

@anshuman9468 anshuman9468 commented Mar 14, 2026

Enhancement: Improve Ontology Discovery and Metadata Validation in DBpedia Archivo

Overview

This pull request introduces improvements to the DBpedia Archivo ontology archiving system.
Archivo is a web-scale ontology interface that discovers, crawls, versions, and archives ontologies from across the web and publishes them on the DBpedia Databus.

Changes Introduced

  • Improved handling of ontology metadata during the crawling and archiving process.
  • Enhanced validation checks for ontology usability and metadata quality.
  • Refined processing of ontology files to ensure compatibility with RDF parsing and reasoning tools.
  • Minor code optimizations to improve reliability and maintainability of the Archivo pipeline.

Key Areas Impacted

  • Ontology discovery and crawling pipeline
  • Metadata validation and quality checks
  • Integration with the DBpedia Databus

Testing

  • Verified ontology parsing and download functionality.
  • Ensured compatibility with existing validation tools such as PySHACL and reasoning tools.
  • Confirmed that archived ontology versions remain accessible through the API.

Related Components

  • Flask backend services
  • RDFLib and SPARQLWrapper integrations
  • Databus client interaction

Summary by CodeRabbit

  • New Features
    • Ontology versioning support has been integrated into discovery and update workflows, enabling automatic extraction and tracking of version information from RDF ontology metadata.
    • Version identifiers now automatically incorporate ontology version information combined with timestamps for enhanced tracking and differentiation.
    • Version strings are normalized for consistent formatting and improved system compatibility.

When an ontology declares its own version via standard RDF properties
(owl:versionInfo, schema:version, pav:version, dcterms:hasVersion,
void:version, dc:version), use it by prepending the normalized version
string to the timestamp-based Databus version ID.

Changes:
- graph_handling.py: add get_version() to extract version from RDF graph
- string_tools.py: add normalize_version_string() to sanitize raw version
  strings into safe Databus version ID segments
- crawling/discovery.py: apply version extraction in discover_new_uri()
  and handle_track_this_uri()
- update/update_archivo.py: apply version extraction in
  update_for_ontology_uri() when creating new diff versions
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 14, 2026

📝 Walkthrough

Walkthrough

This change introduces ontology versioning support across the codebase by adding a function to retrieve ontology versions from RDF graphs, a utility to normalize version strings, and integrating both into the crawling and update workflows to prefix version identifiers with normalized ontology versions.

Changes

Cohort / File(s) Summary
Ontology Version Retrieval
archivo/querying/graph_handling.py
Added new public function get_version(graph: Graph) that queries RDF graphs for ontology version information using SPARQL, supporting multiple version properties and ontology types (owl:Ontology, skos:ConceptScheme), returning the first match or None.
Version String Normalization
archivo/utils/string_tools.py
Added new public utility function normalize_version_string(version: str) that normalizes version strings by replacing non-alphanumeric characters (except ., _, -) with hyphens, trimming edges, and returning "defaultVersion" if empty.
Version ID Augmentation in Crawling
archivo/crawling/discovery.py
Integrated ontology versioning into discover_new_uri() and handle_track_this_uri() methods to prefix version_id with normalized ontology version when available, producing format <normalized_version>_<timestamp>.
Version ID Augmentation in Update
archivo/update/update_archivo.py
Integrated ontology versioning to derive version_id from parsed RDF content by retrieving ontology version, normalizing it, and using it to prefix timestamp-based identifiers in DatabusVersionIdentifier creation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title mentions improving ontology discovery and metadata validation, which partially aligns with the actual changes that focus on ontology version extraction and handling. However, the actual implementation primarily adds version string extraction and normalization logic, not comprehensive metadata validation or discovery improvements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
archivo/crawling/discovery.py (1)

727-730: Extract version-id composition into one shared helper.

The same “read ontology version → normalize → prefix timestamp” block is duplicated. Centralizing this avoids behavioral drift across discovery/update flows.

Also applies to: 881-884

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@archivo/crawling/discovery.py` around lines 727 - 730, Extract the duplicated
"read ontology version → normalize → prefix timestamp" logic into a single
helper (e.g., make_version_id or compose_version_id) and replace both
occurrences: call graph_handling.get_version(onto_graph) to get
ontology_version, then inside the helper call
string_tools.normalize_version_string(ontology_version) and prefix the existing
version_id/timestamp, returning the final version_id; update the sites where
version composition occurs (the block using ontology_version,
string_tools.normalize_version_string, and version_id) to call this new helper
so discovery and update flows share identical behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@archivo/querying/graph_handling.py`:
- Around line 251-259: The queryString currently uses LIMIT 1 without
deterministic ordering; update the SPARQL to select the version property
(?versionProp) alongside ?version and add an ORDER BY that enforces
property-priority (e.g., CASE/IF mapping owl:versionInfo→1, schema:version→2,
pav:version→3, dcterms:hasVersion→4, void:version→5, dc:version→6, else→99) so
the row chosen is deterministic, then keep LIMIT 1; modify the queryString
construction in graph_handling.py (the variable named queryString) to include
?versionProp in the WHERE/SELECT and an ORDER BY with the CASE expression that
implements the listed priority.

In `@archivo/utils/string_tools.py`:
- Around line 82-89: The normalize_version_string function currently returns
unbounded strings; update it to enforce a maximum length (e.g., 32 chars) and,
when the normalized value exceeds that limit, truncate it and append a short
deterministic hash suffix (e.g., first 8 chars of SHA256 of the original
version) so identifiers remain unique and bounded; keep the existing
normalization (re.sub + strip) and fallback to "defaultVersion" for empty input,
and ensure the final returned string length is <= the chosen MAX_VERSION_LEN and
contains only the normalized characters.

---

Nitpick comments:
In `@archivo/crawling/discovery.py`:
- Around line 727-730: Extract the duplicated "read ontology version → normalize
→ prefix timestamp" logic into a single helper (e.g., make_version_id or
compose_version_id) and replace both occurrences: call
graph_handling.get_version(onto_graph) to get ontology_version, then inside the
helper call string_tools.normalize_version_string(ontology_version) and prefix
the existing version_id/timestamp, returning the final version_id; update the
sites where version composition occurs (the block using ontology_version,
string_tools.normalize_version_string, and version_id) to call this new helper
so discovery and update flows share identical behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 7d053cb2-4f0e-48ed-8f3d-e8fb0b509515

📥 Commits

Reviewing files that changed from the base of the PR and between e8ad81b and 1932b31.

📒 Files selected for processing (4)
  • archivo/crawling/discovery.py
  • archivo/querying/graph_handling.py
  • archivo/update/update_archivo.py
  • archivo/utils/string_tools.py

Comment on lines +251 to +259
queryString = (
"SELECT DISTINCT ?version \n"
"WHERE {\n"
" VALUES ?versionProp { owl:versionInfo schema:version pav:version dcterms:hasVersion void:version dc:version }\n"
" VALUES ?type { owl:Ontology skos:ConceptScheme }\n"
" ?uri a ?type .\n"
" ?uri ?versionProp ?version .\n"
"} LIMIT 1"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and read the relevant lines
cat -n archivo/querying/graph_handling.py | sed -n '240,270p'

Repository: dbpedia/archivo

Length of output: 1348


🏁 Script executed:

# Also check the overall file structure to ensure we're looking at the right section
wc -l archivo/querying/graph_handling.py

Repository: dbpedia/archivo

Length of output: 97


🏁 Script executed:

# Check how the query results are used
cat -n archivo/querying/graph_handling.py | sed -n '260,280p'

Repository: dbpedia/archivo

Length of output: 772


🏁 Script executed:

# Let me also check if there are any tests related to version handling
find . -name "*.py" -type f | xargs grep -l "get_version" | head -5

Repository: dbpedia/archivo

Length of output: 198


🏁 Script executed:

# Check if the repository uses SPARQL and validate the syntax concept
rg "VALUES.*?prio" --type py -A 3

Repository: dbpedia/archivo

Length of output: 41


🏁 Script executed:

# Check imports to see what SPARQL/RDF library is used
head -50 archivo/querying/graph_handling.py

Repository: dbpedia/archivo

Length of output: 1567


🏁 Script executed:

# Check requirements or setup.py for rdflib version
find . -name "requirements*.txt" -o -name "setup.py" -o -name "pyproject.toml" | xargs cat 2>/dev/null | grep -i rdf

Repository: dbpedia/archivo

Length of output: 76


🏁 Script executed:

# Verify SPARQL tuple syntax support in rdflib - check documentation or examples
rg "VALUES.*\(" --type py | head -10

Repository: dbpedia/archivo

Length of output: 41


🏁 Script executed:

# Check if there are any other queries in the codebase that use similar patterns or ORDER BY
rg "ORDER BY" --type py -B 3 -A 1

Repository: dbpedia/archivo

Length of output: 41


🏁 Script executed:

# Check if there are tests for the get_version function
grep -r "get_version" --include="*.py" | grep -i test

Repository: dbpedia/archivo

Length of output: 41


🏁 Script executed:

# Check rdflib documentation or examples for SPARQL 1.1 support
# Search for SPARQL query patterns in the codebase to understand typical usage
rg "SELECT.*WHERE" --type py -A 5 | head -30

Repository: dbpedia/archivo

Length of output: 2090


🏁 Script executed:

# Look for any SPARQL-related documentation or comments in the file
grep -n "SPARQL\|query\|ORDER BY\|VALUES" archivo/querying/graph_handling.py | head -20

Repository: dbpedia/archivo

Length of output: 1042


🏁 Script executed:

# Check rdflib documentation for SPARQL 1.1 tuple syntax support
# Look for any examples of tuple VALUES in the codebase or related projects
python3 << 'EOF'
import sys
# Check if rdflib 6.2.0 supports SPARQL 1.1 tuple syntax
# This is standard SPARQL 1.1, and rdflib 6.2.0+ supports it

version = "6.2.0"
print(f"rdflib version: {version}")
print("SPARQL 1.1 tuple syntax (VALUES with multiple variables) is part of SPARQL 1.1 standard.")
print("rdflib 6.2.0 fully supports SPARQL 1.1 with tuple syntax.")
print("\nProposed fix syntax:")
print("  VALUES (?prio ?versionProp) { (0 owl:versionInfo) ... }")
print("\nThis is valid SPARQL 1.1 syntax supported by rdflib 6.2.0+")
EOF

Repository: dbpedia/archivo

Length of output: 367


🏁 Script executed:

# Check if there are real scenarios where multiple version properties could exist
# by examining test data or example RDF files
find . -name "*.rdf" -o -name "*.owl" -o -name "*.ttl" | head -5

Repository: dbpedia/archivo

Length of output: 199


🏁 Script executed:

# Look at the git history or comments to understand why multiple version properties are checked
grep -B 5 -A 5 "versionInfo.*schema:version" archivo/querying/graph_handling.py

Repository: dbpedia/archivo

Length of output: 496


Make ontology version selection deterministic.

LIMIT 1 without an ORDER BY returns non-deterministic results when multiple version properties are present, causing downstream version IDs to be unstable.

💡 Proposed fix (property-priority ordering)
     queryString = (
         "SELECT DISTINCT ?version \n"
         "WHERE {\n"
-        " VALUES ?versionProp { owl:versionInfo schema:version pav:version dcterms:hasVersion void:version dc:version }\n"
+        " VALUES (?prio ?versionProp) {\n"
+        "  (0 owl:versionInfo)\n"
+        "  (1 schema:version)\n"
+        "  (2 pav:version)\n"
+        "  (3 dcterms:hasVersion)\n"
+        "  (4 void:version)\n"
+        "  (5 dc:version)\n"
+        " }\n"
         " VALUES ?type { owl:Ontology skos:ConceptScheme }\n"
         " ?uri a ?type .\n"
         " ?uri ?versionProp ?version .\n"
-        "} LIMIT 1"
+        "} ORDER BY ?prio LIMIT 1"
     )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
queryString = (
"SELECT DISTINCT ?version \n"
"WHERE {\n"
" VALUES ?versionProp { owl:versionInfo schema:version pav:version dcterms:hasVersion void:version dc:version }\n"
" VALUES ?type { owl:Ontology skos:ConceptScheme }\n"
" ?uri a ?type .\n"
" ?uri ?versionProp ?version .\n"
"} LIMIT 1"
)
queryString = (
"SELECT DISTINCT ?version \n"
"WHERE {\n"
" VALUES (?prio ?versionProp) {\n"
" (0 owl:versionInfo)\n"
" (1 schema:version)\n"
" (2 pav:version)\n"
" (3 dcterms:hasVersion)\n"
" (4 void:version)\n"
" (5 dc:version)\n"
" }\n"
" VALUES ?type { owl:Ontology skos:ConceptScheme }\n"
" ?uri a ?type .\n"
" ?uri ?versionProp ?version .\n"
"} ORDER BY ?prio LIMIT 1"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@archivo/querying/graph_handling.py` around lines 251 - 259, The queryString
currently uses LIMIT 1 without deterministic ordering; update the SPARQL to
select the version property (?versionProp) alongside ?version and add an ORDER
BY that enforces property-priority (e.g., CASE/IF mapping owl:versionInfo→1,
schema:version→2, pav:version→3, dcterms:hasVersion→4, void:version→5,
dc:version→6, else→99) so the row chosen is deterministic, then keep LIMIT 1;
modify the queryString construction in graph_handling.py (the variable named
queryString) to include ?versionProp in the WHERE/SELECT and an ORDER BY with
the CASE expression that implements the listed priority.

Comment on lines +82 to +89
def normalize_version_string(version: str) -> str:
# Replace non-alphanumeric characters (except ._- ) with -
normalized = re.sub(r"[^a-zA-Z0-9\._-]", "-", version)
# remove leading/trailing non-alphanumeric
normalized = normalized.strip("-._")
if normalized == "":
return "defaultVersion"
return normalized
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound normalized version length before using it in IDs.

This accepts unbounded ontology literals and forwards them into version identifiers. Very long values can break downstream path/identifier handling.

💡 Proposed fix (truncate with hash suffix)
+_MAX_VERSION_TOKEN_LENGTH = 96
+
 def normalize_version_string(version: str) -> str:
     # Replace non-alphanumeric characters (except ._- ) with -
     normalized = re.sub(r"[^a-zA-Z0-9\._-]", "-", version)
     # remove leading/trailing non-alphanumeric
     normalized = normalized.strip("-._")
     if normalized == "":
         return "defaultVersion"
+    if len(normalized) > _MAX_VERSION_TOKEN_LENGTH:
+        digest = hashlib.sha1(normalized.encode("utf-8")).hexdigest()[:10]
+        head = normalized[: _MAX_VERSION_TOKEN_LENGTH - 11]
+        normalized = f"{head}-{digest}"
     return normalized
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@archivo/utils/string_tools.py` around lines 82 - 89, The
normalize_version_string function currently returns unbounded strings; update it
to enforce a maximum length (e.g., 32 chars) and, when the normalized value
exceeds that limit, truncate it and append a short deterministic hash suffix
(e.g., first 8 chars of SHA256 of the original version) so identifiers remain
unique and bounded; keep the existing normalization (re.sub + strip) and
fallback to "defaultVersion" for empty input, and ensure the final returned
string length is <= the chosen MAX_VERSION_LEN and contains only the normalized
characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants