DatAasee Architecture Documentation
Version: 0.9
The principal goal of DatAasee is to provision a library-focused one-stop
shop for research data discovery as well as a library-wide metadata hub.
DatAasee is a Metadata-Lake (MDL) that aggregates and interconnects research
metadata and bibliographic data from various data sources and interacts via a
JSON HTTP API, which in turn is prototypically utilized by a web frontend.
Sections:
Introduction & Goals
Constraints
Context & Scope
Solution Strategy
Building Block View
Runtime View
Deployment View
Crosscutting Concepts
Architectural Decisions
Quality Requirements
Risks & Technical Debt
Glossary
Summary:
Data Architecture: Data-Lake with Metadata Catalog
Software Architecture: 3-Tier Architecture
Data-Tier Model: Graph with star schema node properties
Logic-Tier Type: Semantic layer
Presentation-Tier Type: HTTP API (and Web-Frontend)
NOTE: For the specific data model, see: YASQL schema
For background information on data and software architecture, see: https://arxiv.org/abs/2409.05512 and references therein.
1.1 Requirements Overview
Given: research and bibliographic (meta)data maintained in various
distributed databases and no central access point to browse, search, or locate
data-sets. The metadata-lake ...
... incorporates metadata of research outputs as well as bibliographic metadata.
... cleans, normalizes, and provides metadata.
... allows users to search, filter and browse metadata (and locate underlying data).
... facilitates exports of metadata.
... integrates with other services and processes.
The database is the core component.
The backend encapsulates the database and spans the API.
An optional web frontend uses the API.
All external and internal communication via HTTP.
Imports of sources into the database triggered via the backend.
Exports to services are requested externally.
Users or downstream services can interact through the API.
Quality Goal
Associated Scenarios
Functional Suitability
F0
Transferability
T0
Compatibility
C0
Operability
O0
Maintainability
M0 , M1
2.1 Technical Constraints
Constraint
Explanation
Cloud Deployability
To integrate into existing infrastructure and operation environments, a containerized service is required.
Interoperability
Data pipelining is required to be compatible to existing database interfaces.
Extensibility
Components such as metadata schemas, data pipelines, and metadata exports are required to be extensible.
2.2 Organizational Constraints
Constraint
Explanation
OAI-PMH
Many existing data sources provide an OAI-PMH endpoint which needs to be supported.
XML
All source metadata is expected to be in XML.
S3
File-based ingest has to be also performed via object storage, particularly Ceph's S3 API .
K8s
If possible Kubernetes should be supported (in addition to Compose).
Standard
Function
JSON
Serialization language for all external messages
JSON:API
External message format standardization
JSON Schema
External message content validation
YAML
Internal processor (and prototype frontend) declaration language
StrictYAML
Preferred declaration language dialect
OpenAPI
External API definition and documentation format
SHA256
Identifier Hashing and Checksums
Base64URL
Identifier Encoding
Naming Things with Hashes
Identifier Marking
Compose
Deployment and orchestration
Standard
Function
Tech Stack Canvas
Product tech stack (see README )
Diataxis
Software documentation structure (see docs )
arc42
Software architecture documentation (this document)
yasql
Database schema documentation (can be rendered with PlantUML)
Channel
Description
Interact
All unprivileged functionality
Search
Directly query metadata records (typically privileged)
Control
Monitor, trigger ingests and backups (privileged)
Import
Ingest metadata records from source system
Channel
Description
Interact
Unprivileged HTTP API
Search
Requested and responded through HTTP API
Control
Privileged HTTP API
Import
Pulled via HTTP
Three-tier architecture:
HTTP API is the primary presentation layer (part of the backend)
Web frontend (exclusively using API) is secondary presentation tier
Two main components:
Database (data tier)
Backend (stateless application tier)
All components are packaged in containers for:
Infrastructure compatibility
Cloud deployability
Property graph data model:
Metadata records are key-value documents (intra-metadata)
Metadata records are interrelated based on permanent identifiers (inter-metadata)
All messaging happens via HTTP APIs:
Internally between components (containers)
Externally via endpoints (including frontend)
Source codes and external messages are in plain text and in standardized formats:
External messages are in JSON, formatted as JSON:API, and documented by JSON-Schemas.
Declarative sources are in YAML, following StrictYAML.
Further components are optional:
Storage not necessary since only metadata is handled, payload data only referenced
Web-frontend uses HTTP API (prototype is included)
Declarative realization for high level of abstraction via:
Internal Queries: ArcadeDB SQL (external queries may use various query languages)
Processes: Configuration-based + Bloblang (data mapping language)
DatAasee uses a three-tier architecture
with these separately containerized components which are orchestrated by Compose :
Function
Abstraction
Tier
Product
Metadata Catalog
Multi-Model Database
Data (Database)
ArcadeDB
EtLT Processor
Declarative Streaming Processor
Logic (Backend)
Benthos
Web Frontend
Declarative Web Framework
Presentation (Frontend)
Lowdefy
Imports metadata from source systems via pull
Provides API to interact with metadata via endpoints
Frontend translates user input to API calls
Source Databases (External)
Known URLs (i.e., service or database endpoints) holding metadata
Bulk ingested
Pollable regularly for updates
Backup Storage (External)
Loaded from on service startup
Database backup on finished ingest
Database backup on finished interconnect
Prototype Web-Frontend (Optional)
Included prototype frontend
External to core system
Template and documentation for a production frontend
Container holding an ArcadeDB database system
This core component stores and serves all metadata
A system backup saves its database
Container holding a Benthos stream processor
This component exposes the external API endpoints and translates between data formats as well as between API and database
Has no state (except temporary cache, which caches queries and refreshes, as well as ingest status)
Frontend Container (Optional)
Container holding a Lowdefy web-frontend
This optional component renders a web-based user interface
Uses API endpoints (but from the internal network, thus the frontend does not use the external port)
Database Container Internals
The native schema is created via SQL (during build)
Enumerated types are inserted via SQL (during build)
The initialization script restores the database on start from the latest backup.
Backend Container Internals
API schemas are deposited
Custom configurable components (templates) are defined
Reusable fixed components (resources) are defined
Frontend Container Internals
Pages are defined declaratively
Reused template blocks are loaded
Static assets (images and styles) are loaded
NOTE: This endpoint is implicitly cached, meaning all schema files are opened only once.
See api endpoint documentation and source file .
NOTE: This endpoint reports ready if processor and database are ready.
See ready endpoint reference and source file .
/health Endpoint (Private)
NOTE: Since the returned information is only useful to an operator, not to a user, this is a private and thus POST endpoint.
See health endpoint reference and source file .
/ingest Endpoint (Private, External Read)
NOTE: The ingest process is asynchronous; the request returns success if an ingest was started.
See ingest endpoint reference and source file .
/schema Endpoint (Public, Cached)
See schema endpoint reference and source file .
/metadata Endpoint (Public)
See metadata endpoint reference and source file .
/database Endpoint (Public)
NOTE: This endpoint allows idempotent read operations since it uses the query endpoint of ArcadeDB.
See database endpoint reference and source file .
See compose.yaml for deployment details.
EtLT (Extract-transform-Load-Transform): Ingest vs. Read
All components are separately containerized.
All communication between components is performed over HTTP using JSON.
HTTP and JSON:API conventions are used and parameters, requests, and responses provide JSON schemas.
Read access is granted to every user without limitation (expects external rate limits).
Write access (trigger ingest or check health) is only granted to the admin user.
Basic authentication is used by the backend for the "admin" user for private endpoints (expects external TLS termination).
Container images are multi-stage with a generic base stage and a custom development and release stage.
All images run their own health check.
The default API base path communicates the API version.
All components provide (internal) ready endpoints and write logs to the standard output.
Secrets are read (safely) from environment variables on the host and mounted as files inside the containers.
Logs are written according to the defaults of the employed container engine.
9. Architectural Decisions
Timestamp
Template
Status
...
Decision
...
Consequences
...
2026-04-17
Remove backup endpoint
Status
Approved
Decision
Remove manual backups since data does not change between ingests
Consequences
Backup endpoint not needed any more.
2026-04-08
Remove view tracking
Status
Approved
Decision
Remove the tracking of record views.
Consequences
Simpler and faster database, no data loss between ingests.
2026-02-20
No Backup on Shutdown
Status
Approved
Decision
The database shutdown does not trigger a backup, as it takes too long for large databases.
Consequences
Faster shutdown and simpler database init script; backups after ingest preserve most of state.
2026-01-27
Use NI for record identifiers
Status
Approved
Decision
Record identifiers are prefixed with the NI URI scheme and use base64url-encoded SHA256 hash.
Consequences
Frontends can detect record identifiers without parsing the key field.
2026-01-22
Remove Gremlin query language
Status
Approved
Decision
Remove Gremlin module from ArcadeDB
Consequences
The Gremlin query language is not supported anymore in DatAasee and hence SPARQL will not be overlaid.
2025-12-17
Streamline HTTP API
Status
Approved
Decision
Remove enums and sources endpoints and integrate their information into schema endpoint
Consequences
More uniform API handling, and less endpoints for easier usability.
2025-09-19
Frontend Container Image
Status
Approved
Decision
Production and development frontend images should be air-gapped after build.
Consequences
More control over dependencies especially during dynamic rebuilds.
2025-04-11
Post-Processing
Status
Approved
Decision
Minimize database response post-processing.
Consequences
Shift transformation workload to ArcadeDB.
2024-10-23
Container Base Images
Status
Approved
Decision
Base containers for database and backend are the current Ubuntu LTS (ie: 26.04).
Consequences
Full libc support compared to Alpine and obvious release date and support horizon from version number compared to Debian.
2024-07-04
Indirect Processor Dependency Updates
Status
Approved
Decision
Indirect processor dependency updates do not cause a (minor) version update.
Consequences
A release image build (of the current version) can be triggered and processor dependencies are updated in the process.
2024-06-03
API Licensing
Status
Approved
Decision
The OpenAPI license definition is additionally licensed under CC-BY.
Consequences
Easier third-party reimplementation of the DatAasee API.
2024-02-21
Use OAI vs Non-OAI metadata format variants
Status
Approved
Decision
Non-OAI variants of the DC and DataCite formats are supported.
Consequences
More lenient, and less strict with fields configuring ingest.
2024-01-17
Compose-only Deployment
Status
Approved
Decision
Deployment is solely distributed and initiated by the compose.yaml.
Consequences
The compose file and orchestrator have central importance.
2023-11-20
Database Storage
Status
Approved
Decision
Database uses in-container storage, only backups are stored outside.
Consequences
Faster database at the price of fixed savepoints.
2023-08-24
Record Identifier
Status
Approved (Superseded)
Decision
Use xxhash64 / SHA256 of ingested or inserted raw record.
Consequences
Identifier is reproducible but not a URL.
2023-08-08
Ingest Modularity
Status
Approved
Decision
Ingest sources are passed via API to the backend.
Consequences
Sources can be maintained outside and appended during runtime.
2023-05-16
Graph Edges
Status
Approved
Decision
Graph edges are only set by ingest (or other automatic) processes, not by a user.
Consequences
Edge semantics need to be machine-interpretable.
2022-12-07
Frontend Language
Status
Approved
Decision
Use English language only for frontend and metadata labels and comments.
Consequences
Additional translations (German) are not prepared for now.
2022-10-10
Only Virtual Storage
Status
Approved
Decision
No explicit storage component for data, only metadata is managed.
Consequences
No interface or instance to e.g. Ceph is developed, but URL references (to data storage) are stored.
2022-10-05
API-only Frontend
Status
Approved
Decision
The HTTP API is the sole frontend, further frontends are only expressions of the API.
Consequences
Web frontend can only use the API.
2022-10-04
Declarative First
Status
Approved
Decision
Prefer declarative (YAML-based) approaches for defining processes and interfaces to reduce free coding and increase robustness.
Consequences
Frontrunners Benthos as backend, and Lowdefy (or uteam) as prototype web-frontend.
2022-09-16
Multi-model Database
Status
Approved
Decision
Use (property)-graph / document / key-value database as central catalog component for maximal flexible data model.
Consequences
Frontrunner ArcadeDB (or OrientDB) as database.
10.1 Quality Requirements
Quality Category
Quality
ID
Description
Functional Suitability
Appropriateness
F0
DatAasee should fulfill the expected overall functionality.
Transferability
Installability
T0
Installation should work in various container-based environments.
Compatibility
Interoperability
C0
The available protocols (and format parsers) should fit the most common systems.
Operability
Ease of Use
O0
The API should be self-describing, well documented, and following standards and best practices.
Maintainability
Modularity
M0
New protocols, format parsers or other pipelines should be implementable without too much effort.
Maintainability
Reusability
M1
The protocol and format parser codes serve as sample and documentation.
ID
Scenario
F0
Stakeholder project evaluation
T0
Setup of DatAasee by a new operator
C0
Ingesting from a new source system
O0
User and (downstream) developer API Usage
M0
Extending the compatibility to new systems
M1
Development of a follow-up project to DatAasee
11. Risks & Technical Debt
Risk
Description
Mitigation
Unsecure deployment
There is no bultin in TLS termination or rate limiting, and the database endpoint is not meant for public consumption
Comprehensive documentation with warnings and guidelines.
DBMS project might cease
ArcadeDB is a small project which has small-project risks
However, since SQL is used internally to interact with ArcadeDB, in principle RDBMs could be a replacement, but it is a core architectural dependency.
Processor project might complicate
Benthos was acquired by "Redpanda" who may change its license or licenses of the connectors
Using hard fork bento .
Term
Acronym
Definition
Administrative Metadata
Metadata about accessibility.
Application Programming Interface
API
Specification and implementation of a way for software to interact (here HTTP API).
Backend
BE
Software component encoding the internal logic.
Container
CTR
Software packaged into standardized unit for operating-system-level virtualization.
Create-Read-Update-Delete
CRUD
Basic operations when interacting with a database (or storage).
Database
DB
Collection of related records.
Database Management System
DBMS
The software running the databases.
Data Catalog
DCAT
Inventory of databases.
Data-Lake
DL
Structured, semi-structured, and unstructured data architecture.
Declarative Low-Code
Defining an application only by configuration of components (and minimal explicit transformations).
Declarative Programming
Programming style of expressing logic without prescribing control flow ("what", not "how").
Descriptive Metadata
Metadata describing the underlying data.
Domain Specific Language
DSL
A formal language designed for a particular application.
Extract-Load-Transform
ELT
A typical ingestion process for unstructured data.
Extract-Transform-Load
ETL
A typical ingestion process for structured data.
Extract-transform-Load-Transform
EtLT
An ingestion process for semi-structured data.
Frontend
FE
(Web-based) software component presenting a user interface.
Inter-Metadata
Metadata about data related to the underlying data.
Intra-Metadata
Metadata about the underlying data.
Low-Code
Functionality assembly using high-level prefabricated components.
Metadata
MD
All statements about a (tangible or digital) information object.
Metadata Catalog
MDCAT
Inventory of metadata databases.
Metadata-Lake
MDL
Structured, semi-structured, and unstructured data architecture for metadata management.
Metadata-Set
A record containing metadata.
Named Identifier
NI
Protocol for record identifiers.
Process Metadata
Metadata about lineage.
Social Metadata
Metadata about usage and discoverability.
Technical Metadata
Metadata about format and structure.