DatAasee Architecture Documentation

Version: 0.9

The principal goal of DatAasee is to provision a library-focused one-stop shop for research data discovery as well as a library-wide metadata hub. DatAasee is a Metadata-Lake (MDL) that aggregates and interconnects research metadata and bibliographic data from various data sources and interacts via a JSON HTTP API, which in turn is prototypically utilized by a web frontend.

Sections:

Introduction & Goals
Constraints
Context & Scope
Solution Strategy
Building Block View
Runtime View
Deployment View
Crosscutting Concepts
Architectural Decisions
Quality Requirements
Risks & Technical Debt
Glossary

Summary:

Data Architecture: Data-Lake with Metadata Catalog
Software Architecture: 3-Tier Architecture
- Data-Tier Model: Graph with star schema node properties
- Logic-Tier Type: Semantic layer
- Presentation-Tier Type: HTTP API (and Web-Frontend)

NOTE: For the specific data model, see: YASQL schema

For background information on data and software architecture, see: https://arxiv.org/abs/2409.05512 and references therein.

1. Introduction & Goals

1.1 Requirements Overview

Given: research and bibliographic (meta)data maintained in various distributed databases and no central access point to browse, search, or locate data-sets. The metadata-lake ...

... incorporates metadata of research outputs as well as bibliographic metadata.
... cleans, normalizes, and provides metadata.
... allows users to search, filter and browse metadata (and locate underlying data).
... facilitates exports of metadata.
... integrates with other services and processes.

The database is the core component.
The backend encapsulates the database and spans the API.
An optional web frontend uses the API.
All external and internal communication via HTTP.
Imports of sources into the database triggered via the backend.
Exports to services are requested externally.
Users or downstream services can interact through the API.

1.2 Quality Goals

Quality Goal	Associated Scenarios
Functional Suitability	F0
Transferability	T0
Compatibility	C0
Operability	O0
Maintainability	M0, M1

2. Constraints

2.1 Technical Constraints

Constraint	Explanation
Cloud Deployability	To integrate into existing infrastructure and operation environments, a containerized service is required.
Interoperability	Data pipelining is required to be compatible to existing database interfaces.
Extensibility	Components such as metadata schemas, data pipelines, and metadata exports are required to be extensible.

2.2 Organizational Constraints

Constraint	Explanation
OAI-PMH	Many existing data sources provide an OAI-PMH endpoint which needs to be supported.
XML	All source metadata is expected to be in XML.
S3	File-based ingest has to be also performed via object storage, particularly Ceph's S3 API.
K8s	If possible Kubernetes should be supported (in addition to Compose).

2.3 Conventions

Technical

Standard	Function
JSON	Serialization language for all external messages
JSON:API	External message format standardization
JSON Schema	External message content validation
YAML	Internal processor (and prototype frontend) declaration language
StrictYAML	Preferred declaration language dialect
OpenAPI	External API definition and documentation format
SHA256	Identifier Hashing and Checksums
Base64URL	Identifier Encoding
Naming Things with Hashes	Identifier Marking
Compose	Deployment and orchestration

Content

Standard	Function
DataCite	Core metadata vocabulary
OpenWEMI	Entity relationships
Fields of Science	Scientific classification
SPDX License List	Software license names
Creative Commons	License names
RightsStatements.org	Copyright classification
ISO 8601	Date and time formatting
ISO 639-1	Language name abbreviations
DOI	Preferred resource identifier
ORCID	Preferred creator identifier
DublinCore	Import format
MODS	Import format
MARCXML	Import format
LIDO	Import format
BibJSON	Export format

Documentation

Standard	Function
Tech Stack Canvas	Product tech stack (see README)
Diataxis	Software documentation structure (see docs)
arc42	Software architecture documentation (this document)
yasql	Database schema documentation (can be rendered with PlantUML)

3. Context & Scope

3.1 Business Context

Channel	Description
Interact	All unprivileged functionality
Search	Directly query metadata records (typically privileged)
Control	Monitor, trigger ingests and backups (privileged)
Import	Ingest metadata records from source system

3.2 Technical Context

Channel	Description
Interact	Unprivileged `HTTP` API
Search	Requested and responded through `HTTP` API
Control	Privileged `HTTP` API
Import	Pulled via `HTTP`

4. Solution Strategy

Three-tier architecture:
- HTTP API is the primary presentation layer (part of the backend)
- Web frontend (exclusively using API) is secondary presentation tier
Two main components:
- Database (data tier)
- Backend (stateless application tier)
All components are packaged in containers for:
- Infrastructure compatibility
- Cloud deployability
Property graph data model:
- Metadata records are key-value documents (intra-metadata)
- Metadata records are interrelated based on permanent identifiers (inter-metadata)
All messaging happens via HTTP APIs:
- Internally between components (containers)
- Externally via endpoints (including frontend)
Source codes and external messages are in plain text and in standardized formats:
- External messages are in JSON, formatted as JSON:API, and documented by JSON-Schemas.
- Declarative sources are in YAML, following StrictYAML.
Further components are optional:
- Storage not necessary since only metadata is handled, payload data only referenced
- Web-frontend uses HTTP API (prototype is included)
Declarative realization for high level of abstraction via:
- Internal Queries: ArcadeDB SQL (external queries may use various query languages)
- Processes: Configuration-based + Bloblang (data mapping language)

5. Building Block View

DatAasee uses a three-tier architecture with these separately containerized components which are orchestrated by Compose:

Function	Abstraction	Tier	Product
Metadata Catalog	Multi-Model Database	Data (Database)	ArcadeDB
EtLT Processor	Declarative Streaming Processor	Logic (Backend)	Benthos
Web Frontend	Declarative Web Framework	Presentation (Frontend)	Lowdefy

Level 0 (Outside View)

DatAasee

Imports metadata from source systems via pull
Provides API to interact with metadata via endpoints
Frontend translates user input to API calls

Source Databases (External)

Known URLs (i.e., service or database endpoints) holding metadata
Bulk ingested
Pollable regularly for updates

Backup Storage (External)

Loaded from on service startup
Database backup on finished ingest
Database backup on finished interconnect

Prototype Web-Frontend (Optional)

Included prototype frontend
External to core system
Template and documentation for a production frontend

Level 1 (Inside View)

Database Container

Container holding an ArcadeDB database system
This core component stores and serves all metadata
A system backup saves its database

Backend Container

Container holding a Benthos stream processor
This component exposes the external API endpoints and translates between data formats as well as between API and database
Has no state (except temporary cache, which caches queries and refreshes, as well as ingest status)

Frontend Container (Optional)

Container holding a Lowdefy web-frontend
This optional component renders a web-based user interface
Uses API endpoints (but from the internal network, thus the frontend does not use the external port)

Level 2 (Container View)

Database Container Internals

The native schema is created via SQL (during build)
Enumerated types are inserted via SQL (during build)
The initialization script restores the database on start from the latest backup.

Backend Container Internals

API schemas are deposited
Custom configurable components (templates) are defined
Reusable fixed components (resources) are defined

Frontend Container Internals

Pages are defined declaratively
Reused template blocks are loaded
Static assets (images and styles) are loaded

6. Runtime View

System Endpoints

`/api` Endpoint (Public)

NOTE: This endpoint is implicitly cached, meaning all schema files are opened only once.

See api endpoint documentation and source file.

`/ready` Endpoint (Public)

NOTE: This endpoint reports ready if processor and database are ready.

See ready endpoint reference and source file.

`/health` Endpoint (Private)

NOTE: Since the returned information is only useful to an operator, not to a user, this is a private and thus POST endpoint.

See health endpoint reference and source file.

`/ingest` Endpoint (Private, External Read)

NOTE: The ingest process is asynchronous; the request returns success if an ingest was started.

See ingest endpoint reference and source file.

Support Endpoints

`/schema` Endpoint (Public, Cached)

See schema endpoint reference and source file.

Data Endpoints

`/metadata` Endpoint (Public)

See metadata endpoint reference and source file.

`/database` Endpoint (Public)

NOTE: This endpoint allows idempotent read operations since it uses the query endpoint of ArcadeDB.

See database endpoint reference and source file.

7. Deployment View

Level 0 (Technical View)

See compose.yaml for deployment details.

Level 1 (Data-Flow View)

EtLT (Extract-transform-Load-Transform): Ingest vs. Read

8. Crosscutting Concepts

Internal Concepts

All components are separately containerized.
All communication between components is performed over HTTP using JSON.
HTTP and JSON:API conventions are used and parameters, requests, and responses provide JSON schemas.

Security Concepts

Read access is granted to every user without limitation (expects external rate limits).
Write access (trigger ingest or check health) is only granted to the admin user.
Basic authentication is used by the backend for the "admin" user for private endpoints (expects external TLS termination).

Development Concepts

Container images are multi-stage with a generic base stage and a custom development and release stage.
All images run their own health check.
The default API base path communicates the API version.

Operational Concepts

All components provide (internal) ready endpoints and write logs to the standard output.
Secrets are read (safely) from environment variables on the host and mounted as files inside the containers.
Logs are written according to the defaults of the employed container engine.

9. Architectural Decisions

Timestamp	Template
Status	...
Decision	...
Consequences	...

2026-04-17	Remove backup endpoint
Status	Approved
Decision	Remove manual backups since data does not change between ingests
Consequences	Backup endpoint not needed any more.

2026-04-08	Remove view tracking
Status	Approved
Decision	Remove the tracking of record views.
Consequences	Simpler and faster database, no data loss between ingests.

2026-02-20	No Backup on Shutdown
Status	Approved
Decision	The database shutdown does not trigger a backup, as it takes too long for large databases.
Consequences	Faster shutdown and simpler database init script; backups after ingest preserve most of state.

2026-01-27	Use NI for record identifiers
Status	Approved
Decision	Record identifiers are prefixed with the NI URI scheme and use base64url-encoded SHA256 hash.
Consequences	Frontends can detect record identifiers without parsing the key field.

2026-01-22	Remove Gremlin query language
Status	Approved
Decision	Remove Gremlin module from ArcadeDB
Consequences	The Gremlin query language is not supported anymore in DatAasee and hence SPARQL will not be overlaid.

2025-12-17	Streamline HTTP API
Status	Approved
Decision	Remove `enums` and `sources` endpoints and integrate their information into `schema` endpoint
Consequences	More uniform API handling, and less endpoints for easier usability.

2025-09-19	Frontend Container Image
Status	Approved
Decision	Production and development frontend images should be air-gapped after build.
Consequences	More control over dependencies especially during dynamic rebuilds.

2025-04-11	Post-Processing
Status	Approved
Decision	Minimize database response post-processing.
Consequences	Shift transformation workload to ArcadeDB.

2024-10-23	Container Base Images
Status	Approved
Decision	Base containers for database and backend are the current Ubuntu LTS (ie: 26.04).
Consequences	Full `libc` support compared to Alpine and obvious release date and support horizon from version number compared to Debian.

2024-07-04	Indirect Processor Dependency Updates
Status	Approved
Decision	Indirect processor dependency updates do not cause a (minor) version update.
Consequences	A release image build (of the current version) can be triggered and processor dependencies are updated in the process.

2024-06-03	API Licensing
Status	Approved
Decision	The OpenAPI license definition is additionally licensed under CC-BY.
Consequences	Easier third-party reimplementation of the DatAasee API.

2024-02-21	Use OAI vs Non-OAI metadata format variants
Status	Approved
Decision	Non-OAI variants of the DC and DataCite formats are supported.
Consequences	More lenient, and less strict with fields configuring ingest.

2024-01-17	Compose-only Deployment
Status	Approved
Decision	Deployment is solely distributed and initiated by the `compose.yaml`.
Consequences	The compose file and orchestrator have central importance.

2023-11-20	Database Storage
Status	Approved
Decision	Database uses in-container storage, only backups are stored outside.
Consequences	Faster database at the price of fixed savepoints.

2023-08-24	Record Identifier
Status	Approved (Superseded)
Decision	Use xxhash64 / SHA256 of ingested or inserted raw record.
Consequences	Identifier is reproducible but not a URL.

2023-08-08	Ingest Modularity
Status	Approved
Decision	Ingest sources are passed via API to the backend.
Consequences	Sources can be maintained outside and appended during runtime.

2023-05-16	Graph Edges
Status	Approved
Decision	Graph edges are only set by ingest (or other automatic) processes, not by a user.
Consequences	Edge semantics need to be machine-interpretable.

2022-12-07	Frontend Language
Status	Approved
Decision	Use English language only for frontend and metadata labels and comments.
Consequences	Additional translations (German) are not prepared for now.

2022-10-10	Only Virtual Storage
Status	Approved
Decision	No explicit storage component for data, only metadata is managed.
Consequences	No interface or instance to e.g. Ceph is developed, but URL references (to data storage) are stored.

2022-10-05	API-only Frontend
Status	Approved
Decision	The HTTP API is the sole frontend, further frontends are only expressions of the API.
Consequences	Web frontend can only use the API.

2022-10-04	Declarative First
Status	Approved
Decision	Prefer declarative (YAML-based) approaches for defining processes and interfaces to reduce free coding and increase robustness.
Consequences	Frontrunners Benthos as backend, and Lowdefy (or uteam) as prototype web-frontend.

2022-09-16	Multi-model Database
Status	Approved
Decision	Use (property)-graph / document / key-value database as central catalog component for maximal flexible data model.
Consequences	Frontrunner ArcadeDB (or OrientDB) as database.

10. Quality Requirements

10.1 Quality Requirements

Quality Category	Quality	ID	Description
Functional Suitability	Appropriateness	F0	DatAasee should fulfill the expected overall functionality.
Transferability	Installability	T0	Installation should work in various container-based environments.
Compatibility	Interoperability	C0	The available protocols (and format parsers) should fit the most common systems.
Operability	Ease of Use	O0	The API should be self-describing, well documented, and following standards and best practices.
Maintainability	Modularity	M0	New protocols, format parsers or other pipelines should be implementable without too much effort.
Maintainability	Reusability	M1	The protocol and format parser codes serve as sample and documentation.

10.2 Quality Scenarios

ID	Scenario
F0	Stakeholder project evaluation
T0	Setup of DatAasee by a new operator
C0	Ingesting from a new source system
O0	User and (downstream) developer API Usage
M0	Extending the compatibility to new systems
M1	Development of a follow-up project to DatAasee

11. Risks & Technical Debt

Risk	Description	Mitigation
Unsecure deployment	There is no bultin in TLS termination or rate limiting, and the `database` endpoint is not meant for public consumption	Comprehensive documentation with warnings and guidelines.
DBMS project might cease	`ArcadeDB` is a small project which has small-project risks	However, since SQL is used internally to interact with `ArcadeDB`, in principle RDBMs could be a replacement, but it is a core architectural dependency.
Processor project might complicate	`Benthos` was acquired by "Redpanda" who may change its license or licenses of the connectors	Using hard fork `bento`.

12. Glossary

Term	Acronym	Definition
Administrative Metadata		Metadata about accessibility.
Application Programming Interface	API	Specification and implementation of a way for software to interact (here HTTP API).
Backend	BE	Software component encoding the internal logic.
Container	CTR	Software packaged into standardized unit for operating-system-level virtualization.
Create-Read-Update-Delete	CRUD	Basic operations when interacting with a database (or storage).
Database	DB	Collection of related records.
Database Management System	DBMS	The software running the databases.
Data Catalog	DCAT	Inventory of databases.
Data-Lake	DL	Structured, semi-structured, and unstructured data architecture.
Declarative Low-Code		Defining an application only by configuration of components (and minimal explicit transformations).
Declarative Programming		Programming style of expressing logic without prescribing control flow ("what", not "how").
Descriptive Metadata		Metadata describing the underlying data.
Domain Specific Language	DSL	A formal language designed for a particular application.
Extract-Load-Transform	ELT	A typical ingestion process for unstructured data.
Extract-Transform-Load	ETL	A typical ingestion process for structured data.
Extract-transform-Load-Transform	EtLT	An ingestion process for semi-structured data.
Frontend	FE	(Web-based) software component presenting a user interface.
Inter-Metadata		Metadata about data related to the underlying data.
Intra-Metadata		Metadata about the underlying data.
Low-Code		Functionality assembly using high-level prefabricated components.
Metadata	MD	All statements about a (tangible or digital) information object.
Metadata Catalog	MDCAT	Inventory of metadata databases.
Metadata-Lake	MDL	Structured, semi-structured, and unstructured data architecture for metadata management.
Metadata-Set		A record containing metadata.
Named Identifier	NI	Protocol for record identifiers.
Process Metadata		Metadata about lineage.
Social Metadata		Metadata about usage and discoverability.
Technical Metadata		Metadata about format and structure.

FilesExpand file tree

arc42.md

Latest commit

History