Data Models

Important

The default branch has been converted to tyu-refresh, but the content is still under review #43. Once this PR is merged, the default branch will be converted back to main. Feedback would be greatly appreciated.

The Curator-Extension (formerly Schematic) data model is used to create JSON Schemas for Curator to enable the contribution of valid metadata. See JSON Schema documentation. This can be used by those that prefer working in a tabular format (CSV) over JSON or LinkML. A data model is created in the format specified here. The Curator-Extension in the Synapse Python Client can be used to convert to JSON Schema.

This repository will recommend three different ways to maintaining your data model when using the CSV format.

One CSV
Modular CSV
Contextualized CSV

One CSV

example.model.csv is a CSV representation of the example data model.

Modular CSV

When data models get larger, it becomes overwhelming to maintain it all in one csv. In this scenario, users can choose to break down the csv into smaller, more manageable chunks. The modules folder contain an example of how the example.model.csv is broken down.

Contextualized CSV

Motivated by the ARK portal https://github.com/ARK-Portal/data_model data model, the data model can be created to utilize "contexts" in order to have context-specific conditionally required attributes with the bonus of also being able to define context-specific valid value lists for model attributes and more. An small example of this can be found in the contexts folder. Each data model is within it's own csv, a user can modularize this as well in whatever way they choose. In this scenario, each "template" would have it's own data model csv and the generate-json-schema command would be run for each template csv instead of concatenating all of the CSV together.

Descriptions of valid values

The "Valid Values" column for attributes often contain many values without any descriptions. In this scenario, you can add descriptions to these valid values by adding extra rows and having these valid values appear as "Attributes".

Caution

When adding valid value as an Attirbute to add a description of the valid value, it CANNOT appear in any string value in the "DependsOn" column unless you wanted it to be a data model attribute as well.

Using this template repository

Decide on the way you want to organize and maintain your data model (three options above)
Keep the desired folder (one_csv, modules, contexts) and delete the other two
Keep the corresponding GitHub Action onecsv-ci.yml, modules-ci.yml, and contexts-ci.yml and delete the other two. Consider renaming the file to ci.yml for simplicity.

Manually Generating JSON schemas

To manually generate jsonschemas, you are required to install the Synapse Python Client along with the curation extension. Each of the data model options above will have slightly different methods of generating JSON schemas.

Note

This section assumes that you already have working proficiency with Python.

pip install "synapseclient[curator]"

One CSV

Generate all data model jsonschemas from one CSV.

synapse generate-json-schema one_csv/example.model.csv --data-model-labels display_label

Modular CSV

Concatenate all CSVs and generate all data model jsonschemas from the assembled CSV.

python scripts/assemble_csv_data_model.py modules assembled.csv
synapse generate-json-schema assembled.csv --data-model-labels display_label

Contextualized CSV

Generate a jsonschema from each data model CSV.

synapse generate-json-schema contexts/clinical_model.csv --data-model-labels display_label
synapse generate-json-schema contexts/genomic_model.csv --data-model-labels display_label

Best Practices: Operations for Data Models

Purpose

This describes operational best practices for:

Ensuring day-to-day data model edits reliably produce JSON Schemas that work with Synapse Curator.
Creating official, versioned JSON Schema releases registered in Synapse.
Maintaining clear separation between test (development) and production (released) schema environments.

This guidance focuses on governance, change management, and release discipline.

Guiding Principles

Data models require ownership, review, and lifecycle management.
Schemas used in production must be immutable.
Development and production environments should be separated. It is encouraged to have a development environment.
Portals own their data models.

Environment Separation: Recommended Organizational Structure

Each portal is recommended to maintain two separate Synapse schema organizations.

Test Schema Organization (Development)

Purpose

Rapid iteration
Curator compatibility testing
Pre-release schema staging

Characteristics

Schemas may change frequently.
Versions may be overwritten.
Clearly labeled as non-production.

Recommended naming conventions

test.sage.{portal_name}

Production Schema Organization (Released)

Purpose

Official, versioned schema releases
Stable references for Curator

Characteristics

Schemas are immutable once released.
Organized by version.
Ideally never overwritten.

Recommended naming conventions

sage.schemas.{portal_name}
org.synapse.{portal_name}

Daily Model Edits (Test Environment)

Ensure routine data model changes work with Synapse Curator and remain aligned with operational expectations.

Every change should generate JSON Schemas

No model change is considered complete until JSON schemas are able to be generated and registered into the Test Schema Organization to ensure Synapse Curator compliance.

Official Schema Releases (Production)

Create reproducible, traceable, immutable schema releases registered in Synapse.

Schemas are versioned using explicit release numbers

Recommended: Semantic Versioning

MAJOR – breaking changes
MINOR – backward-compatible additions
PATCH – non-breaking fixes

Versioning applies to the release set, not individual ad hoc files.

Production releases must be immutable

Once a schema version is registered in the Production Schema Organization, it should never be modified. Corrections to the schema require a new version.

Formal Release Process

Each portal should define a lightweight but explicit release process of their data model to

Confirm all changes can create Synapse compliant JSON schemas
Ability to create release artifacts within GitHub (e.g. use GitHub Release + tag feature)
Generate and register versioned JSONschemas to production JSONschema organization

Using GitHub Actions

This repository also contains template github actions that will generate jsonschemas from each of the recommended data model maintenance approaches. These github action workflows lightly implement the what was described in the "Best Practices: Operations for Data Models" section above but Sage Portal Owners do NOT have to use these workflows to achieve the best practices.

Avoiding Merge Conflicts with Automated Commits

Try to avoid configuring GitHub Actions to commit generated files (like assembled CSVs or JSON schemas) back to the repository. This practice commonly leads to merge conflicts and complicates collaborative workflows.

Problems with automated commits:

Creates merge conflicts when multiple contributors work simultaneously
Makes git history noisy with automated commits
Complicates branch management and pull request reviews
Can cause infinite loops if not properly configured

Recommended alternative: Use GitHub Artifacts

Store generated files as build artifacts that can be downloaded

- name: Upload assembled CSV
  uses: actions/upload-artifact@v6
  with:
    name: assembled-data-model
    path: assembled.csv

This approach keeps your repository clean while still providing access to generated files for downstream consumers and for github tagged releases, it will retain the artifact "forever"

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github/workflows		.github/workflows
contexts		contexts
modules		modules
one_csv		one_csv
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Models

One CSV

Modular CSV

Contextualized CSV

Descriptions of valid values

Using this template repository

Manually Generating JSON schemas

One CSV

Modular CSV

Contextualized CSV

Best Practices: Operations for Data Models

Purpose

Guiding Principles

Environment Separation: Recommended Organizational Structure

Test Schema Organization (Development)

Production Schema Organization (Released)

Daily Model Edits (Test Environment)

Every change should generate JSON Schemas

Official Schema Releases (Production)

Schemas are versioned using explicit release numbers

Production releases must be immutable

Formal Release Process

Using GitHub Actions

Avoiding Merge Conflicts with Automated Commits

Problems with automated commits:

Recommended alternative: Use GitHub Artifacts

About

Uh oh!

Releases 1

Uh oh!

Contributors 5

Uh oh!

Languages

License

Sage-Bionetworks/data-models

Folders and files

Latest commit

History

Repository files navigation

Data Models

One CSV

Modular CSV

Contextualized CSV

Descriptions of valid values

Using this template repository

Manually Generating JSON schemas

One CSV

Modular CSV

Contextualized CSV

Best Practices: Operations for Data Models

Purpose

Guiding Principles

Environment Separation: Recommended Organizational Structure

Test Schema Organization (Development)

Production Schema Organization (Released)

Daily Model Edits (Test Environment)

Every change should generate JSON Schemas

Official Schema Releases (Production)

Schemas are versioned using explicit release numbers

Production releases must be immutable

Formal Release Process

Using GitHub Actions

Avoiding Merge Conflicts with Automated Commits

Problems with automated commits:

Recommended alternative: Use GitHub Artifacts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors 5

Uh oh!

Languages