Skip to content

Introducing an updated document format

Martin Winter edited this page Feb 10, 2026 · 9 revisions

OpenBoard 1.8.0 introduces an updated document format that includes a Table of Contents (TOC). This feature offers many benefits:

  • Improved performance when adding, moving, or deleting pages from large documents, even on low-throughput network storage.
  • Reduced storage consumption by avoiding duplicate media files, such as images, videos, and audio files.
  • Securely remove unreferenced media asset files.

The following sections describe the basic idea, compatibility considerations and technical implementation details.

Basic idea

In versions of OpenBoard up to 1.7.5, the pages in a document are stored as a collection of files called page<nnn>.svg, where <nnn> is a sequential number starting from 000 and counting upwards. When pages are added, removed or moved, the sequential numbering is always maintained. This can result in a large number of renaming operations, especially for large documents.

The new idea is to add a table of contents that stores the filename of each page. Reordering pages simply involves reordering the table of contents (TOC) entries. Similarly, adding or removing a page from the middle of a document simply adds or removes the corresponding TOC entry. Maintaining sequential page numbering is no longer necessary.

The table of contents can also carry additional information. First, it keeps a list of the media asset files used by the graphic elements in each scene. This allows to identify and possibly remove unreferenced media asset files.

Ideas for future use of the TOC include using named pages instead of page numbers or adding bookmarks for quick navigation within a document.

Table-of-content

A table of contents (TOC) is a list that stores information about each page. It currently contains:

  • the page ID for each page, which is used to derive the file name (see below);
  • the UUID of the scene;
  • a list of assets referenced by that scene.

In the future, the TOC can easily be expanded to include additional information, such as page titles, which could be displayed beneath the thumbnails, or bookmarks for quick navigation between pages.

Page index and Page ID

The most important feature of the TOC is that it decouples the page index from the page ID.

  • The page index is the zero-based index of a page. The first page in a document has an index of 0, the second has an index of 1, and so on. The index always describes a page's position within a document.
  • The page ID determines the file names used for a page and the associated thumbnail. For example, the file name page123.svg corresponds to page ID 123.

Previously, the page ID and page index were always the same. This required renaming files when pages were inserted, deleted, or moved. Now, the TOC allows you to perform such operations within the TOC. For example, moving a page just means moving a TOC entry in the list. The files remain unchanged.

The page index is now used to address the proper entry in the TOC to determine the corresponding page ID and, from that, the file name.

Format

The TOC is stored together with the document as a single JSON file. However, it would be easy to switch to a different serialization format if a more compact format were required, for example.

Scanning

If no TOC exists, or if the document was modified by a previous version of OpenBoard, which may cause a mismatch between the files and the TOC content, we scan the document to create or recreate a matching TOC.

The scanning algorithm is designed to cope with missing or mismatched TOCs. The following steps are performed:

  • Load the TOC if it exists.
  • Create a set of all scene UUIDs -> tocSceneUuids.
  • Create a directory listing and scan through the pages, starting with the lowest page number and counting up. Do not stop on gaps.
  • Load the scenes and check the media assets (i.e., items that reference a media file or directory). See below for a more detailed description of media assets.
  • Check the media asset UUIDs. If it is not an SHA-1-based UUIDv5, copy the asset file and create a UUIDv5 from the media asset. Use the new UUIDv5 for the item. Mark the scene as modified.
  • Check whether the tocSceneUuids already contains this scene.
  • If so, locate and update the TOC entry. Otherwise add the file ID, scene UUID, and asset list to the TOC after the last processed TOC entry or at the beginning, if no entry was processed before.
  • Remove the scene UUID from the tocSceneUuids.
  • When finished, remove all scenes still in tocSceneUuids from the TOC. These scenes may have been deleted using an earlier version of OpenBoard.
  • Go through the TOC and create a list of all referenced media assets.
  • Go through the media asset directories and delete all files that are not referenced.

The TOC is now in sync with the files in the document.

Two-phase scanning

To improve performance and to reduce delays when selecting or opening documents we split the scanning process into two phases.

  • Phase 1 involves scanning the scene UUIDs and occurs when a document is selected in Document mode or when an instance of UBDocument is created by another action.
  • Phase 2 involves reading the scenes and scanning for media assets. Here, we also perform the conversion to UUIDv5, if necessary. Phase 2 begins in the background when a document is opened.

To avoid blocking for indefinite periods of time, all disk I/O is moved out of the main thread. Note that opening a file on a network drive may take much longer than reading the actual data. Since we have to read a potentially large number of files, it is beneficial to move this process out of the main thread and process several files in parallel on multiple threads.

Phase 1

The execution mode of this phase depends on the existence of a TOC and the document version.

  • If no TOC exists, then we can start this phase in the background. Using the UBBackgroundLoader we read the first few bytes of each scene, scan for the scene UUID and store it in the TOC.
  • If a TOC exists but the document version is below 4.9.0, the document was modified by an earlier version of OpenBoard, so we must synchronize the TOC information with the scene files on disk. This process must finish before we can start loading the thumbnails.

During Phase 1, we check the version of each scene. If the scene's version is below 4.9.0 and there is an associated TOC entry, we invalidate the asset information for that scene to force an update in Phase 2.

Phase 2

During this phase, we create or update information about assets referenced by scenes. While this phase is not essential for working with the document, it is necessary to detect and delete unreferenced asset files. As we go through the scenes, we perform the following checks:

  • We check the table of contents (TOC) to create a list of scenes for which no asset information is available.
  • Then, we start loading the scene files in the background using the UBBackgroundLoader.
  • As each scene file loads, we create the scene on the main thread.
  • Once loading is finished, we review the media asset items of that scene. If the asset UUID is not a UUIDv5, we create a UUIDv5 from the asset file content. Then, we copy the file and update the item accordingly.
  • If the scene was modified, it is persisted again in the background.

Document version

To distinguish the new document format from the previous one, we changed the version number of the document from 4.8.0 to 4.9.0. This number is written in the document's RDF metadata, as well as in each scene. When opening a document, we can check the version number to see if the document requires special handling.

Upgrading documents

We know that a document with a version number below 4.9.0 was last opened and modified by an earlier version of OpenBoard. It may contain a table of contents (TOC), but the information in the TOC is likely out of sync with the page data files. This triggers the scanner to collect information from all the page files and merge it with the information from any existing table of contents (TOC).

Downgrading documents

When opening a document with version number 4.9.0 with an earlier version of OpenBoard, a dialogue box will warn the user that some information may not be processed correctly. If the user accepts the risk, the document opens. While all sequentially numbered pages are displayed correctly, they may be out of order when reordered using the table of contents (TOC). If a page is deleted using the table of contents (TOC), the sequential numbering is interrupted, and subsequent pages may not be visible.

Downgrading will result in loss of information, but if a document has only been slightly modified and no pages have been added, moved, or deleted, it will open correctly in an earlier version of OpenBoard.

Exporting documents

Another way to maintain compatibility with previous OpenBoard versions is to export a document. If the document is version 4.9.0 and has a valid table of contents (TOC), we rename the page files during export to create sequential file names, just as before. We also update the table of contents (TOC) to reflect this change.

Media assets

Pages can contain graphical items that are backed by a media asset file or directory. Such items include:

  • Audio
  • Video
  • SVG images
  • Pixel images in various formats
  • PDF backgrounds
  • Web widgets

The table of contents (TOC) also contains a list of referenced media assets for each page.

UUID (V5)

OpenBoard never changes the media asset files. Therefore, we can identify them by an SHA-1 hash computed from the file content. The UUID specification enables us to create a UUID based on the hash value. This is called a version 5 UUID.

The UUID is then used to determine the asset file's name. This has the following consequences:

  • Different media asset files have different file names with a very high probability (SHA-1 hash conflicts are extremely rare).
  • The same media asset file always has the same UUID. If someone adds the same image more than once, for example, the file is stored only once in the document. The TOC keeps track of the references. This saves disk space.

Removing unreferenced media asset files

The table of contents (TOC) now allows us to identify media asset files that are not referenced by any page in the document. We just have to scan through the media asset folders. We check to see if each file or directory is listed in the TOC. If not, we can safely delete the asset.

This cleanup only occurs after a scan or when the document is closed, i.e., when the corresponding UBDocument instance is deleted. It only occurs when asset information is complete and available for all scenes. Therefore, assets belonging to an item that was recently deleted are retained in case the user undoes the deletion.

Asset UUID and item UUID

Web widgets differ from other media assets in that:

  • They don't consist of a single file, but rather a collection of files in a directory.
  • They can be updated, i.e., their content can be changed.

Therefore, we use a slightly different approach for the asset UUID.

  • Widget applications and interactivities have a unique ID. We create a UUIDv5 based on the widget ID.
  • Web widgets created from a whole webpage now use the webpage's URL as the widget ID. We create the UUIDv5 based on that URL.
  • For web widgets created from an oEmbed code snippet or an iframe, we create the asset UUID based on the snippet's content. We use a URN in the format "urn:uuid:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" for the widget ID.

Note that the widget ID, and therefore the asset UUID, will not change for applications and interactivities, even if the widget is updated to a new version.

Other changes

Remove CFF support

Support for the CFF file format has been removed from OpenBoard. It was broken and did not work.

Implementation

This section provides additional implementation details to help developers understand the code structure.

UBDocumentToc and the role of UBDocument and UBPersistenceManager

The table of contents is stored in a small class UBDocumentToc. This class provides read and write access to the entries, as well as the ability to load and save the table of contents to disk using the associated class, UBTocSerializer. UBDocumentToc and UBTocSerializer have been designed so that page properties added in the future can be read and written and will not be lost if they are edited by a previous version of OpenBoard.

A UBDocument owns an instance of UBDocumentToc. When a UBDocument is created, the associated UBDocumentToc instance is also created and, if possible, loaded from disk. All functions of the UBDocument use the page index as a parameter when accessing a specific page of the document. These functions use the TOC to translate the page index to a page ID before accessing the UBPersistenceManager.

All functions of the UBPersistenceManager that refer to a specific page of a document, now use the page ID to identify the page. The page ID is directly related to the file name of the page files (SVG and thumbnail). These functions are now solely called by the UBDocument. Any other code that accesses a specific page only uses the functions of UBDocument that take a page index as parameter.

When writing or modifying code, keep the following in mind:

  • When accessing a page, use the functions of UBDocument.
  • Only use the UBPersistenceManager directly for functions that do not refer to a specific page, but rather to the entire document.
  • Always keep in mind the difference between a page index and a page ID: a page index describes the position of the page within the document, while a page ID refers to the file name. UBDocument uses the TOC to translate between the two.

No renaming of page files

Page files are never renamed when pages are inserted, moved, or deleted. Therefore, all related functions can be removed from the UBPersistenceManager. The UBSceneCache was also simplified, because these functions could be removed.

However, this created one issue for the UBSceneCache: The removeAllScenes() function, which was used to delete all scenes of a document, could no longer work as it did before. Previously, it simply iterated over the page count and deleted all possible entries. Now, page file names are not directly related to the page count. We solved this issue by using aQMap for the cache instead of a QHash. A QMap guarantees the order of its entries, ensuring that all entries for a specific document are in sequence. We can simply use the lowerBound and upperBound functions of QMap to identify the first and last entries belonging to a specific document.

Page count

Previously, the page count of a document was determined by scanning the disk for existing page files and counting up from the first file, page000.svg. During a document's lifetime, the count was stored in the UBDocumentProxy and was incremented or decremented as necessary when pages were inserted or deleted.

Now the TOC is the authoritative source of the page count. We can simply ask the TOC for the number of entries. Therefore, determining the page count when opening a document only involves reading the TOC, instead of scanning the document directory.

Consequently, the function pageCount() is no longer available at the UBDocumentProxy, but has been moved to UBDocument.

The scanner plays an important role here. If we doubt the accuracy of the TOC, we scan the document directory to update it. As mentioned earlier, this occurs when accessing a document without a TOC or when the document version indicates that it was last modified by an earlier version of OpenBoard that was unaware of the TOC.

Scene copy

As discussed in https://github.com/letsfindaway/OpenBoard/issues/198, OpenBoard employed various implementations for copying scenes in different scenarios. With the introduction of the TOC, it became essential that this only happen in one function located in the UBDocument, ensuring that the TOC is in sync with the document.

In the above-linked issue, we also began discussing whether to keep or change UUIDs when copying a scene. The most important outcome of this discussion was the introduction of the UBMediaAsset and the differentiation between item UUID and asset UUID as explained above.

We now have a single function, UBDocument::copyPage(), which can copy a page within a document or between documents. This function creates a new UUID for the copied scene, but leaves all other UUIDs intact:

  • UUIDs for media assets remain unchanged so that media items on the copied page can use the same media asset files without duplicating them.
  • UUIDs for items are also not modified. They only need to be unique on a single page; therefore, reusing the same UUIDs on a copied page is not a problem.
  • However, the UUID of the scene is changed because it identifies the scene and is also used for document scanning.

The call sequence and the responsibilities of the functions is as follows:

  • Both the UBDocument::copyPage() and the UBDocument::duplicatePage() functions call the private function UBDocument::copyPage(), which can handle both.
  • This private function determines the page's relative dependencies, i.e., the media asset files referenced by any of the page's items. Then, it calls the persistence manager to copy the page. After that, it updates the thumbnail and adjusts the table of contents (TOC).
  • UBPersistenceManager::copyDocumentScene() is responsible for the disk operations. It ensures that a modified scene is saved to disk before copying. Then, it calls UBPersistenceManager::copyPage() to perform the actual copy. Next, it copies all the relative dependencies and finally, it updates the document metadata.
  • UBPersistenceManager::copyPage() copies the page files. First, it copies the thumbnail. Next, it creates a new UUID for the copied scene and instructs the UBSvgSubsetAdaptor to perform the copy.
  • The UBSvgSubsetAdaptor is the place where knowledge about the representation of a scene on disk is concentrated. UBSvgSubsetAdaptor::replicateScene() reads the page SVG file, replaces the UUID by searching for the appropriate XML tag, and saves the result to the disk.

This creates clearly defined responsibilities for each class:

  • UBDocument knows the source and target documents and uses the TOC to determine the page IDs and the media assets. It also manages the thumbnail scene.
  • The UBPersistenceManager class performs file operations.
  • UBSvgSubsetAdaptor handles operations related to the internal structure and content of the page SVG file.

Consequently, several code parts that implemented their own copying could be removed. The UBForeignObjectsHandler, for example, was only referenced in dead code and could be completely removed.

Clone this wiki locally