Reduce the memory usage of the write-ahead log

### Context

The total IO done by an index writer is roughly proportional to the maximum size of the log file, since the store must be entirely rewritten once per `log_size` replace operations. As a result, the in-memory mirror of the log file tends to dominate the overall memory usage (i.e. since it's worth making `log_size` as large as possible). When used in `irmin-pack`, this data-structure expands out as follows:

```ocaml
  (key, value) Stdlib.Hashtbl.t
= (string<32>, (int63 * int * char)) Stdlib.Hashtbl.t
= (string<32>, (int63 * int * char)) list2 array
```

When `Sys.word_size = 64`, this makes for approximately **14.5 words per binding**, broken down into:

- 6 word string (header + 4 words data + null padding)
- 4 word triple (header + 3 fields)
- 4 word list node (header + key pointer + value pointer + tail pointer)
- ~0.5 word bucket slot (in the best case, there are ~2 bindings per bucket)

In the worst case, there are both reader and writer instances and merges are stacking, meaning there are 4 such hashtables (`log` and `log_async` for each of the reader and writer) and the writer also has an array of sorted entries (5 words per binding). When `log_size = 500_000`, this sums to ~250MB. Here's two (conflicting) optimisations we could make:

### Suggestion 1: functorise over a user-defined hashtable implementations

We could provide a functor that allows the user to provide their own `Hashtbl` implementation specialised to their particular `key` and `value` types. In the specific case of `irmin-pack`, we could do the following:

- keep `string` keys in an arena (-2 words)
- use a small-size optimised list for hashtable buckets (-1.5 words)
- pack the value triple into a pair of `int63`s (`offset` and `kind_then_length`) and keep them as immediates in the hashtable (-3 words)

This makes for **8 words per binding**. By itself, this isn't much of a change since the entries must be unpacked from their compact representation to sort them in an array before the merge. The simple solution here is just to pick the hashtable bucket using the upper bits of `key_hash`, so that the hashtable is already almost sorted in the correct order (only the bindings within a particular bucket are relatively out-of-order). This means we can expose a `to_sorted_seq` that takes O(1) additional memory.

When `log_size = 500_000`, this works out to ~128MB in the worst case. (Half what we had before.)

### Suggestion 2: keep only log offsets in memory

The previous approach effectively lets the user pick an efficient packed representation for entries in the write-ahead log. However, Index can already uniquely identify entries by their offset on disk (in 1 word). If we commit to needing to `read` when finding in the log file (and when resizing the hashtable), we can keep only the offsets in memory (in hashset buckets, again keyed by the upper bits of `key_hash`). This achieves **2 words per binding**.

When `log_size = 500_000`, this is ~32MB in the worst case. The disadvantage is that values in the log can no longer be found without reading on disk, so best-case read performance is worse. If there are many `key_hash` prefix collisions in a given bucket, we have to read and decode each entry on `find`.

### Analysis

  | Suggestion 1 | Suggestion 2
-- | -- | --
Memory | ~2x improvement | ~7x improvement
Performance | Same as before – fast. | Worse than before: successful log `find`s must always read from disk. However, the improved memory footprint could allow an LRU to be added, which may be more effective anyway.
Interface | Optimised Index is more complicated and requires that complex code (i.e. an extended hashtable implementation) lives outside Index in the user's code. We can provide helper functors to build compact hashtable implementations, but users won't benefit from them for free. | Same as before – simple.

Here are some preliminary measurements using the Index replay trace with ~100 million operations. Firstly, internally-measured heap size as a function of operations performed:

![image](https://user-images.githubusercontent.com/26125912/135775656-28c6eaad-66bf-48d3-bccc-f0d417746a8a.png)

Secondly, externally-measured `maxrss` as a function of time:

![image](https://user-images.githubusercontent.com/26125912/135775681-fefb54b1-de0b-42fe-98ba-b66e1e4dab5b.png)

This shows that the performance hit of a naive implementation of Suggestion 2 is roughly 10% (but my laptop isn't a very precise environment). I suspect it's possible to recover the performance with a relatively small LRU, but haven't done this yet.

Overall, I'm leaning towards `#2`, depending on the measured benefits of adding an LRU. This should make it possible to substantially increase the default `log_size` in Tezos (and so avoid stacking merges / improve IO performance). In the medium-term, we'll have complementary solutions to the the log size issue in the form of reduced index load and/or an alternative index implementation.

CC @pascutto, who kindly volunteered to review the suggestions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the memory usage of the write-ahead log #354

Context

Suggestion 1: functorise over a user-defined hashtable implementations

Suggestion 2: keep only log offsets in memory

Analysis

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Suggestion 1	Suggestion 2
Memory	~2x improvement	~7x improvement
Performance	Same as before – fast.	Worse than before: successful log `find`s must always read from disk. However, the improved memory footprint could allow an LRU to be added, which may be more effective anyway.
Interface	Optimised Index is more complicated and requires that complex code (i.e. an extended hashtable implementation) lives outside Index in the user's code. We can provide helper functors to build compact hashtable implementations, but users won't benefit from them for free.	Same as before – simple.

Reduce the memory usage of the write-ahead log #354

Description

Context

Suggestion 1: functorise over a user-defined hashtable implementations

Suggestion 2: keep only log offsets in memory

Analysis

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions