Reduce the memory usage of the write-ahead log by craigfe · Pull Request #355 · mirage/index

craigfe · 2021-10-04T10:13:07Z

This implements "Suggestion 2" as described in #354, without adding an LRU.

src/log_file.ml

src/log_file.mli

src/unix/index_unix.ml

icristescu · 2021-10-05T14:00:12Z

src/small_list.ml

+  | Tuple3 of 'a * 'a * 'a
+  | Tuple4 of 'a * 'a * 'a * 'a
+  | Tuple5 of 'a * 'a * 'a * 'a * 'a
+  | Tuple6 of 'a * 'a * 'a * 'a * 'a * 'a


maybe this is obvious, but why up to 6?

Not obvious at all; I just got bored after 6 😉 In practice, sizes above about 7 don't really happen with the hash functions we're using, so it's not worth going any further. If we used Obj rather than a type-safe implementation, we could push the threshold much higher & also improve the speed.

As it happens, @pascutto has a WIP patch to use open-addressing in the hashtable, which would avoid the need for small_list.ml entirely. For now, this is probably good enough.

The first experiments showed that open addressing using linear probing is better for allocations, but worse for both memory and performance, so we should probably go with the short list here :)

icristescu

A lot of nice optimizations, thanks!

src/log_file.ml

Ngoguey42

LGTM

src/index.ml

src/unix/index_unix.ml

src/log_file.ml

If we commit to needing to `read` from disk when finding in the log file (and when resizing the hashtable), we can keep only the offsets of each entry in memory, reducing the memory usage of Index significantly.

Previously, the hash-set bucket used to store binding offsets was determined by the bottom bits of the key's hash. By switching to use the _top_ bits instead, the hashset keeps entries roughly in order (with only the entries within each bucket being relatively out-of-order). This means that we don't need to load all the bindings into an array in order to sort them for merging, reducing the peak memory usage of Index.

Now that the merge function re-uses the in-memory log for sorting values, we only need to know if the index is empty before we start.

Co-authored-by: Clément Pascutto <clement@pascutto.fr>

craigfe · 2021-10-07T11:22:13Z

Thanks all for the reviews. Merging now.

mirage#355 introduced a small optimisation to re-use a "local" scratch buffer when decoding values from the log file. This is actually unsafe: during the merge, the asynchronous merge thread and the main writer thread can attempt concurrent reads from the log file, causing contention over the scratch buffer. This can be observed by inserting `Thread.yield` calls inside the `Value.decode` implementation and then stress-testing the interface (e.g. by running the replay benchmarks). This commit adds a lock over the scratch buffer, preventing unsafe concurrent access.

mirage#355 introduced a small optimisation to re-use a "local" scratch buffer when decoding values from the log file. This is actually unsafe: during the merge, the asynchronous merge thread and the main writer thread can attempt concurrent reads from the log file, causing contention over the scratch buffer. This can be observed by inserting `Thread.yield` calls inside the `Value.decode` implementation and then stress-testing the interface (e.g. by running the replay benchmarks). This commit ensures that each call to a `Log_file` function gets its own scratch buffer, ensuring safe concurrent access without introducing potential contention issues.

CHANGES: ## Fixed - Fix stats recording in `Raw.unsafe_write` (mirage/index#351) ## Changed - Changed the implementation of the write-ahead log to significantly reduce its memory usage (at the cost of some additional disk IO). (mirage/index#355)

craigfe requested a review from pascutto October 4, 2021 10:13

craigfe force-pushed the log-only-offsets-in-memory branch from f2ffb84 to 64b3682 Compare October 4, 2021 12:23

icristescu reviewed Oct 5, 2021

View reviewed changes

src/log_file.ml Show resolved Hide resolved

icristescu reviewed Oct 5, 2021

View reviewed changes

src/log_file.mli Show resolved Hide resolved

icristescu reviewed Oct 5, 2021

View reviewed changes

src/unix/index_unix.ml Outdated Show resolved Hide resolved

icristescu reviewed Oct 5, 2021

View reviewed changes

icristescu approved these changes Oct 5, 2021

View reviewed changes

pascutto reviewed Oct 5, 2021

View reviewed changes

src/log_file.ml Outdated Show resolved Hide resolved

src/log_file.ml Outdated Show resolved Hide resolved

src/log_file.ml Outdated Show resolved Hide resolved

Ngoguey42 approved these changes Oct 6, 2021

View reviewed changes

src/index.ml Show resolved Hide resolved

src/index.ml Outdated Show resolved Hide resolved

src/unix/index_unix.ml Show resolved Hide resolved

src/log_file.ml Show resolved Hide resolved

pascutto approved these changes Oct 6, 2021

View reviewed changes

craigfe and others added 14 commits October 6, 2021 21:47

index: keep only log offsets in memory

1372f53

If we commit to needing to `read` from disk when finding in the log file (and when resizing the hashtable), we can keep only the offsets of each entry in memory, reducing the memory usage of Index significantly.

index: extract a Log_file abstraction

b4ba837

index: optimise the Log hashset resize operation

5e6384f

index: remove unnecessary get_witness logic

c4a0e56

Now that the merge function re-uses the in-memory log for sorting values, we only need to know if the index is empty before we start.

Add a CHANGES entry

f9e644c

unix: improve implementation of buffered IO.read

01b5692

Respond to @icristescu's code review

5aaa483

Update src/log_file.ml

c4da6be

Co-authored-by: Clément Pascutto <clement@pascutto.fr>

index: simplify implementation of Log_file.to_sorted_seq

d83fd44

index: rename Log_file.hashset to Log_file.hashtbl

4563bd2

index: assert that is_empty is called only by RW instances

b5f4865

index: inline internal log_to_list helper function

6b9c7a9

unix: add a comment clarifying short-read RW semantics

3a3bfd8

craigfe force-pushed the log-only-offsets-in-memory branch from 1b60936 to 3a3bfd8 Compare October 6, 2021 20:47

craigfe merged commit fe5e962 into mirage:master Oct 7, 2021

craigfe mentioned this pull request Oct 10, 2021

index: avoid unsafe buffer sharing in Log_file #358

Merged

craigfe mentioned this pull request Oct 10, 2021

Add an LRU to cache finds #359

Closed

icristescu mentioned this pull request Oct 15, 2021

[new release] index and index-bench (1.4.2) ocaml/opam-repository#19785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the memory usage of the write-ahead log#355

Reduce the memory usage of the write-ahead log#355
craigfe merged 14 commits intomirage:masterfrom
craigfe:log-only-offsets-in-memory

craigfe commented Oct 4, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

icristescu Oct 5, 2021

Uh oh!

craigfe Oct 5, 2021

Uh oh!

pascutto Oct 6, 2021

Uh oh!

icristescu left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ngoguey42 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

craigfe commented Oct 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

craigfe commented Oct 4, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

icristescu Oct 5, 2021

Choose a reason for hiding this comment

Uh oh!

craigfe Oct 5, 2021

Choose a reason for hiding this comment

Uh oh!

pascutto Oct 6, 2021

Choose a reason for hiding this comment

Uh oh!

icristescu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Ngoguey42 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

craigfe commented Oct 7, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants