Skip to content

Modernizing Valkey AOF #3515

@sumitk163

Description

@sumitk163

Modernizing Valkey AOF

Summary

This issue outlines enhancements to Valkey's Append-Only File (AOF) persistence mechanism. The current AOF implementation suffers from performance bottlenecks related to synchronous I/O and kernel overhead, as well as resilience gaps regarding data integrity and disk capacity management. By introducing backwards-compatible log headers, asynchronous I/O via io_uring, Direct I/O, and proactive disk management, we aim to significantly improve Valkey's performance and reliability.


1. Challenges with Current AOF Implementation

  • Lack of resilience against torn writes and data corruption: During a system crash or power failure, an AOF write might only partially complete, resulting in a "torn write." Currently, the AOF lacks robust built-in data integrity checks (like checksumming per entry). This makes it difficult to reliably detect and safely recover from corrupted AOF entries.
  • Lack of appropriate metadata to support psync: The current AOF format does not contain the necessary sequence metadata to effectively support partial resynchronization (psync) upon server restart. This forces the system into expensive full synchronizations. (Reference: Valkey Issue #2904 Comment)
  • AOF flush blocks the main thread in AOF_FSYNC_ALWAYS mode: When configured for maximum durability, the fdatasync() system call blocks Valkey's main thread. This completely halts command processing until the disk operation completes, severely degrading throughput and increasing latency. Write-behind log for async AOF-based durability #3381 addresses this.
  • Slow writes due to filesystem journaling and kernel page cache overhead: Standard buffered I/O routes data through the kernel page cache, incurring CPU overhead from data copying and creating memory pressure (double buffering). Furthermore, when data is written to new blocks, filesystems like ext4 must perform heavy metadata updates to initialize the extent tree. These metadata updates trigger filesystem journaling operations, which introduce significant overhead and unpredictable latency spikes during the critical write path.
  • Lack of backpressure when the disk is full: Valkey currently lacks a graceful degradation mechanism for disk capacity. If the disk fills up, the server will encounter hard write failures, which can lead to abrupt crashes, rather than proactively throttling or rejecting writes.

2. Proposed Improvements

To address the challenges above, we propose enhancements that prioritize non-blocking operations, bypassed kernel caches, and robust entry framing.

2.1 Backwards-Compatible AOF Log Headers

To resolve data integrity and psync limitations, we will annotate every AOF log entry with a metadata header.

Header Fields:

  • len: The length of the log entry.
  • lsn: A monotonically increasing Log Sequence Number, incrementing by exactly 1 for every log entry.
  • replid: Replication id to allow psync on server restart.
  • reploff: Replication offset to allow psync on server restart.
  • cksum: A checksum for the log entry to detect torn writes and bit rot.

Format and Compatibility:

The log headers will be added as Valkey protocol annotations to ensure backwards compatibility. Older AOF parsers will simply ignore them. The format will be:
#HDR:len:{$len};lsn:{$lsn};replid:{$replid};reploff:{$reploff};cksum:{$cksum}\r\n

2.2 Asynchronous Writes via io_uring and Dirty Key Tracker

To prevent AOF operations from blocking the main thread when using AOF-based durability mode (appendfsync always), we will transition to asynchronous I/O using Linux's io_uring. Combined with a dirty key tracker, the main thread will mark keys being updated as dirty, hand off disk writing to io_uring and and block the acknowledgement of writes and any read referencing the dirty keys until disk write completes. This extends the proposal in #3381 as follows:

  • Instead of performing sequential disk writes by blocking new disk writes until the previous write finishes, perform concurrent, offset-based writes to improve write latency and throughput.
  • Support performing writes using io_uring instead of io-threads to support efficient asynchronous writes on small machines where provisioning io-threads may not be feasible.

2.3 Pre-allocate and Pre-initialize File Blocks

We will pre-allocate space and explicitly pre-initialize the incremental AOF file blocks by writing zeros to them when using AOF-based durability mode (appendfsync always). This ensures that subsequent log writes to these blocks incur zero filesystem journaling overhead on the critical write path, thereby improving the write latency. This is achieved by eliminating metadata updates - since the file size and blocks are already defined, the OS doesn't have to update the inode size or allocate new blocks during the transaction. It becomes a pure "overwrite" operation.

This also significantly reduces the disk IOPS by eliminating the metadata operations sent to the disk.

Backwards-Compatible Zeroes:
We will writes zeroes using annotations:
#PAD:{$zeroes}\r\n

2.4 Recycle Log Files [Backwards Incompatible]

Creating new files and zeroing them out consumes valuable disk bandwidth and compute resources. Instead of deleting old incremental AOF files that are no longer needed, we will recycle and reuse them for writing future logs. By reusing already allocated and initialized files, we completely bypass the cost of zeroing out new files, optimizing disk bandwidth and reducing overall system overhead.

Handling Recycled Data During Recovery: Reusing files introduces a complication: a recycled AOF file will contain arbitrary junk data or old log entries from its previous lifecycle. This makes it difficult to differentiate between newly written log entries and leftover data. We resolve this using the log headers proposed in section 2.1.

When reading the log during replay or recovery, the process must be sequential, starting from a known-good checkpoint (or the beginning of the log). The system will read each log header to determine the entry's length, read the corresponding bytes, and perform two critical verifications:

  • Valid CRC: Ensures the data is a legitimate, uncorrupted log entry rather than random garbage bytes.
  • Monotonically Increasing LSN: Validates the sequence. If a consistent log entry is read but its LSN is lower than expected, the system assumes it belongs to the previous lifecycle of the recycled file and treats it morally as uninitialized data.

To safely recover, the system reads sequentially from a checkpoint until it encounters either a CRC error or an LSN mismatch. At that point, it assumes it has reached the true end of the current AOF.

The incremental AOF file size will be fixed, and logs will continue in a new file once the current one reaches that limit.

  • Note: This specific change breaks backwards compatibility and will require an explicit opt-in.

2.5 Direct I/O (O_DIRECT & O_DSYNC) with Alignment Padding

To eliminate kernel page cache overhead, we will support a configuration that allows writes to be performed using Direct I/O when using AOF-based durability mode (appendfsync always).

Implementation Details:

  • Files will be opened with the O_DIRECT and O_DSYNC flags.
  • This consolidates the write and fdatasync operations into a single system call and ensures data is written directly to the storage device.
  • Alignment Requirement: Direct I/O requires that write start offsets align strictly with the underlying block device's physical block size (e.g., 4KB).

Backwards-Compatible Zero Padding:
To achieve block alignment without breaking older AOF parsers, we will pad writes using protocol-compliant annotations:
#PAD:{$zeroes}\r\n

Since direct io can result in high write amplification for workloads with low write throughput and given that not all file systems support O_DIRECT, enabling direct io will require an explicit opt-in.

2.6 Disk Capacity Backpressure

Currently, when using AOF-based durability mode (appendfsync always), if the disk fills up, new writes may fail, resulting in a server crash.
To improve system resilience, we will introduce a configurable disk usage threshold. When the disk capacity approaches this watermark (e.g., 95% full), Valkey will proactively trigger backpressure. It will safely reject new write commands with the existing "-MISCONF Errors writing to the AOF file" error message to the client, keeping the server alive and allowing AOF rewrite to run to free up disk space or giving administrators a window to intervene before a hard disk-full crash occurs.


3. Performance Benchmarks

3.1 Benchmark Setup

Image

Valkey 8.0 is run on a 2vcpu GCP VM with io-threads=1, appenonly=yes, appendfsync=always. The AOF logs are written to a Hyperdisk Balanced volume attached to the VM. AOF rewrite is disabled.

Memtier client is run on another GCP VM in the same availability zone as the Valkey server with the following parameters:

  • Set:Get ratio : 1:3
  • Payload size: 1000 bytes
  • Total keys: 5 million

Data is pre-loaded into the cache so as to achieve 100% hit rate during the benchmarks. Clients and threads parameters are gradually incremented until we hit an optimum setup where further increasing the connection count does not improve the throughput.

3.2 Summary

Setup Op QPS P50 Latency (ms) P99 Latency (ms)
Baseline AOF Get 99,669 5.34 11.07
Baseline AOF Set 33,226 5.34 11.13
AOF with Async Writes Get 141,244
(+41.7%)
0.50
(-90.6%)
1.04
(-90.6%)
AOF with Async Writes Set 47,082
(+41.7%)
2.89
(-45.9%)
4.31
(-61.3%)
AOF with Pre-initialized Files Get 154,671
(+55.2%)
0.32
(-94.0%)
0.47
(-95.8%)
AOF with Pre-initialized Files Set 51,557
(+55.2%)
0.79
(-85.2%)
1.21
(-89.1%)
Image

We also compare the performance with optimised AOF against performance with AOF disabled (appendonly=no).

Setup Op QPS P50 Latency (ms) P99 Latency (ms)
AOF Disabled Get 185,057 0.37 0.49
AOF Disabled Set 61,686 0.35 0.49
AOF with Pre-initialized Files Get 154,671
(-16.4%)
0.32
(-13.5%)
0.47
(-4.1%)
AOF with Pre-initialized Files Set 51,557
(-16.4%)
0.79
(+125.7%)
1.21
(+146.9%)
Image

3.3 Baseline AOF Performance

Image

3.4 Performance with async writes

Log header annotation is added to the log entries, using crc64 for computing the checksum. AOF write and fdatasync is performed on bio threads, instead of the main thread. A pool of 4 bio threads is used to allow up to 4 concurrent disk writes. A dirty key tracker is used to block sending responses to clients referencing the dirty keys.

Note that the implementation will be revised based on #3381 and we will explore using io_uring to minimize context switching overhead and to support async writes on smaller machine types where provisioning io-threads may not be feasible.

Image

3.5 Performance with pre-initialized file blocks and direct io

Incremental AOF files are pre-initialized and recycled on AOF rewrite. We use an array of 1GB files that essentially acts as a circular buffer. Files are opened with O_DSYNC and O_DIRECT. The AOF writes are 4k-aligned.

Image

3.6 Performance with AOF disabled

Image

4. Technical References

Postgres WAL

Postgres WAL (write-ahead log) data physically manifests as a series 16MB files (i.e. each physical file covers a ~16MB section of the logical WAL space).

When the files are first allocated, they are zero-filled (https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L3305 ) which generates 2x the bandwidth to disk (once for the zeroes, once for the actual data).

However, Postgres does not keep the physical files for eternity, there is a point past which the old WAL files are no longer needed. Rather than deleting those file, Postgres recycles them (https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L4081), so that they can be re-used without needing to be zero initialized. So, the zeroing penalty only happens for new database, until they start recycling WAL files. Once the DB has initialized N WAL files (where N depends on some configuration parameters of the database), it stops paying the penalty.


5. Open Questions

  • What new configuration flags need to be added? Adding log headers would increase the AOF log size. So, we may want to guard such changes behind configuration flags and disable the functionality by default.
  • Should AOF log headers use text format or a Base64 format? How is the log header format versioned?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions