Modernizing Valkey AOF

# **Modernizing Valkey AOF**

## **Summary**
This issue outlines enhancements to Valkey's Append-Only File (AOF) persistence mechanism. The current AOF implementation suffers from performance bottlenecks related to synchronous I/O and kernel overhead, as well as resilience gaps regarding data integrity and disk capacity management. By introducing backwards-compatible log headers, asynchronous I/O via io_uring, Direct I/O, and proactive disk management, we aim to significantly improve Valkey's performance and reliability.

---

## **1. Challenges with Current AOF Implementation**
* **Lack of resilience against torn writes and data corruption:** During a system crash or power failure, an AOF write might only partially complete, resulting in a "torn write." Currently, the AOF lacks robust built-in data integrity checks (like checksumming per entry). This makes it difficult to reliably detect and safely recover from corrupted AOF entries.
* **Lack of appropriate metadata to support psync:** The current AOF format does not contain the necessary sequence metadata to effectively support partial resynchronization (psync) upon server restart. This forces the system into expensive full synchronizations. (Reference: [Valkey Issue #2904 Comment](https://github.com/valkey-io/valkey/issues/2904#issuecomment-3612512829))
* **AOF flush blocks the main thread in `AOF_FSYNC_ALWAYS` mode:** When configured for maximum durability, the `fdatasync()` system call blocks Valkey's main thread. This completely halts command processing until the disk operation completes, severely degrading throughput and increasing latency. https://github.com/valkey-io/valkey/pull/3381 addresses this.
* **Slow writes due to filesystem journaling and kernel page cache overhead:** Standard buffered I/O routes data through the kernel page cache, incurring CPU overhead from data copying and creating memory pressure (double buffering). Furthermore, when data is written to new blocks, filesystems like ext4 must perform heavy metadata updates to initialize the extent tree. These metadata updates trigger filesystem journaling operations, which introduce significant overhead and unpredictable latency spikes during the critical write path.
* **Lack of backpressure when the disk is full:** Valkey currently lacks a graceful degradation mechanism for disk capacity. If the disk fills up, the server will encounter hard write failures, which can lead to abrupt crashes, rather than proactively throttling or rejecting writes.

---

## **2. Proposed Improvements**

To address the challenges above, we propose enhancements that prioritize non-blocking operations, bypassed kernel caches, and robust entry framing.

### **2.1 Backwards-Compatible AOF Log Headers**
To resolve data integrity and psync limitations, we will annotate every AOF log entry with a metadata header.

**Header Fields:**
* **len:** The length of the log entry.
* **lsn:** A monotonically increasing Log Sequence Number, incrementing by exactly 1 for every log entry.
* **replid:** Replication id to allow psync on server restart.
* **reploff:** Replication offset to allow psync on server restart.
* **cksum:** A checksum for the log entry to detect torn writes and bit rot.

**Format and Compatibility:**

The log headers will be added as Valkey protocol annotations to ensure backwards compatibility. Older AOF parsers will simply ignore them. The format will be:
`#HDR:len:{$len};lsn:{$lsn};replid:{$replid};reploff:{$reploff};cksum:{$cksum}\r\n`

### **2.2 Asynchronous Writes via `io_uring` and Dirty Key Tracker**
To prevent AOF operations from blocking the main thread when using AOF-based durability mode (appendfsync always), we will transition to asynchronous I/O using Linux's io_uring. Combined with a dirty key tracker, the main thread will mark keys being updated as dirty, hand off disk writing to io_uring and and block the acknowledgement of writes and any read referencing the dirty keys until disk write completes. This extends the proposal in https://github.com/valkey-io/valkey/pull/3381 as follows:
* Instead of performing sequential disk writes by blocking new disk writes until the previous write finishes, perform concurrent, offset-based writes to improve write latency and throughput.
* Support performing writes using io_uring instead of io-threads to support efficient asynchronous writes on small machines where provisioning io-threads may not be feasible.

### **2.3 Pre-allocate and Pre-initialize File Blocks**
We will pre-allocate space and explicitly pre-initialize the incremental AOF file blocks by writing zeros to them when using AOF-based durability mode (appendfsync always). This ensures that subsequent log writes to these blocks incur zero filesystem journaling overhead on the critical write path, thereby improving the write latency. This is achieved by eliminating metadata updates - since the file size and blocks are already defined, the OS doesn't have to update the inode size or allocate new blocks during the transaction. It becomes a pure "overwrite" operation.

This also significantly reduces the disk IOPS by eliminating the metadata operations sent to the disk.

**Backwards-Compatible Zeroes:**
We will writes zeroes using annotations:
#PAD:{$zeroes}\r\n


### **2.4 Recycle Log Files [Backwards Incompatible]**
Creating new files and zeroing them out consumes valuable disk bandwidth and compute resources. Instead of deleting old incremental AOF files that are no longer needed, we will recycle and reuse them for writing future logs. By reusing already allocated and initialized files, we completely bypass the cost of zeroing out new files, optimizing disk bandwidth and reducing overall system overhead.

**Handling Recycled Data During Recovery**: Reusing files introduces a complication: a recycled AOF file will contain arbitrary junk data or old log entries from its previous lifecycle. This makes it difficult to differentiate between newly written log entries and leftover data. We resolve this using the log headers proposed in section 2.1.

When reading the log during replay or recovery, the process must be sequential, starting from a known-good checkpoint (or the beginning of the log). The system will read each log header to determine the entry's length, read the corresponding bytes, and perform two critical verifications:
* Valid CRC: Ensures the data is a legitimate, uncorrupted log entry rather than random garbage bytes.
* Monotonically Increasing LSN: Validates the sequence. If a consistent log entry is read but its LSN is lower than expected, the system assumes it belongs to the previous lifecycle of the recycled file and treats it morally as uninitialized data.

To safely recover, the system reads sequentially from a checkpoint until it encounters either a CRC error or an LSN mismatch. At that point, it assumes it has reached the true end of the current AOF.

The incremental AOF file size will be fixed, and logs will continue in a new file once the current one reaches that limit.

* Note: This specific change breaks backwards compatibility and will require an explicit opt-in.


### **2.5 Direct I/O (O_DIRECT & O_DSYNC) with Alignment Padding**
To eliminate kernel page cache overhead, we will support a configuration that allows writes to be performed using Direct I/O when using AOF-based durability mode (appendfsync always).

**Implementation Details:**
* Files will be opened with the O_DIRECT and O_DSYNC flags.
* This consolidates the write and fdatasync operations into a single system call and ensures data is written directly to the storage device.
* Alignment Requirement: Direct I/O requires that write start offsets align strictly with the underlying block device's physical block size (e.g., 4KB).

**Backwards-Compatible Zero Padding:**
To achieve block alignment without breaking older AOF parsers, we will pad writes using protocol-compliant annotations:
#PAD:{$zeroes}\r\n

Since direct io can result in high write amplification for workloads with low write throughput and given that not all file systems support O_DIRECT, enabling direct io will require an explicit opt-in.


### **2.6 Disk Capacity Backpressure**
Currently, when using AOF-based durability mode (appendfsync always), if the disk fills up, new writes may fail, resulting in a server crash.
To improve system resilience, we will introduce a configurable disk usage threshold. When the disk capacity approaches this watermark (e.g., 95% full), Valkey will proactively trigger backpressure. It will safely reject new write commands with the existing "-MISCONF Errors writing to the AOF file" error message to the client, keeping the server alive and allowing AOF rewrite to run to free up disk space or giving administrators a window to intervene before a hard disk-full crash occurs.


---

## **3. Performance Benchmarks**

### **3.1 Benchmark Setup**

<img width="1045" height="280" alt="Image" src="https://github.com/user-attachments/assets/92ca2d2e-e375-495b-ab55-16123d2747ee" />

Valkey 8.0 is run on a 2vcpu GCP VM with io-threads=1, appenonly=yes, appendfsync=always. The AOF logs are written to a Hyperdisk Balanced volume attached to the VM. AOF rewrite is disabled.

Memtier client is run on another GCP VM in the same availability zone as the Valkey server with the following parameters:
* Set:Get ratio : 1:3
* Payload size: 1000 bytes
* Total keys: 5 million

Data is pre-loaded into the cache so as to achieve 100% hit rate during the benchmarks. Clients and threads parameters are gradually incremented until we hit an optimum setup where further increasing the connection count does not improve the throughput.


### **3.2 Summary**

| Setup | Op | QPS | P50 Latency (ms) | P99 Latency (ms) |
| :--- | :--- | :--- | :--- | :--- |
| **Baseline AOF** | Get | 99,669 | 5.34 | 11.07 |
| **Baseline AOF** | Set | 33,226 | 5.34 | 11.13 |
| **AOF with Async Writes** | Get | 141,244 (+41.7%) | 0.50 (-90.6%) | 1.04 (-90.6%) |
| **AOF with Async Writes** | Set | 47,082 (+41.7%) | 2.89 (-45.9%) | 4.31 (-61.3%) |
| **AOF with Pre-initialized Files** | Get | 154,671 (+55.2%) | 0.32 (-94.0%) | 0.47 (-95.8%) |
| **AOF with Pre-initialized Files** | Set | 51,557 (+55.2%) | 0.79 (-85.2%) | 1.21 (-89.1%) |

<img width="600" height="371" alt="Image" src="https://github.com/user-attachments/assets/e7e36781-61ec-4314-9060-b009abfae477" />

We also compare the performance with optimised AOF against performance with AOF disabled (appendonly=no).

| Setup | Op | QPS | P50 Latency (ms) | P99 Latency (ms) |
| :--- | :--- | :--- | :--- | :--- |
| **AOF Disabled** | Get | 185,057 | 0.37 | 0.49 |
| **AOF Disabled** | Set | 61,686 | 0.35 | 0.49 |
| **AOF with Pre-initialized Files** | Get | 154,671 (-16.4%) | 0.32 (-13.5%) | 0.47 (-4.1%) |
| **AOF with Pre-initialized Files** | Set | 51,557 (-16.4%) | 0.79 (+125.7%) | 1.21 (+146.9%) |

<img width="600" height="371" alt="Image" src="https://github.com/user-attachments/assets/374375de-508a-4a90-95b4-e3fab812e40c" />

### **3.3 Baseline AOF Performance**

<img width="1096" height="118" alt="Image" src="https://github.com/user-attachments/assets/310bb196-3586-45c2-a971-073118341878" />

### **3.4 Performance with async writes**

Log header annotation is added to the log entries, using crc64 for computing the checksum. AOF write and fdatasync is performed on bio threads, instead of the main thread. A pool of 4 bio threads is used to allow up to 4 concurrent disk writes. A dirty key tracker is used to block sending responses to clients referencing the dirty keys. 

Note that the implementation will be revised based on https://github.com/valkey-io/valkey/pull/3381 and we will explore using io_uring to minimize context switching overhead and to support async writes on smaller machine types where provisioning io-threads may not be feasible.

<img width="1099" height="116" alt="Image" src="https://github.com/user-attachments/assets/0c6a6343-72c0-4ce7-9ce2-abf0ca4be37e" />

### **3.5 Performance with pre-initialized file blocks and direct io**

Incremental AOF files are pre-initialized and recycled on AOF rewrite. We use an array of 1GB files that essentially acts as a circular buffer. Files are opened with O_DSYNC and O_DIRECT. The AOF writes are 4k-aligned.

<img width="1098" height="119" alt="Image" src="https://github.com/user-attachments/assets/c08ef1d0-05fd-4adc-b1bb-75a3a72701a1" />

### **3.6 Performance with AOF disabled**

<img width="1097" height="116" alt="Image" src="https://github.com/user-attachments/assets/606bdc21-bc37-40b2-9e07-12831c6731c5" />

---

## **4. Technical References**

### **Postgres WAL**
Postgres WAL (write-ahead log) data physically manifests as a series 16MB files (i.e. each physical file covers a ~16MB section of the logical WAL space).

When the files are first allocated, they are zero-filled (https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L3305 ) which generates 2x the bandwidth to disk (once for the zeroes, once for the actual data).

However, Postgres does not keep the physical files for eternity, there is a point past which the old WAL files are no longer needed. Rather than deleting those file, Postgres recycles them (https://github.com/postgres/postgres/blob/master/src/backend/access/transam/xlog.c#L4081), so that they can be re-used without needing to be zero initialized. So, the zeroing penalty only happens for new database, until they start recycling WAL files. Once the DB has initialized N WAL files (where N depends on some configuration parameters of the database), it stops paying the penalty.

---

## **5. Open Questions**
* What new configuration flags need to be added? Adding log headers would increase the AOF log size. So, we may want to guard such changes behind configuration flags and disable the functionality by default.
* Should AOF log headers use text format or a Base64 format? How is the log header format versioned?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernizing Valkey AOF #3515

Modernizing Valkey AOF

Summary

1. Challenges with Current AOF Implementation

2. Proposed Improvements

2.1 Backwards-Compatible AOF Log Headers

2.2 Asynchronous Writes via `io_uring` and Dirty Key Tracker

2.3 Pre-allocate and Pre-initialize File Blocks

2.4 Recycle Log Files [Backwards Incompatible]

2.5 Direct I/O (O_DIRECT & O_DSYNC) with Alignment Padding

2.6 Disk Capacity Backpressure

3. Performance Benchmarks

3.1 Benchmark Setup

3.2 Summary

3.3 Baseline AOF Performance

3.4 Performance with async writes

3.5 Performance with pre-initialized file blocks and direct io

3.6 Performance with AOF disabled

4. Technical References

Postgres WAL

5. Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Setup	Op	QPS	P50 Latency (ms)	P99 Latency (ms)
Baseline AOF	Get	99,669	5.34	11.07
Baseline AOF	Set	33,226	5.34	11.13
AOF with Async Writes	Get	141,244 (+41.7%)	0.50 (-90.6%)	1.04 (-90.6%)
AOF with Async Writes	Set	47,082 (+41.7%)	2.89 (-45.9%)	4.31 (-61.3%)
AOF with Pre-initialized Files	Get	154,671 (+55.2%)	0.32 (-94.0%)	0.47 (-95.8%)
AOF with Pre-initialized Files	Set	51,557 (+55.2%)	0.79 (-85.2%)	1.21 (-89.1%)

Setup	Op	QPS	P50 Latency (ms)	P99 Latency (ms)
AOF Disabled	Get	185,057	0.37	0.49
AOF Disabled	Set	61,686	0.35	0.49
AOF with Pre-initialized Files	Get	154,671 (-16.4%)	0.32 (-13.5%)	0.47 (-4.1%)
AOF with Pre-initialized Files	Set	51,557 (-16.4%)	0.79 (+125.7%)	1.21 (+146.9%)

Modernizing Valkey AOF #3515

Description

Modernizing Valkey AOF

Summary

1. Challenges with Current AOF Implementation

2. Proposed Improvements

2.1 Backwards-Compatible AOF Log Headers

2.2 Asynchronous Writes via io_uring and Dirty Key Tracker

2.3 Pre-allocate and Pre-initialize File Blocks

2.4 Recycle Log Files [Backwards Incompatible]

2.5 Direct I/O (O_DIRECT & O_DSYNC) with Alignment Padding

2.6 Disk Capacity Backpressure

3. Performance Benchmarks

3.1 Benchmark Setup

3.2 Summary

3.3 Baseline AOF Performance

3.4 Performance with async writes

3.5 Performance with pre-initialized file blocks and direct io

3.6 Performance with AOF disabled

4. Technical References

Postgres WAL

5. Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2.2 Asynchronous Writes via `io_uring` and Dirty Key Tracker