change file save buffer size and use external buffer to avoid neighbo…#944
change file save buffer size and use external buffer to avoid neighbo…#944SkyInTheSea wants to merge 2 commits intomainfrom
Conversation
…r buffer allocation for each vertex when saving
There was a problem hiding this comment.
Pull request overview
Improves DiskANN provider serialization performance by increasing buffered write capacity and enabling adjacency list serialization without per-vertex allocations.
Changes:
- Introduces a 16 MB
BufWritercapacity for.bindata and graph save paths. - Extends
GetAdjacencyListwithget_adjacency_list_intoto support caller-provided reusable buffers (with optimized impls for async providers). - Updates
save_graphto reuse a single neighbor buffer across vertices and perform bulk writes.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
diskann-providers/src/storage/bin.rs |
Adds save buffer sizing, extends adjacency-list trait with buffer-reuse API, and updates graph/data save routines to use larger buffered writes and reusable neighbor buffers. |
diskann-providers/src/model/graph/provider/async_/simple_neighbor_provider.rs |
Implements get_adjacency_list_into for SimpleNeighborProviderAsync and its DiskAdaptor to avoid intermediate AdjacencyList allocations during graph saves. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| out.write_all(&num_neighbors.to_le_bytes())?; | ||
|
|
||
| // Write all the neighbors, applying transformation if provided. | ||
| neighbors | ||
| .iter() | ||
| .copied() | ||
| .try_for_each(|n| out.write_all(&n.to_le_bytes()))?; | ||
| // Bulk write using bytemuck for zero-copy conversion | ||
| let neighbor_bytes: &[u8] = bytemuck::must_cast_slice(&neighbor_buffer); | ||
| out.write_all(neighbor_bytes)?; |
There was a problem hiding this comment.
save_graph writes neighbor IDs by casting &[u32] to bytes and writing them directly. This relies on the host being little-endian, but the file format docs (and load_graph, which uses ReadBytesExt::<LittleEndian>) require the neighbor IDs be encoded as little-endian u32. On big-endian targets this will silently produce corrupt graph files. Consider bulk-encoding to little-endian (e.g., via a reusable scratch [u8] and byteorder little-endian writes, or by converting u32 values to LE before writing) while still keeping the per-vertex allocation avoidance.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #944 +/- ##
==========================================
- Coverage 89.41% 89.29% -0.13%
==========================================
Files 448 447 -1
Lines 84856 83918 -938
==========================================
- Hits 75878 74937 -941
- Misses 8978 8981 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
hildebrandmw
left a comment
There was a problem hiding this comment.
A couple quick comments on how we can do slightly better without introducing a new method. Longer term, though, this API was always kind of intended to be a stop gap until a more disciplined approach to the inmem providers and the serialization is worked out.
| /// before being populated with the adjacency list data. | ||
| /// | ||
| /// Default implementation falls back to `get_adjacency_list` and copies. | ||
| fn get_adjacency_list_into(&self, i: usize, buffer: &mut Vec<Self::Element>) -> ANNResult<()> |
There was a problem hiding this comment.
You can make this even more efficient by not adding get_adjacency_list_into and instead changing the Item to something like this:
struct Guarded<'a, T> {
guard: RwLockReadGuard<'a, ()>,
slice: &'a [T],
}
impl<T> Deref for Guarded<'_, T> {
type Target = [T];
fn deref(&self) -> &[T] {
self.slice
}
}This bypasses the copy into the temporary buffer and lets you copy directly from the underlying provider into the destination.
This doesn't work for the DiskAdaptor, but presumably you aren't using that? Another change would be to change get_adjacency_list to accept by &mut self and that way DiskAdaptor can allocate a local buffer itself.
There was a problem hiding this comment.
I don't catch your point here. The optimization is to avoid memory allocation per node to flatten its neighbor list to index file, with a global fixed size buffer we can save allocation time, if we don't want to use a new interface to pass the reference of the global buffer address, we may allocate it beforehand and make it referenceable for get_adjacency_list
There was a problem hiding this comment.
Right. With your API - you are first copying the data into the externally allocated buffer, then copying it out of that external buffer into the final stream. With the change I showed use using Item<'a> = Guard<'a, u32>, you can avoid every copying into the allocated buffer and instead use the Deref implementation of Guard to copy straight from the underlying store. This does not allocate and removes the extra copy step.
There was a problem hiding this comment.
I think I know your meaning now, you want to expose the vertex's neighbor array to final bufwriter to copy, It will save one memcpy operation, but the copy to bufwriter in protection of read lock, which bufwriter is blackbox we don't know when the flush will happen(like if possible part of the copy fill the buffer and trigger flush) and write to disk which is a blocking process and may cause readlock hold for long time compared with a single memcpy, and block other operations like update of the node.
memcopy to a small external buffer is very cheap so generally I prefer to copy it which we can ensure hold readlock for short time.
There was a problem hiding this comment.
I sympathize with the desire to minimize the duration under which a lock is held, but this leads me to a bigger concern: why do we need to minimize the duration the lock is held in this API? The only reason I can think of is because doing so would prevent other concurrent writes to the graph, but if writes are happening concurrently to a save using this API, isn't the save going to be incorrect/not a faithful snapshot? If writes are not happening concurrently, then it doesn't matter if we hold onto the lock?
There was a problem hiding this comment.
yes, we need consider the write traffic has possibly to update the vertex's neighbors and be blocked by the stream write which fetch read lock.
it's an real issue for your concern, like if your save take 5 minutes(in our prod case for large diskANN index), the copy of the neighbors dump to disk is actually not a completed graph strictly speaking, but an ongoing update one, hard to evaluate the impact.
That means single ANN object can not handle this save during write case but need to resolved by up layer.
Like our ANN service manages multiple indices to avoid such snapshot issue, which host more than one index and the search result is aggregation of them, one read/write index and all others are readonly, write will be applied on read/write index.
when snapshot or save command reached, current read/write index will be sealed as readonly atomically and a new index is created to serve write, and all delete will be applied to not only the new rw index but also a delete list to play on ro index, there's background thread kept merge ro indices and the deletion list.
|
Please do not submit PRs with no description and a blank template. Write up an appropriate description of what this PR does and why. |
…r buffer allocation for each vertex when saving
Reference Issues/PRs
What does this implement/fix? Briefly explain your changes.
Any other comments?