Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -288,6 +288,26 @@ impl storage::bin::GetAdjacencyList for SimpleNeighborProviderAsync<u32> {
Ok(list)
}

/// Optimized version that reuses a pre-allocated buffer to avoid per-call allocation.
///
/// This directly copies the adjacency list into the provided buffer, avoiding the
/// intermediate `AdjacencyList` allocation that `get_adjacency_list` requires.
fn get_adjacency_list_into(&self, i: usize, buffer: &mut Vec<u32>) -> ANNResult<()> {
#[cfg(test)]
self.num_get_calls.increment();

// Lint: We don't have a good way of recovering from lock poisoning anyways.
let _guard = self.locks[i].read().unwrap();

// SAFETY: We are holding the read lock for `i`.
let list = unsafe { self.get_slice(i) };

// Reuse buffer: clear and copy data directly
buffer.clear();
buffer.extend_from_slice(list);
Ok(())
}

fn total(&self) -> usize {
self.locks.len()
}
Expand Down Expand Up @@ -346,6 +366,32 @@ impl storage::bin::GetAdjacencyList for DiskAdaptor<'_> {
Ok(list)
}

/// Optimized version that reuses a pre-allocated buffer to avoid per-call allocation.
///
/// This directly reads neighbors into the buffer and performs the start point remapping
/// in-place, avoiding the intermediate `AdjacencyList` allocation.
fn get_adjacency_list_into(&self, i: usize, buffer: &mut Vec<u32>) -> ANNResult<()> {
// Lint: We don't have a good way of recovering from lock poisoning anyways.
#[allow(clippy::unwrap_used)]
let _guard = self.provider.locks[i].read().unwrap();

// SAFETY: We are holding the read lock for `i`.
let list = unsafe { self.provider.get_slice(i) };
Comment thread
SkyInTheSea marked this conversation as resolved.

// Reuse buffer: clear and copy data directly
buffer.clear();
buffer.extend_from_slice(list);

// Remap the in-memory start point to the actual start point
for id in buffer.iter_mut() {
if *id == self.inmem_start_point {
*id = self.actual_start_point;
}
}

Ok(())
}

fn total(&self) -> usize {
// Don't include any start points at the end.
self.provider.locks.len() - self.provider.num_start_points
Expand Down
43 changes: 30 additions & 13 deletions diskann-providers/src/storage/bin.rs
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,23 @@ pub(crate) trait GetAdjacencyList {
/// Retrieve the data stored at index `i`.
fn get_adjacency_list(&self, i: usize) -> ANNResult<Self::Item<'_>>;

/// Retrieve the data stored at index `i` into a pre-allocated buffer.
///
/// This method allows callers to reuse a buffer across multiple calls,
/// avoiding per-call memory allocation overhead. The buffer is cleared
/// before being populated with the adjacency list data.
///
/// Default implementation falls back to `get_adjacency_list` and copies.
fn get_adjacency_list_into(&self, i: usize, buffer: &mut Vec<Self::Element>) -> ANNResult<()>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make this even more efficient by not adding get_adjacency_list_into and instead changing the Item to something like this:

struct Guarded<'a, T> {
    guard: RwLockReadGuard<'a, ()>,
    slice: &'a [T],
}

impl<T> Deref for Guarded<'_, T> {
    type Target = [T];
    fn deref(&self) -> &[T] {
        self.slice
    }
}

This bypasses the copy into the temporary buffer and lets you copy directly from the underlying provider into the destination.

This doesn't work for the DiskAdaptor, but presumably you aren't using that? Another change would be to change get_adjacency_list to accept by &mut self and that way DiskAdaptor can allocate a local buffer itself.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't catch your point here. The optimization is to avoid memory allocation per node to flatten its neighbor list to index file, with a global fixed size buffer we can save allocation time, if we don't want to use a new interface to pass the reference of the global buffer address, we may allocate it beforehand and make it referenceable for get_adjacency_list

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. With your API - you are first copying the data into the externally allocated buffer, then copying it out of that external buffer into the final stream. With the change I showed use using Item<'a> = Guard<'a, u32>, you can avoid every copying into the allocated buffer and instead use the Deref implementation of Guard to copy straight from the underlying store. This does not allocate and removes the extra copy step.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I know your meaning now, you want to expose the vertex's neighbor array to final bufwriter to copy, It will save one memcpy operation, but the copy to bufwriter in protection of read lock, which bufwriter is blackbox we don't know when the flush will happen(like if possible part of the copy fill the buffer and trigger flush) and write to disk which is a blocking process and may cause readlock hold for long time compared with a single memcpy, and block other operations like update of the node.

memcopy to a small external buffer is very cheap so generally I prefer to copy it which we can ensure hold readlock for short time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sympathize with the desire to minimize the duration under which a lock is held, but this leads me to a bigger concern: why do we need to minimize the duration the lock is held in this API? The only reason I can think of is because doing so would prevent other concurrent writes to the graph, but if writes are happening concurrently to a save using this API, isn't the save going to be incorrect/not a faithful snapshot? If writes are not happening concurrently, then it doesn't matter if we hold onto the lock?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we need consider the write traffic has possibly to update the vertex's neighbors and be blocked by the stream write which fetch read lock.
it's an real issue for your concern, like if your save take 5 minutes(in our prod case for large diskANN index), the copy of the neighbors dump to disk is actually not a completed graph strictly speaking, but an ongoing update one, hard to evaluate the impact.
That means single ANN object can not handle this save during write case but need to resolved by up layer.
Like our ANN service manages multiple indices to avoid such snapshot issue, which host more than one index and the search result is aggregation of them, one read/write index and all others are readonly, write will be applied on read/write index.
when snapshot or save command reached, current read/write index will be sealed as readonly atomically and a new index is created to serve write, and all delete will be applied to not only the new rw index but also a delete list to play on ro index, there's background thread kept merge ro indices and the deletion list.

where
Self::Element: Clone,
{
buffer.clear();
let list = self.get_adjacency_list(i)?;
buffer.extend_from_slice(&list);
Ok(())
}

/// Return the total number of elements contained in `self`.
fn total(&self) -> usize;

Expand Down Expand Up @@ -344,31 +361,31 @@ where
let mut observed_max_degree: u32 = 0;

out.write_all(&index_size.to_le_bytes())?;
out.write_all(&observed_max_degree.to_le_bytes())?; // Will be updated later with correct max_degree
out.write_all(&observed_max_degree.to_le_bytes())?;
out.write_all(&start_point.to_le_bytes())?;

out.write_all(&graph.additional_points().to_le_bytes())?;

let total = graph.total();

// Pre-allocate a reusable buffer for adjacency lists
let initial_capacity = graph.max_degree().map(|d| d as usize).unwrap_or(128);
let mut neighbor_buffer: Vec<u32> = Vec::with_capacity(initial_capacity);

for i in 0..total {
let binding = graph.get_adjacency_list(i)?;
let neighbors: &[u32] = &binding;
let num_neighbors: u32 = neighbors.len() as u32;
// Reuse buffer to avoid per-vertex allocation overhead
graph.get_adjacency_list_into(i, &mut neighbor_buffer)?;
let num_neighbors: u32 = neighbor_buffer.len() as u32;

// Write the number of neighbors as a `u32`.
out.write_all(&num_neighbors.to_le_bytes())?;

// Write all the neighbors, applying transformation if provided.
neighbors
.iter()
.copied()
.try_for_each(|n| out.write_all(&n.to_le_bytes()))?;
// Bulk write using bytemuck for zero-copy conversion
let neighbor_bytes: &[u8] = bytemuck::must_cast_slice(&neighbor_buffer);
out.write_all(neighbor_bytes)?;
Comment on lines 379 to +383
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_graph writes neighbor IDs by casting &[u32] to bytes and writing them directly. This relies on the host being little-endian, but the file format docs (and load_graph, which uses ReadBytesExt::<LittleEndian>) require the neighbor IDs be encoded as little-endian u32. On big-endian targets this will silently produce corrupt graph files. Consider bulk-encoding to little-endian (e.g., via a reusable scratch [u8] and byteorder little-endian writes, or by converting u32 values to LE before writing) while still keeping the per-vertex allocation avoidance.

Copilot uses AI. Check for mistakes.

observed_max_degree = observed_max_degree.max(num_neighbors);
index_size += (std::mem::size_of::<u32>() * (1 + neighbors.len())) as u64;
index_size += (std::mem::size_of::<u32>() * (1 + neighbor_buffer.len())) as u64;
}

// Use configured max degree if provided, otherwise use observed
let max_degree = graph.max_degree().unwrap_or(observed_max_degree);

// Finish up by writing the observed index size and max degree.
Expand Down
Loading