[ENH]: Persist a backfill record on log to trigger backfill #5897

tanujnay112 · 2025-11-23T06:32:25Z

Description of changes

Summarize the changes made by this PR.

This diff persists a backfill signal that goes from a call to attach_function to run_backfilled_attached_function_workflow, introduced by #5896 . This diff achieves that by pushing a new backfill_fn log to the concerned input collection's logs. This log should effectively be a no-op on the compactor path as materialize_logs makes sure to ignore these logs. In compact.rs::compact(), if the initial run_fetch_logs call detects the presence of this backfill log, run_backfilled_attached_function_workflow will be called. If this call succeeds, we return and ignore the usual compaction flow. If backfill was not needed, the usual pre-existing compaction + attached function execution is called.

Improvements & Bug fixes
- ...
New functionality
- ...

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the _docs section?_

tanujnay112 · 2025-11-23T06:32:40Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

github-actions · 2025-11-23T06:32:41Z

propel-code-bot · 2025-11-23T21:22:07Z

Persist ‘backfill_fn’ log entry and execution path to automatically back-fill newly attached functions

This PR introduces a durable, end-to-end mechanism that guarantees all previously ingested records are re-processed whenever a user attaches a new function to an existing collection. A new log entry type (backfill_fn) is written by the front-end during attach_function(). Compactor/worker components detect this entry and invoke the dedicated backfill workflow instead of the normal compaction path. Because the intent is stored in the collection log, the backfill survives service restarts and requires no manual intervention, closing a long-standing consistency gap for customers who retrofit enrichment functions onto live data.

The change is additive and fully backward compatible: older binaries that do not understand the new enum will simply ignore it, so no wire-protocol or DB migrations are required. Normal operation is unaffected unless the new log entry is present.

Key Changes

• Added backfill_fn operation to protobuf (chroma.proto) and Rust type definitions (api_types.rs, operation.rs, strategies.rs).
• Service front-end (service_based_frontend.rs) now emits a backfill_fn record during attach_function().
• Compactor/worker (log_fetch_orchestrator.rs, compact.rs) short-circuit standard compaction when they encounter a backfill_fn entry and call run_backfilled_attached_function_workflow().
• Updated enum handling across segment implementations (segment, local_hnsw), metadata layers (sqlite_metadata) and tests to recognise the new record.
• Materialiser treats the new log type as a no-op, maintaining normal materialisation behaviour.

Affected Areas

• Protocol definitions (protobuf)
• Front-end service logic (function attachment)
• Log orchestration & compaction pipeline
• Segment/metadata enum handling
• Unit & integration tests

This summary was automatically generated by @propel-code-bot

propel-code-bot · 2025-11-23T21:25:51Z

rust/frontend/src/impls/service_based_frontend.rs

+        self.log_client
+            .push_logs(
+                &tenant,
+                collection_id,
+                vec![OperationRecord {
+                    id: "backfill_id".to_string(),
+                    embedding: Some(fake_embedding),
+                    encoding: None,
+                    metadata: None,
+                    document: None,
+                    operation: Operation::BackfillFn,
+                }],
+            )
+            .await?;


[CriticalError]

Resource leak in error path: The start_backfill function pushes a fake backfill record to the log, but if push_logs fails, the collection state may become inconsistent since the function was already registered in step 1. The sysdb update in step 3 won't execute, leaving the system in a partial state.

// Add cleanup or make this transactional if let Err(e) = self.log_client.push_logs(...).await { // Rollback function registration or mark as failed self.sysdb_client.detach_function(attached_function_id).await?; return Err(e); }

Context for Agents

**Resource leak in error path**: The `start_backfill` function pushes a fake backfill record to the log, but if `push_logs` fails, the collection state may become inconsistent since the function was already registered in step 1. The sysdb update in step 3 won't execute, leaving the system in a partial state. ```rust // Add cleanup or make this transactional if let Err(e) = self.log_client.push_logs(...).await { // Rollback function registration or mark as failed self.sysdb_client.detach_function(attached_function_id).await?; return Err(e); } ``` File: rust/frontend/src/impls/service_based_frontend.rs Line: 1992

rust/frontend/src/impls/service_based_frontend.rs

propel-code-bot · 2025-11-23T21:25:52Z

rust/frontend/src/impls/service_based_frontend.rs

+                collection_id,
+                vec![OperationRecord {
+                    id: "backfill_id".to_string(),
+                    embedding: Some(fake_embedding),
+                    encoding: None,
+                    metadata: None,
+                    document: None,
+                    operation: Operation::BackfillFn,
+                }],
+            )


[BestPractice]

Race condition: Non-idempotent backfill operation: This code pushes a backfill record with a hardcoded ID "backfill_id". If start_backfill is called multiple times (retry, duplicate request, etc.), it will push duplicate backfill records with the same ID, potentially triggering multiple backfill operations.

// Use unique ID per invocation OperationRecord { id: format!("backfill_{}_{}", collection_id, Uuid::new_v4()), // ... rest of fields }

Or add deduplication logic to check if backfill is already in progress before pushing the record.

Context for Agents

**Race condition: Non-idempotent backfill operation**: This code pushes a backfill record with a hardcoded ID `"backfill_id"`. If `start_backfill` is called multiple times (retry, duplicate request, etc.), it will push duplicate backfill records with the same ID, potentially triggering multiple backfill operations. ```rust // Use unique ID per invocation OperationRecord { id: format!("backfill_{}_{}", collection_id, Uuid::new_v4()), // ... rest of fields } ``` Or add deduplication logic to check if backfill is already in progress before pushing the record. File: rust/frontend/src/impls/service_based_frontend.rs Line: 1991

rust/segment/src/types.rs

rescrv · 2025-11-23T23:03:43Z

rust/types/src/strategies.rs

                        }
                        Operation::Delete => None,
+                        Operation::BackfillFn => {
+                            // TODO(tanujnay112): Can we even reach here with this in prod? Ask @jason.


Does this need to be done before PR?

rust/log/src/sqlite_log.rs

propel-code-bot · 2025-11-23T23:56:47Z

rust/frontend/src/impls/service_based_frontend.rs

+            .ok_or_else(|| {
+                AttachFunctionError::GetCollectionError(GetCollectionError::CollectionNotFound(
+                    collection_id.to_string(),
+                ))
+            })?;
+        let fake_embedding = vec![0.0; embedding_dim as usize];
+        // TODO(tanujnay112): Make this either a configurable or better yet a separate
+        // RPC to the logs service.
+        let num_fake_logs = 250;


[PerformanceOptimization]

Performance issue: Allocating 250 fake logs with full embedding vectors will consume significant memory. With embedding_dim potentially being 1536 (common for OpenAI embeddings), this creates 250 * 1536 * 8 bytes = ~3MB per call. If multiple backfills run concurrently, this could exhaust memory.

Consider:

Making num_fake_logs smaller (e.g., 10-50)

Streaming logs in batches instead of allocating all at once

Using a smaller sentinel embedding for backfill markers

// Option 3: Use minimal embedding for backfill marker let fake_embedding = vec![0.0; 1]; // Single sentinel value

Context for Agents

**Performance issue**: Allocating 250 fake logs with full embedding vectors will consume significant memory. With `embedding_dim` potentially being 1536 (common for OpenAI embeddings), this creates `250 * 1536 * 8 bytes = ~3MB` per call. If multiple backfills run concurrently, this could exhaust memory. Consider: 1. Making `num_fake_logs` smaller (e.g., 10-50) 2. Streaming logs in batches instead of allocating all at once 3. Using a smaller sentinel embedding for backfill markers ```rust // Option 3: Use minimal embedding for backfill marker let fake_embedding = vec![0.0; 1]; // Single sentinel value ``` File: rust/frontend/src/impls/service_based_frontend.rs Line: 1984

tanujnay112 · 2025-11-24T00:43:46Z

Merge activity

Nov 24, 12:43 AM UTC: A user started a stack merge that includes this pull request via Graphite.
Nov 24, 12:45 AM UTC: Graphite rebased this pull request as part of a merge.
Nov 24, 1:11 AM UTC: @tanujnay112 merged this pull request with Graphite.

propel-code-bot · 2025-11-24T00:48:13Z

rust/worker/src/execution/orchestration/log_fetch_orchestrator.rs

-                ctx,
-            )
-            .await;
+            if self.has_backfill {


[CriticalError]

Logic bug: The condition if self.has_backfill only checks a boolean flag that's set to true if any MaterializeLogOutput has backfill. However, the code takes all materialized outputs (including non-backfill ones) and returns them as RequireFunctionBackfill.

This means if you have:

5 normal log records materialized

1 backfill log record materialized

All 6 records are sent to the backfill path, not just the 1 backfill record.

Fix: Filter the materialized outputs:

if self.has_backfill { let (backfill_outputs, normal_outputs): (Vec<_>, Vec<_>) = materialized.into_iter().partition(|o| o.result.has_backfill()); if !backfill_outputs.is_empty() { self.terminate_with_result( Ok(RequireFunctionBackfill::new(backfill_outputs, collection_info).into()), ctx, ).await; } // Handle normal_outputs separately or in next iteration }

Context for Agents

**Logic bug**: The condition `if self.has_backfill` only checks a boolean flag that's set to `true` if *any* MaterializeLogOutput has backfill. However, the code takes *all* materialized outputs (including non-backfill ones) and returns them as `RequireFunctionBackfill`. This means if you have: - 5 normal log records materialized - 1 backfill log record materialized All 6 records are sent to the backfill path, not just the 1 backfill record. **Fix**: Filter the materialized outputs: ```rust if self.has_backfill { let (backfill_outputs, normal_outputs): (Vec<_>, Vec<_>) = materialized.into_iter().partition(|o| o.result.has_backfill()); if !backfill_outputs.is_empty() { self.terminate_with_result( Ok(RequireFunctionBackfill::new(backfill_outputs, collection_info).into()), ctx, ).await; } // Handle normal_outputs separately or in next iteration } ``` File: rust/worker/src/execution/orchestration/log_fetch_orchestrator.rs Line: 809

tanujnay112 mentioned this pull request Nov 23, 2025

[ENH]: Introduce backfill capabilities in compactor #5896

Merged

1 task

tanujnay112 changed the base branch from backfill2 to graphite-base/5897 November 23, 2025 06:42

tanujnay112 force-pushed the graphite-base/5897 branch from 55ca743 to cae0462 Compare November 23, 2025 06:48

tanujnay112 force-pushed the backfill_signal branch from c11f0ae to 4629c97 Compare November 23, 2025 06:48

tanujnay112 changed the base branch from graphite-base/5897 to backfill2 November 23, 2025 06:48

tanujnay112 force-pushed the backfill_signal branch from 4629c97 to a680a54 Compare November 23, 2025 07:27

tanujnay112 force-pushed the backfill2 branch from cae0462 to b2e7b15 Compare November 23, 2025 21:08

tanujnay112 force-pushed the backfill_signal branch from a680a54 to beaea83 Compare November 23, 2025 21:08

tanujnay112 marked this pull request as ready for review November 23, 2025 21:21