Skip to content

Conversation

@tanujnay112
Copy link
Contributor

@tanujnay112 tanujnay112 commented Nov 23, 2025

Description of changes

Summarize the changes made by this PR.

This diff persists a backfill signal that goes from a call to attach_function​ to run_backfilled_attached_function_workflow​, introduced by #5896 . This diff achieves that by pushing a new backfill_fn​ log to the concerned input collection's logs. This log should effectively be a no-op on the compactor path as materialize_logs makes sure to ignore these logs. In compact.rs::compact(), if the initial run_fetch_logs​ call detects the presence of this backfill log, run_backfilled_attached_function_workflow will be called. If this call succeeds, we return and ignore the usual compaction flow. If backfill was not needed, the usual pre-existing compaction + attached function execution is called.

  • Improvements & Bug fixes
    • ...
  • New functionality
    • ...

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?

Observability plan

What is the plan to instrument and monitor this change?

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the _docs section?_

Copy link
Contributor Author

tanujnay112 commented Nov 23, 2025

@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@tanujnay112 tanujnay112 changed the base branch from backfill2 to graphite-base/5897 November 23, 2025 06:42
@tanujnay112 tanujnay112 changed the base branch from graphite-base/5897 to backfill2 November 23, 2025 06:48
@tanujnay112 tanujnay112 marked this pull request as ready for review November 23, 2025 21:21
@propel-code-bot
Copy link
Contributor

propel-code-bot bot commented Nov 23, 2025

Persist ‘backfill_fn’ log entry and execution path to automatically back-fill newly attached functions

This PR introduces a durable, end-to-end mechanism that guarantees all previously ingested records are re-processed whenever a user attaches a new function to an existing collection. A new log entry type (backfill_fn) is written by the front-end during attach_function(). Compactor/worker components detect this entry and invoke the dedicated backfill workflow instead of the normal compaction path. Because the intent is stored in the collection log, the backfill survives service restarts and requires no manual intervention, closing a long-standing consistency gap for customers who retrofit enrichment functions onto live data.

The change is additive and fully backward compatible: older binaries that do not understand the new enum will simply ignore it, so no wire-protocol or DB migrations are required. Normal operation is unaffected unless the new log entry is present.

Key Changes

• Added backfill_fn operation to protobuf (chroma.proto) and Rust type definitions (api_types.rs, operation.rs, strategies.rs).
• Service front-end (service_based_frontend.rs) now emits a backfill_fn record during attach_function().
• Compactor/worker (log_fetch_orchestrator.rs, compact.rs) short-circuit standard compaction when they encounter a backfill_fn entry and call run_backfilled_attached_function_workflow().
• Updated enum handling across segment implementations (segment, local_hnsw), metadata layers (sqlite_metadata) and tests to recognise the new record.
• Materialiser treats the new log type as a no-op, maintaining normal materialisation behaviour.

Affected Areas

• Protocol definitions (protobuf)
• Front-end service logic (function attachment)
• Log orchestration & compaction pipeline
• Segment/metadata enum handling
• Unit & integration tests

This summary was automatically generated by @propel-code-bot

Comment on lines 1979 to 1994
self.log_client
.push_logs(
&tenant,
collection_id,
vec![OperationRecord {
id: "backfill_id".to_string(),
embedding: Some(fake_embedding),
encoding: None,
metadata: None,
document: None,
operation: Operation::BackfillFn,
}],
)
.await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

Resource leak in error path: The start_backfill function pushes a fake backfill record to the log, but if push_logs fails, the collection state may become inconsistent since the function was already registered in step 1. The sysdb update in step 3 won't execute, leaving the system in a partial state.

// Add cleanup or make this transactional
if let Err(e) = self.log_client.push_logs(...).await {
    // Rollback function registration or mark as failed
    self.sysdb_client.detach_function(attached_function_id).await?;
    return Err(e);
}
Context for Agents
**Resource leak in error path**: The `start_backfill` function pushes a fake backfill record to the log, but if `push_logs` fails, the collection state may become inconsistent since the function was already registered in step 1. The sysdb update in step 3 won't execute, leaving the system in a partial state.

```rust
// Add cleanup or make this transactional
if let Err(e) = self.log_client.push_logs(...).await {
    // Rollback function registration or mark as failed
    self.sysdb_client.detach_function(attached_function_id).await?;
    return Err(e);
}
```

File: rust/frontend/src/impls/service_based_frontend.rs
Line: 1992

Comment on lines 1982 to 1991
collection_id,
vec![OperationRecord {
id: "backfill_id".to_string(),
embedding: Some(fake_embedding),
encoding: None,
metadata: None,
document: None,
operation: Operation::BackfillFn,
}],
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Race condition: Non-idempotent backfill operation: This code pushes a backfill record with a hardcoded ID "backfill_id". If start_backfill is called multiple times (retry, duplicate request, etc.), it will push duplicate backfill records with the same ID, potentially triggering multiple backfill operations.

// Use unique ID per invocation
OperationRecord {
    id: format!("backfill_{}_{}", collection_id, Uuid::new_v4()),
    // ... rest of fields
}

Or add deduplication logic to check if backfill is already in progress before pushing the record.

Context for Agents
**Race condition: Non-idempotent backfill operation**: This code pushes a backfill record with a hardcoded ID `"backfill_id"`. If `start_backfill` is called multiple times (retry, duplicate request, etc.), it will push duplicate backfill records with the same ID, potentially triggering multiple backfill operations.

```rust
// Use unique ID per invocation
OperationRecord {
    id: format!("backfill_{}_{}", collection_id, Uuid::new_v4()),
    // ... rest of fields
}
```

Or add deduplication logic to check if backfill is already in progress before pushing the record.

File: rust/frontend/src/impls/service_based_frontend.rs
Line: 1991

@tanujnay112 tanujnay112 requested a review from rescrv November 23, 2025 22:27
}
Operation::Delete => None,
Operation::BackfillFn => {
// TODO(tanujnay112): Can we even reach here with this in prod? Ask @jason.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be done before PR?

@tanujnay112 tanujnay112 changed the base branch from backfill2 to graphite-base/5897 November 23, 2025 23:14
@rescrv rescrv self-requested a review November 23, 2025 23:51
@tanujnay112 tanujnay112 changed the base branch from graphite-base/5897 to backfill2 November 23, 2025 23:53
Comment on lines 1976 to 1984
.ok_or_else(|| {
AttachFunctionError::GetCollectionError(GetCollectionError::CollectionNotFound(
collection_id.to_string(),
))
})?;
let fake_embedding = vec![0.0; embedding_dim as usize];
// TODO(tanujnay112): Make this either a configurable or better yet a separate
// RPC to the logs service.
let num_fake_logs = 250;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[PerformanceOptimization]

Performance issue: Allocating 250 fake logs with full embedding vectors will consume significant memory. With embedding_dim potentially being 1536 (common for OpenAI embeddings), this creates 250 * 1536 * 8 bytes = ~3MB per call. If multiple backfills run concurrently, this could exhaust memory.

Consider:

  1. Making num_fake_logs smaller (e.g., 10-50)
  2. Streaming logs in batches instead of allocating all at once
  3. Using a smaller sentinel embedding for backfill markers
// Option 3: Use minimal embedding for backfill marker
let fake_embedding = vec![0.0; 1]; // Single sentinel value
Context for Agents
**Performance issue**: Allocating 250 fake logs with full embedding vectors will consume significant memory. With `embedding_dim` potentially being 1536 (common for OpenAI embeddings), this creates `250 * 1536 * 8 bytes = ~3MB` per call. If multiple backfills run concurrently, this could exhaust memory.

Consider:
1. Making `num_fake_logs` smaller (e.g., 10-50)
2. Streaming logs in batches instead of allocating all at once
3. Using a smaller sentinel embedding for backfill markers

```rust
// Option 3: Use minimal embedding for backfill marker
let fake_embedding = vec![0.0; 1]; // Single sentinel value
```

File: rust/frontend/src/impls/service_based_frontend.rs
Line: 1984

Copy link
Contributor Author

tanujnay112 commented Nov 24, 2025

Merge activity

  • Nov 24, 12:43 AM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Nov 24, 12:45 AM UTC: Graphite rebased this pull request as part of a merge.
  • Nov 24, 1:11 AM UTC: @tanujnay112 merged this pull request with Graphite.

@tanujnay112 tanujnay112 changed the base branch from backfill2 to graphite-base/5897 November 24, 2025 00:43
@tanujnay112 tanujnay112 changed the base branch from graphite-base/5897 to main November 24, 2025 00:44
ctx,
)
.await;
if self.has_backfill {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CriticalError]

Logic bug: The condition if self.has_backfill only checks a boolean flag that's set to true if any MaterializeLogOutput has backfill. However, the code takes all materialized outputs (including non-backfill ones) and returns them as RequireFunctionBackfill.

This means if you have:

  • 5 normal log records materialized
  • 1 backfill log record materialized

All 6 records are sent to the backfill path, not just the 1 backfill record.

Fix: Filter the materialized outputs:

if self.has_backfill {
    let (backfill_outputs, normal_outputs): (Vec<_>, Vec<_>) = 
        materialized.into_iter().partition(|o| o.result.has_backfill());
    
    if !backfill_outputs.is_empty() {
        self.terminate_with_result(
            Ok(RequireFunctionBackfill::new(backfill_outputs, collection_info).into()),
            ctx,
        ).await;
    }
    // Handle normal_outputs separately or in next iteration
}
Context for Agents
**Logic bug**: The condition `if self.has_backfill` only checks a boolean flag that's set to `true` if *any* MaterializeLogOutput has backfill. However, the code takes *all* materialized outputs (including non-backfill ones) and returns them as `RequireFunctionBackfill`.

This means if you have:
- 5 normal log records materialized
- 1 backfill log record materialized

All 6 records are sent to the backfill path, not just the 1 backfill record.

**Fix**: Filter the materialized outputs:
```rust
if self.has_backfill {
    let (backfill_outputs, normal_outputs): (Vec<_>, Vec<_>) = 
        materialized.into_iter().partition(|o| o.result.has_backfill());
    
    if !backfill_outputs.is_empty() {
        self.terminate_with_result(
            Ok(RequireFunctionBackfill::new(backfill_outputs, collection_info).into()),
            ctx,
        ).await;
    }
    // Handle normal_outputs separately or in next iteration
}
```

File: rust/worker/src/execution/orchestration/log_fetch_orchestrator.rs
Line: 809

@tanujnay112 tanujnay112 merged commit fa8dff2 into main Nov 24, 2025
62 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants