Skip to content

Conversation

@rescrv
Copy link
Contributor

@rescrv rescrv commented Jun 28, 2025

Description of changes

Wire up the garbage collector to do 3-phase GC.

Test plan

CI

Documentation Changes

N/A

@github-actions
Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@propel-code-bot
Copy link
Contributor

propel-code-bot bot commented Jun 28, 2025

Implement 3-Phase Garbage Collection for WAL Logs and Wire Up End-to-End GC Path

This PR introduces a full 3-phase garbage collection (GC) workflow across the WAL-backed log service and garbage collector in Chroma, extending the API, service implementations, and GC operator logic accordingly. The changes add a new GarbageCollectPhase2 RPC to the logservice API and wire all the GC phases (compute, truncate manifest, delete files) into both the log service and the garbage collector, updating configuration, orchestration, and operator layers to support this across distributed services. It also updates documentation and charts to expose the new API and workflow.

Key Changes:
• Introduces and wires up a new GarbageCollectPhase2 gRPC endpoint in the LogService proto and Rust/Go implementations, allowing phase 2 (manifest truncation) to be performed by the log service.
• Refactors the garbage collector's log deletion operators to use explicit three-phase GC: phase 1 (compute garbage), phase 2 (truncate manifest via RPC call to log service), and phase 3 (physical deletion of log files).
• Extends the log abstraction (Log types in Rust) and their trait implementations to support phase 2 and phase 3 of GC as atomic RPC operations.
• Updates several Kubernetes YAMLs (values/Chart/config/config.yaml) to route and expose the new phase2 GC endpoint, including Helm version bump.
• Documents the rationale for three-phase GC and the responsibilities of each phase in wal3/README.md and code comments, clarifying the architectural division of labor.
• Makes breaking config changes for the garbage collector, requiring explicit log service configuration, and propagates log client configuration throughout the stack.
• Updates test and proptest harness to account for the log API changes.
• Minor: NOP implementations of new endpoint in Go (for go log compatibility), proto/plumbing changes, and CI/test/plumbing fixes.

Affected Areas:
• Distributed garbage collection (Rust GC service, Rust log service, wal3)
• Proto API (logservice.proto, generated Rust/Go code, service interfaces)
• Cluster/k8s configuration (Helm charts, values.yaml, service configs)
• Rust log, log-service, garbage_collector, wal3, and associated glue/operators
• Documentation (wal3/README.md, comment updates)

This summary was automatically generated by @propel-code-bot

@rescrv rescrv requested a review from sanketkedia June 29, 2025 17:49
@rescrv rescrv force-pushed the rescrv/three-phase-gc branch from ba3fe76 to 9bbea49 Compare June 29, 2025 17:51
@rescrv rescrv force-pushed the rescrv/three-phase-gc-service branch from d5d740b to a1786d3 Compare June 29, 2025 17:51
Base automatically changed from rescrv/three-phase-gc to main June 30, 2025 16:06
@rescrv rescrv force-pushed the rescrv/three-phase-gc-service branch from a1786d3 to 5475ab1 Compare June 30, 2025 16:08
// This endpoint should route to the rust log service.
rpc ScrubLog(ScrubLogRequest) returns (ScrubLogResponse) {}
// This endpoint should route to the rust log service.
rpc GarbageCollectPhase2(GarbageCollectPhase2Request) returns (GarbageCollectPhase2Response) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we rename this so that the name explains what this service does in more details?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you name it? It's phase2 of the process known as garbage collect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my understanding it's truncating the manifest? so maybe TruncateManifest

}

func (s *logServer) GarbageCollectPhase2(ctx context.Context, req *logservicepb.GarbageCollectPhase2Request) (res *logservicepb.GarbageCollectPhase2Response, err error) {
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this return nil or an actual response?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an empty response. I'm copying other empty responses here. Should we set the struct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's implicitly converting nil to empty response in this case. We can set the struct for clarity or leave it as is for consistency, whichever you prefer

match writer.garbage_collect(&GarbageCollectionOptions::default(), Some(*minimum_log_offset_to_keep)).await {
Ok(()) => Ok(()),
match writer.garbage_collect_phase1(&GarbageCollectionOptions::default(), Some(*minimum_log_offset_to_keep)).await {
Ok(true) => {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is the writer expected to return true v/s false. And why do we skip phase2 and 3 in case it returns false?

Copy link
Contributor

@sanketkedia sanketkedia Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is phase1 not done by the log service? It can compute and write out the garbage and send a path to the garbage file as a response of the rpc?

tracing::error!("Unable to garbage collect log for collection [{collection_id}]: {err}");
return Err(DeleteUnusedLogsError::Gc(err));
};
if let Err(err) = writer.garbage_collect_phase3(&GarbageCollectionOptions::default()).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also would prefer to name these phases better instead of just phase1, 2 and 3

return Err(DeleteUnusedLogsError::Wal3(err));
}
};
if let Err(err) = logs.garbage_collect_phase2(*collection_id).await {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comments on why phase3 is done by the GC service i.e. the argument of keeping a single service that deletes all the files

@rescrv rescrv force-pushed the rescrv/three-phase-gc-service branch from c8d817a to 19793f0 Compare June 30, 2025 20:43
@github-actions
Copy link

github-actions bot commented Jun 30, 2025

✅ The Helm chart's version was changed. Your changes to the chart will be published upon merge to main.

@rescrv rescrv merged commit 76de464 into main Jun 30, 2025
57 checks passed
@rescrv rescrv deleted the rescrv/three-phase-gc-service branch June 30, 2025 22:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants