GCU sort: batch leaf-list changes into single REPLACE move#4478
GCU sort: batch leaf-list changes into single REPLACE move#4478rookie-who wants to merge 1 commit intosonic-net:masterfrom
Conversation
|
/azp run |
|
|
@rimunagala This is complementary to your #4476. Your PR optimizes per-move cost (~2x for REMOVE). This one reduces move count from N to 1 for leaf-list changes. Combined, ACL port removal on 512-port systems goes from ~717s to ~1.4s. Would appreciate your review — this also intersects with the performance test in sonic-net/sonic-mgmt#24092. |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@rimunagala — adding some context on why batching leaf-list changes into a single REPLACE is safe and doesn't compromise the correctness guarantees that per-move validation provides. Why individual moves existGCU's DFS + per-move validation exists to discover safe ordering across cross-table dependencies. For example: Swapping the order fails validation (dangling reference). DFS tries different orderings until it finds one that validates at every step. Why leaf-lists don't need thisA YANG leaf-list is a single field containing an array of scalars: leaf-list ports {
type string;
}Config DB: Key properties:
Since every intermediate state (any subset of the target values) is valid, the N individual validations are redundant — they all pass. One REPLACE + one validation produces the same result. Safety netThe generator is conservative: if the single REPLACE fails validation (e.g., some YANG model does have a constraint we didn't anticipate), it falls through to the existing per-item move generators. The batch is an optimization, not a bypass. ImpactFor a 512-port ACL port removal: 256 moves × 2 loadData each = 512 loadData calls → 1 move × 2 loadData = 2 calls. At ~0.5s per loadData at 512 ports, that's ~250s → ~1s. |
|
@rookie-who Thanks for the detailed writeup — the core insight is solid and the arithmetic checks out. A few things worth tightening before implementation: What's confirmed correct:
Three things that need correction before implementation:
Bottom line: Don't rework — just tighten the scope. The ACL_TABLE/ports optimization is valid and the performance case is real. Replace "typically no constraints" with a YANG-model check requirement, fix the fallback description, and add the isolation boundary on the dependency claim. As written, someone could read this and implement a batch optimization that silently breaks ACL_TABLE_TYPE modifications or patches that combine port deletion with ACL unbinding. |
KVM Testbed Performance ResultsTested on SONiC VS (32-port Force10-S6000 KVM, master.1094504-eae46f301) using the perf smoke test from sonic-mgmt PR #24092. Test: Remove 16 ports from a 32-port ACL table via
Key result: #4478 delivers a 4.5x speedup by batching 16 individual REMOVE moves into 1 bulk REPLACE. On the 32-port VS, it nearly passes the budget alone (4.6s vs 4s). Combined with #4476 hash cache (which reduces redundant loadData within the bulk move validation), it comfortably passes. Extrapolation for 512-port devices:
Test PR: sonic-net/sonic-mgmt#24092 |
|
@rimunagala — excellent review. All three points are valid. Here's how I'll address them: 1. min-elements check ✅ AgreedYou're right that 2. Fallback mechanism ✅ AgreedYou're correct that the "falls through" isn't an explicit fallback — it's a consequence of the DFS candidate queue having both bulk and individual moves. The BulkLeafListMoveGenerator is registered as non-extendable and runs before BulkKeyLevelMoveGenerator in 3. Isolation boundary ✅ AgreedThe batch is safe for the leaf-list itself, but you're right that cross-table ordering still matters at the patch level. If the same patch removes PORT entries that the leaf-list references, the DFS still needs to order: clear ports from ACL → delete PORT. The batch REPLACE for the leaf-list produces the correct final list value, and the DFS handles the cross-table ordering around it. But the documentation should be explicit about this scope. I'll push an update with these corrections. The core optimization (and the 4.5x KVM result) remains valid — these are documentation/guardrail improvements. |
|
@rookie-who I think this change should be validated with a proper repro across different GCU operation types and scales before we conclude it is safe. Can we please verify this with:
And for each of the above, test with different port-set sizes, for example:
Similarly for
I’m suggesting this because behavior can differ significantly depending on both the operation type and the scale of the update, so a repro-based validation would help confirm this does not introduce issues in practical scenarios. |
|
@rimunagala Good call — comprehensive testing across operation types and scales is the right approach. Let me run the matrix. I'll test on the KVM testbed (32-port Force10-S6000 VS) with the following matrix:
For each: baseline (no PRs) vs #4478 alone vs #4476+#4478, measuring wall time, move count, and loadData calls. Will post results shortly. |
|
@rookie-who — reviewed Bug —
|
5f99c08 to
d8a4aaa
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@rimunagala Thanks for the thorough review. All three items addressed:
Force-pushed: |
|
@rookie-who The PR still shows only the old commit. Can you confirm the force-push is on branch rookie-who:fix/gcu-batch-leaf-list-moves and share the new HEAD SHA / updated diff? |
|
@rookie-who Thanks, I reviewed the updated version in commit For the updated code in 1.
|
d8a4aaa to
ca9fe9d
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@rimunagala Addressed both items: 1.
|
|
Hey @rookie-who , Two small things before approving, Please validate each point before making any changes — only fix what is actually broken or required:
Code
Code
|
ca9fe9d to
4faf2ee
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Add BulkLeafListMoveGenerator that produces a single REPLACE move for leaf-list fields whose items differ between current and target configs, instead of decomposing into N individual REMOVE/ADD moves. This is registered as a non-extendable generator (tried before individual moves in DFS). If validation fails, DFS falls through to per-item moves. Impact: For a 512-port ACL table where half the ports are removed, this reduces ~256 individual moves (each triggering 2 loadData calls at ~1.4s each = ~717s) to 1 move (1 loadData = ~1.4s). Conservative scope: - Only handles leaf-lists (lists of scalars, not lists of dicts) - Only replaces lists that exist in both current and target - Falls through to individual moves if the bulk replace fails validation Signed-off-by: rookie-who <rookie-who@users.noreply.github.com>
4faf2ee to
1e69b30
Compare
|
/azp run |
Pushed |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description
Adds
BulkLeafListMoveGenerator— a non-extendable move generator that produces a single REPLACE move for leaf-list fields whose items differ between current and target configs.Problem
When removing N ports from an ACL table, GCU decomposes the change into N individual REMOVE moves. Each move is validated independently, triggering 2
loadData()calls per move. On a 512-port system:loadData()× ~1.4s/load = ~717 secondsSolution
Instead of N individual moves, generate one REPLACE of the entire leaf-list. DFS tries non-extendable generators first, so if the bulk REPLACE validates (one
loadData()), we skip N-1 moves entirely:loadData()× ~1.4s/load = ~2.8 seconds~250x improvement for the ACL port removal scenario.
Conservative scope
How it works
BulkLeafListMoveGeneratortraverses the config tree comparing current vs targetJsonMoveGroupwith a single REPLACE move for the whole listSortAlgorithmFactory.create()as a non-extendable generatorRelated PRs
Why this is complementary
PRs #4466 and #4476 optimize per-move cost. This PR reduces move count from N to 1. The approaches compose: even if a future PR cuts loadData cost by 50%, going from 512 moves to 1 move is still the dominant optimization.
Reproducer: https://github.com/rookie-who/sonic-gcu-issues