OCPBUGS-79354: Canonicalize all IP Address comparisons#1577
OCPBUGS-79354: Canonicalize all IP Address comparisons#1577barbacbd wants to merge 6 commits intoopenshift:mainfrom
Conversation
|
@barbacbd: This pull request references Jira Issue OCPBUGS-79354, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdded an exported IP canonicalization function and replaced many raw IP string comparisons with canonicalized forms across bootstrap, operator controllers, TLS helper, and tests; bootstrap excluded-address logic now parses excluded entries before comparing canonicalized IPs. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/jira refresh |
|
@barbacbd: This pull request references Jira Issue OCPBUGS-79354, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go`:
- Around line 482-483: In removeMemberWithoutMachine(), canonicalize
potentialMemberToRemoveIP before any lookups and comparisons: call
ceohelpers.CanonicalizeIP(potentialMemberToRemoveIP) into a local (e.g.,
canonicalMemberIP) and use that value when calling c.getMachineForMember(),
c.getNodeForMember(), and when comparing against memberIP (replace raw memberIP
!= potentialMemberToRemoveIP with memberIP != canonicalMemberIP) so IPv6
variants match consistently with MemberIPSetFromConfigMap().
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4a32073e-78b4-4510-885f-d63c0ce3ce25
📒 Files selected for processing (5)
pkg/cmd/render/bootstrap_ip.gopkg/operator/ceohelpers/common.gopkg/operator/ceohelpers/common_test.gopkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.gopkg/operator/machinedeletionhooks/machinedeletionhooks.go
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
Outdated
Show resolved
Hide resolved
82ef03e to
49cd27e
Compare
|
Before the changes the following information could be interpretted from the logs during failure This shows that the short hand notation are existing and being compared to long format. This is causing issues when directly comparing ip addresses. Now the comparison is between two canonicalized formats. Result: |
|
/retest-required |
|
Logic looks good. Please check the Unit Test and Verify failures,otherwise LGTM from my side |
|
/retest-required |
|
/lgtm |
|
/lgtm cancel |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: barbacbd The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| return nil, fmt.Errorf("failed to get internal IP for node: %w", err) | ||
| } | ||
| if memberInternalIP == internalNodeIP { | ||
| if dnshelpers.CanonicalizeIP(memberInternalIP) == dnshelpers.CanonicalizeIP(internalNodeIP) { |
There was a problem hiding this comment.
internalNodeIP at this point should be canonical already, right?
pkg/tlshelpers/tlshelpers.go
Outdated
| canonicalIPs := make([]string, 0, len(nodeInternalIPs)) | ||
| for _, ip := range nodeInternalIPs { | ||
| canonicalIPs = append(canonicalIPs, dnshelpers.CanonicalizeIP(ip)) | ||
| } |
There was a problem hiding this comment.
this might get a bit tricky on upgrades. Imagine we already have an etcd cluster running with a not-canonical IP as its member address coming from the node internal IP.
When we change the IPs here, we might create a new certificate that would only work with canonical IPs. I'm not sure how smart TLS is, maybe this is not going to be a problem, but it certainly doesn't hurt to add both canonical and "raw" representation here.
The go code seems to kinda get this with the ::1 IP already, definitely worth adding a testcase here.
There was a problem hiding this comment.
After adjusting the function to include both forms of ip addresses (for backwards compatibility) I ran an analysis with Claude to see if I missed anything. Here are the results:
Key Findings:
- The code already handles both forms (lines 64-66 in tlshelpers.go):
completeIPList = append(completeIPList, dnshelpers.CanonicalizeIP(ip), ip) - This adds both canonical (::1) and raw (0:0:0:0:0:0:0:1) forms for each node IP.
- The comment about "::1" is correct but was unclear: When IP addresses are added to TLS certificates, they're parsed to bytes. Both "::1" and
"0:0:0:0:0:0:0:1" parse to the same IP bytes, so TLS verification works with either form even if only one is listed. I've updated the comment to
explain this better. - Tests prove it works: The new tests demonstrate that:
- IPv6 addresses in different formats (compressed/expanded) are recognized as the same IP
- A certificate with "::1" accepts connections to "0:0:0:0:0:0:0:1"
- The code includes both forms for node IPs for maximum compatibility
@tjungblu Please see the changes .
| return err | ||
| } | ||
| if memberIP != potentialMemberToRemoveIP { | ||
| if dnshelpers.CanonicalizeIP(memberIP) != dnshelpers.CanonicalizeIP(potentialMemberToRemoveIP) { |
There was a problem hiding this comment.
also here, memberIP is canonical already.
It seems we also just work-around the fact that the node always returns the non-canonical IP address. Maybe we can have a custom node lister or extended struct that just directly canonicalizes the IP?
That way we don't have to litter these checks everywhere and I'm certain there are a few places we have missed here.
There was a problem hiding this comment.
If an existing etcd member was started with a non-canonical IP in its PeerURL (e.g., https://[0:0:0:0:0:0:0:1]:2380), then:
- The etcd-endpoints configmap stores "0:0:0:0:0:0:0:1" (non-canonical)
- memberIP from the member list is "::1" (canonical)
- Without canonicalizing both sides, they wouldn't match!
Recommendation:
Keep the canonicalization on both sides. While canonicalizing memberIP is redundant, it's:
- Harmless - CanonicalizeIP(canonical) = canonical
- Defensive - Protects against future changes to MemberToNodeInternalIP
- Clear intent - Makes it obvious we're doing canonical IP comparison
- Consistent - Same pattern used in hasInternalIP() at line 529
I was inclined to think of this as harmless too just to be explicit with it too. I don't know your opinions on it but this was the claude analysis.
1. pkg/dnshelpers/util.go
- Added CanonicalizeIP() function (lines 37-46): Converts IP addresses to canonical form using net.ParseIP().String()
- Normalizes IPv6 addresses (e.g., 2001:0db8::1 → 2001:db8::1)
- Auto-converts IPv4-mapped IPv6 to IPv4 (e.g., ::ffff:192.0.2.1 → 192.0.2.1)
- Returns original string if invalid IP
2. pkg/operator/ceohelpers/common.go
- Updated 4 functions to use canonicalization:
- MemberToNodeInternalIP() - canonicalizes IP extracted from etcd member
- IndexMachinesByNodeInternalIP() - canonicalizes map keys
- FindMachineByNodeInternalIP() - canonicalizes both sides of comparison
- VotingMemberIPListSet() - canonicalizes IPs before inserting into set
- MemberIPSetFromConfigMap() - canonicalizes IPs from configmap
2. pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
- Updated hasInternalIP() function (line 528): Canonicalizes both machine and member IPs before comparison
3. pkg/operator/machinedeletionhooks/machinedeletionhooks.go
- Updated line 149: Canonicalizes machine address before set lookup
4. pkg/operator/ceohelpers/common_test.go
- Added comprehensive TestCanonicalizeIP() with 12 test cases covering:
- IPv4 addresses
- IPv6 compressed and full forms
- IPv6 with leading zeros and mixed case
- IPv4-mapped IPv6 conversion
- Loopback addresses
- Invalid IP handling
Impact
This fixes potential bugs where IPv6 addresses in different representations (e.g., 2001:db8::1 vs 2001:0db8:0000:0000:0000:0000:0000:0001) would fail to match, causing:
- Machine lookups to fail
- Member removal decisions to be incorrect
- Deletion hooks to be misapplied
Note: This PR is opened with claude based on findings by the openshift-installer team. Review of claude's work and instructions given by @barbacbd.
pkg/cmd/render/bootstrap_ip_test.go
Outdated
| }, | ||
| { | ||
| name: "IPv4-mapped IPv6 address", | ||
| excludedIPs: []string{"::ffff:192.168.1.1"}, |
There was a problem hiding this comment.
should this be 192.0.2.1 ? the excludedIPs is totally equal to testAddr
as u explained in the comment:
Converting IPv4-mapped IPv6 addresses to IPv4 (e.g., "::ffff:192.0.2.1" → "192.0.2.1")
- Add unit tests for the AddressNotIn() function.
Validates that the function includes both canonical and raw IP forms in the certificate All tests verify that static entries (localhost, 127.0.0.1, ::1, etcd service names) are always present. 2. TestIPCanonicalizeInCertificate (5 test cases) Proves the critical TLS behavior: IP addresses in certificates are compared by their byte representation, not string representation.
1. Comprehensive IPv6 Coverage
- Compressed (::1)
- Expanded (0:0:0:0:0:0:0:1)
- With leading zeros (2001:0db8:0000:...)
- IPv4-mapped (::ffff:192.0.2.1)
2. Dual-Stack Testing
- Both IPv4 and IPv6 in same node
- Ensures both are handled correctly
3. Edge Cases
- Empty address lists
- Non-internal address types
- Invalid data
4. Real-World Validation
- Actual etcd member removal scenario
- Proves the helpers solve the original problem
This test suite proves that the helpers correctly handle the core issue: etcd clusters with non-canonical IPv6 addresses won't break when
certificates are regenerated, because both canonical and raw forms are now properly compared.
Tests are generated with claude and reviewed by @barbacbd
|
@barbacbd: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
pkg/dnshelpers/util.go
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
Impact
This fixes potential bugs where IPv6 addresses in different representations (e.g., 2001:db8::1 vs 2001:0db8:0000:0000:0000:0000:0000:0001) would fail to match, causing:
Note: This PR is opened with claude based on findings by the openshift-installer team. Review of claude's work and instructions given by @barbacbd.