Skip to content

OCPBUGS-79354: Canonicalize all IP Address comparisons#1577

Open
barbacbd wants to merge 6 commits intoopenshift:mainfrom
barbacbd:OCPBUGS-79354
Open

OCPBUGS-79354: Canonicalize all IP Address comparisons#1577
barbacbd wants to merge 6 commits intoopenshift:mainfrom
barbacbd:OCPBUGS-79354

Conversation

@barbacbd
Copy link
Copy Markdown

@barbacbd barbacbd commented Mar 24, 2026

  1. pkg/dnshelpers/util.go

    • Added CanonicalizeIP() function (lines 37-46): Converts IP addresses to canonical form using net.ParseIP().String()
      • Normalizes IPv6 addresses (e.g., 2001:0db8::1 → 2001:db8::1)
      • Auto-converts IPv4-mapped IPv6 to IPv4 (e.g., ::ffff:192.0.2.1 → 192.0.2.1)
      • Returns original string if invalid IP
    1. pkg/operator/ceohelpers/common.go
    • Updated 4 functions to use canonicalization:
      • MemberToNodeInternalIP() - canonicalizes IP extracted from etcd member
      • IndexMachinesByNodeInternalIP() - canonicalizes map keys
      • FindMachineByNodeInternalIP() - canonicalizes both sides of comparison
      • VotingMemberIPListSet() - canonicalizes IPs before inserting into set
      • MemberIPSetFromConfigMap() - canonicalizes IPs from configmap
  2. pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go

  • Updated hasInternalIP() function (line 528): Canonicalizes both machine and member IPs before comparison
  1. pkg/operator/machinedeletionhooks/machinedeletionhooks.go
  • Updated line 149: Canonicalizes machine address before set lookup
  1. pkg/operator/ceohelpers/common_test.go
  • Added comprehensive TestCanonicalizeIP() with 12 test cases covering:
    • IPv4 addresses
    • IPv6 compressed and full forms
    • IPv6 with leading zeros and mixed case
    • IPv4-mapped IPv6 conversion
    • Loopback addresses
    • Invalid IP handling

Impact

This fixes potential bugs where IPv6 addresses in different representations (e.g., 2001:db8::1 vs 2001:0db8:0000:0000:0000:0000:0000:0001) would fail to match, causing:

  • Machine lookups to fail
  • Member removal decisions to be incorrect
  • Deletion hooks to be misapplied

Note: This PR is opened with claude based on findings by the openshift-installer team. Review of claude's work and instructions given by @barbacbd.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 24, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@barbacbd: This pull request references Jira Issue OCPBUGS-79354, which is invalid:

  • expected the bug to target either version "4.22." or "openshift-4.22.", but it targets "4.23" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

  1. pkg/operator/ceohelpers/common.go
  • Added CanonicalizeIP() function (lines 37-46): Converts IP addresses to canonical form using net.ParseIP().String()
    • Normalizes IPv6 addresses (e.g., 2001:0db8::1 → 2001:db8::1)
    • Auto-converts IPv4-mapped IPv6 to IPv4 (e.g., ::ffff:192.0.2.1 → 192.0.2.1)
    • Returns original string if invalid IP
  • Updated 4 functions to use canonicalization:
    • MemberToNodeInternalIP() - canonicalizes IP extracted from etcd member
    • IndexMachinesByNodeInternalIP() - canonicalizes map keys
    • FindMachineByNodeInternalIP() - canonicalizes both sides of comparison
    • VotingMemberIPListSet() - canonicalizes IPs before inserting into set
    • MemberIPSetFromConfigMap() - canonicalizes IPs from configmap
  1. pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
  • Updated hasInternalIP() function (line 528): Canonicalizes both machine and member IPs before comparison
  1. pkg/operator/machinedeletionhooks/machinedeletionhooks.go
  • Updated line 149: Canonicalizes machine address before set lookup
  1. pkg/operator/ceohelpers/common_test.go
  • Added comprehensive TestCanonicalizeIP() with 12 test cases covering:
    • IPv4 addresses
    • IPv6 compressed and full forms
    • IPv6 with leading zeros and mixed case
    • IPv4-mapped IPv6 conversion
    • Loopback addresses
    • Invalid IP handling

Impact

This fixes potential bugs where IPv6 addresses in different representations (e.g., 2001:db8::1 vs 2001:0db8:0000:0000:0000:0000:0000:0001) would fail to match, causing:

  • Machine lookups to fail
  • Member removal decisions to be incorrect
  • Deletion hooks to be misapplied

Note: This PR is opened with claude based on findings by the openshift-installer team. Review of claude's work and instructions given by @barbacbd.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Added an exported IP canonicalization function and replaced many raw IP string comparisons with canonicalized forms across bootstrap, operator controllers, TLS helper, and tests; bootstrap excluded-address logic now parses excluded entries before comparing canonicalized IPs.

Changes

Cohort / File(s) Summary
Canonicalization helper
pkg/dnshelpers/util.go
Added exported CanonicalizeIP(string) string that returns net.ParseIP(ip).String() for valid IPs, otherwise returns the original input. Also canonicalizes selected node-internal IPs in GetPreferredInternalIPAddressForNodeName.
Operator helpers & tests
pkg/operator/ceohelpers/common.go, pkg/operator/ceohelpers/common_test.go
Replaced raw IP strings with dnshelpers.CanonicalizeIP(...) in member/node helper functions and indexing; added TestCanonicalizeIP table-driven unit tests covering IPv4, IPv6 variants, IPv4-mapped IPv6, loopback, and invalid inputs.
Bootstrap IP validation
pkg/cmd/render/bootstrap_ip.go
AddressNotIn computes candidate canonical IP and parses each excluded entry with net.ParseIP; an exclusion matches only if the excluded entry parses and its canonical string equals the candidate.
Cluster member removal controller
pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
IP comparisons in member removal logic now canonicalize candidate and stored internal IPs before matching (e.g., removeMemberWithoutMachine, getNodeForMember, hasInternalIP).
Machine deletion hooks
pkg/operator/machinedeletionhooks/machinedeletionhooks.go
When checking pending-deletion Machine addresses, Machine.Status.Addresses[].Address values are canonicalized before membership checks against current etcd member IP set.
TLS helper hostnames
pkg/tlshelpers/tlshelpers.go
getServerHostNames now canonicalizes nodeInternalIPs into a new slice and appends the canonical forms to the certificate SAN host list.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from dusk125 and hasbro17 March 24, 2026 18:00
@barbacbd
Copy link
Copy Markdown
Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 24, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@barbacbd: This pull request references Jira Issue OCPBUGS-79354, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go`:
- Around line 482-483: In removeMemberWithoutMachine(), canonicalize
potentialMemberToRemoveIP before any lookups and comparisons: call
ceohelpers.CanonicalizeIP(potentialMemberToRemoveIP) into a local (e.g.,
canonicalMemberIP) and use that value when calling c.getMachineForMember(),
c.getNodeForMember(), and when comparing against memberIP (replace raw memberIP
!= potentialMemberToRemoveIP with memberIP != canonicalMemberIP) so IPv6
variants match consistently with MemberIPSetFromConfigMap().

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4a32073e-78b4-4510-885f-d63c0ce3ce25

📥 Commits

Reviewing files that changed from the base of the PR and between b91eb44 and 4e85398.

📒 Files selected for processing (5)
  • pkg/cmd/render/bootstrap_ip.go
  • pkg/operator/ceohelpers/common.go
  • pkg/operator/ceohelpers/common_test.go
  • pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go
  • pkg/operator/machinedeletionhooks/machinedeletionhooks.go

@barbacbd barbacbd force-pushed the OCPBUGS-79354 branch 2 times, most recently from 82ef03e to 49cd27e Compare March 24, 2026 20:40
@barbacbd
Copy link
Copy Markdown
Author

Before the changes the following information could be interpretted from the logs during failure

Dual-Stack Certificate Regeneration Loop (Recent)                                                                                               
                                                                                                                                                     
  "fd20:4c1:4cdd::2:0:0" are existing and not required,                                                                                              
  "fd20:4c1:4cdd:0:0:2:0:0" are required and not existing                                                                                            
  - IPv6 addresses are changing format (compressed vs expanded notation)                                                                             
  - Certificates keep getting regenerated with new IP formats                                                                                        
  - This suggests an IPv6 normalization issue in the dual-stack implementation

This shows that the short hand notation are existing and being compared to long format. This is causing issues when directly comparing ip addresses. Now the comparison is between two canonicalized formats.

Result:
Cluster creation succeeds, but more importantly was the etcd operator succeeding in coming up and no longer in a loop.

@barbacbd
Copy link
Copy Markdown
Author

/retest-required

@lance5890
Copy link
Copy Markdown
Contributor

Logic looks good. Please check the Unit Test and Verify failures,otherwise LGTM from my side

@dusk125
Copy link
Copy Markdown
Contributor

dusk125 commented Mar 25, 2026

/retest-required

@lance5890
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 26, 2026
@lance5890
Copy link
Copy Markdown
Contributor

/lgtm cancel
add some comments that should be addressed firstly

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 26, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 26, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: barbacbd
Once this PR has been reviewed and has the lgtm label, please assign dusk125 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

return nil, fmt.Errorf("failed to get internal IP for node: %w", err)
}
if memberInternalIP == internalNodeIP {
if dnshelpers.CanonicalizeIP(memberInternalIP) == dnshelpers.CanonicalizeIP(internalNodeIP) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internalNodeIP at this point should be canonical already, right?

Comment on lines +63 to +66
canonicalIPs := make([]string, 0, len(nodeInternalIPs))
for _, ip := range nodeInternalIPs {
canonicalIPs = append(canonicalIPs, dnshelpers.CanonicalizeIP(ip))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might get a bit tricky on upgrades. Imagine we already have an etcd cluster running with a not-canonical IP as its member address coming from the node internal IP.

When we change the IPs here, we might create a new certificate that would only work with canonical IPs. I'm not sure how smart TLS is, maybe this is not going to be a problem, but it certainly doesn't hurt to add both canonical and "raw" representation here.
The go code seems to kinda get this with the ::1 IP already, definitely worth adding a testcase here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After adjusting the function to include both forms of ip addresses (for backwards compatibility) I ran an analysis with Claude to see if I missed anything. Here are the results:

Key Findings:

  1. The code already handles both forms (lines 64-66 in tlshelpers.go):
    completeIPList = append(completeIPList, dnshelpers.CanonicalizeIP(ip), ip)
  2. This adds both canonical (::1) and raw (0:0:0:0:0:0:0:1) forms for each node IP.
  3. The comment about "::1" is correct but was unclear: When IP addresses are added to TLS certificates, they're parsed to bytes. Both "::1" and
    "0:0:0:0:0:0:0:1" parse to the same IP bytes, so TLS verification works with either form even if only one is listed. I've updated the comment to
    explain this better.
  4. Tests prove it works: The new tests demonstrate that:
    - IPv6 addresses in different formats (compressed/expanded) are recognized as the same IP
    - A certificate with "::1" accepts connections to "0:0:0:0:0:0:0:1"
    - The code includes both forms for node IPs for maximum compatibility

@tjungblu Please see the changes .

return err
}
if memberIP != potentialMemberToRemoveIP {
if dnshelpers.CanonicalizeIP(memberIP) != dnshelpers.CanonicalizeIP(potentialMemberToRemoveIP) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here, memberIP is canonical already.

It seems we also just work-around the fact that the node always returns the non-canonical IP address. Maybe we can have a custom node lister or extended struct that just directly canonicalizes the IP?

That way we don't have to litter these checks everywhere and I'm certain there are a few places we have missed here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an existing etcd member was started with a non-canonical IP in its PeerURL (e.g., https://[0:0:0:0:0:0:0:1]:2380), then:

  • The etcd-endpoints configmap stores "0:0:0:0:0:0:0:1" (non-canonical)
  • memberIP from the member list is "::1" (canonical)
  • Without canonicalizing both sides, they wouldn't match!

Recommendation:

Keep the canonicalization on both sides. While canonicalizing memberIP is redundant, it's:

  1. Harmless - CanonicalizeIP(canonical) = canonical
  2. Defensive - Protects against future changes to MemberToNodeInternalIP
  3. Clear intent - Makes it obvious we're doing canonical IP comparison
  4. Consistent - Same pattern used in hasInternalIP() at line 529

I was inclined to think of this as harmless too just to be explicit with it too. I don't know your opinions on it but this was the claude analysis.

1. pkg/dnshelpers/util.go
  - Added CanonicalizeIP() function (lines 37-46): Converts IP addresses to canonical form using net.ParseIP().String()
    - Normalizes IPv6 addresses (e.g., 2001:0db8::1 → 2001:db8::1)
    - Auto-converts IPv4-mapped IPv6 to IPv4 (e.g., ::ffff:192.0.2.1 → 192.0.2.1)
    - Returns original string if invalid IP

2. pkg/operator/ceohelpers/common.go
  - Updated 4 functions to use canonicalization:
    - MemberToNodeInternalIP() - canonicalizes IP extracted from etcd member
    - IndexMachinesByNodeInternalIP() - canonicalizes map keys
    - FindMachineByNodeInternalIP() - canonicalizes both sides of comparison
    - VotingMemberIPListSet() - canonicalizes IPs before inserting into set
    - MemberIPSetFromConfigMap() - canonicalizes IPs from configmap

  2. pkg/operator/clustermemberremovalcontroller/clustermemberremovalcontroller.go

  - Updated hasInternalIP() function (line 528): Canonicalizes both machine and member IPs before comparison

  3. pkg/operator/machinedeletionhooks/machinedeletionhooks.go

  - Updated line 149: Canonicalizes machine address before set lookup

  4. pkg/operator/ceohelpers/common_test.go

  - Added comprehensive TestCanonicalizeIP() with 12 test cases covering:
    - IPv4 addresses
    - IPv6 compressed and full forms
    - IPv6 with leading zeros and mixed case
    - IPv4-mapped IPv6 conversion
    - Loopback addresses
    - Invalid IP handling

  Impact

  This fixes potential bugs where IPv6 addresses in different representations (e.g., 2001:db8::1 vs 2001:0db8:0000:0000:0000:0000:0000:0001) would fail to match, causing:
  - Machine lookups to fail
  - Member removal decisions to be incorrect
  - Deletion hooks to be misapplied

Note: This PR is opened with claude based on findings by the openshift-installer team. Review of claude's work and instructions given by @barbacbd.
},
{
name: "IPv4-mapped IPv6 address",
excludedIPs: []string{"::ffff:192.168.1.1"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be 192.0.2.1 ? the excludedIPs is totally equal to testAddr

as u explained in the comment:

Converting IPv4-mapped IPv6 addresses to IPv4 (e.g., "::ffff:192.0.2.1" → "192.0.2.1")

- Add unit tests for the AddressNotIn() function.
  Validates that the function includes both canonical and raw IP forms in the certificate

All tests verify that static entries (localhost, 127.0.0.1, ::1, etcd service names) are always present.

2. TestIPCanonicalizeInCertificate (5 test cases)

  Proves the critical TLS behavior: IP addresses in certificates are compared by their byte representation, not string representation.
1. Comprehensive IPv6 Coverage
    - Compressed (::1)
    - Expanded (0:0:0:0:0:0:0:1)
    - With leading zeros (2001:0db8:0000:...)
    - IPv4-mapped (::ffff:192.0.2.1)
  2. Dual-Stack Testing
    - Both IPv4 and IPv6 in same node
    - Ensures both are handled correctly
  3. Edge Cases
    - Empty address lists
    - Non-internal address types
    - Invalid data
  4. Real-World Validation
    - Actual etcd member removal scenario
    - Proves the helpers solve the original problem

This test suite proves that the helpers correctly handle the core issue: etcd clusters with non-canonical IPv6 addresses won't break when
  certificates are regenerated, because both canonical and raw forms are now properly compared.

Tests are generated with claude and reviewed by @barbacbd
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Mar 31, 2026

@barbacbd: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/unit 6b5113b link true /test unit
ci/prow/e2e-gcp-operator 6b5113b link true /test e2e-gcp-operator
ci/prow/verify 6b5113b link true /test verify
ci/prow/e2e-aws-ovn-single-node 6b5113b link true /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants