-
Notifications
You must be signed in to change notification settings - Fork 33
feat: azure maintenance events #382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: azure maintenance events #382
Conversation
WalkthroughAdds Azure CSP support: new Changes
Sequence Diagram(s)sequenceDiagram
participant Main as Monitor Init
participant AzureClient as Azure Client
participant IMDS as Instance Metadata
participant Creds as Azure Credential
participant K8s as Kubernetes API
participant Updates as Azure Updates API
participant Chan as Event Channel
Main->>AzureClient: NewClient(ctx, AzureConfig)
AzureClient->>Creds: DefaultAzureCredential()
alt subscription in config
AzureClient-->>AzureClient: use config.SubscriptionID
else
AzureClient->>IMDS: GET /metadata/instance?api-version=...
IMDS-->>AzureClient: { "subscriptionId": "..." }
end
AzureClient->>Updates: New UpdatesClient(subscriptionID)
AzureClient->>K8s: Build Kubernetes client / informer
AzureClient-->>Main: Client ready
Main->>AzureClient: StartMonitoring(ctx, eventChan)
loop every PollingIntervalSeconds
AzureClient->>K8s: List Nodes
par per-node
AzureClient->>Updates: ListUpdates(resourceID)
Updates-->>AzureClient: []Update
AzureClient->>AzureClient: filter & normalize
AzureClient->>Chan: send MaintenanceEvent
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain main module or its selected dependencies" Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
39-40: CSPAzure enum addition is good; fixNodeNameBSON tag typoThe new
CSPAzureconstant wiring looks correct and consistent with the other CSP values.However,
NodeName’s BSON tag has a typo:NodeName string `json:"nodeName,omitempty" bson:"nodeName,omitemtpy"`Because
omitemtpyis not recognized, emptynodeNamevalues will not be omitted as intended and may affect how events are stored/queried.Consider correcting it to:
- NodeName string `json:"nodeName,omitempty" bson:"nodeName,omitemtpy"` + NodeName string `json:"nodeName,omitempty" bson:"nodeName,omitempty"`Also applies to: 55-59
🧹 Nitpick comments (6)
health-monitors/csp-health-monitor/cmd/csp-health-monitor/main.go (1)
37-43: Azure monitor init path is consistent; consider doc/log clarifications and confirm CSP priorityThe Azure branch in
initActiveMonitoris wired consistently with GCP/AWS: it respectscfg.Azure.Enabled, incrementsCSPMonitorErrorson init failure, logs, and returns the Azure monitor. A few minor points to consider:
- The comment above
initActiveMonitorstill says “GCP/AWS”; it would be clearer to mention Azure as well.- CSP selection remains “first enabled wins” with priority
GCP -> AWS -> Azure. That matches the existing pattern but means Azure is ignored if GCP or AWS are enabled; confirm this is the intended behavior.- The Azure init log uses
cfg.Azure.SubscriptionID, which will be empty if the subscription ID is discovered via IMDS. If you care about logging the effective subscription ID, you may want to expose/log it from the Azure client instead.None of these are blockers, but updating them would improve clarity and observability.
Also applies to: 218-220, 260-279
health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go (1)
1-181: Azure polling tests are well‑structured and exercise key pathsBoth tests use the k8s and Azure fakes effectively to cover the “no updates” and “single pending update” cases, and they validate the important fields on the emitted
MaintenanceEvent(CSP, cluster, node, status, type, and non‑emptyEventID). This gives good confidence in the Azure polling + normalization flow. If you find yourself adding more scenarios later, you might consider extracting a small helper to build the fake client setup, but it’s fine as‑is.health-monitors/csp-health-monitor/pkg/event/azure_normalizer_test.go (1)
28-173: Azure normalizer tests validate key paths; consider extending for additional statusesThe two tests exercise the main happy paths (Pending + InProgress with impact) and validate CSP, resource fields, status mappings, metadata, and EventID format; they align with the current Normalize implementation. It would be useful to add cases for
Completedand retry-related statuses (and perhapsImpactTypeNone) to lock in the mapping behavior for those edge cases as well.health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (1)
52-95: Event construction and status mappings look good; double‑check EventID semantics vs provider IDsThe MaintenanceEvent construction and
mapAzureStatusToInternal/mapAzureStatusToCSPStatuslogic look reasonable, and the tests cover the Pending/InProgress cases. One thing to re‑evaluate isEventID: it’s currently a syntheticazure-<resourceGroup>-<vmName>-<timestamp>string, while in the AWS pathEventIDis intentionally set to the provider entity ARN and reused downstream as the stable resource identifier. If any consumer assumesEventIDis a provider‑native identifier, you may want to instead use the Azure resource ID or a stable update identifier (and move this synthetic value intoMetadataif still needed). Based on learnings.Also applies to: 100-118
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (2)
111-135: Guard StartMonitoring against invalid polling intervals
time.NewTicker(time.Duration(c.config.PollingIntervalSeconds) * time.Second)will panic ifPollingIntervalSeconds <= 0. Today config validation should prevent that, butClientcan be constructed with arbitraryAzureConfig(e.g., tests or future callers) and would panic on a bad value. It’s safer to clamp or default the interval (and log a warning) before creating the ticker so the client fails gracefully instead of panicking if an invalid interval slips through.
157-171: Limit per‑node goroutine fan‑out when polling Azure updates
pollForMaintenanceEventsspawns one goroutine per node, and each goroutine pages through the Azure Updates API. On large clusters this can create significant parallel load against Azure and the Kubernetes API, and rely onUpdatesClientbeing fully concurrency‑safe. Consider introducing a bounded worker pool or semaphore (e.g., max N concurrent nodes per poll) to cap concurrency while still keeping per‑node isolation.Also applies to: 190-216
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
health-monitors/csp-health-monitor/go.sumis excluded by!**/*.sum
📒 Files selected for processing (10)
.gitignore(2 hunks)health-monitors/csp-health-monitor/cmd/csp-health-monitor/main.go(2 hunks)health-monitors/csp-health-monitor/go.mod(4 hunks)health-monitors/csp-health-monitor/pkg/config/config.go(4 hunks)health-monitors/csp-health-monitor/pkg/csp/azure/azure.go(1 hunks)health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go(1 hunks)health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go(1 hunks)health-monitors/csp-health-monitor/pkg/event/azure_normalizer_test.go(1 hunks)health-monitors/csp-health-monitor/pkg/event/normalizer.go(1 hunks)health-monitors/csp-health-monitor/pkg/model/maintenance_event.go(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
health-monitors/csp-health-monitor/pkg/model/maintenance_event.gohealth-monitors/csp-health-monitor/pkg/event/azure_normalizer.gohealth-monitors/csp-health-monitor/pkg/csp/azure/azure.go
🧬 Code graph analysis (6)
health-monitors/csp-health-monitor/pkg/event/normalizer.go (2)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
CSPAzure(56-56)health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (1)
AzureNormalizer(26-26)
health-monitors/csp-health-monitor/cmd/csp-health-monitor/main.go (3)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)
NewClient(39-105)health-monitors/csp-health-monitor/pkg/metrics/metrics.go (1)
CSPMonitorErrors(62-68)health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
CSPAzure(56-56)
health-monitors/csp-health-monitor/pkg/event/azure_normalizer_test.go (3)
health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (1)
AzureNormalizer(26-26)health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
CSPAzure(56-56)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_RESTART_VM(46-46)
health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (2)
health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
CSPAzure(56-56)data-models/pkg/protos/health_event.pb.go (1)
RecommendedAction_RESTART_VM(46-46)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (4)
health-monitors/csp-health-monitor/pkg/config/config.go (2)
AzureConfig(67-71)Config(39-48)health-monitors/csp-health-monitor/pkg/event/normalizer.go (2)
Normalizer(25-30)GetNormalizer(34-45)health-monitors/csp-health-monitor/pkg/metrics/metrics.go (4)
CSPMonitorErrors(62-68)CSPPollingDuration(39-46)CSPAPIErrors(47-53)CSPEventsReceived(32-38)health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (3)
CSPAzure(56-56)CSP(43-43)MaintenanceEvent(22-40)
health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go (4)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)
Client(28-36)health-monitors/csp-health-monitor/pkg/config/config.go (1)
AzureConfig(67-71)health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (1)
AzureNormalizer(26-26)health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
CSPAzure(56-56)
🔇 Additional comments (5)
.gitignore (2)
10-11: LGTM: Version manager support.Adding
mise.tomlto the ignore list is appropriate for projects using the mise version manager. This is a reasonable housekeeping addition.
456-456: No duplicate entry found—the review comment's concern is unfounded.The verification confirms that
platform-connectors/platform-connectorsappears only once in the.gitignorefile (at line 456). There is no duplicate entry elsewhere in the file. The review comment's assertion that this is a duplicate ignore entry is incorrect.Likely an incorrect or invalid review comment.
health-monitors/csp-health-monitor/go.mod (1)
10-11: Azure SDK/auth dependency additions look coherent; please verify versions/security externallyThe added Azure SDK/auth/jwt/browser dependencies and versions line up with the new Azure maintenance client usage and test fakes; they look reasonable and narrowly scoped to the Azure feature. Please just double‑check that these specific versions match your target Azure SDK minor line and have no known security advisories in your environment/tooling before merging.
Also applies to: 42-45, 81-81, 92-92, 97-97
health-monitors/csp-health-monitor/pkg/event/normalizer.go (1)
35-41: Azure normalizer wiring into factory looks correctAdding the
model.CSPAzurebranch that returns&AzureNormalizer{}is consistent with the existing GCP/AWS pattern and matches howNewClientrequests the Azure normalizer.health-monitors/csp-health-monitor/pkg/config/config.go (1)
39-48: Azure config wiring and CSP validation look consistentAzureConfig is correctly added to Config, and validateCSPConfig now enforces a minimum Azure polling interval and single-CSP enablement in line with existing GCP/AWS behavior. Just ensure the new
azure.*keys are covered in operator-facing config docs so users understand how to enable Azure monitoring.Also applies to: 66-71, 185-223
| go func(node v1.Node) { | ||
| defer wg.Done() | ||
|
|
||
| c.processNodeMaintenanceEvents(ctx, node, eventChan) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I was under the impression that with the updates list api we would not need to query for each individual node. It seems that is not supported (we are back to the same original problem of having to query the api for each VM in the cluster). I looked into some other options to query for VM maintenance events at a cluster-level in azure, and it seems Azure resource graph: https://learn.microsoft.com/en-us/azure/virtual-machines/resource-graph-availability, this is the sdk: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/resourcegraph/[email protected]#Client.Resources
Can this be used to query for the VM maintenance events? (we are doing something similar in GCP with the cloudlogging query)
aa3ff61 to
6244c8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
health-monitors/csp-health-monitor/go.sumis excluded by!**/*.sum
📒 Files selected for processing (2)
.gitignore(2 hunks)health-monitors/csp-health-monitor/go.mod(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- .gitignore
🧰 Additional context used
📓 Path-based instructions (1)
**/go.mod
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Use
go.modfor each service as a separate Go module with semantic import versioning
Files:
health-monitors/csp-health-monitor/go.mod
🧠 Learnings (3)
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/go.mod : Use `go.mod` for each service as a separate Go module with semantic import versioning
Applied to files:
health-monitors/csp-health-monitor/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use `client-go` for Kubernetes API interactions in Go code
Applied to files:
health-monitors/csp-health-monitor/go.mod
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Keep Go dependencies minimal and up-to-date
Applied to files:
health-monitors/csp-health-monitor/go.mod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (3)
195-221: Consider Azure Resource Graph for cluster-level queries.The current implementation queries maintenance updates for each VM individually, which may not scale well for large clusters. A previous review suggested using Azure Resource Graph to query maintenance events at the cluster level, similar to the GCP implementation's use of cloud logging queries.
See: https://learn.microsoft.com/en-us/azure/virtual-machines/resource-graph-availability
SDK: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/resourcegraph/armresourcegraph
150-156: Use informer instead of direct API calls for listing nodes.Direct API calls to list nodes can overload the Kubernetes API server, especially in production environments. The coding guidelines specify using informers to cache and watch resources instead.
Refer to the AWS CSP monitor implementation for the recommended informer pattern.
284-319: Complete the IMDS hardening—add timeout, check all errors, validate response.While
NewRequestWithContextwas added (line 294), the remaining hardening items from the previous review are still missing:
- No timeout: HTTP client (line 292) can hang indefinitely outside Azure
- Ignored error: Line 294 discards the error from
NewRequestWithContext- No status validation: No check for
resp.StatusCodeafter line 302- Empty subscriptionId accepted: Line 318 returns empty string if JSON field is missing
Apply this diff to complete the hardening:
var PTransport = &http.Transport{Proxy: nil} - client := http.Client{Transport: PTransport} + client := http.Client{ + Transport: PTransport, + Timeout: 5 * time.Second, + } - req, _ := http.NewRequestWithContext(context.Background(), "GET", "http://169.254.169.254/metadata/instance", nil) + req, err := http.NewRequestWithContext(context.Background(), "GET", "http://169.254.169.254/metadata/instance", nil) + if err != nil { + return "", fmt.Errorf("failed to construct IMDS request: %w", err) + } req.Header.Add("Metadata", "True") q := req.URL.Query() q.Add("format", "json") q.Add("api-version", "2021-02-01") req.URL.RawQuery = q.Encode() resp, err := client.Do(req) if err != nil { return "", err } defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + return "", fmt.Errorf("unexpected IMDS status code: %d", resp.StatusCode) + } // now that we have the response get the subscription ID from it var result struct { Compute struct { SubscriptionID string `json:"subscriptionId"` } `json:"compute"` } if err := json.NewDecoder(resp.Body).Decode(&result); err != nil { return "", fmt.Errorf("failed to decode IMDS response: %w", err) } + if result.Compute.SubscriptionID == "" { + return "", fmt.Errorf("subscriptionId missing from IMDS response") + } return result.Compute.SubscriptionID, nilhealth-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go (1)
38-183: Replace fake Kubernetes client with envtest.Both test functions use
fakek8s.NewSimpleClientset, which violates the established coding guidelines and a previous review comment. Tests should useenvtestto provide a more realistic Kubernetes API server for controller testing.Based on coding guidelines and learnings, refer to the AWS CSP monitor tests for the recommended pattern.
🧹 Nitpick comments (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)
123-127: Redundant context check before select.The context cancellation is already handled in the
selectstatement at line 132. The check at lines 123-127 is redundant and can be removed.Apply this diff:
for { - if ctx.Err() != nil { - slog.Info("Azure monitoring stopping due to context cancellation.") - return ctx.Err() - } - c.pollForMaintenanceEvents(ctx, eventChan) select {
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go(1 hunks)health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.gohealth-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go
**/*_test.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*_test.go: Useenvtestfor testing Kubernetes controllers instead of fake clients
Usetestify/assertandtestify/requirefor assertions in Go tests
Write table-driven tests when testing multiple scenarios in Go
Name Go tests descriptively using format:TestFunctionName_Scenario_ExpectedBehavior
Files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go
🧠 Learnings (4)
📚 Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*_test.go : Use `envtest` for testing Kubernetes controllers instead of fake clients
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go
🧬 Code graph analysis (2)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (2)
health-monitors/csp-health-monitor/pkg/config/config.go (2)
AzureConfig(67-71)Config(39-48)health-monitors/csp-health-monitor/pkg/event/normalizer.go (2)
Normalizer(25-30)GetNormalizer(34-45)
health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go (4)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)
Client(42-50)health-monitors/csp-health-monitor/pkg/config/config.go (1)
AzureConfig(67-71)health-monitors/csp-health-monitor/pkg/event/azure_normalizer.go (1)
AzureNormalizer(26-26)health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (1)
CSPAzure(56-56)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)
318-353: IMDS hardening issues remain unresolved.A previous review flagged four hardening issues in
getSubscriptionID, and while the comment was marked as addressed, all four issues are still present:
- No timeout (line 326): HTTP client can hang indefinitely outside Azure
- Ignored error (line 328):
NewRequestWithContexterror is discarded with_- No status code check (after line 336):
resp.StatusCodeis not validated- Empty subscriptionId accepted (line 352): Returns empty string as success if JSON decodes but field is missing
Apply the previously suggested diff:
func getSubscriptionID(cfg config.AzureConfig) (string, error) { if cfg.SubscriptionID != "" { return cfg.SubscriptionID, nil } - // pulled from https://github.com/Microsoft/azureimds/blob/master/imdssample.go var PTransport = &http.Transport{Proxy: nil} - client := http.Client{Transport: PTransport} + client := http.Client{ + Transport: PTransport, + Timeout: 5 * time.Second, + } - req, _ := http.NewRequestWithContext(context.Background(), "GET", "http://169.254.169.254/metadata/instance", nil) + req, err := http.NewRequestWithContext(context.Background(), "GET", "http://169.254.169.254/metadata/instance", nil) + if err != nil { + return "", fmt.Errorf("failed to construct IMDS request: %w", err) + } req.Header.Add("Metadata", "True") q := req.URL.Query() q.Add("format", "json") q.Add("api-version", "2021-02-01") req.URL.RawQuery = q.Encode() resp, err := client.Do(req) if err != nil { return "", err } defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + return "", fmt.Errorf("unexpected IMDS status code: %d", resp.StatusCode) + } - // now that we have the response get the subscription ID from it var result struct { Compute struct { SubscriptionID string `json:"subscriptionId"` } `json:"compute"` } if err := json.NewDecoder(resp.Body).Decode(&result); err != nil { return "", fmt.Errorf("failed to decode IMDS response: %w", err) } + if result.Compute.SubscriptionID == "" { + return "", fmt.Errorf("subscriptionId missing from IMDS response") + } return result.Compute.SubscriptionID, nil }Consider also accepting a
context.Contextparameter (fromNewClient'sctx) to respect caller cancellation.
🧹 Nitpick comments (3)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (3)
119-119: Simplify resync period expression.
0*time.Secondis unnecessarily verbose.Apply this diff:
- factory := informers.NewSharedInformerFactory(k8sClient, 0*time.Second) + factory := informers.NewSharedInformerFactory(k8sClient, 0)
157-172: Remove redundant context check.The context check at lines 158-161 is redundant since the
selectblock at lines 165-170 already handlesctx.Done(). Based on coding guidelines, this makes the flow more elegant.Apply this diff:
for { - if ctx.Err() != nil { - slog.Info("Azure monitoring stopping due to context cancellation.") - return ctx.Err() - } - c.pollForMaintenanceEvents(ctx, eventChan) select {
260-263: Remove obvious comment.As per coding guidelines, the comment doesn't add significant value beyond what the code clearly expresses.
Apply this diff:
func (c *Client) shouldReportUpdate(update *armmaintenance.Update) bool { - // These are the two fields we need to determine if a maintenance event needs to be reported if update.Status == nil || update.ImpactType == nil { return false }
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go(1 hunks)health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- health-monitors/csp-health-monitor/pkg/csp/azure/azure_test.go
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
🧠 Learnings (3)
📚 Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
🧬 Code graph analysis (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (4)
health-monitors/csp-health-monitor/pkg/config/config.go (2)
AzureConfig(67-71)Config(39-48)health-monitors/csp-health-monitor/pkg/event/normalizer.go (2)
Normalizer(25-30)GetNormalizer(34-45)health-monitors/csp-health-monitor/pkg/metrics/metrics.go (3)
CSPPollingDuration(39-46)CSPAPIErrors(47-53)CSPEventsReceived(32-38)health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (3)
CSPAzure(56-56)CSP(43-43)MaintenanceEvent(22-40)
🔇 Additional comments (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)
56-56: Add godoc for exported function.As per coding guidelines, all exported functions require documentation comments.
Add godoc before the function:
+// NewClient creates and initializes a new Azure health monitoring client. +// It resolves the subscription ID from config or IMDS, creates Azure SDK clients, +// initializes the Kubernetes client and node informer, and returns a ready-to-use Client. func NewClient(Likely an incorrect or invalid review comment.
| // See the License for the specific language governing permissions and | ||
| // limitations under the License. | ||
|
|
||
| package azure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add package-level godoc.
As per coding guidelines, all Go packages require package-level documentation describing the package's purpose.
Add godoc before the package declaration:
+// Package azure implements Azure CSP health monitoring by polling the Azure Maintenance Updates API
+// for VM maintenance events and forwarding them to the main event pipeline.
package azure📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| package azure | |
| // Package azure implements Azure CSP health monitoring by polling the Azure Maintenance Updates API | |
| // for VM maintenance events and forwarding them to the main event pipeline. | |
| package azure |
🤖 Prompt for AI Agents
In health-monitors/csp-health-monitor/pkg/csp/azure/azure.go around line 15,
there is no package-level godoc; add a descriptive package comment immediately
above the "package azure" line that succinctly explains the package's purpose,
what it provides (e.g., Azure CSP health-check utilities/clients/helpers), and
any high-level usage notes or important package responsibilities so the package
conforms to the project's Go documentation guidelines.
|
|
||
| slog.Info("Azure Client: Successfully initialized Kubernetes client") | ||
|
|
||
| nodeInformer, _ := newNodeInformer(k8sClient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle error from newNodeInformer.
The error from newNodeInformer is silently discarded. If informer initialization fails, nodeInformer will be nil, causing a nil pointer dereference at line 184 when c.nodeInformer.Lister().List() is called.
Apply this diff:
- nodeInformer, _ := newNodeInformer(k8sClient)
+ nodeInformer, stopCh := newNodeInformer(k8sClient)
+ if nodeInformer == nil {
+ metrics.CSPMonitorErrors.WithLabelValues(string(model.CSPAzure), "node_informer_error").Inc()
+ return nil, fmt.Errorf("failed to initialize node informer")
+ }Note: You'll also need to store stopCh in the Client struct and close it during shutdown to properly clean up the informer.
Committable suggestion skipped: line range outside the PR's diff.
| _, err := nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ | ||
| AddFunc: func(obj interface{}) {}, | ||
| DeleteFunc: func(obj interface{}) {}, | ||
| UpdateFunc: func(oldObj, newObj interface{}) {}, | ||
| }) | ||
| if err != nil { | ||
| fmt.Fprintf(os.Stderr, "failed to add event handler: %v", err) | ||
| return nil, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Remove no-op event handlers and use structured logging.
The event handlers are empty and serve no purpose. If handlers aren't needed, don't register them. Additionally, error handling should use slog for consistency with the rest of the codebase.
Apply this diff:
- _, err := nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
- AddFunc: func(obj interface{}) {},
- DeleteFunc: func(obj interface{}) {},
- UpdateFunc: func(oldObj, newObj interface{}) {},
- })
- if err != nil {
- fmt.Fprintf(os.Stderr, "failed to add event handler: %v", err)
+ // No event handlers needed; the informer is used only for listing/caching nodes
+ if err := nodeInformer.Informer().GetController(); err != nil {
+ slog.Error("failed to initialize node informer controller", "error", err)
return nil, nil
}Or simply remove the handler registration entirely if not required:
- _, err := nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
- AddFunc: func(obj interface{}) {},
- DeleteFunc: func(obj interface{}) {},
- UpdateFunc: func(oldObj, newObj interface{}) {},
- })
- if err != nil {
- fmt.Fprintf(os.Stderr, "failed to add event handler: %v", err)
- return nil, nil
- }
-📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _, err := nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ | |
| AddFunc: func(obj interface{}) {}, | |
| DeleteFunc: func(obj interface{}) {}, | |
| UpdateFunc: func(oldObj, newObj interface{}) {}, | |
| }) | |
| if err != nil { | |
| fmt.Fprintf(os.Stderr, "failed to add event handler: %v", err) | |
| return nil, nil | |
| } | |
| // No event handlers needed; the informer is used only for listing/caching nodes |
🤖 Prompt for AI Agents
In health-monitors/csp-health-monitor/pkg/csp/azure/azure.go around lines
124-132, remove the no-op AddEventHandler registration (or omit calling
AddEventHandler entirely) since empty AddFunc/DeleteFunc/UpdateFunc serve no
purpose; if you do need handlers implement meaningful logic instead of empty
closures. Also replace the fmt.Fprintf error handling with the structured slog
logger (e.g., slog.Error with a descriptive message and the err value) to match
the codebase logging; preserve the current return behavior after logging.
| synced := factory.WaitForCacheSync(stopCh) | ||
| for v, ok := range synced { | ||
| if !ok { | ||
| fmt.Fprintf(os.Stderr, "caches failed to sync: %v", v) | ||
| return nil, nil | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Use structured logging for cache sync errors.
Replace fmt.Fprintf(os.Stderr, ...) with slog.Error for consistency with the codebase's structured logging approach.
Apply this diff:
synced := factory.WaitForCacheSync(stopCh)
for v, ok := range synced {
if !ok {
- fmt.Fprintf(os.Stderr, "caches failed to sync: %v", v)
+ slog.Error("node informer cache failed to sync", "informer", v)
return nil, nil
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| synced := factory.WaitForCacheSync(stopCh) | |
| for v, ok := range synced { | |
| if !ok { | |
| fmt.Fprintf(os.Stderr, "caches failed to sync: %v", v) | |
| return nil, nil | |
| } | |
| } | |
| synced := factory.WaitForCacheSync(stopCh) | |
| for v, ok := range synced { | |
| if !ok { | |
| slog.Error("node informer cache failed to sync", "informer", v) | |
| return nil, nil | |
| } | |
| } |
🤖 Prompt for AI Agents
In health-monitors/csp-health-monitor/pkg/csp/azure/azure.go around lines 135 to
141, replace the fmt.Fprintf(os.Stderr, ...) call used when a cache fails to
sync with a structured slog.Error call; call slog.Error with a clear message
like "caches failed to sync" and include the failing informer key/value (v) as a
field (e.g., slog.String("informer", fmt.Sprintf("%v", v)) or
slog.Any("informer", v)) so the error is logged in structured form, and ensure
the function returns an appropriate error (not nil) after logging; also add any
missing slog import if needed.
| func (c *Client) GetName() model.CSP { | ||
| return model.CSPAzure | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add godoc for exported method.
As per coding guidelines, all exported functions require documentation comments.
Apply this diff:
+// GetName returns the CSP identifier for this client.
func (c *Client) GetName() model.CSP {
return model.CSPAzure
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (c *Client) GetName() model.CSP { | |
| return model.CSPAzure | |
| } | |
| // GetName returns the CSP identifier for this client. | |
| func (c *Client) GetName() model.CSP { | |
| return model.CSPAzure | |
| } |
🤖 Prompt for AI Agents
In health-monitors/csp-health-monitor/pkg/csp/azure/azure.go around lines 146 to
148, the exported method GetName lacks a godoc comment; add a one-line
documentation comment immediately above the function that briefly describes what
GetName returns (e.g., "GetName returns the CSP identifier for Azure") and
mention that it returns model.CSPAzure, following Go doc comment style for
exported functions.
| return model.CSPAzure | ||
| } | ||
|
|
||
| func (c *Client) StartMonitoring(ctx context.Context, eventChan chan<- model.MaintenanceEvent) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Add godoc for exported method.
As per coding guidelines, all exported functions require documentation comments.
Apply this diff:
+// StartMonitoring begins polling Azure for VM maintenance events at the configured interval.
+// It sends normalized events to eventChan and respects context cancellation.
func (c *Client) StartMonitoring(ctx context.Context, eventChan chan<- model.MaintenanceEvent) error {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (c *Client) StartMonitoring(ctx context.Context, eventChan chan<- model.MaintenanceEvent) error { | |
| // StartMonitoring begins polling Azure for VM maintenance events at the configured interval. | |
| // It sends normalized events to eventChan and respects context cancellation. | |
| func (c *Client) StartMonitoring(ctx context.Context, eventChan chan<- model.MaintenanceEvent) error { |
🤖 Prompt for AI Agents
In health-monitors/csp-health-monitor/pkg/csp/azure/azure.go around line 150,
the exported method StartMonitoring lacks a godoc comment; add a documentation
comment immediately above the function that begins with "StartMonitoring" and
concisely describes what the method does, its context parameter and the
eventChan parameter (chan<- model.MaintenanceEvent), and what the returned error
signifies, following standard Go godoc style and punctuation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (7)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (7)
137-143: Use structured logging for cache sync errors.Replace
fmt.Fprintfwithslog.Errorfor consistency with the codebase's structured logging.Apply this diff:
synced := factory.WaitForCacheSync(stopCh) for v, ok := range synced { if !ok { - fmt.Fprintf(os.Stderr, "caches failed to sync: %v", v) + slog.Error("node informer cache failed to sync", "informer", v) return nil, nil } }As per coding guidelines, use structured logging via
log/slogin Go code.
15-15: Add package-level godoc.As per coding guidelines, all Go packages require package-level documentation.
Apply this diff:
+// Package azure implements Azure CSP health monitoring by polling the Azure Maintenance Updates API +// for VM maintenance events and forwarding them to the main event pipeline. package azureAs per coding guidelines, package-level godoc is required for all Go packages.
152-152: Add godoc for exported method.As per coding guidelines, all exported functions require documentation comments.
Apply this diff:
+// StartMonitoring begins polling Azure for VM maintenance events at the configured interval. +// It sends normalized events to eventChan and respects context cancellation. func (c *Client) StartMonitoring(ctx context.Context, eventChan chan<- model.MaintenanceEvent) error {As per coding guidelines, function comments are required for all exported Go functions.
148-150: Add godoc for exported method.As per coding guidelines, all exported functions require documentation comments.
Apply this diff:
+// GetName returns the CSP identifier for this client. func (c *Client) GetName() model.CSP { return model.CSPAzure }As per coding guidelines, function comments are required for all exported Go functions.
320-355: Critical issues in getSubscriptionID remain unaddressed.Despite being marked as addressed in a previous commit, four critical issues remain:
- No timeout: HTTP call can hang indefinitely outside Azure
- Ignored error: Line 330 discards error from
NewRequestWithContext- No status code validation: No check for
resp.StatusCodeafterclient.Do- Empty subscriptionId accepted: Line 354 returns empty string as success if JSON decodes but
subscriptionIdis missing, causing downstream failures inUpdatesClientconstructionApply this diff:
func getSubscriptionID(cfg config.AzureConfig) (string, error) { if cfg.SubscriptionID != "" { return cfg.SubscriptionID, nil } // pulled from https://github.com/Microsoft/azureimds/blob/master/imdssample.go var PTransport = &http.Transport{Proxy: nil} - client := http.Client{Transport: PTransport} + client := http.Client{ + Transport: PTransport, + Timeout: 5 * time.Second, + } - req, _ := http.NewRequestWithContext(context.Background(), "GET", "http://169.254.169.254/metadata/instance", nil) + req, err := http.NewRequestWithContext(context.Background(), "GET", "http://169.254.169.254/metadata/instance", nil) + if err != nil { + return "", fmt.Errorf("failed to construct IMDS request: %w", err) + } req.Header.Add("Metadata", "True") q := req.URL.Query() q.Add("format", "json") q.Add("api-version", "2021-02-01") req.URL.RawQuery = q.Encode() resp, err := client.Do(req) if err != nil { return "", err } defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + return "", fmt.Errorf("unexpected IMDS status code: %d", resp.StatusCode) + } // now that we have the response get the subscription ID from it var result struct { Compute struct { SubscriptionID string `json:"subscriptionId"` } `json:"compute"` } if err := json.NewDecoder(resp.Body).Decode(&result); err != nil { return "", fmt.Errorf("failed to decode IMDS response: %w", err) } + if result.Compute.SubscriptionID == "" { + return "", fmt.Errorf("subscriptionId missing from IMDS response") + } return result.Compute.SubscriptionID, nil }Consider also accepting a
context.Contextparameter from the caller to respect cancellation.
101-101: Handle error from newNodeInformer and store stopCh for proper cleanup.The error and stopCh from
newNodeInformerare discarded. If informer initialization fails,nodeInformerwill be nil, causing a nil pointer dereference at line 186. Additionally, the stopCh is never stored or closed, causing a goroutine leak.Apply this diff to handle the error and store stopCh:
- nodeInformer, _ := newNodeInformer(k8sClient) + nodeInformer, stopCh := newNodeInformer(k8sClient) + if nodeInformer == nil { + metrics.CSPMonitorErrors.WithLabelValues(string(model.CSPAzure), "node_informer_error").Inc() + return nil, fmt.Errorf("failed to initialize node informer") + }Then add
stopCh chan struct{}to the Client struct (line 54) and close it in a cleanup method to prevent resource leaks.
125-133: Remove no-op event handlers.The empty event handlers serve no purpose since the informer is only used for caching/listing nodes, not for reacting to node changes.
Apply this diff:
- _, err := nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{ - AddFunc: func(obj interface{}) {}, - DeleteFunc: func(obj interface{}) {}, - UpdateFunc: func(oldObj, newObj interface{}) {}, - }) - if err != nil { - fmt.Fprintf(os.Stderr, "failed to add event handler: %v", err) - return nil, nil - } -As per coding guidelines, extract informer event handler setup into helper methods only when handlers are meaningful.
🧹 Nitpick comments (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (1)
159-174: Refactor to eliminate redundant context checks.The context cancellation is checked twice: once at line 160 and again in the select at line 168. This is redundant.
Apply this diff:
ticker := time.NewTicker(time.Duration(c.config.PollingIntervalSeconds) * time.Second) defer ticker.Stop() for { - if ctx.Err() != nil { - slog.Info("Azure monitoring stopping due to context cancellation.") - return ctx.Err() - } - c.pollForMaintenanceEvents(ctx, eventChan) select { case <-ctx.Done(): slog.Info("Azure monitoring stopping due to context cancellation.") return ctx.Err() case <-ticker.C: } }Based on learnings, this pattern is more idiomatic and avoids duplicate checks.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.go
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.go: Follow standard Go conventions with gofmt and golint
Use structured logging vialog/slogin Go code
Wrap errors with context usingfmt.Errorf("context: %w", err)in Go code
Withinretry.RetryOnConflictblocks, return errors without wrapping to preserve retry behavior
Use meaningful variable names such assyncedoverokfor cache sync checks
Useclient-gofor Kubernetes API interactions in Go code
Prefer informers over direct API calls for watching Kubernetes resources
Implement proper shutdown handling with context cancellation in Go code
Package-level godoc required for all Go packages
Function comments required for all exported Go functions
Use inline comments for complex logic only in Go code
TODO comments should reference issues in Go code
Extract informer event handler setup into helper methods
Use separate informers for different Kubernetes resource types
Files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
🧠 Learnings (9)
📚 Learning: 2025-10-29T12:40:29.621Z
Learnt from: KaivalyaMDabhadkar
Repo: NVIDIA/NVSentinel PR: 143
File: fault-quarantine-module/pkg/informer/k8s_client.go:52-67
Timestamp: 2025-10-29T12:40:29.621Z
Learning: The clientcmd.BuildConfigFromFlags function in k8s.io/client-go/tools/clientcmd automatically handles in-cluster configuration as a fallback. When both masterUrl and kubeconfigPath parameters are empty strings, it internally attempts rest.InClusterConfig() before falling back to default config loading rules. No explicit in-cluster fallback logic is needed when using this function.
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Prefer informers over direct API calls for watching Kubernetes resources
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Package-level godoc required for all Go packages
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Function comments required for all exported Go functions
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use separate informers for different Kubernetes resource types
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Extract informer event handler setup into helper methods
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use meaningful variable names such as `synced` over `ok` for cache sync checks
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-24T22:20:48.152Z
Learnt from: CR
Repo: NVIDIA/NVSentinel PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T22:20:48.152Z
Learning: Applies to **/*.go : Use structured logging via `log/slog` in Go code
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
📚 Learning: 2025-11-12T14:08:15.229Z
Learnt from: tanishagoyal2
Repo: NVIDIA/NVSentinel PR: 333
File: health-monitors/csp-health-monitor/pkg/csp/aws/aws.go:624-632
Timestamp: 2025-11-12T14:08:15.229Z
Learning: In the AWS health monitor codebase (health-monitors/csp-health-monitor), the EventID field in model.MaintenanceEvent stores the AWS entity ARN. This is set during normalization in aws_normalizer.go where EventID is assigned from EventMetadata.EntityArn. Therefore, when processing active events, using activeEvent.EventID as the EntityArn value is correct and intentional.
Applied to files:
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go
🧬 Code graph analysis (1)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (3)
health-monitors/csp-health-monitor/pkg/config/config.go (2)
AzureConfig(67-71)Config(39-48)health-monitors/csp-health-monitor/pkg/event/normalizer.go (2)
Normalizer(25-30)GetNormalizer(34-45)health-monitors/csp-health-monitor/pkg/model/maintenance_event.go (3)
CSPAzure(56-56)CSP(43-43)MaintenanceEvent(22-40)
🔇 Additional comments (2)
health-monitors/csp-health-monitor/pkg/csp/azure/azure.go (2)
260-318: LGTM!The filtering logic in
shouldReportUpdatecorrectly validates required fields and filters for non-completed events with actual impact. ThenormalizeAndSendEventmethod properly handles normalization, error cases, and context cancellation.
357-383: LGTM!The
parseAzureProviderIDfunction correctly handles the Azure provider ID format, validates the structure, and extracts the required resource group and VM name with appropriate error handling.
| pager := c.updatesClient.NewListPager( | ||
| resourceGroup, | ||
| "Microsoft.Compute", | ||
| "virtualMachines", | ||
| vmName, | ||
| nil, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
Azure Resource Graph API query maintenance updates events
💡 Result:
You can use Azure Resource Graph (ARG) to query maintenance notifications, planned maintenance windows, and update/patch events. Key ARG tables: maintenanceresources, ServiceHealthResources (Microsoft.ResourceHealth/events), patchassessmentresources/patchinstallationresources, and Resources. Examples and notes below.
Sample: find maintenance updates (from notifications / microsoft.maintenance/updates)
resources
| project resource = tolower(id)
| join kind=inner (
maintenanceresources
| where type == "microsoft.maintenance/updates"
| extend p = parse_json(properties)
| mv-expand d = p.value
| where d has 'notificationId' and d.notificationId == ''
| project resource = tolower(name), status = d.status, resourceGroup, location, startTimeUtc = d.startTimeUtc, endTimeUtc = d.endTimeUtc, impactType = d.impactType
) on resource
| project resource, status, resourceGroup, location, startTimeUtc, endTimeUtc, impactType. [1]
Sample: list active Service Health events (planned maintenance, service issues, advisories)
ServiceHealthResources
| where type =~ 'Microsoft.ResourceHealth/events'
| extend eventType = properties.EventType, status = properties.Status, title = properties.Title, trackingId = properties.TrackingId, start = properties.ImpactStartTime, mitigation = properties.ImpactMitigationTime
| where (eventType in ('HealthAdvisory','SecurityAdvisory','PlannedMaintenance') and mitigation > now()) or (eventType == 'ServiceIssue' and status == 'Active'). [4]
Patch / Update Manager logs (assessment/install results) are available in patchassessmentresources and patchinstallationresources (30-day / 7-day retention details in docs). Use these tables to query per-VM patch status and installationState. [3]
Notes / troubleshooting
- Some fields (maintenanceStartTime/maintenanceEndTime/info) show null unless event is PlannedMaintenance or you have proper permissions; event details may be populated later in the event lifecycle. [5]
- ARG table reference and supported resource-type tables listed in docs. Use ARG Explorer, az graph query or the Resource Graph REST API to run these queries at scale across subscriptions. [2]
Useful docs
- Maintenance updates example (ARG query). [1]
- ARG table list / query language reference. [2]
- Update Manager / patch tables and schema. [3]
- Service Health sample queries. [4]
- Troubleshooting null maintenance fields. [5]
References:
[1] Configure advance notifications for planned maintenance events (sample ARG query).
[2] Understand the query language / ARG table reference.
[3] Query Resources with Azure Update Manager (patchassessmentresources / patchinstallationresources).
[4] Azure Resource Graph sample queries for Azure Service Health.
[5] Azure Resource Graph Overview — troubleshooting maintenance fields.
🏁 Script executed:
cat -n health-monitors/csp-health-monitor/pkg/csp/azure/azure.go | sed -n '200,240p'Repository: NVIDIA/NVSentinel
Length of output: 178
Consider using Azure Resource Graph API for cluster-level maintenance event queries.
The current implementation queries the Azure Maintenance Updates API for each VM individually (lines 231-237), which may lead to API rate limiting and performance issues in large clusters. Azure Resource Graph API supports querying maintenance events at subscription and resource group level using the maintenanceresources and ServiceHealthResources tables, enabling more efficient cluster-wide queries similar to GCP's cloudlogging approach.
Note: Some maintenance event fields may be null depending on permissions and event lifecycle timing, so implementation should handle these cases appropriately.
🤖 Prompt for AI Agents
In health-monitors/csp-health-monitor/pkg/csp/azure/azure.go around lines
231-237 the code creates a per-VM pager via c.updatesClient.NewListPager(...)
which will hit the Maintenance Updates API for each VM and can cause rate
limiting and poor performance for large clusters; replace this per-VM approach
with an Azure Resource Graph query against the maintenanceresources and/or
ServiceHealthResources tables at the subscription or resource-group scope to
fetch maintenance events in bulk (implement a Resource Graph client call,
construct a query that filters by subscription/resourceGroup and relevant
resource types, handle pagination of results and API throttling backoff), map
Resource Graph results to the existing internal event model (safely handling
nullable fields and missing attributes), and only fall back to per-VM
Maintenance Updates API when Resource Graph returns incomplete data or
permissions prevent access.
Summary
Implements #124
AI Disclosure: Cursor helped quite a bit here once I got the hand-made skeleton outlined. Broad strokes are me but the normalization is mostly Cursor - I had to tone down the amount of information pushed to the event (metadata) though. I almost always use gen-ai for testing and thats the same here. The testing seems quite good - solid test cases that do not use mocking and walk through the most important logic of the new pkg/normalizer.
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit
New Features
Bug Fixes / Validation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.