Translate Instance Store NVME EKS metrics#1943
Conversation
| NvmeInstanceStoreWriteTime = "diskio_instance_store_total_write_time" | ||
| NvmeInstanceStoreExceededEC2IOPSTime = "diskio_instance_store_ec2_instance_performance_exceeded_iops" | ||
| NvmeInstanceStoreExceededEC2TPTime = "diskio_instance_store_ec2_instance_performance_exceeded_tp" | ||
| NvmeInstanceStoreVolumeQueueLength = "diskio_instance_store_volume_queue_length" |
There was a problem hiding this comment.
Please note there are 2 EBS NVMe metrics that do not have an equivalent metric for LIS NVMe metrics. This is called out in the design, and communicated by the team owning the LIS CSI that these volume performance metrics are not relevant for LIS CSI. Additionally we have existing unit tests confirming those 2 metrics are not supported.
// relevant EBS iops/tp metrics supported
NvmeExceededIOPSTime = "diskio_ebs_volume_performance_exceeded_iops"
NvmeExceededTPTime = "diskio_ebs_volume_performance_exceeded_tp"
NvmeExceededEC2IOPSTime = "diskio_ebs_ec2_instance_performance_exceeded_iops"
NvmeExceededEC2TPTime = "diskio_ebs_ec2_instance_performance_exceeded_tp"
//relevant LIS iops/tp metrics supported
NvmeInstanceStoreExceededEC2IOPSTime = "diskio_instance_store_ec2_instance_performance_exceeded_iops"
NvmeInstanceStoreExceededEC2TPTime = "diskio_instance_store_ec2_instance_performance_exceeded_tp"
| github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awscloudwatchlogsexporter => github.com/amazon-contributing/opentelemetry-collector-contrib/exporter/awscloudwatchlogsexporter v0.0.0-20251103165826-c3d3976f1fa0 | ||
| github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter => github.com/amazon-contributing/opentelemetry-collector-contrib/exporter/awsemfexporter v0.0.0-20251103165826-c3d3976f1fa0 | ||
| github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsxrayexporter => github.com/amazon-contributing/opentelemetry-collector-contrib/exporter/awsxrayexporter v0.0.0-20251103165826-c3d3976f1fa0 | ||
| github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awscloudwatchlogsexporter => github.com/amazon-contributing/opentelemetry-collector-contrib/exporter/awscloudwatchlogsexporter v0.0.0-20251118152320-cd34549ef7ed |
There was a problem hiding this comment.
is this from this one ? otel contrib PR - amazon-contributing/opentelemetry-collector-contrib#382
| transformRules = append(transformRules, map[string]interface{}{ | ||
| "include": oldNvmeInstanceStoreMetric, | ||
| "action": "update", | ||
| "new_name": containerinsightscommon.MetricName(containerinsightscommon.TypeNode, newNvmeInstanceStoreMetric), |
There was a problem hiding this comment.
should the type be TypeNodeInstanceStore here instead of Node ?
There was a problem hiding this comment.
No, the prefix is actually "node". The TypeNodeInstanceStoreis used as the MetricType label on line 141
|
There are some failing tests, but I don't think they are related to these changes (xray and selinux/emf). Probably just flaky. Give them a retry. We should look at those though (outside this PR of course). Approving |
Tests succeeded on rerun after github internal issues were resolved |
| "node_diskio_instance_store_total_read_ops", | ||
| "node_diskio_instance_store_total_write_ops", | ||
| "node_diskio_instance_store_total_read_bytes", | ||
| "node_diskio_instance_store_total_write_bytes", | ||
| "node_diskio_instance_store_total_read_time", | ||
| "node_diskio_instance_store_total_write_time", | ||
| "node_diskio_instance_store_ec2_instance_performance_exceeded_iops", | ||
| "node_diskio_instance_store_ec2_instance_performance_exceeded_tp", | ||
| "node_diskio_instance_store_volume_queue_length", |
There was a problem hiding this comment.
nit: Is there a reason we aren't using the consts defined in containerinsightscommon?
There was a problem hiding this comment.
Yeah good question, I decided to follow the existing convention in this file, as all the other kubernetes metrics have the same format. I'm looking to update the entire file to depend on constants in a future cr as a refactor
| "aws_ebs_csi_volume_queue_length": containerinsightscommon.NvmeVolumeQueueLength, | ||
| } | ||
|
|
||
| var renameMapForNvmeInstanceStore = map[string]string{ |
There was a problem hiding this comment.
Oh interesting. Had no idea the mapping was being done in the processor and not the prometheus receiver, but this makes sense.
| if result != tc.expectedName { | ||
| t.Errorf("MetricName(%q, %q) = %q, want %q", tc.metricType, tc.metricName, result, tc.expectedName) | ||
| } |
There was a problem hiding this comment.
nit: For consistency, consider using github.com/stretchr/testify/assert.
There was a problem hiding this comment.
thanks, will update in a follow up
| } | ||
| } | ||
|
|
||
| func TestMetricNameWithRealConstants(t *testing.T) { |
There was a problem hiding this comment.
nit: What is this testing that TestMetricName isn't?
There was a problem hiding this comment.
its testing the method called MetricName to ensure it actually performs prefix injection appropriately.
Description of the issue
Scrape and translate NVME metrics for Instance Store on EKS. This leverages the awscontainerinsightsreceiver from the otel contrib package
Description of changes
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Unit testing
Validated EMF Log generated from manual test a test EKS cluster vending Instance Store NVME metrics.
Requirements
Before commiting your code, please do the following steps.
make fmtandmake fmt-shmake lintIntegration Tests
To run integration tests against this PR, add the
ready for testinglabel.