Skip to content

Comments

Translate Instance Store NVME EKS metrics#1943

Merged
agarakan merged 10 commits intomainfrom
lis-csi-nvme-metric-support
Nov 19, 2025
Merged

Translate Instance Store NVME EKS metrics#1943
agarakan merged 10 commits intomainfrom
lis-csi-nvme-metric-support

Conversation

@agarakan
Copy link
Contributor

@agarakan agarakan commented Nov 15, 2025

Description of the issue

Scrape and translate NVME metrics for Instance Store on EKS. This leverages the awscontainerinsightsreceiver from the otel contrib package

Description of changes

  • Leverage the updated awscontainerinsightsreceiver to scrape Instance Store metrics from EKS
  • Add metric name mapping to conform to cloudwatch agent metric naming convention
  • Define dimension set
  • Export metrics in EMF Log format.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Unit testing

make fmt
make fmt-sh
make lint
make test

Validated EMF Log generated from manual test a test EKS cluster vending Instance Store NVME metrics.

{
    "AutoScalingGroupName": "",
    "CloudWatchMetrics": [
        {
            "Namespace": "ContainerInsights",
            "Dimensions": [
                [
                    "ClusterName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName"
                ],
                [
                    "ClusterName",
                    "InstanceId",
                    "NodeName",
                    "VolumeId"
                ]
            ],
            "Metrics": [
                {
                    "Name": "node_diskio_instance_store_total_write_time",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_instance_store_ec2_instance_performance_exceeded_iops",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_instance_store_total_read_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_instance_store_total_write_ops",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_instance_store_total_read_time",
                    "Unit": "Second",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_instance_store_total_read_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_instance_store_volume_queue_length",
                    "Unit": "Count",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_instance_store_total_write_bytes",
                    "Unit": "Bytes",
                    "StorageResolution": 60
                },
                {
                    "Name": "node_diskio_instance_store_ec2_instance_performance_exceeded_tp",
                    "Unit": "Second",
                    "StorageResolution": 60
                }
            ]
        }
    ],
    "ClusterName": "instance-store-test",
    "InstanceId": "i-05a21e33f03fe16b5",
    "InstanceType": "i3en.24xlarge",
    "NodeName": "ip-192-168-31-5.us-west-2.compute.internal",
    "PlatformType": "AWS::EKS",
    "Timestamp": "1763321337000",
    "Type": "NodeInstanceStore",
    "Version": "0",
    "VolumeId": "/dev/nvme4n1",
    "http.scheme": "http",
    "instance_id": "ip-192-168-31-5.us-west-2.compute.internal",
    "k8s.namespace.name": "kube-system",
    "kubernetes": {
        "host": "ip-192-168-31-5.us-west-2.compute.internal"
    },
    "net.host.name": "instance-store-csi-metrics.kube-system.svc",
    "net.host.port": "8101",
    "server.address": "instance-store-csi-metrics.kube-system.svc",
    "server.port": "8101",
    "service.instance.id": "instance-store-csi-metrics.kube-system.svc:8101",
    "service.name": "containerInsightsNVMeLISScraper",
    "url.scheme": "http",
    "volume_id": "/dev/nvme4n1",
    "node_diskio_instance_store_ec2_instance_performance_exceeded_iops": 0,
    "node_diskio_instance_store_ec2_instance_performance_exceeded_tp": 0,
    "node_diskio_instance_store_total_read_bytes": 73728,
    "node_diskio_instance_store_total_read_ops": 32,
    "node_diskio_instance_store_total_read_time": 0.0008310000000015805,
    "node_diskio_instance_store_total_write_bytes": 0,
    "node_diskio_instance_store_total_write_ops": 0,
    "node_diskio_instance_store_total_write_time": 0,
    "node_diskio_instance_store_volume_queue_length": 1
}

Requirements

Before commiting your code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

Integration Tests

To run integration tests against this PR, add the ready for testing label.

@agarakan agarakan marked this pull request as ready for review November 18, 2025 19:18
@agarakan agarakan requested a review from a team as a code owner November 18, 2025 19:18
@agarakan agarakan added the ready for testing Indicates this PR is ready for integration tests to run label Nov 18, 2025
dricross
dricross previously approved these changes Nov 18, 2025
NvmeInstanceStoreWriteTime = "diskio_instance_store_total_write_time"
NvmeInstanceStoreExceededEC2IOPSTime = "diskio_instance_store_ec2_instance_performance_exceeded_iops"
NvmeInstanceStoreExceededEC2TPTime = "diskio_instance_store_ec2_instance_performance_exceeded_tp"
NvmeInstanceStoreVolumeQueueLength = "diskio_instance_store_volume_queue_length"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note there are 2 EBS NVMe metrics that do not have an equivalent metric for LIS NVMe metrics. This is called out in the design, and communicated by the team owning the LIS CSI that these volume performance metrics are not relevant for LIS CSI. Additionally we have existing unit tests confirming those 2 metrics are not supported.

// relevant EBS iops/tp metrics supported
NvmeExceededIOPSTime = "diskio_ebs_volume_performance_exceeded_iops"
NvmeExceededTPTime = "diskio_ebs_volume_performance_exceeded_tp"
NvmeExceededEC2IOPSTime = "diskio_ebs_ec2_instance_performance_exceeded_iops"
NvmeExceededEC2TPTime = "diskio_ebs_ec2_instance_performance_exceeded_tp"

//relevant LIS iops/tp metrics supported
NvmeInstanceStoreExceededEC2IOPSTime = "diskio_instance_store_ec2_instance_performance_exceeded_iops"
NvmeInstanceStoreExceededEC2TPTime = "diskio_instance_store_ec2_instance_performance_exceeded_tp"

github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awscloudwatchlogsexporter => github.com/amazon-contributing/opentelemetry-collector-contrib/exporter/awscloudwatchlogsexporter v0.0.0-20251103165826-c3d3976f1fa0
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsemfexporter => github.com/amazon-contributing/opentelemetry-collector-contrib/exporter/awsemfexporter v0.0.0-20251103165826-c3d3976f1fa0
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awsxrayexporter => github.com/amazon-contributing/opentelemetry-collector-contrib/exporter/awsxrayexporter v0.0.0-20251103165826-c3d3976f1fa0
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/awscloudwatchlogsexporter => github.com/amazon-contributing/opentelemetry-collector-contrib/exporter/awscloudwatchlogsexporter v0.0.0-20251118152320-cd34549ef7ed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this from this one ? otel contrib PR - amazon-contributing/opentelemetry-collector-contrib#382

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes correct

transformRules = append(transformRules, map[string]interface{}{
"include": oldNvmeInstanceStoreMetric,
"action": "update",
"new_name": containerinsightscommon.MetricName(containerinsightscommon.TypeNode, newNvmeInstanceStoreMetric),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the type be TypeNodeInstanceStore here instead of Node ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the prefix is actually "node". The TypeNodeInstanceStoreis used as the MetricType label on line 141

@dricross
Copy link
Contributor

There are some failing tests, but I don't think they are related to these changes (xray and selinux/emf). Probably just flaky. Give them a retry. We should look at those though (outside this PR of course). Approving

@agarakan
Copy link
Contributor Author

There are some failing tests, but I don't think they are related to these changes (xray and selinux/emf). Probably just flaky. Give them a retry. We should look at those though (outside this PR of course). Approving

Tests succeeded on rerun after github internal issues were resolved

Comment on lines +710 to +718
"node_diskio_instance_store_total_read_ops",
"node_diskio_instance_store_total_write_ops",
"node_diskio_instance_store_total_read_bytes",
"node_diskio_instance_store_total_write_bytes",
"node_diskio_instance_store_total_read_time",
"node_diskio_instance_store_total_write_time",
"node_diskio_instance_store_ec2_instance_performance_exceeded_iops",
"node_diskio_instance_store_ec2_instance_performance_exceeded_tp",
"node_diskio_instance_store_volume_queue_length",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there a reason we aren't using the consts defined in containerinsightscommon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good question, I decided to follow the existing convention in this file, as all the other kubernetes metrics have the same format. I'm looking to update the entire file to depend on constants in a future cr as a refactor

"aws_ebs_csi_volume_queue_length": containerinsightscommon.NvmeVolumeQueueLength,
}

var renameMapForNvmeInstanceStore = map[string]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting. Had no idea the mapping was being done in the processor and not the prometheus receiver, but this makes sense.

Comment on lines +62 to +64
if result != tc.expectedName {
t.Errorf("MetricName(%q, %q) = %q, want %q", tc.metricType, tc.metricName, result, tc.expectedName)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For consistency, consider using github.com/stretchr/testify/assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, will update in a follow up

}
}

func TestMetricNameWithRealConstants(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: What is this testing that TestMetricName isn't?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its testing the method called MetricName to ensure it actually performs prefix injection appropriately.

@agarakan agarakan merged commit 7bf323e into main Nov 19, 2025
491 of 497 checks passed
@agarakan agarakan deleted the lis-csi-nvme-metric-support branch November 19, 2025 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for testing Indicates this PR is ready for integration tests to run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants