Skip to content

Add bundle version of utf8_range to validate attributes#3512

Open
owent wants to merge 41 commits intoopen-telemetry:mainfrom
owent:validate_utf8_string_in_setters
Open

Add bundle version of utf8_range to validate attributes#3512
owent wants to merge 41 commits intoopen-telemetry:mainfrom
owent:validate_utf8_string_in_setters

Conversation

@owent
Copy link
Copy Markdown
Member

@owent owent commented Jul 1, 2025

Fixes #3491

Changes

  • Add bundle version of utf8_range to validate attribute values
  • Update the implementation for the new SPEC(2026-03)

For significant contributions please make sure you have completed the following items:

  • CHANGELOG.md updated for non-trivial changes
  • Unit tests have been added
  • Changes in public API reviewed

Copilot AI review requested due to automatic review settings July 1, 2025 12:40
@owent owent requested a review from a team as a code owner July 1, 2025 12:40
@netlify
Copy link
Copy Markdown

netlify bot commented Jul 1, 2025

Deploy Preview for opentelemetry-cpp-api-docs canceled.

Name Link
🔨 Latest commit 2ab31e8
🔍 Latest deploy log https://app.netlify.com/projects/opentelemetry-cpp-api-docs/deploys/68a8d5196a46540008c6326f

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a bundled UTF-8 validation library (utf8_range) and integrates it into attribute validation throughout the SDK, ensuring that span, resource, and instrumentation-scope attributes containing invalid UTF-8 are filtered out with warnings.

  • Add attribute_validity module wrapping the new utf8_range library
  • Integrate attribute validation/filtering into Span, Resource, and InstrumentationScope
  • Add build rules for utf8_range and update CMake/Bazel files accordingly

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
sdk/src/common/internal/utf8_range/uft8_range.cc Added bundled UTF-8 validation implementation
sdk/src/common/internal/utf8_range/utf8_range.h Declared public API for UTF-8 validation (IsValid, ValidPrefix)
sdk/src/common/attribute_validity.cc Wrapped validation logic around SDK attribute values
sdk/src/common/BUILD Added utf8_range and attribute_validity libraries
sdk/src/common/CMakeLists.txt Updated to include new source in common build
sdk/src/trace/span.cc Applied attribute validation in Span::SetAttribute and AddLink
sdk/src/resource/resource.cc Applied attribute validation in Resource constructor
sdk/include/opentelemetry/sdk/instrumentationscope/instrumentation_scope.h Applied attribute validation in instrumentation scope factories and setters
sdk/test/instrumentationscope/CMakeLists.txt Updated linkage to include opentelemetry_common
sdk/test/instrumentationscope/BUILD Added dependency on attribute_validity for tests
Comments suppressed due to low confidence (2)

sdk/src/common/internal/utf8_range/uft8_range.cc:1

  • The source file is named uft8_range.cc, which looks like a typo. Renaming it to utf8_range.cc would align with the header (utf8_range.h) and improve consistency.
// Copyright 2023 Google LLC

sdk/src/common/BUILD:6

  • The new utf8_range library adds core validation logic but there are no accompanying unit tests verifying valid and invalid UTF-8 sequences. Consider adding tests under sdk/test/common/ to cover edge cases.
cc_library(

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 1, 2025

Codecov Report

❌ Patch coverage is 58.79828% with 96 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.14%. Comparing base (2eba094) to head (49b474a).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
sdk/src/common/attribute_validity.cc 56.10% 36 Missing ⚠️
sdk/src/common/internal/utf8_range/uft8_range.cc 50.75% 33 Missing ⚠️
...lude/opentelemetry/sdk/common/attribute_validity.h 63.64% 12 Missing ⚠️
...y/sdk/instrumentationscope/instrumentation_scope.h 61.12% 7 Missing ⚠️
sdk/src/trace/span.cc 64.71% 6 Missing ⚠️
...ntelemetry/sdk/metrics/view/attributes_processor.h 66.67% 1 Missing ⚠️
sdk/src/resource/resource.cc 92.31% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3512      +/-   ##
==========================================
- Coverage   90.12%   89.14%   -0.98%     
==========================================
  Files         227      230       +3     
  Lines        7261     7480     +219     
==========================================
+ Hits         6543     6667     +124     
- Misses        718      813      +95     
Files with missing lines Coverage Δ
...ntelemetry/sdk/metrics/view/attributes_processor.h 85.00% <66.67%> (ø)
sdk/src/resource/resource.cc 95.13% <92.31%> (-1.65%) ⬇️
sdk/src/trace/span.cc 85.57% <64.71%> (-4.08%) ⬇️
...y/sdk/instrumentationscope/instrumentation_scope.h 87.50% <61.12%> (-12.50%) ⬇️
...lude/opentelemetry/sdk/common/attribute_validity.h 63.64% <63.64%> (ø)
sdk/src/common/internal/utf8_range/uft8_range.cc 50.75% <50.75%> (ø)
sdk/src/common/attribute_validity.cc 56.10% <56.10%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@owent owent force-pushed the validate_utf8_string_in_setters branch 2 times, most recently from 71e6ce2 to f02ac94 Compare July 18, 2025 09:24
@owent owent changed the title [Draft] Add bundle version of utf8_range to validate attribute values [Draft] Add bundle version of utf8_range to validate attributes Jul 19, 2025
@owent owent force-pushed the validate_utf8_string_in_setters branch from f02ac94 to 9fe92a8 Compare July 19, 2025 08:47
@owent owent changed the title [Draft] Add bundle version of utf8_range to validate attributes Add bundle version of utf8_range to validate attributes Jul 19, 2025
@owent owent force-pushed the validate_utf8_string_in_setters branch from 69c0259 to d0b4bfe Compare July 22, 2025 13:48
@owent owent force-pushed the validate_utf8_string_in_setters branch from f9e1da8 to c804ec7 Compare November 1, 2025 17:02
dependabot bot and others added 11 commits November 2, 2025 23:02
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.30.9 to 4.31.0.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@16140ae...4e94bd1)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-version: 4.31.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 5.0.0 to 6.0.0.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](actions/download-artifact@634f93c...018cc2c)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-version: 6.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
- prevent superfluous copy in construction of OwnedAttributeValue with std::string or span
- combined identical lambdas of both lamdas
- use const-ref
- removed unneeded (stateless) members
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.31.0 to 4.31.2.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@4e94bd1...0499de3)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-version: 4.31.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@owent owent force-pushed the validate_utf8_string_in_setters branch from c804ec7 to 08e80f3 Compare November 2, 2025 15:02
@owent
Copy link
Copy Markdown
Member Author

owent commented Nov 29, 2025

The clang-tidy check failure does not appear to be caused by this PR.

owent added 2 commits March 21, 2026 21:57
…string_in_setters

# Conflicts:
#	CHANGELOG.md
#	functional/configuration/shelltests/kitchen-sink.test
#	sdk/src/common/BUILD
base64.cc
disabled.cc
attribute_validity.cc
internal/utf8_range/uft8_range.cc)
Copy link
Copy Markdown
Member

@dbarker dbarker Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vendoring this library and always running validation by default adds some maintenance and performance overhead. Ideally we can just get the required package/targets along with protobuf and then string validation can be tested as a preview feature and controlled by the users.

Can we require the utf8_range package to be found/fetched like the other dependencies?

The package is within protobuf source and built with protobuf. For CMake builds it can be installed using the utf8_range_ENABLE_INSTALL CMake option (See protobuf/third_party/utf8_range/CMakeLists.txt#L10 ) and then found with find_package. If protobuf is fetched then we can use FetchContent_Declare on the source directory ${protobuf_SOURCE_DIR}/third_party/utf8_range and build the target directly. Bazel builds relying on the protobuf packaged utf8_range also seem doable.

What do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8_range was moved into protobuf’s source tree starting from protobuf v22, and it is only included in CMake installation when the protobuf binaries are built.
If an older protobuf version is used, or if only the libraries are built for cross-compilation, utf8_range may not be installed and therefore cannot be found by find_package.

Enforcing the use of FetchContent_Declare may also introduce ABI compatibility issues when users link our project against a prebuilt version of protobuf.

For this reason, I imported it in the same way as the bundled version of abseil-cpp.

Alternatively, do you think we can disable these checks, even though this may result in some records being dropped?

Copy link
Copy Markdown
Member

@dbarker dbarker Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utf8_range was moved into protobuf’s source tree starting from protobuf v22, and it is only included in CMake installation when the protobuf binaries are built.
If an older protobuf version is used, or if only the libraries are built for cross-compilation, utf8_range may not be installed and therefore cannot be found by find_package.

We currently have v3.21 as the minimum supported/tested protobuf version (it is a 4 year old release) and we can bump it to 3.22 if needed, but only supporting utf8 validation for otlp attributes if utf8_range is installed or fetched seems reasonable.

If protobuf (v3.22+) is fetched now (by default otel-cpp will fetch protobuf v6.31) the utf_range CMake targets (utf8_validity) will be available to link to.

Enforcing the use of FetchContent_Declare may also introduce ABI compatibility issues when users link our project against a prebuilt version of protobuf.

The utf8_range package is independent and used as a static dependency in protobuf and will be used as a private dependency in otel-cpp OTLP exporters. ABI compatibility with protobuf may not be of concern in this case. Either way I think we use the same version bundled with protobuf.

For this reason, I imported it in the same way as the bundled version of abseil-cpp.

Makes sense. Just raising the alternative to optionally enable UTF-8 validation when the SDK is built with OTLP support and installing/importing/fetching utf8_range along with the protobuf dependency. This could be less maintenance for otel-cpp.

Alternatively, do you think we can disable these checks, even though this may result in some records being dropped?

The spec on OTLP attribute type mapping (specification/common/attribute-type-mapping.md) uses Should for UTF8 validation so it appears to be an optional requirement of the SDK. I think we should make it a compile time option that is off by default for now and evaluate performance/stability (similar to other _PREVIEW options).

{
OTEL_INTERNAL_LOG_WARN("[Resource] Invalid attribute value for "
<< kv.first << ". This attribute will be ignored.");
continue;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these attribute values that are not valid UTF-8 sequences be set to byte values here?

Spec language (specification/common/attribute-type-mapping.md#L109):

String Values
String values which are valid UTF-8 sequences SHOULD be converted to AnyValue's string_value field.

String values which are not valid Unicode sequences SHOULD be converted to AnyValue's bytes_value with the bytes representing the string in the original order and format of the source string.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the spec seems changed after the first version of this PR.I will change it to follow the latest spec.


if (!sdk::common::AttributeValidator::IsValid(value))
{
OTEL_INTERNAL_LOG_WARN("[Trace Span] Invalid span attribute value for "
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above: ignore or set as bytes?

};

/**
* Supports internal iteration over a collection of key-value pairs and filtering of invalid
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have to place invalid utf8 sequences into byte attribute values per the spec should the IsValid check always be performed when constructing an attribute map (attribute_utils.h)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks like the spec changed after the initial version of this PR.
In the earlier version, bytes were allowed in log attributes, but not in traces, metrics, or resources.
I’ll review the spec again over the next few days and update the implementation to match the latest version.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the spec it may only apply utf8 validation to OTLP type mapping. If that is the case we may be able to make a more isolated change to OtlpPopulateAttributeUtils::PopulateAnyValue to support this as @lalitb suggests.

@dbarker
Copy link
Copy Markdown
Member

dbarker commented Mar 25, 2026

Hey @owent, sorry for the long delay in review. I've added some comments above. Two main questions:

  • Can we treat utf8_range as a standard dependency (whose source comes with protobuf and can be installed with protobuf)?
  • Do we always need to put invalid utf8 strings into byte values of attributes?

@lalitb
Copy link
Copy Markdown
Member

lalitb commented Mar 26, 2026

@owent - Is it possible to scope this fix within OTLP instead of adding SDK-wide validation/framework changes?

The original issue seems much narrower: OTLP/protobuf serialization is where invalid UTF-8 actually becomes a problem. So I’d expect a smaller fix closer to the exporter boundary, for example validating right before writing protobuf string fields / set_string_value(...), and handling invalid data there, rather than introducing a broader SDK-wide mechanism. If we go that route, I’m also not sure we need to bring in utf8_range in this broader way.

@owent
Copy link
Copy Markdown
Member Author

owent commented Mar 26, 2026

@owent - Is it possible to scope this fix within OTLP instead of adding SDK-wide validation/framework changes?

The original issue seems much narrower: OTLP/protobuf serialization is where invalid UTF-8 actually becomes a problem. So I’d expect a smaller fix closer to the exporter boundary, for example validating right before writing protobuf string fields / set_string_value(...), and handling invalid data there, rather than introducing a broader SDK-wide mechanism. If we go that route, I’m also not sure we need to bring in utf8_range in this broader way.

Thanks. This is also due to a change in the spec.
The latest baggage API spec (https://opentelemetry.io/docs/specs/otel/baggage/api/) also requires UTF-8 strings. Do you think we can ignore this requirement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Need validation/debug logs for protobuf setters

7 participants