Add bundle version of utf8_range to validate attributes#3512
Add bundle version of utf8_range to validate attributes#3512owent wants to merge 41 commits intoopen-telemetry:mainfrom
Conversation
✅ Deploy Preview for opentelemetry-cpp-api-docs canceled.
|
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a bundled UTF-8 validation library (utf8_range) and integrates it into attribute validation throughout the SDK, ensuring that span, resource, and instrumentation-scope attributes containing invalid UTF-8 are filtered out with warnings.
- Add
attribute_validitymodule wrapping the newutf8_rangelibrary - Integrate attribute validation/filtering into
Span,Resource, andInstrumentationScope - Add build rules for
utf8_rangeand update CMake/Bazel files accordingly
Reviewed Changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| sdk/src/common/internal/utf8_range/uft8_range.cc | Added bundled UTF-8 validation implementation |
| sdk/src/common/internal/utf8_range/utf8_range.h | Declared public API for UTF-8 validation (IsValid, ValidPrefix) |
| sdk/src/common/attribute_validity.cc | Wrapped validation logic around SDK attribute values |
| sdk/src/common/BUILD | Added utf8_range and attribute_validity libraries |
| sdk/src/common/CMakeLists.txt | Updated to include new source in common build |
| sdk/src/trace/span.cc | Applied attribute validation in Span::SetAttribute and AddLink |
| sdk/src/resource/resource.cc | Applied attribute validation in Resource constructor |
| sdk/include/opentelemetry/sdk/instrumentationscope/instrumentation_scope.h | Applied attribute validation in instrumentation scope factories and setters |
| sdk/test/instrumentationscope/CMakeLists.txt | Updated linkage to include opentelemetry_common |
| sdk/test/instrumentationscope/BUILD | Added dependency on attribute_validity for tests |
Comments suppressed due to low confidence (2)
sdk/src/common/internal/utf8_range/uft8_range.cc:1
- The source file is named
uft8_range.cc, which looks like a typo. Renaming it toutf8_range.ccwould align with the header (utf8_range.h) and improve consistency.
// Copyright 2023 Google LLC
sdk/src/common/BUILD:6
- The new
utf8_rangelibrary adds core validation logic but there are no accompanying unit tests verifying valid and invalid UTF-8 sequences. Consider adding tests undersdk/test/common/to cover edge cases.
cc_library(
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3512 +/- ##
==========================================
- Coverage 90.12% 89.14% -0.98%
==========================================
Files 227 230 +3
Lines 7261 7480 +219
==========================================
+ Hits 6543 6667 +124
- Misses 718 813 +95
🚀 New features to boost your workflow:
|
71e6ce2 to
f02ac94
Compare
f02ac94 to
9fe92a8
Compare
69c0259 to
d0b4bfe
Compare
f9e1da8 to
c804ec7
Compare
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.30.9 to 4.31.0. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@16140ae...4e94bd1) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 4.31.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 5.0.0 to 6.0.0. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](actions/download-artifact@634f93c...018cc2c) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-version: 6.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
- prevent superfluous copy in construction of OwnedAttributeValue with std::string or span - combined identical lambdas of both lamdas - use const-ref - removed unneeded (stateless) members
Bumps [github/codeql-action](https://github.com/github/codeql-action) from 4.31.0 to 4.31.2. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@4e94bd1...0499de3) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 4.31.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
c804ec7 to
08e80f3
Compare
…string_in_setters # Conflicts: # CHANGELOG.md
|
The clang-tidy check failure does not appear to be caused by this PR. |
…string_in_setters # Conflicts: # CHANGELOG.md # functional/configuration/shelltests/kitchen-sink.test # sdk/src/common/BUILD
| base64.cc | ||
| disabled.cc | ||
| attribute_validity.cc | ||
| internal/utf8_range/uft8_range.cc) |
There was a problem hiding this comment.
Vendoring this library and always running validation by default adds some maintenance and performance overhead. Ideally we can just get the required package/targets along with protobuf and then string validation can be tested as a preview feature and controlled by the users.
Can we require the utf8_range package to be found/fetched like the other dependencies?
The package is within protobuf source and built with protobuf. For CMake builds it can be installed using the utf8_range_ENABLE_INSTALL CMake option (See protobuf/third_party/utf8_range/CMakeLists.txt#L10 ) and then found with find_package. If protobuf is fetched then we can use FetchContent_Declare on the source directory ${protobuf_SOURCE_DIR}/third_party/utf8_range and build the target directly. Bazel builds relying on the protobuf packaged utf8_range also seem doable.
What do you think?
There was a problem hiding this comment.
utf8_range was moved into protobuf’s source tree starting from protobuf v22, and it is only included in CMake installation when the protobuf binaries are built.
If an older protobuf version is used, or if only the libraries are built for cross-compilation, utf8_range may not be installed and therefore cannot be found by find_package.
Enforcing the use of FetchContent_Declare may also introduce ABI compatibility issues when users link our project against a prebuilt version of protobuf.
For this reason, I imported it in the same way as the bundled version of abseil-cpp.
Alternatively, do you think we can disable these checks, even though this may result in some records being dropped?
There was a problem hiding this comment.
utf8_range was moved into protobuf’s source tree starting from protobuf v22, and it is only included in CMake installation when the protobuf binaries are built.
If an older protobuf version is used, or if only the libraries are built for cross-compilation, utf8_range may not be installed and therefore cannot be found by find_package.
We currently have v3.21 as the minimum supported/tested protobuf version (it is a 4 year old release) and we can bump it to 3.22 if needed, but only supporting utf8 validation for otlp attributes if utf8_range is installed or fetched seems reasonable.
If protobuf (v3.22+) is fetched now (by default otel-cpp will fetch protobuf v6.31) the utf_range CMake targets (utf8_validity) will be available to link to.
Enforcing the use of FetchContent_Declare may also introduce ABI compatibility issues when users link our project against a prebuilt version of protobuf.
The utf8_range package is independent and used as a static dependency in protobuf and will be used as a private dependency in otel-cpp OTLP exporters. ABI compatibility with protobuf may not be of concern in this case. Either way I think we use the same version bundled with protobuf.
For this reason, I imported it in the same way as the bundled version of abseil-cpp.
Makes sense. Just raising the alternative to optionally enable UTF-8 validation when the SDK is built with OTLP support and installing/importing/fetching utf8_range along with the protobuf dependency. This could be less maintenance for otel-cpp.
Alternatively, do you think we can disable these checks, even though this may result in some records being dropped?
The spec on OTLP attribute type mapping (specification/common/attribute-type-mapping.md) uses Should for UTF8 validation so it appears to be an optional requirement of the SDK. I think we should make it a compile time option that is off by default for now and evaluate performance/stability (similar to other _PREVIEW options).
| { | ||
| OTEL_INTERNAL_LOG_WARN("[Resource] Invalid attribute value for " | ||
| << kv.first << ". This attribute will be ignored."); | ||
| continue; |
There was a problem hiding this comment.
Should these attribute values that are not valid UTF-8 sequences be set to byte values here?
Spec language (specification/common/attribute-type-mapping.md#L109):
String Values
String values which are valid UTF-8 sequences SHOULD be converted to AnyValue's string_value field.String values which are not valid Unicode sequences SHOULD be converted to AnyValue's bytes_value with the bytes representing the string in the original order and format of the source string.
There was a problem hiding this comment.
Thanks, the spec seems changed after the first version of this PR.I will change it to follow the latest spec.
|
|
||
| if (!sdk::common::AttributeValidator::IsValid(value)) | ||
| { | ||
| OTEL_INTERNAL_LOG_WARN("[Trace Span] Invalid span attribute value for " |
There was a problem hiding this comment.
Same question as above: ignore or set as bytes?
| }; | ||
|
|
||
| /** | ||
| * Supports internal iteration over a collection of key-value pairs and filtering of invalid |
There was a problem hiding this comment.
If we have to place invalid utf8 sequences into byte attribute values per the spec should the IsValid check always be performed when constructing an attribute map (attribute_utils.h)?
There was a problem hiding this comment.
Thanks, it looks like the spec changed after the initial version of this PR.
In the earlier version, bytes were allowed in log attributes, but not in traces, metrics, or resources.
I’ll review the spec again over the next few days and update the implementation to match the latest version.
There was a problem hiding this comment.
Looking at the spec it may only apply utf8 validation to OTLP type mapping. If that is the case we may be able to make a more isolated change to OtlpPopulateAttributeUtils::PopulateAnyValue to support this as @lalitb suggests.
|
Hey @owent, sorry for the long delay in review. I've added some comments above. Two main questions:
|
|
@owent - Is it possible to scope this fix within OTLP instead of adding SDK-wide validation/framework changes? The original issue seems much narrower: OTLP/protobuf serialization is where invalid UTF-8 actually becomes a problem. So I’d expect a smaller fix closer to the exporter boundary, for example validating right before writing protobuf string fields / set_string_value(...), and handling invalid data there, rather than introducing a broader SDK-wide mechanism. If we go that route, I’m also not sure we need to bring in |
Thanks. This is also due to a change in the spec. |
Fixes #3491
Changes
For significant contributions please make sure you have completed the following items:
CHANGELOG.mdupdated for non-trivial changes