Skip to content

rpk: fix concurrent map writes in proto serde encodefn#26619

Merged
r-vasquez merged 2 commits intoredpanda-data:devfrom
hunterjm:proto-encoder-concurrency
Jul 3, 2025
Merged

rpk: fix concurrent map writes in proto serde encodefn#26619
r-vasquez merged 2 commits intoredpanda-data:devfrom
hunterjm:proto-encoder-concurrency

Conversation

@hunterjm
Copy link
Copy Markdown
Contributor

When serde.EncodeRecord is called on a cached serde function across multiple goroutines, dynamicpb.Set panics with a concurrent map writes fatal. This is caused by the message previously being created outside of the returned encodefn and attempting to unmarshal the byte slice on the same dynamicpb message.

fatal error: concurrent map writes

goroutine 841 [running]:
internal/runtime/maps.fatal({0xf321b6?, 0xdf17a0?})
	/usr/local/go/src/runtime/panic.go:1058 +0x20
google.golang.org/protobuf/types/dynamicpb.(*Message).Set(0x40000a3c00, {0x102b548, 0x4000216c90}, {{}, 0xd6fd20, 0x40003446c0, 0x4})
	/go/pkg/mod/google.golang.org/protobuf@v1.36.6/types/dynamicpb/dynamic.go:271 +0x300
google.golang.org/protobuf/encoding/protojson.decoder.unmarshalSingular({0x400046a8c0, {{}, 0x0, 0x0, {0xffff6c2dba98, 0x4000618c60}, 0x270f}}, {0x1028850, 0x40000a3c00}, {0x102b548, ...})
	/go/pkg/mod/google.golang.org/protobuf@v1.36.6/encoding/protojson/decode.go:277 +0x228
google.golang.org/protobuf/encoding/protojson.decoder.unmarshalMessage({0x400046a8c0, {{}, 0x0, 0x0, {0xffff6c2dba98, 0x4000618c60}, 0x270f}}, {0x1028850, 0x40000a3c00}, 0x0)
	/go/pkg/mod/google.golang.org/protobuf@v1.36.6/encoding/protojson/decode.go:243 +0x13ac
google.golang.org/protobuf/encoding/protojson.UnmarshalOptions.unmarshal({{}, 0x0, 0x0, {0xffff6c2dba98, 0x4000618c60}, 0x2710}, {0x4000165220, 0x19, 0x20}, {0x1015500, ...})
	/go/pkg/mod/google.golang.org/protobuf@v1.36.6/encoding/protojson/decode.go:80 +0x140
google.golang.org/protobuf/encoding/protojson.UnmarshalOptions.Unmarshal({{}, 0x0, 0x0, {0xffff6c2dba98, 0x4000618c60}, 0x0}, {0x4000165220, 0x19, 0x20}, {0x1015500, ...})
	/go/pkg/mod/google.golang.org/protobuf@v1.36.6/encoding/protojson/decode.go:63 +0x70
github.com/redpanda-data/redpanda/src/go/rpk/pkg/serde.newProtoEncoder.func1({0x4000165220, 0x19, 0x20})
	/go/pkg/mod/github.com/redpanda-data/redpanda/src/go/rpk@v0.0.0-20250616153404-fffb583072aa/pkg/serde/proto.go:52 +0x94
github.com/redpanda-data/redpanda/src/go/rpk/pkg/serde.(*Serde).EncodeRecord(0x4000502c70, {0x4000165220, 0x19, 0x20})
	/go/pkg/mod/github.com/redpanda-data/redpanda/src/go/rpk@v0.0.0-20250616153404-fffb583072aa/pkg/serde/serde.go:100 +0x78

While not an issue for rpk per-se, if this package is imported into another project which does attempt concurrent encodes, this fails. By moving the creation of the dynamicpb object inside the returned encodefn, it does not impact the performance of rpk but allows concurrent encoding to happen if other packages chose to import it.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

  • none

When serde.EncodeRecord is called on a cached serde function across
multiple goroutines, dynamicpb.Set panics with a concurrent map writes
fatal. This is caused by the message previously being created outside
of the returned encodefn and attempting to unmarshal the byte slice
on the same dynamicpb message.

While not an issue for rpk per-se, if this package is imported into
another project which does attempt concurrent encodes, this fails.
By moving the creation of the dynamicpb object inside the returned
encodefn, it does not impact the performance of rpk but allows
concurrent encoding to happen if other packages chose to import it.
@hunterjm hunterjm requested review from a team, kbatuigas and r-vasquez as code owners June 27, 2025 20:07
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Jun 27, 2025

CLA assistant check
All committers have signed the CLA.

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

@r-vasquez r-vasquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks for the PR.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Retry command for Build#68124

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/archival_test.py::ArchivalTest.test_isolate@{"cloud_storage_type":2}

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Jun 28, 2025

CI test results

test results on build#68124
test_class test_method test_arguments test_kind job_url test_status passed reason
ArchivalTest test_isolate {"cloud_storage_type": 2} ducktape https://buildkite.com/redpanda/redpanda/builds/68124#0197b387-f02d-4327-aefb-021270bffe82 FAIL 0/1
CloudStorageChunkReadTest test_read_chunks ducktape https://buildkite.com/redpanda/redpanda/builds/68124#0197b38f-0909-47ba-882f-5405cb2010c5 FLAKY 19/21 upstream reliability is '98.89807162534436'. current run reliability is '90.47619047619048'. drift is 8.42188 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": false, "mixed_versions": true, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/68124#0197b387-f029-4866-9a6e-2570782f80fe FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
DisablingPartitionsTest test_disable ducktape https://buildkite.com/redpanda/redpanda/builds/68124#0197b38f-0904-4593-8cbe-c6c236212255 FLAKY 19/21 upstream reliability is '91.63120567375887'. current run reliability is '90.47619047619048'. drift is 1.15502 and the allowed drift is set to 50. The test should PASS
test results on build#68364
test_class test_method test_arguments test_kind job_url test_status passed reason
NodesDecommissioningTest test_decommissioning_cancel_ongoing_movements ducktape https://buildkite.com/redpanda/redpanda/builds/68364#0197cd5b-626a-4b51-8a87-2f430f2c6c96 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
RecreateTopicMetadataTest test_recreated_topic_metadata_are_valid {"replication_factor": 3} ducktape https://buildkite.com/redpanda/redpanda/builds/68364#0197cd5b-6270-4a3c-99f1-cf97814e2052 FLAKY 19/21 upstream reliability is '99.22178988326849'. current run reliability is '90.47619047619048'. drift is 8.7456 and the allowed drift is set to 50. The test should PASS

@hunterjm
Copy link
Copy Markdown
Contributor Author

I am unable to see the results of the failed build to verify it is not my test that is flakey in CI. Let me know if I need to make any modifications.

@r-vasquez
Copy link
Copy Markdown
Contributor

@hunterjm Sorry, I was out this week and didn't check the last CI jobs. I just approved the CI workflow. I'll check for failures and report back once it's done. There is no need to rebase for now unless there is an issue. thanks!

@r-vasquez r-vasquez merged commit 4a328aa into redpanda-data:dev Jul 3, 2025
23 checks passed
@r-vasquez
Copy link
Copy Markdown
Contributor

/backport v25.1.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-26619-v25.1.x-433 remotes/upstream/v25.1.x
git cherry-pick -x d40a21152b c5276c8e78

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants