Skip to content

migrations: deflake data_migrations_api_test#28325

Merged
joe-redpanda merged 1 commit intoredpanda-data:devfrom
joe-redpanda:migration_error_mapping
Dec 4, 2025
Merged

migrations: deflake data_migrations_api_test#28325
joe-redpanda merged 1 commit intoredpanda-data:devfrom
joe-redpanda:migration_error_mapping

Conversation

@joe-redpanda
Copy link
Copy Markdown
Contributor

@joe-redpanda joe-redpanda commented Nov 3, 2025

Myriad fixes for data_migration_api_test.

  1. catch sleep aborted exceptions and map them to the appropriate http error.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.3.x
  • v25.2.x
  • v25.1.x
  • v24.3.x

Release Notes

Improvements

  • reduce test failures on data_migrations_api_test

@joe-redpanda joe-redpanda changed the title migrations: add error handling to list_data_migrations migrations: deflake data_migrations_api_test Nov 3, 2025
@joe-redpanda
Copy link
Copy Markdown
Contributor Author

/ci-repeat 1

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

vbotbuildovich commented Nov 4, 2025

CI test results

test results on build#75513
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkConsumeGroupsMirroringTest test_continuous_group_sync {"source_cluster_spec": {"cluster_type": "kafka", "kafka_quorum": "COMBINED_KRAFT", "kafka_version": "3.8.0"}, "with_failures": false} integration https://buildkite.com/redpanda/redpanda/builds/75513#019a4c70-72db-40ef-88cd-66a17f94f401 FLAKY 14/21 upstream reliability is '89.38461538461539'. current run reliability is '66.66666666666666'. drift is 22.71795 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkConsumeGroupsMirroringTest&test_method=test_continuous_group_sync
ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/75513#019a4c70-72de-4b7c-95a0-2adf320a0e8c FLAKY 19/21 upstream reliability is '92.2514619883041'. current run reliability is '90.47619047619048'. drift is 1.77527 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming
LogCompactionTxRemovalUpgradeTest test_tx_control_batch_removal_with_upgrade {"test_case_name": "All aborts"} integration https://buildkite.com/redpanda/redpanda/builds/75513#019a4c79-013b-4f89-ba5d-2486ccae8fd7 FLAKY 20/21 upstream reliability is '97.35099337748345'. current run reliability is '95.23809523809523'. drift is 2.1129 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalUpgradeTest&test_method=test_tx_control_batch_removal_with_upgrade
SimpleEndToEndTest test_relaxed_acks {"write_caching": false} integration https://buildkite.com/redpanda/redpanda/builds/75513#019a4c79-013a-4508-968d-76964ac8bdfc FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SimpleEndToEndTest&test_method=test_relaxed_acks
TieredStorageTest test_tiered_storage {"cloud_storage_type_and_url_style": [1, "path"], "test_case": {"name": "(TS_Read == True, TS_Timequery == True, SpilloverManifestUploaded == True)"}} integration https://buildkite.com/redpanda/redpanda/builds/75513#019a4c79-013b-4f89-ba5d-2486ccae8fd7 FLAKY 20/21 upstream reliability is '99.4949494949495'. current run reliability is '95.23809523809523'. drift is 4.25685 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=TieredStorageTest&test_method=test_tiered_storage
test results on build#75613
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_auto_prefix_trimming {"source_cluster_spec": {"cluster_type": "redpanda"}, "with_failures": true} integration https://buildkite.com/redpanda/redpanda/builds/75613#019a51c5-f4f9-4c57-8ed9-51d65ca81141 FLAKY 19/21 upstream reliability is '92.46298788694482'. current run reliability is '90.47619047619048'. drift is 1.9868 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_auto_prefix_trimming
DataMigrationsApiTest test_creating_and_listing_migrations null integration https://buildkite.com/redpanda/redpanda/builds/75613#019a51c5-f4f7-4d7c-8907-578e8e649488 FLAKY 16/21 upstream reliability is '99.02912621359224'. current run reliability is '76.19047619047619'. drift is 22.83865 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_creating_and_listing_migrations
LogCompactionTxRemovalTest test_tx_control_batch_removal null integration https://buildkite.com/redpanda/redpanda/builds/75613#019a51c5-f4f5-4efc-b771-5f2d4cc48add FLAKY 14/21 upstream reliability is '86.7612293144208'. current run reliability is '66.66666666666666'. drift is 20.09456 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalTest&test_method=test_tx_control_batch_removal
LogCompactionTxRemovalUpgradeTest test_tx_control_batch_removal_with_upgrade {"test_case_name": "All commits"} integration https://buildkite.com/redpanda/redpanda/builds/75613#019a51cb-4cc1-43c2-828e-892f61652159 FLAKY 20/21 upstream reliability is '97.28601252609603'. current run reliability is '95.23809523809523'. drift is 2.04792 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=LogCompactionTxRemovalUpgradeTest&test_method=test_tx_control_batch_removal_with_upgrade
NodesDecommissioningTest test_decommissioning_finishes_after_manual_cancellation {"cloud_topic": false, "delete_topic": true} integration https://buildkite.com/redpanda/redpanda/builds/75613#019a51cb-4cc1-43c2-828e-892f61652159 FLAKY 20/21 upstream reliability is '98.63013698630137'. current run reliability is '95.23809523809523'. drift is 3.39204 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=NodesDecommissioningTest&test_method=test_decommissioning_finishes_after_manual_cancellation
ScalingUpTest test_fast_node_addition null integration https://buildkite.com/redpanda/redpanda/builds/75613#019a51cb-4cbe-4f20-92a3-1c7ecc2767da FLAKY 20/21 upstream reliability is '98.43260188087774'. current run reliability is '95.23809523809523'. drift is 3.19451 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ScalingUpTest&test_method=test_fast_node_addition
ShadowLinkingRandomOpsTest test_node_operations {"failures": false} integration https://buildkite.com/redpanda/redpanda/builds/75613#019a51cb-4cc0-496b-9a29-833d772254b5 FLAKY 19/21 upstream reliability is '99.69135802469135'. current run reliability is '90.47619047619048'. drift is 9.21517 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingRandomOpsTest&test_method=test_node_operations
test results on build#76177
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
ShadowLinkingReplicationTests test_replication_basic {"shuffle_leadership": true, "source_cluster_spec": {"cluster_type": "redpanda"}} integration https://buildkite.com/redpanda/redpanda/builds/76177#019a7aaf-4539-42d0-a75b-6fc651f46397 FLAKY 14/21 upstream reliability is '88.58418367346938'. current run reliability is '66.66666666666666'. drift is 21.91752 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=ShadowLinkingReplicationTests&test_method=test_replication_basic
MountUnmountIcebergTest test_simple_remount {"cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/76177#019a7aaf-4532-4b9b-869e-0127b3580769 FLAKY 18/21 upstream reliability is '86.37510513036165'. current run reliability is '85.71428571428571'. drift is 0.66082 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=MountUnmountIcebergTest&test_method=test_simple_remount
SegmentMsTest test_segment_rolling_with_retention_consumer null integration https://buildkite.com/redpanda/redpanda/builds/76177#019a7aaf-4533-4465-a817-8b212ffb5cbb FLAKY 20/21 upstream reliability is '92.67241379310344'. current run reliability is '95.23809523809523'. drift is -2.56568 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=SegmentMsTest&test_method=test_segment_rolling_with_retention_consumer
src/v/storage/tests/segment_appender_rpbench_test src/v/storage/tests/segment_appender_rpbench_test unit https://buildkite.com/redpanda/redpanda/builds/76177#019a7a77-e5b3-4898-90ea-52c180759d0a FAIL 0/1
test results on build#76685
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
AuditLogTestKafkaApi test_no_auth_enabled {"audit_transport_mode": "kclient"} integration https://buildkite.com/redpanda/redpanda/builds/76685#019a9e5b-5fa1-499a-a861-9384ba1026a3 FLAKY 20/21 upstream reliability is '95.20639147802929'. current run reliability is '95.23809523809523'. drift is -0.0317 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=AuditLogTestKafkaApi&test_method=test_no_auth_enabled
DatalakeClusterRestoreTest test_restore_partition_spec {"catalog_type": "rest_hadoop", "cloud_storage_type": 1} integration https://buildkite.com/redpanda/redpanda/builds/76685#019a9e5b-5fa5-446a-8904-741fd4870a63 FLAKY 20/21 upstream reliability is '99.16142557651992'. current run reliability is '95.23809523809523'. drift is 3.92333 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DatalakeClusterRestoreTest&test_method=test_restore_partition_spec
test results on build#77205
test_class test_method test_arguments test_kind job_url test_status passed reason test_history
PartitionBalancerTest test_rack_awareness null integration https://buildkite.com/redpanda/redpanda/builds/77205#019ae0ba-ce00-4b4e-b52c-068b8419406d FLAKY 20/21 upstream reliability is '99.08675799086758'. current run reliability is '95.23809523809523'. drift is 3.84866 and the allowed drift is set to 50. The test should PASS https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=PartitionBalancerTest&test_method=test_rack_awareness

@joe-redpanda joe-redpanda marked this pull request as ready for review November 5, 2025 01:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes flakiness in the data_migrations_api_test by improving error handling and shutdown behavior in the data migrations API. The changes ensure that sleep aborted exceptions during shutdown are properly caught and converted to appropriate HTTP errors, and that HTTP requests don't hang during shutdown.

Key Changes:

  • Added error handling for sleep_aborted exceptions in list_migrations() to map them to cluster::errc::shutting_down
  • Updated list_migrations() return type to include error handling via result<> wrapper
  • Modified admin server endpoint to handle errors from list_migrations() and convert them to proper HTTP responses

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
src/v/cluster/data_migration_frontend.h Changed list_migrations() return type to include error result wrapper
src/v/cluster/data_migration_frontend.cc Added try-catch for sleep_aborted exception and converted to error result
src/v/redpanda/admin/migrations.cc Added error handling in admin endpoint to process errors from list_migrations()

@joe-redpanda joe-redpanda force-pushed the migration_error_mapping branch from 0d38bce to da1bdf6 Compare November 7, 2025 20:17
@joe-redpanda
Copy link
Copy Markdown
Contributor Author

Rebase atop Michal's changes

@joe-redpanda
Copy link
Copy Markdown
Contributor Author

Not sure how the "stop" change fell off in the rebase, readded and testing

@joe-redpanda joe-redpanda force-pushed the migration_error_mapping branch from 50b2052 to f2ca443 Compare November 12, 2025 23:47
@joe-redpanda
Copy link
Copy Markdown
Contributor Author

no changes, rebased onto dev to reduce my build times

@bashtanov
Copy link
Copy Markdown
Contributor

As far as I understand router needs to be requested to stop early, in controller shutdown input. That's for Admin API to be able to stop. See #28491

@joe-redpanda
Copy link
Copy Markdown
Contributor Author

As far as I understand router needs to be requested to stop early, in controller shutdown input. That's for Admin API to be able to stop. See #28491

makes sense, left a comment. Should we still have a stop that awaits the completion of the shutdown of the contained routers?

@joe-redpanda joe-redpanda force-pushed the migration_error_mapping branch from f2ca443 to 6d41416 Compare November 19, 2025 22:08
@joe-redpanda joe-redpanda force-pushed the migration_error_mapping branch from 6d41416 to 4242b7b Compare December 2, 2025 19:17
Invoke on instance can fail on shutdown with  a sleep_aborted exception.
Catch the exception and translate to an errc::shutting_down code such
that the admin server can handle it correctly by changing it into a
service unavailable response.
@joe-redpanda joe-redpanda force-pushed the migration_error_mapping branch from 4242b7b to 28f67e4 Compare December 2, 2025 19:22
co_return cluster::errc::shutting_down;
}
vlog(dm_log.error, "unexpected exception on list_migrations {}", eptr);
throw;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use rethrow_exception?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throw with no expression just rethrows the current exception

Am I missing something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was a nitpick, I thought since eptr was already captured in a local variable, we may aswell just rethrow that.

co_return cluster::errc::shutting_down;
}
vlog(dm_log.error, "unexpected exception on list_migrations {}", eptr);
throw;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was a nitpick, I thought since eptr was already captured in a local variable, we may aswell just rethrow that.

@joe-redpanda joe-redpanda merged commit 6121646 into redpanda-data:dev Dec 4, 2025
19 checks passed
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants