Skip to content

tests/kgo: handle errors when polling for verifier status#27013

Merged
nvartolomei merged 2 commits intoredpanda-data:devfrom
mmaslankaprv:kgo-verifier-status
Jul 30, 2025
Merged

tests/kgo: handle errors when polling for verifier status#27013
nvartolomei merged 2 commits intoredpanda-data:devfrom
mmaslankaprv:kgo-verifier-status

Conversation

@mmaslankaprv
Copy link
Copy Markdown
Member

@mmaslankaprv mmaslankaprv commented Jul 28, 2025

It sometimes happens that the verifier status request returns an error due to transient network issue or application being slow to start. In this case the test failed completely as the producer status wasn't reported even if the producer finished successfully. Added error handling to the status loop to prevent those errors from failing the tests.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

  • none

nvartolomei
nvartolomei previously approved these changes Jul 28, 2025
Copy link
Copy Markdown
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about the case when kgo crashes/doesn't start and that this will probably cause longer waits. I.e. minute instead of failing in few seconds.

Should we limit the number of retries? I.e. if we can't reach the status in 3 tries one second apart then it is not worth trying anymore? Or something like that.

Same for any HTTP error codes. We shouldn't retry them at all because we don't expect any.

From chatgtp which looks suspicious but i wanted to show the connect/read params for Retry object.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

# Retry configuration
retry_strategy = Retry(
    total=3,                    # total number of retries
    connect=3,                  # retries on connection errors
    read=3,                     # retries on read errors (socket errors)
    status=0,                   # no retries on specific HTTP status codes
    backoff_factor=1,           # sleep interval (exponential backoff): 1s, 2s, 4s...
    allowed_methods=["GET", "POST"], # methods to retry (e.g., GET, POST, PUT, etc.)
    raise_on_status=False,      # don't raise error on HTTP errors (since status=0)
)

adapter = HTTPAdapter(max_retries=retry_strategy)
session = requests.Session()
session.mount("http://", adapter)
session.mount("https://", adapter)

try:
    response = session.get("https://example.com", timeout=5)
    response.raise_for_status()
    print(response.status_code)
except requests.exceptions.RequestException as e:
    print(f"Request failed: {e}")

nvartolomei
nvartolomei previously approved these changes Jul 29, 2025
mmaslankaprv and others added 2 commits July 30, 2025 21:32
It sometimes happens that the verifier status request returns an error
due to transient network issue or application being slow to start. In
this case the test failed completely as the producer status wasn't
reported even if the producer finished successfully. Added error
handling to the status loop to prevent those errors from failing the
tests.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@nvartolomei nvartolomei force-pushed the kgo-verifier-status branch from 67598f8 to c7237ba Compare July 30, 2025 20:33
@nvartolomei nvartolomei requested a review from a team as a code owner July 30, 2025 20:33
@nvartolomei nvartolomei requested review from rpdevmp and removed request for a team July 30, 2025 20:33
@nvartolomei nvartolomei enabled auto-merge July 30, 2025 20:43
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

CI test results

test results on build#69963
test_class test_method test_arguments test_kind job_url test_status passed reason
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "sliding_window", "enable_failures": true, "mixed_versions": false, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/69963#01985d21-1a7f-47e2-9d32-544adf18c59c FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS

@nvartolomei nvartolomei merged commit ee6c7f6 into redpanda-data:dev Jul 30, 2025
20 checks passed
@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.2.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v25.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27013-v25.1.x-270 remotes/upstream/v25.1.x
git cherry-pick -x 9dff3196ee c7237ba7dd

Workflow run logs.

@vbotbuildovich
Copy link
Copy Markdown
Collaborator

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-27013-v24.3.x-374 remotes/upstream/v24.3.x
git cherry-pick -x 9dff3196ee c7237ba7dd

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants