Skip to content

Conversation

@BewareMyPower
Copy link
Contributor

Motivation

All the retryable operations share the same Backoff object in RetryableLookupService, so if the reconnection happens for some times, the delay of retrying will keeps the maximum value (30 seconds).

Modifications

Refactor the design of the RetryableLookupService:

  • Add a RetryableOperation class to represent a retryable operation, each instance has its own Backoff object. The operation could only be executed once.
  • Add a RetryableOperationCache class to represent a map that maps a specific name to its associated operation. It's an optimization that if an operation (e.g. find the owner topic of topic A) was not complete while the same operation was executed, the future would be reused.
  • In RetryableLookupService, just maintain some caches for different operations.
  • Add RetryableOperationCacheTest to verify the behaviors.

Documentation

  • doc-required
    (Your PR needs to update docs and you will update later)

  • doc-not-needed
    (Please explain why)

  • doc
    (Your PR contains doc changes)

  • doc-complete
    (Docs have been already added)

@BewareMyPower BewareMyPower added the bug Something isn't working label Jun 29, 2023
@BewareMyPower BewareMyPower added this to the 3.3.0 milestone Jun 29, 2023
@BewareMyPower BewareMyPower self-assigned this Jun 29, 2023
@BewareMyPower BewareMyPower marked this pull request as draft June 30, 2023 02:26
@BewareMyPower
Copy link
Contributor Author

BasicEndToEndTest.testLookupThrottling is stuck, I will fix this test ASAP.

### Motivation

All the retryable operations share the same `Backoff` object in
`RetryableLookupService`, so if the reconnection happens for some times,
the delay of retrying will keeps the maximum value (30 seconds).

### Modifications

Refactor the design of the `RetryableLookupService`:
- Add a `RetryableOperation` class to represent a retryable operation,
  each instance has its own `Backoff` object. The operation could only
  be executed once.
- Add a `RetryableOperationCache` class to represent a map that maps a
  specific name to its associated operation. It's an optimization that
  if an operation (e.g. find the owner topic of topic A) was not
  complete while the same operation was executed, the future would be
  reused.
- In `RetryableLookupService`, just maintain some caches for different
  operations.
- Add `RetryableOperationCacheTest` to verify the behaviors.
@BewareMyPower BewareMyPower force-pushed the bewaremypower/fix-retry-backoff branch from 1146d02 to 78bb0be Compare July 4, 2023 11:21
@BewareMyPower BewareMyPower marked this pull request as ready for review July 4, 2023 12:57
Comment on lines 70 to 72
size_t getNumberOfPendingTasks() const {
return lookupCache_->size() + partitionLookupCache_->size() + namespaceLookupCache_->size() +
getSchemaCache_->size();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only used for testing. Do we need to expose it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed them. PTAL again.

Copy link
Member

@shibd shibd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! Left some small comments.

@BewareMyPower
Copy link
Contributor Author

It seems some tests failed after merging the main branch. Mark it as drafted currently.

    1278 ms: ./pulsar-tests ClientTest.testConnectTimeout (try #1)
    1259 ms: ./pulsar-tests ClientTest.testConnectTimeout (try #2)
     237 ms: ./pulsar-tests ClientTest.testCloseClient (try #1)
    1280 ms: ./pulsar-tests ClientTest.testConnectTimeout (try #3)
     241 ms: ./pulsar-tests ClientTest.testCloseClient (try #2)
     242 ms: ./pulsar-tests ClientTest.testCloseClient (try #3)
     311 ms: ./pulsar-tests ClientTest.testCloseClient (try #4)
    1410 ms: ./pulsar-tests ClientTest.testConnectTimeout (try #4)
    1056 ms: ./pulsar-tests Pulsar/MessageChunkingTest.testSeekChunkMessages/3 (try #1)

@BewareMyPower BewareMyPower marked this pull request as draft July 6, 2023 11:11
@BewareMyPower BewareMyPower marked this pull request as ready for review July 6, 2023 11:20
@BewareMyPower
Copy link
Contributor Author

@shibd @RobertIndie Now all tests passed, PTAL again.

@BewareMyPower BewareMyPower merged commit e737716 into apache:main Jul 7, 2023
@BewareMyPower BewareMyPower deleted the bewaremypower/fix-retry-backoff branch July 7, 2023 08:23
BewareMyPower added a commit to BewareMyPower/pulsar-client-cpp that referenced this pull request Sep 8, 2023
### Motivation

apache#296 introduced a
regression for GCC <= 7.

> lib/RetryableOperation.h:109:66: error: 'pulsar::RetryableOperation<T>::runImpl(pulsar::TimeDuration)::<lambda(pulsar::Result, const T&)> [with T = pulsar::LookupService::LookupResult]::<lambda(const boost::system::error_code&)>' declared with greater visibility than the type of its field 'pulsar::RetryableOperation<T>::runImpl(pulsar::TimeDuration)::<lambda(pulsar::Result, const T&)> [with T = pulsar::LookupService::LookupResult]::<lambda(const boost::system::error_code&)>::<this capture>' [-Werror=attributes]

It seems to be a bug for GCC <= 7 abort the visibility of the lambda
expression might not be affected by the `-fvisibility=hidden` option.

### Modifications

Add `__attribute__((visibility("hidden")))` to
`RetryableOperation::runImpl` explicitly.
BewareMyPower added a commit that referenced this pull request Sep 9, 2023
### Motivation

#296 introduced a
regression for GCC <= 7.

> lib/RetryableOperation.h:109:66: error: 'pulsar::RetryableOperation<T>::runImpl(pulsar::TimeDuration)::<lambda(pulsar::Result, const T&)> [with T = pulsar::LookupService::LookupResult]::<lambda(const boost::system::error_code&)>' declared with greater visibility than the type of its field 'pulsar::RetryableOperation<T>::runImpl(pulsar::TimeDuration)::<lambda(pulsar::Result, const T&)> [with T = pulsar::LookupService::LookupResult]::<lambda(const boost::system::error_code&)>::<this capture>' [-Werror=attributes]

It seems to be a bug for GCC <= 7 abort the visibility of the lambda
expression might not be affected by the `-fvisibility=hidden` option.

### Modifications

Add `__attribute__((visibility("hidden")))` to
`RetryableOperation::runImpl` explicitly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants