Page MenuHomePhabricator

[services] Tunnelbroker - Fix AMQP client reconnection algorithm
ClosedPublic

Authored by max on Aug 4 2022, 11:23 AM.
Tags
None
Referenced Files
F2183855: D4744.diff
Wed, Jul 3, 8:14 PM
F2181167: D4744.id15330.diff
Wed, Jul 3, 5:09 PM
Unknown Object (File)
Sun, Jun 30, 8:34 PM
Unknown Object (File)
Sat, Jun 29, 5:08 PM
Unknown Object (File)
Wed, Jun 26, 4:16 PM
Unknown Object (File)
Wed, Jun 26, 2:41 PM
Unknown Object (File)
Wed, Jun 26, 6:16 AM
Unknown Object (File)
Wed, Jun 26, 5:48 AM

Details

Summary

This diff is a part of the stack.

This diff introduces a fix to the AMQP-Client reconnection algorithm.

  1. In the current implementation uses AMQP_SHORTEST_RECONNECTION_ATTEMPT_INTERVAL constant and the current timestamp to make a delay.

That was changed to use AMQP_RECONNECT_MAX_ATTEMPTS and AMQP_RECONNECT_ATTEMPT_INTERVAL instead.
AMQP_RECONNECT_MAX_ATTEMPTS reflects the maximum attempts before we exit with the error. The value is 10 attempts.
AMQP_RECONNECT_ATTEMPT_INTERVAL reflects the interval between reconnect attempts. The value is 3 seconds.
The maximum waiting time is 30 seconds with 3 seconds intervals and 10 attempts maximum. Which looks enough to me to reconnect in case of network issues.

  1. The current implementation in AmqpManager::connect() -> while(true) loop doesn't work.

In case of the channel/connection is closed it just throws an error or segmentation fault due to the access to the this->amqpChannel which is null in that case.

The loop was changed to use a local atomic reconnectAttempt counter and AMQP_RECONNECT_MAX_ATTEMPTS maximum attempts constant. As long as throw a fatal error only once outside of the loop when the maximum reconnect attempts were reached.

  1. On the successful reconnect clearing the reconnectAttempt counter was added.
  1. A waiter method introduced in D4741 updated to use AMQP_RECONNECT_ATTEMPT_INTERVAL to wait for another attempt to check if the connection/channel is ready.

Related linear task: ENG-1495

Test Plan

Successfully built using yarn run-tunnelbroker-service-in-sandbox command.
Passing all AMQP unit tests in the last diff D4749 in a stack.

Diff Detail

Repository
rCOMM Comm
Branch
fix-amqp-reconnect-logic
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

max held this revision as a draft.
max retitled this revision from [services] Tunnelbroker - Fix Amqp client reconnection logic to [services] Tunnelbroker - Fix AMQP client reconnection algorithm.Aug 4 2022, 12:17 PM
max edited the summary of this revision. (Show Details)
max edited the test plan for this revision. (Show Details)
max added reviewers: karol, tomek.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
49 ↗(On Diff #15330)

Clear the counter on successful reconnect.

96 ↗(On Diff #15330)

We should not throw a fatal error here. In case of the connection is lost it will throw a fatal error instead of going forward to the reconnection loop. It was changed to a log error.

max published this revision for review.Aug 4 2022, 4:53 PM
max edited the test plan for this revision. (Show Details)
karol added inline comments.
services/tunnelbroker/src/Constants.h
40–41

where do these numbers come from?

tomek added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
103–104

Shouldn't we increase the counter before calling connectInternal? Currently, the following can happen:

  1. connect is called
  2. connectInternal is called and it works - reconnectAttempt is set to 0 in onReady
  3. reconnectAttempt is increased
  4. next loop iterations with unsuccessful connects

The starting value will be reconnectAttempt = 1, so effectively we will try to reconnect only AMQP_RECONNECT_MAX_ATTEMPTS - 1 times.

108–109

What do you think about having exponential backoff also here? It should be really easy to implement, because x * 2 ^ reconnectAttempt can be used as a sleep duration.

services/tunnelbroker/src/Constants.h
40

Having a unit as a part of name is a lot better than comments with explanation - they will go out of sync.

This revision is now accepted and ready to land.Aug 9 2022, 9:08 AM
services/tunnelbroker/src/Constants.h
40

Another idea is to store std::chrono::milliseconds(3000)) directly as a constant - that is probably the best solution

Rebase/Merge on master changes.
Change constant name from AMQP_RECONNECT_ATTEMPT_INTERVAL to AMQP_RECONNECT_ATTEMPT_INTERVAL_MS.

max added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
103–104

Shouldn't we increase the counter before calling connectInternal? Currently, the following can happen:

  1. connect is called
  2. connectInternal is called and it works - reconnectAttempt is set to 0 in onReady
  3. reconnectAttempt is increased
  4. next loop iterations with unsuccessful connects

I think it doesn't matter here to swap these lines because of:
In case of a successful connection, we are resetting the counter inside the connectInternal() but we have a condition to check it in a while loop above/outside it. Also, if we swap them the log message below will have a 0 attempt after the first connection lost.

The starting value will be reconnectAttempt = 1, so effectively we will try to reconnect only AMQP_RECONNECT_MAX_ATTEMPTS - 1 times.

Seems you're right. We can fix it by changing the condition above from this->reconnectAttempt < AMQP_RECONNECT_MAX_ATTEMPTS to this->reconnectAttempt <= AMQP_RECONNECT_MAX_ATTEMPTS.

108–109

What do you think about having exponential backoff also here? It should be really easy to implement, because x * 2 ^ reconnectAttempt can be used as sleep duration.

There is already a follow-up task ENG-1381 to switch exponential backoff.

services/tunnelbroker/src/Constants.h
40

Having a unit as a part of name is a lot better than comments with explanation - they will go out of sync.

Agree! I've changed to use AMQP_RECONNECT_ATTEMPT_INTERVAL_MS.

40

Another idea is to store std::chrono::milliseconds(3000)) directly as a constant - that is probably the best solution

This constant will be removed by the exponential backoff algorithm in the future by ENG-1381 so we can omit to make now. Thanks for the suggestion!

40–41

where do these numbers come from?

The reasonable maximum wait timeout to reconnect is around 30 seconds, and the reasonable attempt wait to reconnect is 2-5 seconds speaking for the rabbitMQ. That's why I went to 10 attempts by 3 seconds.

max marked 4 inline comments as done.

Minor refactoring to the AmqpManager::connect() to connect after waiting and run first connect outside of the reconnection loop.

services/tunnelbroker/src/Amqp/AmqpManager.cpp
103–104

Shouldn't we increase the counter before calling connectInternal? Currently, the following can happen:

  1. connect is called
  2. connectInternal is called and it works - reconnectAttempt is set to 0 in onReady
  3. reconnectAttempt is increased
  4. next loop iterations with unsuccessful connects

The starting value will be reconnectAttempt = 1, so effectively we will try to reconnect only AMQP_RECONNECT_MAX_ATTEMPTS - 1 times.

I think the better way is a minor refactoring of AmqpManager::connect() to wait before connecting:

void AmqpManager::connect() {
  this->connectInternal();
  while (this->reconnectAttempt < AMQP_RECONNECT_MAX_ATTEMPTS) {
    this->reconnectAttempt++;
    LOG(INFO) << "AMQP: Attempt " << this->reconnectAttempt
              << " to reconnect in " << AMQP_RECONNECT_ATTEMPT_INTERVAL_MS
              << " ms";
    std::this_thread::sleep_for(
        std::chrono::milliseconds(AMQP_RECONNECT_ATTEMPT_INTERVAL_MS));
    this->connectInternal();
  }
  LOG(FATAL) << "Cannot connect to AMQP server after "
             << AMQP_RECONNECT_MAX_ATTEMPTS << " attempts";
}

I've made an update with this.