Page MenuHomePhabricator

[services] Tunnelbroker - Fix AMQP client reconnection algorithm
ClosedPublic

Authored by max on Aug 4 2022, 11:23 AM.
Tags
None
Referenced Files
F2155069: D4744.id.diff
Sun, Jun 30, 8:34 PM
Unknown Object (File)
Sat, Jun 29, 5:08 PM
Unknown Object (File)
Wed, Jun 26, 4:16 PM
Unknown Object (File)
Wed, Jun 26, 2:41 PM
Unknown Object (File)
Wed, Jun 26, 6:16 AM
Unknown Object (File)
Wed, Jun 26, 5:48 AM
Unknown Object (File)
Wed, Jun 26, 5:48 AM
Unknown Object (File)
Wed, Jun 26, 5:48 AM

Details

Summary

This diff is a part of the stack.

This diff introduces a fix to the AMQP-Client reconnection algorithm.

  1. In the current implementation uses AMQP_SHORTEST_RECONNECTION_ATTEMPT_INTERVAL constant and the current timestamp to make a delay.

That was changed to use AMQP_RECONNECT_MAX_ATTEMPTS and AMQP_RECONNECT_ATTEMPT_INTERVAL instead.
AMQP_RECONNECT_MAX_ATTEMPTS reflects the maximum attempts before we exit with the error. The value is 10 attempts.
AMQP_RECONNECT_ATTEMPT_INTERVAL reflects the interval between reconnect attempts. The value is 3 seconds.
The maximum waiting time is 30 seconds with 3 seconds intervals and 10 attempts maximum. Which looks enough to me to reconnect in case of network issues.

  1. The current implementation in AmqpManager::connect() -> while(true) loop doesn't work.

In case of the channel/connection is closed it just throws an error or segmentation fault due to the access to the this->amqpChannel which is null in that case.

The loop was changed to use a local atomic reconnectAttempt counter and AMQP_RECONNECT_MAX_ATTEMPTS maximum attempts constant. As long as throw a fatal error only once outside of the loop when the maximum reconnect attempts were reached.

  1. On the successful reconnect clearing the reconnectAttempt counter was added.
  1. A waiter method introduced in D4741 updated to use AMQP_RECONNECT_ATTEMPT_INTERVAL to wait for another attempt to check if the connection/channel is ready.

Related linear task: ENG-1495

Test Plan

Successfully built using yarn run-tunnelbroker-service-in-sandbox command.
Passing all AMQP unit tests in the last diff D4749 in a stack.

Diff Detail

Repository
rCOMM Comm
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

max held this revision as a draft.
max retitled this revision from [services] Tunnelbroker - Fix Amqp client reconnection logic to [services] Tunnelbroker - Fix AMQP client reconnection algorithm.Aug 4 2022, 12:17 PM
max edited the summary of this revision. (Show Details)
max edited the test plan for this revision. (Show Details)
max added reviewers: karol, tomek.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
49 ↗(On Diff #15330)

Clear the counter on successful reconnect.

96 ↗(On Diff #15330)

We should not throw a fatal error here. In case of the connection is lost it will throw a fatal error instead of going forward to the reconnection loop. It was changed to a log error.

max published this revision for review.Aug 4 2022, 4:53 PM
max edited the test plan for this revision. (Show Details)
karol added inline comments.
services/tunnelbroker/src/Constants.h
40–41 ↗(On Diff #15408)

where do these numbers come from?

tomek added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
103–104 ↗(On Diff #15408)

Shouldn't we increase the counter before calling connectInternal? Currently, the following can happen:

  1. connect is called
  2. connectInternal is called and it works - reconnectAttempt is set to 0 in onReady
  3. reconnectAttempt is increased
  4. next loop iterations with unsuccessful connects

The starting value will be reconnectAttempt = 1, so effectively we will try to reconnect only AMQP_RECONNECT_MAX_ATTEMPTS - 1 times.

108–109 ↗(On Diff #15408)

What do you think about having exponential backoff also here? It should be really easy to implement, because x * 2 ^ reconnectAttempt can be used as a sleep duration.

services/tunnelbroker/src/Constants.h
40 ↗(On Diff #15408)

Having a unit as a part of name is a lot better than comments with explanation - they will go out of sync.

This revision is now accepted and ready to land.Aug 9 2022, 9:08 AM
services/tunnelbroker/src/Constants.h
40 ↗(On Diff #15408)

Another idea is to store std::chrono::milliseconds(3000)) directly as a constant - that is probably the best solution

Rebase/Merge on master changes.
Change constant name from AMQP_RECONNECT_ATTEMPT_INTERVAL to AMQP_RECONNECT_ATTEMPT_INTERVAL_MS.

max added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
103–104 ↗(On Diff #15408)

Shouldn't we increase the counter before calling connectInternal? Currently, the following can happen:

  1. connect is called
  2. connectInternal is called and it works - reconnectAttempt is set to 0 in onReady
  3. reconnectAttempt is increased
  4. next loop iterations with unsuccessful connects

I think it doesn't matter here to swap these lines because of:
In case of a successful connection, we are resetting the counter inside the connectInternal() but we have a condition to check it in a while loop above/outside it. Also, if we swap them the log message below will have a 0 attempt after the first connection lost.

The starting value will be reconnectAttempt = 1, so effectively we will try to reconnect only AMQP_RECONNECT_MAX_ATTEMPTS - 1 times.

Seems you're right. We can fix it by changing the condition above from this->reconnectAttempt < AMQP_RECONNECT_MAX_ATTEMPTS to this->reconnectAttempt <= AMQP_RECONNECT_MAX_ATTEMPTS.

108–109 ↗(On Diff #15408)

What do you think about having exponential backoff also here? It should be really easy to implement, because x * 2 ^ reconnectAttempt can be used as sleep duration.

There is already a follow-up task ENG-1381 to switch exponential backoff.

services/tunnelbroker/src/Constants.h
40 ↗(On Diff #15408)

Having a unit as a part of name is a lot better than comments with explanation - they will go out of sync.

Agree! I've changed to use AMQP_RECONNECT_ATTEMPT_INTERVAL_MS.

40 ↗(On Diff #15408)

Another idea is to store std::chrono::milliseconds(3000)) directly as a constant - that is probably the best solution

This constant will be removed by the exponential backoff algorithm in the future by ENG-1381 so we can omit to make now. Thanks for the suggestion!

40–41 ↗(On Diff #15408)

where do these numbers come from?

The reasonable maximum wait timeout to reconnect is around 30 seconds, and the reasonable attempt wait to reconnect is 2-5 seconds speaking for the rabbitMQ. That's why I went to 10 attempts by 3 seconds.

max marked 4 inline comments as done.

Minor refactoring to the AmqpManager::connect() to connect after waiting and run first connect outside of the reconnection loop.

services/tunnelbroker/src/Amqp/AmqpManager.cpp
103–104 ↗(On Diff #15408)

Shouldn't we increase the counter before calling connectInternal? Currently, the following can happen:

  1. connect is called
  2. connectInternal is called and it works - reconnectAttempt is set to 0 in onReady
  3. reconnectAttempt is increased
  4. next loop iterations with unsuccessful connects

The starting value will be reconnectAttempt = 1, so effectively we will try to reconnect only AMQP_RECONNECT_MAX_ATTEMPTS - 1 times.

I think the better way is a minor refactoring of AmqpManager::connect() to wait before connecting:

void AmqpManager::connect() {
  this->connectInternal();
  while (this->reconnectAttempt < AMQP_RECONNECT_MAX_ATTEMPTS) {
    this->reconnectAttempt++;
    LOG(INFO) << "AMQP: Attempt " << this->reconnectAttempt
              << " to reconnect in " << AMQP_RECONNECT_ATTEMPT_INTERVAL_MS
              << " ms";
    std::this_thread::sleep_for(
        std::chrono::milliseconds(AMQP_RECONNECT_ATTEMPT_INTERVAL_MS));
    this->connectInternal();
  }
  LOG(FATAL) << "Cannot connect to AMQP server after "
             << AMQP_RECONNECT_MAX_ATTEMPTS << " attempts";
}

I've made an update with this.