Details

Reviewers

• karol
tomek

Commits

rCOMM5dfe7615251c: [services] Tunnelbroker - Fix AMQP client reconnection algorithm

Summary

This diff is a part of the stack.

This diff introduces a fix to the AMQP-Client reconnection algorithm.

In the current implementation uses AMQP_SHORTEST_RECONNECTION_ATTEMPT_INTERVAL constant and the current timestamp to make a delay.

That was changed to use AMQP_RECONNECT_MAX_ATTEMPTS and AMQP_RECONNECT_ATTEMPT_INTERVAL instead.
AMQP_RECONNECT_MAX_ATTEMPTS reflects the maximum attempts before we exit with the error. The value is 10 attempts.
AMQP_RECONNECT_ATTEMPT_INTERVAL reflects the interval between reconnect attempts. The value is 3 seconds.
The maximum waiting time is 30 seconds with 3 seconds intervals and 10 attempts maximum. Which looks enough to me to reconnect in case of network issues.

The current implementation in AmqpManager::connect() -> while(true) loop doesn't work.

In case of the channel/connection is closed it just throws an error or segmentation fault due to the access to the this->amqpChannel which is null in that case.

The loop was changed to use a local atomic reconnectAttempt counter and AMQP_RECONNECT_MAX_ATTEMPTS maximum attempts constant. As long as throw a fatal error only once outside of the loop when the maximum reconnect attempts were reached.

On the successful reconnect clearing the reconnectAttempt counter was added.

A waiter method introduced in D4741 updated to use AMQP_RECONNECT_ATTEMPT_INTERVAL to wait for another attempt to check if the connection/channel is ready.

Related linear task: ENG-1495

Test Plan

Successfully built using yarn run-tunnelbroker-service-in-sandbox command.
Passing all AMQP unit tests in the last diff D4749 in a stack.

Diff Detail

Repository

rCOMM Comm

Branch

fix-amqp-reconnect-logic

Lint

No Lint Coverage

Unit

No Test Coverage

Event Timeline

• max created this revision.Aug 4 2022, 11:23 AM

• max held this revision as a draft.

Herald added subscribers: • abosh, • karol, atul and 3 others. · View Herald TranscriptAug 4 2022, 11:23 AM

Harbormaster completed remote builds in B11133: Diff 15330.Aug 4 2022, 11:33 AM

• max retitled this revision from [services] Tunnelbroker - Fix Amqp client reconnection logic to [services] Tunnelbroker - Fix AMQP client reconnection algorithm.Aug 4 2022, 12:17 PM

• max edited the summary of this revision. (Show Details)

• max edited the test plan for this revision. (Show Details)

• max added reviewers: • karol, tomek.

• max added parent revisions: D4743: [services] Tunnelbroker - Refactor handler names in Amqp `connectInternal()`, D4742: [services] Tunnelbroker - Fix `amqpReady` assignment, D4741: [services] Tunnelbroker - Add `waitUntilReady` function in AmqpManager, D4740: [services] Tunnelbroker - Wrap `connect()` into `init()` in AmqpManager.

• max added inline comments.Aug 4 2022, 12:21 PM

services/tunnelbroker/src/Amqp/AmqpManager.cpp
49 ↗	(On Diff #15330)	Clear the counter on successful reconnect.
96 ↗	(On Diff #15330)	We should not throw a fatal error here. In case of the connection is lost it will throw a fatal error instead of going forward to the reconnection loop. It was changed to a log error.

• max added a child revision: D4746: [services] Tunnelbroker - Changes in Amqp messages testing to send and listen in different threads.Aug 4 2022, 4:28 PM

• max added a child revision: D4749: [services] Tunnelbroker - Add timeout for a pop message waiting in AmqpManager tests.Aug 4 2022, 4:45 PM

• max published this revision for review.Aug 4 2022, 4:53 PM

• max edited the test plan for this revision. (Show Details)

• max mentioned this in D4741: [services] Tunnelbroker - Add `waitUntilReady` function in AmqpManager.Aug 5 2022, 6:07 AM

Rebase on parent changes.

• max added a parent revision: D4767: [services] Tunnelbroker - Add AMQP shared channel locking.Aug 6 2022, 11:52 AM

Harbormaster failed remote builds in B11194: Diff 15402!Aug 6 2022, 11:53 AM

Rebase on master changes.

Harbormaster completed remote builds in B11199: Diff 15408.Aug 6 2022, 12:56 PM

• max added a child revision: D4768: [services] Tunnelbroker - Amqp Manager in parallel threads messages throughput test.Aug 6 2022, 1:06 PM

• karol accepted this revision.Aug 8 2022, 6:58 AM

• karol added inline comments.

services/tunnelbroker/src/Constants.h
40–41	where do these numbers come from?

tomek accepted this revision.Aug 9 2022, 9:08 AM

tomek added inline comments.

services/tunnelbroker/src/Amqp/AmqpManager.cpp
103–104	Shouldn't we increase the counter before calling `connectInternal`? Currently, the following can happen: `connect` is called `connectInternal` is called and it works - `reconnectAttempt` is set to 0 in `onReady` `reconnectAttempt` is increased next loop iterations with unsuccessful connects The starting value will be `reconnectAttempt = 1`, so effectively we will try to reconnect only `AMQP_RECONNECT_MAX_ATTEMPTS - 1` times.
108–109	What do you think about having exponential backoff also here? It should be really easy to implement, because `x * 2 ^ reconnectAttempt` can be used as a sleep duration.
services/tunnelbroker/src/Constants.h
40	Having a unit as a part of name is a lot better than comments with explanation - they will go out of sync.

This revision is now accepted and ready to land.Aug 9 2022, 9:08 AM

tomek added inline comments.Aug 9 2022, 9:12 AM

services/tunnelbroker/src/Constants.h
40	Another idea is to store `std::chrono::milliseconds(3000))` directly as a constant - that is probably the best solution

• max mentioned this in D4740: [services] Tunnelbroker - Wrap `connect()` into `init()` in AmqpManager.Aug 12 2022, 7:22 AM

Rebase/Merge on master changes.
Change constant name from AMQP_RECONNECT_ATTEMPT_INTERVAL to AMQP_RECONNECT_ATTEMPT_INTERVAL_MS.

Harbormaster completed remote builds in B11506: Diff 15827.Aug 22 2022, 8:52 AM

Rebase on parent changes.

Harbormaster completed remote builds in B11507: Diff 15828.Aug 22 2022, 10:14 AM

• max marked 5 inline comments as done.Aug 22 2022, 10:43 AM

• max added inline comments.

services/tunnelbroker/src/Amqp/AmqpManager.cpp
103–104	Shouldn't we increase the counter before calling `connectInternal`? Currently, the following can happen: `connect` is called `connectInternal` is called and it works - `reconnectAttempt` is set to 0 in `onReady` `reconnectAttempt` is increased next loop iterations with unsuccessful connects I think it doesn't matter here to swap these lines because of: In case of a successful connection, we are resetting the counter inside the `connectInternal()` but we have a condition to check it in a while loop above/outside it. Also, if we swap them the log message below will have a 0 attempt after the first connection lost. The starting value will be `reconnectAttempt = 1`, so effectively we will try to reconnect only `AMQP_RECONNECT_MAX_ATTEMPTS - 1` times. Seems you're right. We can fix it by changing the condition above from `this->reconnectAttempt < AMQP_RECONNECT_MAX_ATTEMPTS` to `this->reconnectAttempt <= AMQP_RECONNECT_MAX_ATTEMPTS`.
108–109	What do you think about having exponential backoff also here? It should be really easy to implement, because `x * 2 ^ reconnectAttempt` can be used as sleep duration. There is already a follow-up task ENG-1381 to switch exponential backoff.
services/tunnelbroker/src/Constants.h
40	Having a unit as a part of name is a lot better than comments with explanation - they will go out of sync. Agree! I've changed to use `AMQP_RECONNECT_ATTEMPT_INTERVAL_MS`.
40	Another idea is to store `std::chrono::milliseconds(3000))` directly as a constant - that is probably the best solution This constant will be removed by the exponential backoff algorithm in the future by ENG-1381 so we can omit to make now. Thanks for the suggestion!
40–41	where do these numbers come from? The reasonable maximum wait timeout to reconnect is around 30 seconds, and the reasonable attempt wait to reconnect is 2-5 seconds speaking for the rabbitMQ. That's why I went to 10 attempts by 3 seconds.

Minor refactoring to the AmqpManager::connect() to connect after waiting and run first connect outside of the reconnection loop.

• max added inline comments.Aug 22 2022, 11:37 AM

services/tunnelbroker/src/Amqp/AmqpManager.cpp
103–104	Shouldn't we increase the counter before calling `connectInternal`? Currently, the following can happen: `connect` is called `connectInternal` is called and it works - `reconnectAttempt` is set to 0 in `onReady` `reconnectAttempt` is increased next loop iterations with unsuccessful connects The starting value will be `reconnectAttempt = 1`, so effectively we will try to reconnect only `AMQP_RECONNECT_MAX_ATTEMPTS - 1` times. I think the better way is a minor refactoring of `AmqpManager::connect()` to wait before connecting: void AmqpManager::connect() { this->connectInternal(); while (this->reconnectAttempt < AMQP_RECONNECT_MAX_ATTEMPTS) { this->reconnectAttempt++; LOG(INFO) << "AMQP: Attempt " << this->reconnectAttempt << " to reconnect in " << AMQP_RECONNECT_ATTEMPT_INTERVAL_MS << " ms"; std::this_thread::sleep_for( std::chrono::milliseconds(AMQP_RECONNECT_ATTEMPT_INTERVAL_MS)); this->connectInternal(); } LOG(FATAL) << "Cannot connect to AMQP server after " << AMQP_RECONNECT_MAX_ATTEMPTS << " attempts"; } I've made an update with this.