Paths

Table of Contentst

[services] Tunnelbroker - Add timeout for a pop message waiting in AmqpManager tests
AbandonedPublic
Actions

Authored by tomek on Aug 4 2022, 11:31 PM.

Details

Reviewers

• karol
• max

Summary

In our AMQP tests, we are sending the message to the RabbitMQ using the AMQPManager.send(...) method and then waiting for a message to 'pop' using the blocking AMQPManager.pop(…) method.

In case something is wrong with the message sending, RabbitMQ server, or wrong toDeviceID, the blocking pop method will wait indefinitely and cause a deadlock in a test without any result (FAIL or SUCCESS).

This diff adds an observer with a maximum execution time for AMQP tests to prevent deadlocks in a CI environment and fail the test if it takes more than MESSSAGE_MAX_WAIT_TIME to consume a message from a RabbitMQ.

Related Linear task: ENG-1703

Test Plan

Successfully built and pass all tests using yarn run-unit-tests command.

Diff Detail

Repository

rCOMM Comm

Branch

add-timeout-to-amqp-tests

Lint

No Lint Coverage

Unit

No Test Coverage

Event Timeline

• max created this revision.Aug 4 2022, 11:31 PM

• max held this revision as a draft.

Herald added subscribers: • abosh, • karol, atul and 3 others. · View Herald TranscriptAug 4 2022, 11:31 PM

Harbormaster failed remote builds in B11140: Diff 15340!Aug 4 2022, 11:32 PM

• max edited the summary of this revision. (Show Details)Aug 4 2022, 11:42 PM

• max edited the test plan for this revision. (Show Details)

• max added reviewers: • karol, tomek.

• max added parent revisions: D4746: [services] Tunnelbroker - Changes in Amqp messages testing to send and listen in different threads, D4745: [services] Tunnelbroker - Fix message `from` and `to` swap in AMQP tests, D4744: [services] Tunnelbroker - Fix AMQP client reconnection algorithm.Aug 4 2022, 11:45 PM

• max mentioned this in D4740: [services] Tunnelbroker - Wrap `connect()` into `init()` in AmqpManager.Aug 4 2022, 11:48 PM

• max mentioned this in D4741: [services] Tunnelbroker - Add `waitUntilReady` function in AmqpManager.Aug 4 2022, 11:51 PM

• max mentioned this in D4742: [services] Tunnelbroker - Fix `amqpReady` assignment.

• max mentioned this in D4743: [services] Tunnelbroker - Refactor handler names in Amqp `connectInternal()`.

• max mentioned this in D4744: [services] Tunnelbroker - Fix AMQP client reconnection algorithm.Aug 4 2022, 11:53 PM

• max mentioned this in D4746: [services] Tunnelbroker - Changes in Amqp messages testing to send and listen in different threads.

CI build error is unrelated to the changes. After triggering the rebuild it is successful.

• max published this revision for review.Aug 4 2022, 11:59 PM

CI build error is unrelated to the changes.

Yup, they are unrelated to your changes but if you're referring to the ShellCheck CI build failure, that is probably because you have not rebased your changes on top of origin/master yet. The failures in that CI build were fixed by diffs pushed today, so that's probably why your stack doesn't have the fixes in yet. If you rebase and update, the CI build should pass.

Rebase on master.

Harbormaster completed remote builds in B11147: Diff 15350.Aug 5 2022, 10:00 AM

In D4749#136724, @abosh wrote:

CI build error is unrelated to the changes.

Yup, they are unrelated to your changes but if you're referring to the ShellCheck CI build failure, that is probably because you have not rebased your changes on top of origin/master yet. The failures in that CI build were fixed by diffs pushed today, so that's probably why your stack doesn't have the fixes in yet. If you rebase and update, the CI build should pass.

Fixed by rebasing it, thanks for a ShellCheck @abosh!

• max mentioned this in D4753: [services] Tunnelbroker - Change RabbitMQ queue name for a sandbox mode.Aug 5 2022, 12:44 PM

Rebase on parent changes.

• max added a parent revision: D4767: [services] Tunnelbroker - Add AMQP shared channel locking.Aug 6 2022, 7:00 PM

Harbormaster completed remote builds in B11196: Diff 15405.Aug 6 2022, 7:12 PM

Rebase on master changes.

Harbormaster failed remote builds in B11201: Diff 15410!Aug 6 2022, 7:49 PM

• max added a child revision: D4768: [services] Tunnelbroker - Amqp Manager in parallel threads messages throughput test.Aug 6 2022, 8:06 PM

Keyserver built fail is not related to this diff:

"https://registry.yarnpkg.com/is-string/-/is-string-1.0.6.tgz: Request failed \"503 Service Unavailable\"".

This logic is complicated and duplicating it isn't a good idea. Could you extract the common part?

services/tunnelbroker/test/AmqpManagerTest.cpp
49–52 ↗	(On Diff #15410)	This probably can be simplified when `std::async` is used
59 ↗	(On Diff #15410)	Could you explain why `detach` is necessary? Does the code break without it?

This revision now requires changes to proceed.Aug 9 2022, 4:51 PM

• max planned changes to this revision.Aug 9 2022, 7:37 PM

• max mentioned this in rCOMM0aa684380275: [services] Tunnelbroker - Change RabbitMQ queue name for a sandbox mode.Aug 11 2022, 7:32 AM

• max requested review of this revision.Aug 22 2022, 10:39 AM

• max marked 2 inline comments as done.

• max added inline comments.

services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗	(On Diff #15410)	Could you explain why `detach` is necessary? Does the code break without it? This probably can be simplified when std::async is used Unfortunately, we can't use std::async in this scenario because the async thread is joined to the parent test thread. The problem is that the async thread will wait infinitely for the blocking `DeliveryBroker::getInstance().pop(toDeviceID)` because this call will never ends if there are no messages. The actual flow using async or join will be like this: We are printing that the timeout is reached by calling FAIL, The test thread will hang and wait infinitely because of async or join thread will never end (because `DeliveryBroker::getInstance().pop(toDeviceID)` is blocking read) and the parent test thread should wait. I don't see any other options ((

tomek requested changes to this revision.Aug 22 2022, 1:26 PM

tomek added inline comments.

services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗	(On Diff #15410)	Could you explain why `detach` is necessary here? Can't we just delete this call? What is the difference between having and not having a detach here?

This revision now requires changes to proceed.Aug 22 2022, 1:26 PM

• max mentioned this in rCOMMa767de4a3107: [services] Tunnelbroker - Add `waitUntilReady` function in AmqpManager.Aug 22 2022, 1:37 PM

• max mentioned this in rCOMM6b6d2067a16f: [services] Tunnelbroker - Fix `amqpReady` assignment.Aug 22 2022, 1:59 PM

• max mentioned this in rCOMM31bec4fa28ef: [services] Tunnelbroker - Refactor handler names in Amqp `connectInternal()`.Aug 22 2022, 2:12 PM

• max planned changes to this revision.Aug 22 2022, 10:12 PM

• max marked an inline comment as done.

• max requested review of this revision.Aug 23 2022, 1:26 AM

• max marked an inline comment as done.

• max added inline comments.

services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗	(On Diff #15410)	Could you explain why `detach` is necessary here? Can't we just delete this call? What is the difference between having and not having a detach here? Sorry, I thought you asking about `join` vs `detach` here and not about just removing this call. According to the `std::thread` documentation: thread objects that are joinable shall either be joined or detached before they are destroyed. We should join or detach the thread before it goes out of scope. If we don't do this and go with the default constructor we will get the undefined behavior and error at the end of the main. Here is the playground example. By default without calling join or detach the thread is running as detached: the caller thread is not going to wait until it ends, but at the end of the main we will get the exception. That's why the joinable thread should be `joined` or `detached`.

tomek requested changes to this revision.Aug 23 2022, 2:01 PM

tomek added inline comments.

services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗	(On Diff #15410)	Why do we have to use joinable thread?

This revision now requires changes to proceed.Aug 23 2022, 2:01 PM

tomek accepted this revision.Aug 23 2022, 4:57 PM

tomek added inline comments.

services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗	(On Diff #15410)	It seems like `joinable` isn't a special type of thread - it is a thread that has something to execute. So threads that aren't joinable are the ones that were already joined / detached, or the ones that don't have anything to execute. So it makes sense to either `join` or `detach`. We have a code that might block indefinitely, so joining isn't an option. What happens with the thread if we detach it and the execution hangs? Is it going to be stopped and destructed when the test ends? If not, and we have a couple of tests like that, we have a resource leak.

This revision is now accepted and ready to land.Aug 23 2022, 4:57 PM

• max marked 3 inline comments as done.Aug 23 2022, 6:06 PM

• max added inline comments.

services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗	(On Diff #15410)	It seems like `joinable` isn't a special type of thread - it is a thread that has something to execute. So threads that aren't joinable are the ones that were already joined / detached, or the ones that don't have anything to execute. So it makes sense to either `join` or `detach`. We have a code that might block indefinitely, so joining isn't an option. The design principle of the C++ `std::thread`: after construction thread is in `joinable` state which is mean that it can be joined or detached and the developer should choose what and where to call either join or detach, or pass the thread object to another thread and join it to another thread. That's why in a constructor we don't have a flag for imperative detach or join on a construction step. The thread is in a joinable state until it is joined or detached. In our case only detach can be used to go beyond possible deadlock in a test. What happens with the thread if we detach it and the execution hangs? Is it going to be stopped and destructed when the test ends? If not, and we have a couple of tests like that, we have a resource leak. The detached thread with the deadlock will be destroyed only on the app main exit until that detached thread will be running in the background. If we have many tests with the deadlock threads - yes we will have a bunch of the background threads running until the test app exit in the end. The problem is - we can't kill a certain thread with the deadlock. There is a destructor and pthread_cancel but the problem using of them is that they are destroying all the threads, not the one certain thread. Replit playground code to demonstrate that. It's not a C++-specific behavior where we can't kill a certain thread, it's a POSIX thread design and the same problem exists in a Golang with the goroutines, for example. In our test scenario, this is not big deal if we have some threads in the background until the test ends, because the tests are short lived and all of the threads will be killed at the end. Maybe the better solution here is to have a timeout for the test app instead of a timeout for a certain action. Unfortunately, gtest doesn't provide timeouts for a test, but there are workarounds. For example, we can use a simple solution: run tests in CI using the timeout command with 1-5 minute timeout. This will cover all of the tests instead of adding a specific watchdog for every possible deadlock.

tomek added inline comments.Aug 24 2022, 8:16 AM

services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗	(On Diff #15410)	I think the proper solution would be to modify delivery broker api so that it is possible to provide a timeout, instead of having a method which will hang forever.