Page MenuHomePhabricator

[services] Tunnelbroker - Add timeout for a pop message waiting in AmqpManager tests
AbandonedPublic

Authored by tomek on Aug 4 2022, 4:31 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Dec 18, 1:30 PM
Unknown Object (File)
Wed, Dec 18, 1:56 AM
Unknown Object (File)
Sun, Dec 15, 10:18 PM
Unknown Object (File)
Tue, Nov 26, 12:54 PM
Unknown Object (File)
Tue, Nov 26, 12:54 PM
Unknown Object (File)
Nov 22 2024, 8:04 PM
Unknown Object (File)
Nov 22 2024, 4:02 PM
Unknown Object (File)
Oct 27 2024, 6:16 AM

Details

Reviewers
karol
max
Summary

In our AMQP tests, we are sending the message to the RabbitMQ using the AMQPManager.send(...) method and then waiting for a message to 'pop' using the blocking AMQPManager.pop(…) method.

In case something is wrong with the message sending, RabbitMQ server, or wrong toDeviceID, the blocking pop method will wait indefinitely and cause a deadlock in a test without any result (FAIL or SUCCESS).

This diff adds an observer with a maximum execution time for AMQP tests to prevent deadlocks in a CI environment and fail the test if it takes more than MESSSAGE_MAX_WAIT_TIME to consume a message from a RabbitMQ.

Related Linear task: ENG-1703

Test Plan

Successfully built and pass all tests using yarn run-unit-tests command.

Diff Detail

Repository
rCOMM Comm
Branch
add-timeout-to-amqp-tests
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

max held this revision as a draft.
max edited the test plan for this revision. (Show Details)
max added reviewers: karol, tomek.

CI build error is unrelated to the changes. After triggering the rebuild it is successful.

max published this revision for review.Aug 4 2022, 4:59 PM

CI build error is unrelated to the changes.

Yup, they are unrelated to your changes but if you're referring to the ShellCheck CI build failure, that is probably because you have not rebased your changes on top of origin/master yet. The failures in that CI build were fixed by diffs pushed today, so that's probably why your stack doesn't have the fixes in yet. If you rebase and update, the CI build should pass.

In D4749#136724, @abosh wrote:

CI build error is unrelated to the changes.

Yup, they are unrelated to your changes but if you're referring to the ShellCheck CI build failure, that is probably because you have not rebased your changes on top of origin/master yet. The failures in that CI build were fixed by diffs pushed today, so that's probably why your stack doesn't have the fixes in yet. If you rebase and update, the CI build should pass.

Fixed by rebasing it, thanks for a ShellCheck @abosh!

Keyserver built fail is not related to this diff:

"https://registry.yarnpkg.com/is-string/-/is-string-1.0.6.tgz: Request failed \"503 Service Unavailable\"".
tomek requested changes to this revision.Aug 9 2022, 9:51 AM

This logic is complicated and duplicating it isn't a good idea. Could you extract the common part?

services/tunnelbroker/test/AmqpManagerTest.cpp
49–52 ↗(On Diff #15410)

This probably can be simplified when std::async is used

59 ↗(On Diff #15410)

Could you explain why detach is necessary? Does the code break without it?

This revision now requires changes to proceed.Aug 9 2022, 9:51 AM
max marked 2 inline comments as done.
max added inline comments.
services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗(On Diff #15410)

Could you explain why detach is necessary? Does the code break without it?
This probably can be simplified when std::async is used

Unfortunately, we can't use std::async in this scenario because the async thread is joined to the parent test thread. The problem is that the async thread will wait infinitely for the blocking DeliveryBroker::getInstance().pop(toDeviceID) because this call will never ends if there are no messages.

The actual flow using async or join will be like this:

  • We are printing that the timeout is reached by calling FAIL,
  • The test thread will hang and wait infinitely because of async or join thread will never end (because DeliveryBroker::getInstance().pop(toDeviceID) is blocking read) and the parent test thread should wait.

I don't see any other options ((

tomek requested changes to this revision.Aug 22 2022, 6:26 AM
tomek added inline comments.
services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗(On Diff #15410)

Could you explain why detach is necessary here? Can't we just delete this call? What is the difference between having and not having a detach here?

This revision now requires changes to proceed.Aug 22 2022, 6:26 AM
max marked an inline comment as done.
max marked an inline comment as done.
max added inline comments.
services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗(On Diff #15410)

Could you explain why detach is necessary here? Can't we just delete this call? What is the difference between having and not having a detach here?

Sorry, I thought you asking about join vs detach here and not about just removing this call.

According to the std::thread documentation: thread objects that are joinable shall either be joined or detached before they are destroyed. We should join or detach the thread before it goes out of scope. If we don't do this and go with the default constructor we will get the undefined behavior and error at the end of the main.

Here is the playground example. By default without calling join or detach the thread is running as detached: the caller thread is not going to wait until it ends, but at the end of the main we will get the exception.

That's why the joinable thread should be joined or detached.

tomek requested changes to this revision.Aug 23 2022, 7:01 AM
tomek added inline comments.
services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗(On Diff #15410)

Why do we have to use joinable thread?

This revision now requires changes to proceed.Aug 23 2022, 7:01 AM
tomek added inline comments.
services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗(On Diff #15410)

It seems like joinable isn't a special type of thread - it is a thread that has something to execute. So threads that aren't joinable are the ones that were already joined / detached, or the ones that don't have anything to execute. So it makes sense to either join or detach. We have a code that might block indefinitely, so joining isn't an option.

What happens with the thread if we detach it and the execution hangs? Is it going to be stopped and destructed when the test ends? If not, and we have a couple of tests like that, we have a resource leak.

This revision is now accepted and ready to land.Aug 23 2022, 9:57 AM
max added inline comments.
services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗(On Diff #15410)

It seems like joinable isn't a special type of thread - it is a thread that has something to execute. So threads that aren't joinable are the ones that were already joined / detached, or the ones that don't have anything to execute. So it makes sense to either join or detach. We have a code that might block indefinitely, so joining isn't an option.

The design principle of the C++ std::thread: after construction thread is in joinable state which is mean that it can be joined or detached and the developer should choose what and where to call either join or detach, or pass the thread object to another thread and join it to another thread.
That's why in a constructor we don't have a flag for imperative detach or join on a construction step.
The thread is in a joinable state until it is joined or detached.

In our case only detach can be used to go beyond possible deadlock in a test.

What happens with the thread if we detach it and the execution hangs? Is it going to be stopped and destructed when the test ends? If not, and we have a couple of tests like that, we have a resource leak.

The detached thread with the deadlock will be destroyed only on the app main exit until that detached thread will be running in the background. If we have many tests with the deadlock threads - yes we will have a bunch of the background threads running until the test app exit in the end.

The problem is - we can't kill a certain thread with the deadlock. There is a destructor and pthread_cancel but the problem using of them is that they are destroying all the threads, not the one certain thread.
Replit playground code to demonstrate that.

It's not a C++-specific behavior where we can't kill a certain thread, it's a POSIX thread design and the same problem exists in a Golang with the goroutines, for example.

In our test scenario, this is not big deal if we have some threads in the background until the test ends, because the tests are short lived and all of the threads will be killed at the end.

Maybe the better solution here is to have a timeout for the test app instead of a timeout for a certain action. Unfortunately, gtest doesn't provide timeouts for a test, but there are workarounds.

For example, we can use a simple solution:
run tests in CI using the timeout command with 1-5 minute timeout. This will cover all of the tests instead of adding a specific watchdog for every possible deadlock.

services/tunnelbroker/test/AmqpManagerTest.cpp
59 ↗(On Diff #15410)

I think the proper solution would be to modify delivery broker api so that it is possible to provide a timeout, instead of having a method which will hang forever.

max marked an inline comment as done.

This diff will wait for a decision for all the services in ENG-1703.

tomek edited reviewers, added: max; removed: tomek.
This revision now requires review to proceed.Apr 9 2024, 10:55 AM

We don't have the tests anymore and the service is now written in Rust.