Page MenuHomePhabricator

[services] Tunnelbroker - Wrap `connect()` into `init()` in AmqpManager
ClosedPublic

Authored by max on Aug 4 2022, 10:39 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Nov 9, 5:29 PM
Unknown Object (File)
Sun, Oct 27, 4:35 AM
Unknown Object (File)
Sun, Oct 27, 4:35 AM
Unknown Object (File)
Sun, Oct 27, 4:35 AM
Unknown Object (File)
Sun, Oct 27, 4:35 AM
Unknown Object (File)
Sun, Oct 27, 4:34 AM
Unknown Object (File)
Sun, Oct 27, 4:34 AM
Unknown Object (File)
Sun, Oct 27, 4:32 AM

Details

Summary

This diff is a part of the stack.

This diff introduces changes to move the creation of the internal thread for the AMQP client inside the init() function instead of using it outside and repeating the code in the main server.cpp and AMQP tests in test/AmqpManagerTest.cpp because of we need to initialize AMQP thread and connection on tests too.

Creating init() function following the existing approach that AWS-CPP client already uses. We are callinginit() to initialize AWS thread/client.

Related linear task: ENG-1495

Test Plan
  1. Successfully built using yarn run-tunnelbroker-service-in-sandbox command.
  1. Passing all AMQP unit tests in the last diff in a stack D4768. Switch to D4768 and run yarn run-unit-tests tunnelbroker. The tests will pass.

Diff Detail

Repository
rCOMM Comm
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
max edited the test plan for this revision. (Show Details)
tomek requested changes to this revision.Aug 5 2022, 5:25 AM
tomek added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
20–28 ↗(On Diff #15321)

This code isn't secure:

  1. We don't synchronize access to amqpChannel, but use it from multiple threads.
  2. init method can be called multiple times and that would result in multiple threads being spawned.

First of all, what is the reason behind introducing init method? Can't we just perform this initialization in the constructor? This should be our default strategy and we would need a really good reason to use a different approach.

This revision now requires changes to proceed.Aug 5 2022, 5:25 AM
max added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
20–28 ↗(On Diff #15321)

This code isn't secure:

  1. We don't synchronize access to amqpChannel, but use it from multiple threads.

The AMQP-CPP documentation says:

AMQP-CPP is fully asynchronous and does not do any blocking (system) calls, so it can be used in high performance applications without the need for threads.

Regarding the channels:

A channel is a sort of virtual connection, and it is possible to create many channels that all use the same connection...
All operations that you can perform on a channel are non-blocking....

From the above, I don't think we need to synchronize access to the amqpChannel.

  1. init method can be called multiple times and that would result in multiple threads being spawned.

Actually no. The check in the condition of

if (this->amqpChannel != nullptr)

handles this. When init will call multiple times on the already initialized connection it will be skipped because amqpChannel can be null only when it is not initialized. It was tested in D4749 when we run unit tests init calls in every test, but the initialization is skipped as expected here.

First of all, what is the reason behind introducing init method? Can't we just perform this initialization in the constructor? This should be our default strategy and we would need a really good reason to use a different approach.

According to the description of this diff (this is a follow-up for the current using approach):

Creating init() function following the existing approach that AWS-CPP client already uses. We are callinginit() to initialize AWS thread/client.

max added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
20–28 ↗(On Diff #15321)

This code isn't secure:

  1. We don't synchronize access to amqpChannel, but use it from multiple threads.

Anyway, I should recheck this. I'll report on this, and defer this diff until.

max marked an inline comment as done.
max added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
20–28 ↗(On Diff #15321)
  1. We don't synchronize access to amqpChannel, but use it from multiple threads.

I've added D4767 (synchronization) along with the stress test in D4768 (testing) to test in 100 threads.

It doesn't relate to this certain diff which is about the wrapping of spawning of the amqp client thread into the init() instead of repeating it in the codebase, but anyway thanks for an improvement catch!

max added 1 blocking reviewer(s): tomek.

Switching to using std::call_once instead of checking a channel for a null pointer.

services/tunnelbroker/src/Amqp/AmqpManager.cpp
21 ↗(On Diff #15413)

Guarantee to spawn a new amqp thread only once.

Rebase on master to fix shellcheck.

tomek added 1 blocking reviewer(s): karol.

I've seen @karol had some comments about detach in another diff, so adding him as blocking.
Other than that, it looks like the code is now a lot safer. As you noticed, the reason behind my comment was to protect against initialization not being safe and not the usage of the manager itself.

According to the description of this diff (this is a follow-up for the current using approach):

Creating init() function following the existing approach that AWS-CPP client already uses. We are callinginit() to initialize AWS thread/client.

I don't think that following all the approaches is a good idea - we need to be sure that other approaches aren't better. In this case, is there anything wrong with moving this logic to the constructor? Doing that would simplify this significantly, as exactly once semantic would be basically free.

karol added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
23 ↗(On Diff #15414)

Thanks for pointing out @tomek. Btw, what do you think about detach?

I'm just pasting my response from the other diff: https://phab.comm.dev/D4768#137425. I still think we should avoid using detach as much as possible. Quoting SO:

So, should you use join or detach?

  • Use join
  • Unless you need to have more flexibility AND are willing to provide a synchronization mechanism to wait for the thread completion on your own, in which case you may use detach

A question for @max: Do you provide a synchronization mechanism to wait for the thread completion on your own here?

This revision now requires changes to proceed.Aug 11 2022, 6:55 AM
max marked an inline comment as done.
max added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
23 ↗(On Diff #15414)

Thanks for pointing out @tomek. Btw, what do you think about detach?

I'm just pasting my response from the other diff: https://phab.comm.dev/D4768#137425. I still think we should avoid using detach as much as possible. Quoting SO:

So, should you use join or detach?

  • Use join
  • Unless you need to have more flexibility AND are willing to provide a synchronization mechanism to wait for the thread completion on your own, in which case you may use detach

I'm sure we should use detach() in such initialization functions. The reason is that the initialization can be called from the thread which just can initialize and be gone instead of making it join and make alive the parent caller thread. I agree in some cases forward in the stack that detach() may not be necessary, but in the case when we are calling initialization the client thread which must live until the program exit - we can/must use detach().

A question for @max: Do you provide a synchronization mechanism to wait for the thread completion on your own here?

The AMQP client thread should be completed along with the main program and spawned only once. Looking forward the init() calls AmqpManager::connect() which connects and reconnects to the AMQP server in a loop and throws a fatal if the connection attempts are exhausted.

In D4740#137928, @tomek wrote:

I've seen @karol had some comments about detach in another diff, so adding him as blocking.
Other than that, it looks like the code is now a lot safer. As you noticed, the reason behind my comment was to protect against initialization not being safe and not the usage of the manager itself.

According to the description of this diff (this is a follow-up for the current using approach):

Creating init() function following the existing approach that AWS-CPP client already uses. We are callinginit() to initialize AWS thread/client.

I don't think that following all the approaches is a good idea - we need to be sure that other approaches aren't better. In this case, is there anything wrong with moving this logic to the constructor? Doing that would simplify this significantly, as exactly once semantic would be basically free.

I agree that the constructor may look clearer, but I think mixing init() and constructors will make a mess. We can't change the AWS-SDK approach, not sure mixing both is a good way.

karol added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
23 ↗(On Diff #15414)

I'm sure we should use detach() in such initialization functions.

I see. Is it based on something?

Interesting. What if an error occurs during the initialization? We'll not see it and the amqp will not be running. So we'll be pretty much in a state like "I have no clue what's going on".

This revision now requires changes to proceed.Aug 11 2022, 7:36 AM
max marked an inline comment as done.
max added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
23 ↗(On Diff #15414)

I'm sure we should use detach() in such initialization functions.

I see. Is it based on something?

I've described my thoughts on why initialization functions should be detached above:

I'm sure we should use detach() in such initialization functions. The reason is that the initialization can be called from the thread which just can initialize and be gone instead of making it join and make alive the parent caller thread. I agree in some cases forward in the stack that detach() may not be necessary, but in the case when we are calling initialization the client thread which must live until the program exit - we can/must use detach().

Interesting. What if an error occurs during the initialization? We'll not see it and the amqp will not be running. So we'll be pretty much in a state like "I have no clue what's going on".

If the unrecoverable error occurs during the client thread initialization we will throw an unrecoverable error with the message what's happened and exit the main program. In case of the error is recoverable (in our scenario in this case we need to reconnect) the thread needs to be run with the reconnection loop. To handle it properly we must clearly distinguish which errors are recoverable and handled properly inside the thread and which are unrecoverable and we should exit.

In our certain case if the error occurs during the initialization is a connection error only that handles properly with the waiting loop and amqpReady assignment in D4742.

I'm sure that detaching the thread whose purpose is to run all the time which the main program runs is the correct way because we don't know where init() will be invoked, from a long-lived thread or not. In case we don't use detach and call init() from the thread which will die earlier than the main we will get a bug, segfaults etc. So that's why detaching here is necessary for my thoughts.

karol added 1 blocking reviewer(s): tomek.

Adding @tomek as blocking so he can see it.

services/tunnelbroker/src/Amqp/AmqpManager.cpp
23 ↗(On Diff #15414)

I've described my thoughts on why initialization functions should be detached above:

My question was more if there are any external resources to support this idea, we already know that in your opinion it is ok, you may be wrong though. It's sometimes good to go through the Internet and see what other people think about this.

In our certain case if the error occurs during the initialization is a connection error only that handles properly with the waiting loop and amqpReady assignment in D4742.

Even if this works that way, I don't think detach's good. It's not very maintainable, if we ever change the connect logic, we'll have to remember to reconsider using detach. I don't think that's safe. It may result in a bug that may be hard to find.

If the unrecoverable error occurs during the client thread initialization we will throw an unrecoverable error with the message what's happened and exit the main program.

I don't understand how this would work, sorry. From what I understand, you're only relying here on the aqmpReady flag. So if it's not set, then you'll throw an error in a different place. But I believe that if the throw happens inside of the detached thread, we'll not be able to catch it. So, essentially, we'll only have that flag, nothing more. This may make debugging a lot harder.

To support what I said, I'm going to provide two examples, let's have a look:

  • The first one (with join)
#include <iostream>

#include <thread>
#include <chrono>

using namespace std;

int main()
{
  cout << "MAIN BEGIN" << endl;
  thread th([](){
    cout << "TH BEGIN" << endl;
    std::this_thread::sleep_for(std::chrono::milliseconds(500));
    throw std::runtime_error("err");
    cout << "TH END" << endl;
  });
  std::this_thread::sleep_for(std::chrono::milliseconds(100));
  cout << "MAIN END" << endl;
  th.join();
}

The output is

MAIN BEGIN
TH BEGIN
MAIN END
libc++abi: terminating with uncaught exception of type std::runtime_error: err
./build_and_run.sh: line 15: 71224 Abort trap: 6           ./cmake/bin/main
  • Now, the second one (with detach)
#include <iostream>

#include <thread>
#include <chrono>

using namespace std;

int main()
{
  cout << "MAIN BEGIN" << endl;
  thread th([](){
    cout << "TH BEGIN" << endl;
    std::this_thread::sleep_for(std::chrono::milliseconds(500));
    throw std::runtime_error("err");
    cout << "TH END" << endl;
  });
  th.detach();
  std::this_thread::sleep_for(std::chrono::milliseconds(100));
  cout << "MAIN END" << endl;
}

The output is

MAIN BEGIN
TH BEGIN
MAIN END

Conclusion: we were not able to catch the error with the detached thread, but with the joined one, we could do that.

To illustrate what you're doing I prepared yet one more example:

#include <iostream>

#include <thread>
#include <chrono>

using namespace std;

int main()
{
  bool ready = false;
  cout << "MAIN BEGIN" << endl;
  thread th([&ready](){
    cout << "TH BEGIN" << endl;
    std::this_thread::sleep_for(std::chrono::milliseconds(500));
    throw std::runtime_error("err");
    ready = true;
    cout << "TH END" << endl;
  });
  th.detach();
  std::this_thread::sleep_for(std::chrono::milliseconds(100));
  // th.join();
  if (!ready) {
    throw std::runtime_error("not ready");
  }
  std::cout << "ready, all good" << std::endl;
  cout << "MAIN END" << endl;
}

There are two options:

  • you can detach - you will just see that the flag is not set, so you will not know anything about the actual error that's occurred,
  • you can join just before checking for the ready flag - we are able to see the real error.

One more concern: What if the initializing thread takes a long time (for some reason) and we check for the ready flag before the initialization terminates? In this case, we're going to throw an error even though the initialization would succeed (there might be one millisecond of a difference). I'm not going to post yet one more code snippet for this, you can use the ones above just removing this line throw std::runtime_error("err");.

To summarize: As you can see I have a lot of concerns about why using detach and relying on assumptions based on some external state is not a good idea at all. I think other people should speak up too with their opinions about this. The best shot is probably @tomek.

This revision now requires changes to proceed.Aug 12 2022, 4:42 AM
max marked an inline comment as done.

Requesting review for @tomek comments.

services/tunnelbroker/src/Amqp/AmqpManager.cpp
23 ↗(On Diff #15414)

I've described my thoughts on why initialization functions should be detached above:

My question was more if there are any external resources to support this idea, we already know that in your opinion it is ok, you may be wrong though. It's sometimes good to go through the Internet and see what other people think about this.

Of course if you google something there is a pro and cons materials and we should consider of using or not something in a certain scenarios instead of just throw it out and create a possibility of another bug. There is a scenario where we can decide which way to go and there is no certain wrong way. Both ways have its pros and cons. That's why we listening each other and make a decisions as a team. Because of the code must be maintained not by the only person in the team.

When the discussion became a time consumer we just vote for the best solution and that's it. This can take a huge amount of time to discuss which can be spent better I think.
I respect your concerns and thoughts but here is a two ways and how the @tomek will vote I'll just go with it. So let's the third voice decide )

Detaching should be used carefully - yes, correct, but we shouldn't throw it when it needs to be used. Detaching a thread is a common practice BUT it should be used carefully. That's right.
Quick example, we are heavily using AWS-CPP client, it's detaching threads in an executor because the developer don't know from where the init() will be called. The same is in our scenario. We must be sure that the client thread will run until the main exit.

In our certain case if the error occurs during the initialization is a connection error only that handles properly with the waiting loop and amqpReady assignment in D4742.

Even if this works that way, I don't think detach's good. It's not very maintainable, if we ever change the connect logic, we'll have to remember to reconsider using detach. I don't think that's safe. It may result in a bug that may be hard to find.

Using initialization in a no long-lived thread otherwise resulting in a bug. If we simplify all of this there are two ways:

  • We are using a detached thread for the initialization and we don't need to keep in mind that we should run the init() only from the long-lived thread or main (as an AWS SDK do), because of the amqp-client thread should live all the time that app lives.
  • We are not using a detaching, but we should keep in mind that initialization must be done only in a long-lived thread or we will have a bug where the parent thread exits along with the children ampq-client thread and we will have possible segfaults.

If the unrecoverable error occurs during the client thread initialization we will throw an unrecoverable error with the message what's happened and exit the main program.

I don't understand how this would work, sorry. From what I understand, you're only relying here on the aqmpReady flag. So if it's not set, then you'll throw an error in a different place. But I believe that if the throw happens inside of the detached thread, we'll not be able to catch it. So, essentially, we'll only have that flag, nothing more. This may make debugging a lot harder.

To support what I said, I'm going to provide two examples, let's have a look:

  • The first one (with join)
#include <iostream>

#include <thread>
#include <chrono>

using namespace std;

int main()
{
  cout << "MAIN BEGIN" << endl;
  thread th([](){
    cout << "TH BEGIN" << endl;
    std::this_thread::sleep_for(std::chrono::milliseconds(500));
    throw std::runtime_error("err");
    cout << "TH END" << endl;
  });
  std::this_thread::sleep_for(std::chrono::milliseconds(100));
  cout << "MAIN END" << endl;
  th.join();
}

The output is

MAIN BEGIN
TH BEGIN
MAIN END
libc++abi: terminating with uncaught exception of type std::runtime_error: err
./build_and_run.sh: line 15: 71224 Abort trap: 6           ./cmake/bin/main
  • Now, the second one (with detach)
#include <iostream>

#include <thread>
#include <chrono>

using namespace std;

int main()
{
  cout << "MAIN BEGIN" << endl;
  thread th([](){
    cout << "TH BEGIN" << endl;
    std::this_thread::sleep_for(std::chrono::milliseconds(500));
    throw std::runtime_error("err");
    cout << "TH END" << endl;
  });
  th.detach();
  std::this_thread::sleep_for(std::chrono::milliseconds(100));
  cout << "MAIN END" << endl;
}

The output is

MAIN BEGIN
TH BEGIN
MAIN END

Conclusion: we were not able to catch the error with the detached thread, but with the joined one, we could do that.

To illustrate what you're doing I prepared yet one more example:

#include <iostream>

#include <thread>
#include <chrono>

using namespace std;

int main()
{
  bool ready = false;
  cout << "MAIN BEGIN" << endl;
  thread th([&ready](){
    cout << "TH BEGIN" << endl;
    std::this_thread::sleep_for(std::chrono::milliseconds(500));
    throw std::runtime_error("err");
    ready = true;
    cout << "TH END" << endl;
  });
  th.detach();
  std::this_thread::sleep_for(std::chrono::milliseconds(100));
  // th.join();
  if (!ready) {
    throw std::runtime_error("not ready");
  }
  std::cout << "ready, all good" << std::endl;
  cout << "MAIN END" << endl;
}

There are two options:

  • you can detach - you will just see that the flag is not set, so you will not know anything about the actual error that's occurred,

You would, because of the errors are logging inside the connect functions. Quick example how it works now: The tunnelbroker app starts, the flag is not set, the errors prints to the logs in case of the connection error, not established and etc. It's not silent as you think. We should use logging in a multi threaded app instead of to be scarry that the detached thread will not produce any logs. This scenario is for a single thread app. When we are creating app with multiple threads which are can be different (long lived, short lived) we should think of the proper logging instead of not using detach in any case. As of inter thread communication, race conditions etc.

  • you can join just before checking for the ready flag - we are able to see the real error.

That's not true. you will see the real error because of the logging inside of the detached thread.
In your example code the log from the detached thread is not produced because of the main thread exit, not because of detached/joined thread.

One more concern: What if the initializing thread takes a long time (for some reason) and we check for the ready flag before the initialization terminates? In this case, we're going to throw an error even though the initialization would succeed (there might be one millisecond of a difference). I'm not going to post yet one more code snippet for this, you can use the ones above just removing this line throw std::runtime_error("err");.

It's not because of we are checking the flag with the reconnection time intervals in waitUntilReady() function in D4741. We will throw an fatal error only in case when the reconnection attempts are exceeded and we print a log errors on reconnect attempt, and other errors. We have callback with logging on onError in D4744 -> AmqpManager.cpp:51.

tomek requested changes to this revision.Aug 12 2022, 8:53 AM

This discussion is really long and complicated, so I will try to ask a couple of simple questions to reduce confusion:

  1. Why do we need to call this->connect() on a separate thread?
  2. What is the downside of calling this->connect() on main thread?
  3. What is the benefit of main thread exiting before this->connect() is completed?
  4. What would be a downside of main thread waiting for amqpClientThread to complete?
  5. If the main thread ends, and there's a detached thread, does the app continue to run? Is it still running in a terminal, or just in the background?

Overall, without reading too deeply the discussion, I feel like we should either call this->connect() from main thread or join amqpClientThread. The reason is that the main thread would be able to tell what is the result of the initialization and take appropriate action, e.g. retry, report it somewhere, etc. The thing we should avoid is returning non-error state when an error occurs. The answers might make me change my opinion, though.

As a side note, I think that voting is almost never a good solution. It is ideal when we want to have a solution as soon as possible. In other cases, where we care about the quality, it is either that we don't understand each others points, or we value our own points too much. It is especially risky, because sometimes a single dev might see something really important but has difficulty in convincing the team - we should aim to agree about the solution.

This revision now requires changes to proceed.Aug 12 2022, 8:53 AM
max marked an inline comment as done.

Thanks, @tomek, and @karol for the comments! I'll try to describe this case further...

In D4740#139426, @tomek wrote:

This discussion is really long and complicated, so I will try to ask a couple of simple questions to reduce confusion:

  1. Why do we need to call this->connect() on a separate thread?

We need to have a separate thread for the AMQP client, this thread should be alive all the time that the app is alive. Inside this thread connect and re-connect (in D4744) algorithm is implemented in a loop with the delay time and maximum attempts, if they are exceeded the thread will throw a fatal error. Because there is no sense to run the app if we don't have a connection to the AMQP server and can't connect to it. Also, the AMQP-client is reused from this thread in any others.
The same algorithm is used by the AWS-CPP SDK where we are calling Aws::InitAPI({}) and the AWS client makes connect/re-connect in a separate thread and doesn't block anything. We are using non-blocking multithreading pattern here.

  1. What is the downside of calling this->connect() on main thread?

The main downside is that we should always need to remember that init should be called only from the long-running or only from the main thread otherwise the client thread will die and we get errors or should run init again. Which doesn't make sense.

Let's imagine if AWS::InitAPI() joins... this will cause a lot of bugs because developers should always keep in mind that the caller thread must be only the main or long-live thread.
In complicated multithreading apps, it's hard to predict from where it will be called. Also, we are given the freedom to call it from anywhere, even from another short-lived thread.

In the case of using detach and std::call_once to spawn a client thread we are always sure that the thread will not die unexpectedly and that only one thread will exist no matter how much time and from the init() will be called.

  1. What is the benefit of main thread exiting before this->connect() is completed?

The main thread can exit before amqp->connect() in a few cases which are located at the main():

  • Config file was not found or can't be read.
  • gRPC server can't start because the port is not available.

In all of these cases, we will have fatal errors and the detached threads will be killed too. That's why there is no makes sense.

  1. What would be a downside of main thread waiting for amqpClientThread to complete?

Or AMQP client must be asynchronous, we never should use a blocking wait for a connection or re-connection. Instead, we have a WaitUntilReady method in D4741 that checks the current connection state.
If we use a blocking connection waiting in a main we should maintain the reconnection algorithm also in a main, which will add a messy and a bad pattern when creating multi-threading apps because we will block other threads.

The responsibility of the connection/re-connection should belong to the AMQP thread itself without exposing the algorithm to another or main. It should be independent of the main. Like AWS SDK does the same.

  1. If the main thread ends, and there's a detached thread, does the app continue to run? Is it still running in a terminal, or just in the background?

In case the main thread ends, the detached thread will be killed too and the app will not run in the background. We will not have "ghost" threads, which is the main concern about detaching. It doesn't work so.
A simple example of this case in a playground at replit.

The reason is that the main thread would be able to tell what is the result of the initialization and take appropriate action, e.g. retry, report it somewhere, etc.

The connection and re-connection must be done asynchronous, because of how do we handle the case when we need to reconnect if do a blocking in a main?
The app will just die in case of the connection is not available. We should handle the reconnection in a parallel thread instead of just dying how it was before D4744.

report it somewhere, etc.

We have proper logging for the retrying and the state in amqp client in a thread, there are no "silent" black boxes. Also, we have an amqpReady state in D4742 which updates on native amqp-cpp client callback events and we can handle it asynchronously instead of blocking.

This comment explains a lot, thanks!

I think I slowly start to understand the reason here: we want to run the initialization in another, detached thread because we think that this init might be called from a short lived thread and we want the init thread to live as long as it needs. When we call from the main thread, the detaching shouldn't make a difference.

I think that we should be able to assume that the init will be called from main thread, but if that assumption is incorrect, detaching sounds like a reasonable strategy.

One thing that we could consider doing is joining the amqpClientThread at the end of the main, so that we can gracefully finish it. Does that make sense?

Yet one more reason why I'd avoid detach here: https://stackoverflow.com/a/54207121/15854120

Long story short, accessing external resources from the detached thread causes undefined behavior. And here, in connect(), we use fields like: this->lastConnectionTimestamp, this->lastConnectionTimestamp (EDIT: Ok, sorry I missed the "static storage" thing, but still it brings a danger to the house).

You yourself said that

The main thread can exit before amqp->connect() in a few cases

If the main exits before connect(), we may get the UB. This single argument alone should be sufficient to never use detach here.

As you noticed

we will have fatal errors

I don't think we should be okay with that. I think we should always gracefully release all resources and exit the app without allowing raw crashes.

karol requested changes to this revision.EditedAug 16 2022, 7:17 AM

this init might be called from a short lived thread and we want the init thread to live as long as it needs

What if this "short-lived thread" hangs for some reason and is no longer short (just a reminder: we're dealing with network connections here)?

I don't really understand why exactly having explicit control over the threads' lives is so problematic for @max. I provided a bunch of examples and arguments which make me think that detach is not good for this. I think it is dangerous and generates scenarios hard to debug resulting in undefined behaviors. Simply using joins in right places eliminates all these threats right away. Not saying that there may be more scenarios that I couldn't see.

I think I said all I wanted, hard to come up with something else without repeating myself, I'm not going to accept this, if you guys feel like it, you can remove me from the reviewers list and land this.

Thanks for the process.

This revision now requires changes to proceed.Aug 16 2022, 7:17 AM
tomek requested changes to this revision.Aug 16 2022, 8:54 AM

So there are a lot of potential issues with detaching, and it doesn't seem necessary at this point. @max could you try to find a solution where detach isn't used? I think it isn't a bad idea for main thread to wait for the initialization to finish. Also, if the init is called multiple times, only the first invocation has any meaning, so no short-lived threads should ever call it, right?

In D4740#140130, @karol wrote:

Yet one more reason why I'd avoid detach here: https://stackoverflow.com/a/54207121/15854120

Long story short, accessing external resources from the detached thread causes undefined behavior. And here, in connect(), we use fields like: this->lastConnectionTimestamp, this->lastConnectionTimestamp (EDIT: Ok, sorry I missed the "static storage" thing, but still it brings a danger to the house).

I think you are speaking about this. In our scenario, we are not going out of the thread's scope, you can check the same in a simple playground. There is no danger in this scenario.

You yourself said that

The main thread can exit before amqp->connect() in a few cases

If the main exits before connect(), we may get the UB. This single argument alone should be sufficient to never use detach here.

I think you got this from this StackOverflow. But, you should check the rest of the answers. Simple playground with the detached thread and main exits before it demonstrates that when the main exit the detached thread will be killed too and resources will be free.

By the way, this StackOverflow answer mentioned that the detached thread will run when the main exits, but it's not (run the playground code), there are staments down on the discussion there.

As you noticed

we will have fatal errors

I don't think we should be okay with that. I think we should always gracefully release all resources and exit the app without allowing raw crashes.

Few questions to @karol:

  1. In a scenario when the max reconnection attempts are reached why we should not throw a fatal and stop the execution?

How you should free the resources and shut down other threads?
In this case, you should use inter-thread communication and use something like conditional variables to send the exit to other threads and close the connections, etc. In case of the main exits before the detached thread joining will do the same that OS will do - free the resources. That's all. If we are speaking of the "real" synchronization and releasing: we need to use conditional variables and etc to inform that we need to close all the connections from other threads, clean buffers and etc. it's not about joining.

  1. If we rewrite to use join how we can handle this in tests?

We will join the thread to what exactly? If we do this in SetUp the test hangs infinitely because it will wait for the thread to exit. The only way to use joining in the test I see is:
In every test using init(), then run join in the test itself, and run close() to finish the thread at the end of every test. So for every test, we should create and end a new client instead of reusing it, right?
So actually we are shooting to ourselves here.

In D4740#140197, @tomek wrote:

So there are a lot of potential issues with detaching, and it doesn't seem necessary at this point. @max could you try to find a solution where detach isn't used?

There is a bunch of issues to use joins too. Ok, we can join at the end of the main, but how we can reuse the client in the tests if we are joining? The main question is (from the comment above):

If we rewrite to use join how we can handle this in tests?
We will join the thread to what exactly? If we do this in SetUp the test hangs infinitely because it will wait for the thread to exit. The only way to use joining in the test I see is:
In every test using init(), then run join in the test itself, and run close() to finish the thread at the end of every test. So for every test, we should create and end a new client instead of reusing it, right?
So actually we are shooting to ourselves here.

I think it isn't a bad idea for main thread to wait for the initialization to finish.

This is a good idea, it can be done pretty simple just by adding the call of waitUntilReady from the D4741 at the end of init.

Also, if the init is called multiple times, only the first invocation has any meaning, so no short-lived threads should ever call it, right?

For example, the tests are short-lived.

There is way too much churn happening on this diff. Have the people involved scheduled a meeting yet?

There is way too much churn happening on this diff. Have the people involved scheduled a meeting yet?

I think other team members can have their thoughts about this discussion and we should preserve it for the future. I'm sure this question will be risen in a future too.
That's why I've created ENG-1679 as a possible follow-up for the 0.5 version. Because of switching to join() can cause a big refactoring as I've mentioned above.

As I've created a ENG-1679 for the discussion about the detach() vs join() problem and this current solution in this diff passed tests and solved the main problem we can go with it now and defer our discussion and decision to the ENG-1679 .
What do you think @tomek and @karol ?

Ok, we can join at the end of the main, but how we can reuse the client in the tests if we are joining?

We should probably create a new client for each test to avoid tests affecting each other.

As I've created a ENG-1679 for the discussion about the detach() vs join() problem and this current solution in this diff passed tests and solved the main problem we can go with it now and defer our discussion and decision to the ENG-1679 .
What do you think @tomek and @karol ?

After this whole discussion I still don't see a good reason forcing us to use detach, but just for sake of unblocking I can accept and we will follow up in the task.

This revision is now accepted and ready to land.Aug 22 2022, 6:16 AM