Page MenuHomePhabricator

[services] Tunnelbroker - Add AMQP shared channel locking
ClosedPublic

Authored by max on Aug 6 2022, 11:32 AM.
Tags
None
Referenced Files
F3389274: D4767.id15817.diff
Fri, Nov 29, 6:21 PM
Unknown Object (File)
Tue, Nov 26, 9:22 AM
Unknown Object (File)
Tue, Nov 26, 9:19 AM
Unknown Object (File)
Tue, Nov 26, 9:16 AM
Unknown Object (File)
Thu, Nov 7, 10:59 AM
Unknown Object (File)
Oct 27 2024, 6:16 AM
Unknown Object (File)
Oct 27 2024, 6:16 AM
Unknown Object (File)
Oct 27 2024, 6:16 AM

Details

Summary

Following comments in D4740 from the synchronization question and forward, ENG-1495 task, we should use a synchronization mechanism in access to the AMQP Client channel which is accessed from the different threads (gRPC threads + DeliveryBroker thread) by calling send() and ack() methods.
When the number of threads grows non-synchronized access to the shared channel causes a segmentation fault or AMQP-CPP TCP Connection buffer error.

This diff introduces the use of C++17 std::scoped_lock which provides a convenient RAII-style mechanism for owning a shared channel.
Usage of the std::scoped_lock was added to the send() and ack() methods which are called from the threads.

Related Linear task: ENG-1495

Test Plan

Successfully pass a stress test in D4768 with multiple threads sending messages at once.

Diff Detail

Repository
rCOMM Comm
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

max held this revision as a draft.

Mutex include was added to the header file.

max edited the test plan for this revision. (Show Details)

Name for a scoped_lock was added.

max published this revision for review.Aug 6 2022, 1:08 PM
max edited the summary of this revision. (Show Details)
tomek requested changes to this revision.Aug 10 2022, 5:44 AM

I'm slightly confused now. After this diff we can access the amqpChannel only from a single thread at a time. Is it going to degrade the performance? If the channel has to be accessed only from a single thread, maybe we should make it thread local and have an instance for each thread?

services/tunnelbroker/src/Amqp/AmqpManager.cpp
135 ↗(On Diff #15406)

I guess we should use curly brackets initialization

This revision now requires changes to proceed.Aug 10 2022, 5:44 AM

This diff introduces the use of C++17 std::scoped_lock

Does our compilation process support C++17?

In D4767#138282, @tomek wrote:

I'm slightly confused now. After this diff we can access the amqpChannel only from a single thread at a time. Is it going to degrade the performance? If the channel has to be accessed only from a single thread, maybe we should make it thread local and have an instance for each thread?

This degrades performance but in a stress test, in D4768 it shows that we have enough throughput of nearly 2000 messages per second during send/receive in this blocking model. We should consider using a pool of connections which is a common practice. This is a future task for the next version ENG-1562 because creating and managing a pool of connections will take no small amount of time and the current performance looks ok for our needs at the moment.

have an instance for each thread?

Creating a new instance with a separate TCP connection/channel to the AMQP server on each gRPC thread is a bad practice and will cause flooding in TCP connections and a network stack. RabbitMQ servers also have a limit of 100 connections. Creating a pool of connections and using a pool in ENG-1562 is a best solution not to flood the TCP connection and not to block on the single one.

In D4767#138286, @tomek wrote:

This diff introduces the use of C++17 std::scoped_lock

Does our compilation process support C++17?

We are using C++17 in other services as well. We have these changes a long time ago and the compiler supports C++17.

Change to curly braces initialization.

max added inline comments.
services/tunnelbroker/src/Amqp/AmqpManager.cpp
135 ↗(On Diff #15406)

I guess we should use curly brackets initialization

Sure! Thanks.

Does our compilation process support C++17?

I added C++17 support for clang+MacOS in https://phab.comm.dev/D4766

This degrades performance but in a stress test, in D4768 it shows that we have enough throughput of nearly 2000 messages per second during send/receive in this blocking model.

the current performance looks ok for our needs at the moment.

What is the estimate for the throughput we should be able to handle? Could you explain how it was estimated that the current performance is ok?

Even if the throughput is high enough, blocking model should be replaced soon, because one slow operation can block all the others, and that isn't acceptable, so I think prioritizing ENG-1562 might be a good idea @ashoat

Accepting, as we have a ticket to handle the improvement.

This revision is now accepted and ready to land.Aug 11 2022, 4:55 AM

Agree with prioritizing ENG-1562. It's slated for the next release of Tunnelbroker right now (0.5)

In D4767#138735, @tomek wrote:

This degrades performance but in a stress test, in D4768 it shows that we have enough throughput of nearly 2000 messages per second during send/receive in this blocking model.

the current performance looks ok for our needs at the moment.

What is the estimate for the throughput we should be able to handle?

I have a full description of the throughout test in ENG-1562:

...It's pretty enough throughput shown by the stress test from D4768 (100 threads sending 10 messages in parallel) which is around 2000 messages per second in a testing sandbox environment:

MacBook M1, Docker limits: CPU 4 cores, RAM 3 GB.
Docker container with the Tunnelbroker, compiler in debug mode.
Docker container with the RabbitMQ image.

Could you explain how it was estimated that the current performance is ok?

From my previous experiences, 2000 requests per second is a good number as a starting point. Not sure the other infrastructure services like DynamoDB instance can handle more for one connection.

Even if the throughput is high enough, blocking model should be replaced soon, because one slow operation can block all the others, and that isn't acceptable, so I think prioritizing ENG-1562 might be a good idea @ashoat

The blocking model in a certain scenario will not be a bottleneck. Because of message broker throughput is very very high compared to the databases. If there is a blocking connection to the message broker - actually something not good happened like the message broker can't handle the messages at all. The current blocking model in the message broker client will be a latency bottleneck, the latency will grow along with the number of messages per second - that's true. Also, we have a dynamodb database which I'm sure will be a much bigger bottleneck in our scenario.

Agree with prioritizing ENG-1562. It's slated for the next release of Tunnelbroker right now (0.5)

Looking at the 0.5 backlog it's a good idea to prioritize it.

This revision was landed with ongoing or failed builds.Aug 22 2022, 7:04 AM
This revision was automatically updated to reflect the committed changes.