Page MenuHomePhabricator

[services] Tunnelbroker - Put and get messages from database in grpc handler
ClosedPublic

Authored by max on Apr 25 2022, 7:19 AM.
Tags
None
Referenced Files
F3379402: D3833.diff
Wed, Nov 27, 3:43 PM
Unknown Object (File)
Wed, Nov 13, 4:38 AM
Unknown Object (File)
Wed, Nov 13, 4:38 AM
Unknown Object (File)
Sun, Nov 10, 5:17 PM
Unknown Object (File)
Sun, Nov 10, 5:13 PM
Unknown Object (File)
Sun, Nov 10, 5:13 PM
Unknown Object (File)
Sun, Nov 10, 5:13 PM
Unknown Object (File)
Sun, Nov 10, 5:12 PM

Details

Summary

Message persisting logic in this diff:

  • On gRPC Send request we are storing a message to the dynamoDB database before the AMQP message is pushed to the RabbitMQ. In this case, we are making sure that the message is persisted and we don't rely on the RabbitMQ delivery only.
  • On gRPC Get request, we are retrieving the messages for the deviceID from the dynamoDB database first and delivering them to the device.
    • After retrieving messages from the database, wait for the messages from the AMQP message broker (RabbitMQ);
    • Follow-up messages are delivered as Get stream request messages;
    • On message broker (RabbitMQ) message delivery, delete that message from the database by unique messageID.
Test Plan
  1. Run yarn run-tunnelbroker-service and successfully built the service without any warnings and errors.
  2. Message successfully delivered when clients are online:
    • Client with deviceID#1 sends the message to deviceID#2 using the Send request;
    • deviceID#2 is online and listening for the new messages using the Get stream;
    • deviceID#2 instantly and successfully gets the message from deviceID#1 using the Get stream;
  3. Message successfully delivered after client reconnects/offline:
    • Client with deviceID#1 sends the message to deviceID#2 using the Send request;
    • deviceID#2 is offline at that time;
    • deviceID#2 connects to the tunnelbroker server and successfully gets the message from deviceID#1 using the Get stream;
  4. Message successfully delivered when RabbitMQ server was restarted and the queue was cleared when the messages were sent, but the device was offline at that time and then reconnects after:
    • Client with deviceID#1 sends the message to deviceID#2 using the Send request;
    • deviceID#2 is offline at that time;
    • RabbitMQ server goes down, the messages queue was cleared;
    • RabbitMQ server goes up;
    • deviceID#2 connects to the tunnelbroker server and successfully gets the message from deviceID#1 using the Get stream.

Diff Detail

Repository
rCOMM Comm
Branch
messages-put-get-from-database
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Apr 25 2022, 7:20 AM
Harbormaster failed remote builds in B8507: Diff 11866!
max edited the test plan for this revision. (Show Details)
max added a reviewer: tomek.

I've tested it and it can be pushed to the review now as a part of the message persistence stack.

Fixing possible double delivery message bug, by clearing the delivery broker queue in case we have the messages in the database on the device new
connection.

Rebase and merge on master due to the stack changes.

ashoat requested changes to this revision.May 4 2022, 11:03 AM
ashoat added inline comments.
services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
215 ↗(On Diff #12222)

I'm confused, can you remind me why we erase here? If we want to clear the messages after they are sent, shouldn't we do that after we have a successful send?

225–226 ↗(On Diff #12222)

How is this different from the erase above?

241–243 ↗(On Diff #12222)

What is the point of erasing if something is empty?

This revision now requires changes to proceed.May 4 2022, 11:03 AM

I wanted to see the changes locally to perform what's in the test plan but I cannot pull this change.

arc patch D3833 results in errors. Can you please fix this?

Also, the test plan is a bit unclear to me. I don't get points 2 and 3. I was hoping that it becomes clear once I do it locally but that is impossible. Please, fix this revision.

In D3833#109876, @karol-bisztyga wrote:

I wanted to see the changes locally to perform what's in the test plan but I cannot pull this change.

arc patch D3833 results in errors. Can you please fix this?

I have created a new arc patch D3833 clear the build cache and the build is successful without errors. @karol-bisztyga can you please send an error that you have?

services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
215 ↗(On Diff #12222)

I'm confused, can you remind me why we erase here? If we want to clear the messages after they are sent, shouldn't we do that after we have a successful send?

At this point, when a client connects and requests GET for the messages first we check if there are undelivered messages in the database. If so, we erase the messages to deliver from rabbitMQ which are handled by DeliveryBroker.
We are doing so to get rid of possible double delivery because we can have messages in the database and same in the rabbitMQ queue (we have a low TTL in rabbitMQ messages, but the client can reconnect in a lower timeframe). Because all the undelivered messages exist in the database we can clear the rabbitMQ queue for the device.

225–226 ↗(On Diff #12222)

How is this different from the erase above?

Here we are erasing the message from the database. We are removing here from the database only after successfully delivering to the client, which in our case is successfully written to writer.
That's why we can be sure that the message was delivered to the client before it was removed.

241–243 ↗(On Diff #12222)

What is the point of erasing if something is empty?

To handle multithreading we are storing messages in DeliveryBroker structure which is:
folly:ConcurrentHashMap[deviceID] -> folly::MPMCQueue
If deviceID messages queue is empty we don't need to store folly::MPMCQueue for it and can free memory to fix possible 'ghost' queues.
That's why we are removing it by calling erase method which removes the MPMCQueue from ConcurrentHashMap.

services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
215 ↗(On Diff #12222)

Can you add a code comment explaining this? It seems non-obvious

225–226 ↗(On Diff #12222)

We should wait until we can confirm delivery to the client (when the client responded with an updated checkpoint) before we delete from the database. I recognize that this is not the purpose of this diff... can you link a Linear task tracking the suggestion I made? Hopefully one already exists, but if not we should create one

241–243 ↗(On Diff #12222)

Got it! Can you introduce a new method in DeliveryBroker called deleteQueueIfEmpty that runs these three lines? Then the code will be more self-documenting. It might still be good to add a code comment to deleteQueueIfEmpty to clarify that we're doing it to free memory

max added inline comments.
services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
225–226 ↗(On Diff #12222)

We should wait until we can confirm delivery to the client (when the client responded with an updated checkpoint) before we delete from the database. I recognize that this is not the purpose of this diff... can you link a Linear task tracking the suggestion I made? Hopefully one already exists, but if not we should create one

Sure! I've created ENG-1158 task to follow on this comment.

241–243 ↗(On Diff #12222)

Got it! Can you introduce a new method in DeliveryBroker called deleteQueueIfEmpty that runs these three lines? Then the code will be more self-documenting. It might still be good to add a code comment to deleteQueueIfEmpty to clarify that we're doing it to free memory

Makes sense to me. I've created ENG-1146 task.

Accepting as my concerns have been addressed, but I think @karol-bisztyga's still need to be addressed too

In D3833#109876, @karol-bisztyga wrote:

I wanted to see the changes locally to perform what's in the test plan but I cannot pull this change.

arc patch D3833 results in errors. Can you please fix this?

Hey @karol-bisztyga, can you please be more precise about what kind of build error you have?
I have created a new arc patch D3833 clear the build cache and the build is successful without errors. Seems our CI built it without errors too.

In D3833#115283, @geekbrother wrote:
In D3833#109876, @karol-bisztyga wrote:

I wanted to see the changes locally to perform what's in the test plan but I cannot pull this change.

arc patch D3833 results in errors. Can you please fix this?

Hey @karol-bisztyga, can you please be more precise about what kind of build error you have?
I have created a new arc patch D3833 clear the build cache and the build is successful without errors. Seems our CI built it without errors too.

If it says parent commit could not be found, try running git pull --all --tags first


edit: looks like that should work

3f53.png (302×1 px, 79 KB)

ashoat requested changes to this revision.May 19 2022, 8:43 AM

According to @geekbrother, @karol-bisztyga has been unresponsive over Comm. I'd like to unblock this diff.

I wanted to see the changes locally to perform what's in the test plan but I cannot pull this change.

arc patch D3833 results in errors. Can you please fix this?

Please see @atul's comment above. git pull --all --tags && arc patch D3833 works for me, @atul, and @geekbrother.

Also, the test plan is a bit unclear to me. I don't get points 2 and 3. I was hoping that it becomes clear once I do it locally but that is impossible. Please, fix this revision.

This comment from 2 weeks ago was never responded to. It should've been responded to 2 weeks ago.

Rebase on master.
Update with using deleteQueueIfEmpty() DeliveryBroker method.
Code comments was added.

max edited the test plan for this revision. (Show Details)

Rebase on master and tools namespace updates.

In D3833#109876, @karol-bisztyga wrote:

I wanted to see the changes locally to perform what's in the test plan but I cannot pull this change.

arc patch D3833 results in errors. Can you please fix this?

Sorry about this, fixed!

Also, the test plan is a bit unclear to me. I don't get points 2 and 3. I was hoping that it becomes clear once I do it locally but that is impossible. Please, fix this revision.

I've updated the description and test plan, hope now it's more clear.

According to @geekbrother, @karol-bisztyga has been unresponsive over Comm. I'd like to unblock this diff.

Not sure what you're talking about. I got pinged just today about this on the comm app. I'm not recalling any messages from the past though. I try to respond to all messages, @geekbrother what's this about?

services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
220 ↗(On Diff #13117)

Nit: we could define this outside of the loop.

Add comment for the deleteQueueIfEmpty().
tunnelbroker::GetResponse in loop usage optimization.

max added inline comments.
services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
220 ↗(On Diff #13117)

Nit: we could define this outside of the loop.

Nice optimization catch! We can use a single instance and use grpc generated Clear() method on it.

tomek added inline comments.
services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
165

Maybe we should consider changing this method's argument to be MessageItem? It might require moving MessageItem from database namespace.

Pinging @ashoat for a re-review of this diff.

services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
165

Maybe we should consider changing this method's argument to be MessageItem? It might require moving MessageItem from database namespace.

I was thinking about this too. I'll create a task for this in the backlog.

This revision is now accepted and ready to land.May 30 2022, 12:14 PM