Page MenuHomePhabricator

[services] Tunnelbroker - Add ping implementation to the `Get` handler
ClosedPublic

Authored by max on Sep 19 2022, 5:57 AM.
Tags
None
Referenced Files
F3376842: D5174.diff
Wed, Nov 27, 2:32 AM
Unknown Object (File)
Tue, Nov 5, 9:45 AM
Unknown Object (File)
Tue, Nov 5, 9:45 AM
Unknown Object (File)
Tue, Nov 5, 9:45 AM
Unknown Object (File)
Tue, Nov 5, 9:45 AM
Unknown Object (File)
Tue, Nov 5, 9:45 AM
Unknown Object (File)
Tue, Nov 5, 9:45 AM
Unknown Object (File)
Tue, Nov 5, 9:43 AM

Details

Summary

This diff introduces adding of ping thread into the GetResponse handler to check if the device is online.
The ping message from the D5173 will be sent to the client in a ping interval introduced in D5151. By pinging the client in a unidirectional Get stream we will be sure that the client is online and update the state of the client in the database using the method from D5165.
We can use the same approach when changing to the bidirectional stream in a near future in v.0.5.

This is a part of the stack.

Related Linear task: ENG-1766

Test Plan

Successfully built the server and the follow-up diffs in the stack.

Diff Detail

Repository
rCOMM Comm
Branch
arcpatch-D5174
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

max held this revision as a draft.
max edited the test plan for this revision. (Show Details)
max added a reviewer: tomek.
max published this revision for review.Sep 19 2022, 6:40 AM

I'm checking what's wrong with the Android and iOS builds (maybe the protobuf changes broke some old client-side API code). Requesting review to unblock it for possible comments.

tomek requested changes to this revision.Sep 20 2022, 6:44 AM
tomek added inline comments.
services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
202 ↗(On Diff #16828)

Why do we need to use a mutex?

203 ↗(On Diff #16828)

Is it safe to capture by reference?

217 ↗(On Diff #16828)

What does this mean? When do we want to have set_ping(false)?

256 ↗(On Diff #16828)

We're joining a thread with infinite loop. Shouldn't we cancel it in some way?

This revision now requires changes to proceed.Sep 20 2022, 6:44 AM
services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
220 ↗(On Diff #16828)

Not sure how strong are gRPC guarantees, but canceling the connection when one ping fails sounds rather extreme - but might be the right solution.

Rebased on using the empty instead of bool for the ping.
Using of TryCancel() was removed in a favor of the atomic_bool.
Some small code reordering.

max marked 5 inline comments as done.
max added inline comments.
services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
202 ↗(On Diff #16828)

Why do we need to use a mutex?

In our scenario we have two threads:

  • The main thread which writes the messages when the message comes;
  • The pinging thread which send pings;

That's why we are writing through the respondToWriter function from both threads. To synchronise access to the writer we are using the mutex and lock_guard.

In the last update we are using the atomic_bool to handle the current writer state instead of the TryCancel() in IsCancelled() because after the testing and investigation what actually it does - it's sending the gRPC return and don't update the connection state immediately which is important for us to know.

We will update the connection state to false if we fail to write to the response channel - in this case we'll assume that the client is gone (and update the device online state later in the diffs stack in D5175).

203 ↗(On Diff #16828)

Is it safe to capture by reference?

We are using the mutex and lock_guard in this function, I assume it's safe.

217 ↗(On Diff #16828)

What does this mean? When do we want to have set_ping(false)?

This is deprecated and we are [[ https://phab.comm.dev/D5173#152545 | using the Empty instead ]] of bool.

220 ↗(On Diff #16828)

Not sure how strong are gRPC guarantees, but canceling the connection when one ping fails sounds rather extreme - but might be the right solution.

It's deprecated in a favor of std::atomic_bool writerIsReady as I've discovered that we can't rely on it (please check my comment above for context).

256 ↗(On Diff #16828)

We're joining a thread with infinite loop.

After reorganizing the code here on the latest update we are joining only on the one occasion: if the write fails and we are returning an error code from the loop. Seems that there is an only place for it.

Shouldn't we cancel it in some way?

We are waiting for a new messages here from the MPMCQueue blocking read, we can't cancel this thread, because we can't killing threads ((.

services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
196–204 ↗(On Diff #16946)

Feel like we can get rid of the if clause. Also should move the Logging into the respondToWriter function, then we can differiente if the writer or write failed.

Also, we should probably use LOG(ERROR) if there is an error, unless it's expected for the write to fail normally (lack of network for mobile?).

This revision now requires changes to proceed.Sep 22 2022, 11:28 AM
max marked 5 inline comments as done.

Removing of the if condition.

max added inline comments.
services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
196–204 ↗(On Diff #16946)

Feel like we can get rid of the if clause.

Nice catch! I've removed it. Thanks, @jon !

Also should move the Logging into the respondToWriter function, then we can differiente if the writer or write failed.

I think in this case we can't distinguish in debug where exactly we had a write error: in the ping thread or in the handler thread. The better way is to log where it fails to distinguish.

Also, we should probably use LOG(ERROR) if there is an error, unless it's expected for the write to fail normally (lack of network for mobile?).

We are pinging to detect when the client goes offline so this isn't an exact error in our case. It makes sense to log it only for debug purposes using the INFO.

We are pinging to detect when the client goes offline so this isn't an exact error in our case. It makes sense to log it only for debug purposes using the INFO.

Yea, makes sense

tomek requested changes to this revision.Sep 23 2022, 4:21 AM

I'm not really sure about the architecture here. Wy is ping a part of get method? Shouldn't we start sending pings when a new session is created? Do we stop sending pings when get method ends? Maybe I don't see something or don't understand the requirements, but doing pings in get sounds a bit useless - we're sending other messages there and based on the result can determine if a client is online. Pinging mechanism should work even when there is no other communication.

Maybe I'm missing something. @max could you explain why we're doing this in get? Is there something wrong in starting pinging when a new session is established?

services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
198 ↗(On Diff #17010)

Maybe we can change type of DEVICE_ONLINE_PING_INTERVAL_MS to be duration and store std::chrono::milliseconds(DEVICE_ONLINE_PING_INTERVAL_MS) in the variable - that would mean that the unit suffix is not needed anymore.

This revision now requires changes to proceed.Sep 23 2022, 4:21 AM
In D5174#153312, @tomek wrote:

I'm not really sure about the architecture here. Why is ping a part of get method? Shouldn't we start sending pings when a new session is created? Maybe I don't see something or don't understand the requirements, but doing pings in get sounds a bit useless - we're sending other messages there and based on the result can determine if a client is online. Pinging mechanism should work even when there is no other communication.

Maybe I'm missing something. @max could you explain why we're doing this in get? Is there something wrong in starting pinging when a new session is established?

In our current 0.4 architecture, we have only one long-lived request: this is Get. The client creates the session to get the sessionID and then the client connects to Get unidirectional stream to get new messages, so we must use the same stream to send pings instead of creating a separate request and maintaining a separate thread in the client and server. We can reuse the single stream request. This is also actual as we are switching to the bidirectional stream in 0.5, so we can send ping in this unified stream as well without any significant changes.

but doing pings in get sounds a bit useless - we're sending other messages there and based on the result can determine if a client is online

It looks like yes at the first sight, but it's not. The gRPC stream didn't ping the client when there are no messages on the stream. The client requesting Get and waiting for new messages in a stream (this is a long-lived connection). If the client closing connection, has some network error, the gRPC server is not notified about this and the request is still considered alive. We can consider that the client goes offline only by the error during the write operation to the channel. If the for some time there are no new messages, the server will not know about the client's online status. That's why we are pinging the client in the stream to make sure it is online.

I've tested this approach in an isolated server-client gRPC test. This is a repo.

In the future, there is a possibility that we can use the gRPC internal keep-alive mechanism which is off by default. The task that tracking is ENG-1842. At the moment I can't make it work as expected and deferred because of the deadline, but that would be a good idea to use this not only in Tunnelbroker and in any other gRPC connections to make sure that the connection is alive. And possibly we can get rid of pinging by ourselfs.

Do we stop sending pings when get method ends?

Yes, we have a writerIsReady atomic variable that reflects the current writer connection state and if it's false the ping loop and thread will exit.

max marked an inline comment as done.
max added inline comments.
services/tunnelbroker/src/Service/TunnelbrokerServiceImpl.cpp
198 ↗(On Diff #17010)

Maybe we can change type of DEVICE_ONLINE_PING_INTERVAL_MS to be duration and store std::chrono::milliseconds(DEVICE_ONLINE_PING_INTERVAL_MS) in the variable - that would mean that the unit suffix is not needed anymore.

I've followed the current convention in the Constants.h, that makes sense to change all possible occurrences to such an approach as a follow-up.

Thanks for this great explanation! It makes sense now.

This revision is now accepted and ready to land.Sep 23 2022, 7:39 AM
max marked an inline comment as done.

Resolving merge conflict.

This revision was landed with ongoing or failed builds.Sep 23 2022, 1:51 PM
This revision was automatically updated to reflect the committed changes.