Page MenuHomePhabricator

[services] Backup - make blob client instances separate for every thread
ClosedPublic

Authored by karol on Mar 22 2022, 2:47 AM.
Tags
None
Referenced Files
F3113795: D3488.diff
Thu, Oct 31, 7:20 PM
Unknown Object (File)
Sat, Oct 12, 1:34 PM
Unknown Object (File)
Sat, Oct 5, 2:57 AM
Unknown Object (File)
Sep 27 2024, 7:14 AM
Unknown Object (File)
Sep 15 2024, 6:28 AM
Unknown Object (File)
Sep 15 2024, 6:28 AM
Unknown Object (File)
Sep 15 2024, 6:27 AM
Unknown Object (File)
Sep 15 2024, 6:27 AM

Details

Summary

Depends on D3487

https://linear.app/comm/issue/ENG-905/fix-the-threading-problem - the problem and approaches are described in the task

Test Plan

You need 3 terminals
1:

cd services
yarn run-backup-service

2:

cd services
yarn run-blob-service

3:
You can use https://github.com/karol-bisztyga/grpc-playground/tree/backup-async

./build.sh
./cmake/build/bin/client

then in the GUI use the n option (just type n and hit enter, this stands for creating a new backup).

Diff Detail

Repository
rCOMM Comm
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
karol retitled this revision from [services] Backup - make blob client instances separate for every thread to [draft] [services] Backup - make blob client instances separate for every thread.Mar 22 2022, 2:53 AM
karol edited the summary of this revision. (Show Details)
karol edited the test plan for this revision. (Show Details)
karol added reviewers: max, tomek, varun, jim.
karol edited the summary of this revision. (Show Details)
karol edited the test plan for this revision. (Show Details)

update

build fix needed

karol added inline comments.
services/backup/docker-server/contents/server/src/Reactors/client/blob/ServiceBlobClient.h
30 ↗(On Diff #10762)

remove this

tomek requested changes to this revision.EditedMar 31 2022, 9:37 AM

Could you explain why do we need to use this synchronization?

Is there a way to avoid the synchronization?

What are the consequences of using synchronization for handleRequest method? Can we handle a couple of connections at the same time?

edit:
I just noticed that there's a lot of description in the issue. Going to read through that before requesting changes,

This revision now requires changes to proceed.Mar 31 2022, 9:37 AM
This revision now requires review to proceed.Mar 31 2022, 9:39 AM
karol retitled this revision from [draft] [services] Backup - make blob client instances separate for every thread to [services] Backup - make blob client instances separate for every thread.Apr 1 2022, 5:02 AM

rebase needed

tomek requested changes to this revision.Apr 6 2022, 11:28 PM

Before analyzing if this code handles synchronization correctly, I'd like to check if it is really the best option.

In the task you described a possible solution:

store them as reactor's members - we could store the client reactor objects as members of CreateNewBackupReactor since we need 1 client reactor per 1 server reactor (because every backup server reactor needs to call the blob service through the client reactor just once). I tried to do it this way but encountered problems. The main problem is (I think) that the client reactor objects are being destroyed while gRPC still tries to perform some operations on them.

Is it possible to fix the main problem by postponing object deletion until the communication is done, e.g. by replacing delete this with setting a callback in which the object is deleted and giving that callback to the other reactor?

Overall, it's not obvious for me why do we need complicated synchronization here.

This revision now requires changes to proceed.Apr 6 2022, 11:28 PM

Is it possible to fix the main problem by postponing object deletion until the communication is done, e.g. by replacing delete this with setting a callback in which the object is deleted and giving that callback to the other reactor?

The solution you proposed seems to be very hacky. First, delete this lies in the base reactor's code. Second, I don't understand what is "other reactor" here? Did you mean putReactor?

Overall, it's not obvious for me why do we need complicated synchronization here.

Not sure what is unclear here. We have a server reactor that creates a client reactor inside. We need to wait for the client reactor to complete all operations before we terminate the (parent) server reactor.

karol added inline comments.
services/backup/docker-server/contents/server/src/Reactors/server/CreateNewBackupReactor.h
34 ↗(On Diff #11034)

rename

tomek requested changes to this revision.Apr 7 2022, 1:28 AM

We've discussed this offline with @karol-bisztyga. It looks like the main issue is that we need to call one reactor from another, but in our current solution when the caller is done it deletes itself which results in callee being deleted before it can finish its work. We've considered a couple of other options not involving using mutexes, one of them is promising, but it needs to be verified. Even if we agree to change the approach, there are a lot of diffs in this stack so it will be a lot cheaper to do such change in a followup.

Requesting changes because of inline comments, but we will probably keep the overall approach for now.

services/backup/docker-server/contents/server/src/Reactors/server/CreateNewBackupReactor.h
35–36 ↗(On Diff #11136)

Can you rename it so that's obvious what it's doing? E.g. blobDoneCV.
Also wondering if we can use higher level mechanism, e.g. std::latch

91–96 ↗(On Diff #11136)

As discussed, we should try to verify if we can block in this method call

This revision now requires changes to proceed.Apr 7 2022, 1:28 AM
services/backup/docker-server/contents/server/src/Reactors/server/CreateNewBackupReactor.h
35–36 ↗(On Diff #11136)

Can you rename it so that's obvious what it's doing? E.g. blobDoneCV.

sure

Also wondering if we can use higher level mechanism, e.g. std::latch

I think we can. The only difference is that instead of using mutex + CV we just use a single latch, right? I think we could cover this in a follow-up, it will be probably easier/faster that way.

91–96 ↗(On Diff #11136)

I tried to look at the source code + google a bit but I've not found anything... As I think about this, I cannot see why there could be something wrong with this. All grpc operations are done, so at this point, it's just a regular object hanging for a bit to commit the suicide. If your intuition still keeps telling you that it can be dangerous for some reason, we can create a task and follow up there trying to squeeze some information about this out of the web.

Accepting, but please create a task where we can discuss about the alternatives.

services/backup/docker-server/contents/server/src/Reactors/server/CreateNewBackupReactor.h
36 ↗(On Diff #11140)

We can skip CV in the mutex name

93 ↗(On Diff #11140)

Could you remind me why we're scheduling sending empty string here?

35–36 ↗(On Diff #11136)

The only difference is that instead of using mutex + CV we just use a single latch, right?

Yes

I think we could cover this in a follow-up, it will be probably easier/faster that way.

I don't think it will be hard to update in this diff, but adding it as a followup also doesn't hurt - so up to you.

91–96 ↗(On Diff #11136)

I don't expect that we will find the answers easily. I think it would be a lot better to invest in checking if the other solution might work and only if that's not the case, we should continue the verification here.

This revision is now accepted and ready to land.Apr 7 2022, 1:57 AM
services/backup/docker-server/contents/server/src/Reactors/server/CreateNewBackupReactor.h
36 ↗(On Diff #11140)

For me, it's more clear because this CV and mutex are tightly bonded so having a name that's a subset of another name makes sense. If you feel strongly, I can change but I'd leave it as is.

93 ↗(On Diff #11140)

To let the client reactor know that there are no more chunks. We operate on the concurrent queue here, so we have to enqueue something that will indicate we're finished, otherwise, the queue read on the other side is going to hang forever. An alternative here is to have queue<unique_ptr<string>> instead of queue<string> so we can enqueue a nullptr. I don't personally mind but I can understand that an empty string may not be crystal clear (although it is also logical since we wouldn't have an empty chunk - that would be pointless).

35–36 ↗(On Diff #11136)

Ok, I'm landing this now, @palys-swm if you feel like something here is unfinished, please let me know.