Paths

Table of Contentst

[services] Backup - Stop suppressing invalid chunk structure
ClosedPublic
Actions

Authored by • karol on Jun 7 2022, 6:52 AM.

Details

Reviewers

tomek
ashoat

Commits

rCOMM5d63a1164667: [services] Backup - Stop suppressing invalid chunk structure

Summary

Depends on D4228

We should not break the reactor connection if we store a chunk in the database. In this case, we never catch the error when the client tries to push two consecutive chunks below the database limit and that should be pointed out and returned back to the client. If we suppress it, the client may never know that the second chunk is lost.

Test Plan

on the client, send two log chunks below the database limit which is 400kb.

Diff Detail

Repository

rCOMM Comm

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

• karol created this revision.Jun 7 2022, 6:52 AM

Herald added subscribers: • abosh, atul, • adrian and 2 others. · View Herald TranscriptJun 7 2022, 6:52 AM

Harbormaster completed remote builds in B9661: Diff 13387.Jun 7 2022, 7:28 AM

• karol requested review of this revision.Jun 7 2022, 7:28 AM

• karol edited the summary of this revision. (Show Details)Jun 7 2022, 8:34 AM

• karol edited the test plan for this revision. (Show Details)

• karol added a reviewer: tomek.

• karol added a parent revision: D4230: [services] Backup - Add chunk counter to the pull backup reactor.

• karol mentioned this in D4233: [services] Backup - Fix data size database limit.

• karol added a child revision: D4233: [services] Backup - Fix data size database limit.Jun 7 2022, 8:38 AM

when the client tries to push two consecutive chunks below the database limit

Can you clarify what you mean by this? I'm a bit confused

Yes, I was afraid this wouldn't be super clear right away.

So, we have a DB limit which is 400kb. If we upload data that exceeds this limit, it is stored on the S3, otherwise, we save it in the DB.

Now, what happens when we push a single chunk < 400kb is just that it is saved in the DB and that works ok.

But the scenario that is covered here is when a client pushes two (or more) chunks < DB limit, let's say 2 chunks of 100kb each for a single log.
Without this change, what happens is that we receive the first chunk, store it in the DB and return the OK status. Other chunks may be pushed but are never really used so they are lost and still the client sees the OK status as a result of this operation.
With this change, we continue the work of the reactor and listen for more chunks. If the client pushes more chunks after the first one is under the DB limit, we return an error. This is correct behavior as pushing multiple small chunks produces redundant grpc cycles and it means that the client code is poorly designed. Basically, in this case, the chunks should be merged on the client-side.

Ah okay that explanation makes sense! Now I'm just a bit confused about how this change achieves what you described...

Does return nullptr cause the error to be propagated?
How do we make sure we cause the error to be propagated only on the second chunk, rather than on the first chunk?

Does return nullptr cause the error to be propagated?

It causes the error to be triggered when a situation like I described above occurs. Previously, it would never be triggered.

How do we make sure we cause the error to be propagated only on the second chunk, rather than on the first chunk?

We perform checks on the persistenceMethod (by small chunk I'm going to mean the chunk below the DB limit, the big one - above it):

if you push first a small chunk and then a big one - there will be an error because we first set the persistenceMethod to DB and then try to push further.
if you push small and small - the error will be triggered - that's the case we're handling in this diff
if you push a big and small/big - it'll be successfully stored on the S3

Basically, if you push a small chunk, you should stop right there, because we don't allow any extra further data. We could handle a situation when you push a couple of really small chunks that would sum into a small chunk (< 400kb) eventually, but I do not see a point in doing so, we can just push 400kb over grpc and avoid making it more complicated than it has to be.

• karol edited the summary of this revision. (Show Details)Jun 13 2022, 5:24 AM

• karol added a parent revision: D4228: [services] Tests - Add pull backup.

• karol removed a parent revision: D4230: [services] Backup - Add chunk counter to the pull backup reactor.

• karol planned changes to this revision.Jun 13 2022, 6:41 AM

new stack

Harbormaster failed remote builds in B9719: Diff 13464!Jun 13 2022, 9:55 AM

Thanks, that mostly makes sense, but one more question:

In D4232#119697, @karol-bisztyga wrote:

Does return nullptr cause the error to be propagated?

It causes the error to be triggered when a situation like I described above occurs. Previously, it would never be triggered.

Are you saying return nullptr will cause the error to be propagated? Or are you saying that return nullptr will put the Reactor in such a state that if it were to receive another chunk, it would then crash?

If it's the latter, it would seem like return nullptr is basically saying "we are done, and don't expect to receive anything else". But normally I would expect std::make_unique<grpc::Status>(grpc::Status::OK) (the old code) to be saying "we are done".

Can you clarify what return nullptr does and what std::make_unique<grpc::Status>(grpc::Status::OK) does?

Harbormaster edited this Differential Revision.Jun 14 2022, 3:31 AM

Harbormaster completed remote builds in B9719: Diff 13464.Jun 14 2022, 3:57 AM

tomek added a reviewer: ashoat.Jun 14 2022, 3:58 AM

With questions

This revision now requires changes to proceed.Jun 14 2022, 9:58 AM

Are you saying return nullptr will cause the error to be propagated? Or are you saying that return nullptr will put the Reactor in such a state that if it were to receive another chunk, it would then crash?

The latter, but not a crash per se, just an error that will be sent to the user.

If it's the latter, it would seem like return nullptr is basically saying "we are done, and don't expect to receive anything else". But normally I would expect std::make_unique<grpc::Status>(grpc::Status::OK) (the old code) to be saying "we are done".

This is correct, we basically say: "we are done, and don't expect to receive anything else".

Can you clarify what return nullptr does and what std::make_unique<grpc::Status>(grpc::Status::OK) does?

return nullptr - we continue reactor's work expecting more data
return X where X is non-null status - we're terminating the reactor with the given status.

The thing here is that if we terminate the reactor with an ok status, it will work just fine if we send a single small chunk. However, if we send small+big, as I said 2 comments back, this will happen:

Other chunks may be pushed but are never really used so they are lost and still the client sees the OK status as a result of this operation.

I realize that the machine state's not pure perfect here, but I don't think it's that hacky either. For me it is fine, the connection can be easily terminated by the user and leaving the reactor "hanging" like this, we prevent having missing chunks, at the same time saying that everything is okay, which would be even worse (that would be really bad I think).

Your answers seem inconsistent to me. On one hand, you say:

The latter, but not a crash per se, just an error that will be sent to the user.

So return nullptr; will send an error. But then you say:

return nullptr - we continue reactor's work expecting more data

This revision now requires changes to proceed.Jun 15 2022, 4:41 PM

This :

The latter, but not a crash per se, just an error that will be sent to the user.

is a response to this:

Are you saying return nullptr will cause the error to be propagated? Or are you saying that return nullptr will put the Reactor in such a state that if it were to receive another chunk, it would then crash?

So, the latter (aka the second one) is:

return nullptr will put the Reactor in such a state that if it were to receive another chunk, it would then crash

So I responded that yes, this is correct except that

not a crash per se, just an error that will be sent to the user.

But the part that it puts a reactor into the state in which receiving anything more results in an error is correct.

Now:

return nullptr - we continue reactor's work expecting more data

this was a response to this:

Can you clarify what return nullptr does and what std::make_unique<grpc::Status>(grpc::Status::OK) does?

which was a question that I understood as being more about the reactors in general. And this is correct.

You're confused because you mixed two answers, one answering the specific case question and the other one answering a general question.

In reactors, when we return nullptr, we expect more data, but in this specific case, we put a reactor in a state "I expect more data" but in fact if it ever gets more data, it will throw an error.

Got it, thanks!

This revision is now accepted and ready to land.Jun 20 2022, 10:48 AM

Closed by commit rCOMM5d63a1164667: [services] Backup - Stop suppressing invalid chunk structure (authored by • karol). · Explain WhyJun 27 2022, 4:21 AM

This revision was automatically updated to reflect the committed changes.

• karol added a commit: rCOMM5d63a1164667: [services] Backup - Stop suppressing invalid chunk structure.

Revision Contents
Changeset List

Path

Size

services/

backup/

src/

Reactors/

server/

SendLogReactor.cpp

2 lines

Diff 13829

View Options

services/backup/src/Reactors/server/SendLogReactor.cpp

[services] Backup - Stop suppressing invalid chunk structureClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 13829

services/backup/src/Reactors/server/SendLogReactor.cpp

[services] Backup - Stop suppressing invalid chunk structure
ClosedPublic
Actions

Revision Contents
Changeset List