Page MenuHomePhabricator

[services] Backup - Stop suppressing invalid chunk structure
ClosedPublic

Authored by karol on Jun 7 2022, 6:52 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jul 2, 2:41 PM
Unknown Object (File)
Tue, Jul 2, 3:16 AM
Unknown Object (File)
Tue, Jul 2, 2:40 AM
Unknown Object (File)
Sat, Jun 29, 10:31 PM
Unknown Object (File)
Sat, Jun 29, 2:51 PM
Unknown Object (File)
Sat, Jun 29, 2:11 PM
Unknown Object (File)
Wed, Jun 26, 1:35 PM
Unknown Object (File)
Wed, Jun 26, 8:38 AM

Details

Summary

Depends on D4228

We should not break the reactor connection if we store a chunk in the database. In this case, we never catch the error when the client tries to push two consecutive chunks below the database limit and that should be pointed out and returned back to the client. If we suppress it, the client may never know that the second chunk is lost.

Test Plan

on the client, send two log chunks below the database limit which is 400kb.

Diff Detail

Repository
rCOMM Comm
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

when the client tries to push two consecutive chunks below the database limit

Can you clarify what you mean by this? I'm a bit confused

Yes, I was afraid this wouldn't be super clear right away.

So, we have a DB limit which is 400kb. If we upload data that exceeds this limit, it is stored on the S3, otherwise, we save it in the DB.

Now, what happens when we push a single chunk < 400kb is just that it is saved in the DB and that works ok.

But the scenario that is covered here is when a client pushes two (or more) chunks < DB limit, let's say 2 chunks of 100kb each for a single log.
Without this change, what happens is that we receive the first chunk, store it in the DB and return the OK status. Other chunks may be pushed but are never really used so they are lost and still the client sees the OK status as a result of this operation.
With this change, we continue the work of the reactor and listen for more chunks. If the client pushes more chunks after the first one is under the DB limit, we return an error. This is correct behavior as pushing multiple small chunks produces redundant grpc cycles and it means that the client code is poorly designed. Basically, in this case, the chunks should be merged on the client-side.

Ah okay that explanation makes sense! Now I'm just a bit confused about how this change achieves what you described...

  1. Does return nullptr cause the error to be propagated?
  2. How do we make sure we cause the error to be propagated only on the second chunk, rather than on the first chunk?
  1. Does return nullptr cause the error to be propagated?

It causes the error to be triggered when a situation like I described above occurs. Previously, it would never be triggered.

  1. How do we make sure we cause the error to be propagated only on the second chunk, rather than on the first chunk?

We perform checks on the persistenceMethod (by small chunk I'm going to mean the chunk below the DB limit, the big one - above it):

  • if you push first a small chunk and then a big one - there will be an error because we first set the persistenceMethod to DB and then try to push further.
  • if you push small and small - the error will be triggered - that's the case we're handling in this diff
  • if you push a big and small/big - it'll be successfully stored on the S3

Basically, if you push a small chunk, you should stop right there, because we don't allow any extra further data. We could handle a situation when you push a couple of really small chunks that would sum into a small chunk (< 400kb) eventually, but I do not see a point in doing so, we can just push 400kb over grpc and avoid making it more complicated than it has to be.

Thanks, that mostly makes sense, but one more question:

In D4232#119697, @karol-bisztyga wrote:
  1. Does return nullptr cause the error to be propagated?

It causes the error to be triggered when a situation like I described above occurs. Previously, it would never be triggered.

Are you saying return nullptr will cause the error to be propagated? Or are you saying that return nullptr will put the Reactor in such a state that if it were to receive another chunk, it would then crash?

If it's the latter, it would seem like return nullptr is basically saying "we are done, and don't expect to receive anything else". But normally I would expect std::make_unique<grpc::Status>(grpc::Status::OK) (the old code) to be saying "we are done".

Can you clarify what return nullptr does and what std::make_unique<grpc::Status>(grpc::Status::OK) does?

ashoat requested changes to this revision.Jun 14 2022, 9:58 AM

With questions

This revision now requires changes to proceed.Jun 14 2022, 9:58 AM

Are you saying return nullptr will cause the error to be propagated? Or are you saying that return nullptr will put the Reactor in such a state that if it were to receive another chunk, it would then crash?

The latter, but not a crash per se, just an error that will be sent to the user.

If it's the latter, it would seem like return nullptr is basically saying "we are done, and don't expect to receive anything else". But normally I would expect std::make_unique<grpc::Status>(grpc::Status::OK) (the old code) to be saying "we are done".

This is correct, we basically say: "we are done, and don't expect to receive anything else".

Can you clarify what return nullptr does and what std::make_unique<grpc::Status>(grpc::Status::OK) does?

  • return nullptr - we continue reactor's work expecting more data
  • return X where X is non-null status - we're terminating the reactor with the given status.

The thing here is that if we terminate the reactor with an ok status, it will work just fine if we send a single small chunk. However, if we send small+big, as I said 2 comments back, this will happen:

Other chunks may be pushed but are never really used so they are lost and still the client sees the OK status as a result of this operation.

I realize that the machine state's not pure perfect here, but I don't think it's that hacky either. For me it is fine, the connection can be easily terminated by the user and leaving the reactor "hanging" like this, we prevent having missing chunks, at the same time saying that everything is okay, which would be even worse (that would be really bad I think).

ashoat requested changes to this revision.Jun 15 2022, 4:41 PM

Your answers seem inconsistent to me. On one hand, you say:

The latter, but not a crash per se, just an error that will be sent to the user.

So return nullptr; will send an error. But then you say:

return nullptr - we continue reactor's work expecting more data

This revision now requires changes to proceed.Jun 15 2022, 4:41 PM

This :

The latter, but not a crash per se, just an error that will be sent to the user.

is a response to this:

Are you saying return nullptr will cause the error to be propagated? Or are you saying that return nullptr will put the Reactor in such a state that if it were to receive another chunk, it would then crash?

So, the latter (aka the second one) is:

return nullptr will put the Reactor in such a state that if it were to receive another chunk, it would then crash

So I responded that yes, this is correct except that

not a crash per se, just an error that will be sent to the user.

But the part that it puts a reactor into the state in which receiving anything more results in an error is correct.

Now:

return nullptr - we continue reactor's work expecting more data

this was a response to this:

Can you clarify what return nullptr does and what std::make_unique<grpc::Status>(grpc::Status::OK) does?

which was a question that I understood as being more about the reactors in general. And this is correct.

You're confused because you mixed two answers, one answering the specific case question and the other one answering a general question.

In reactors, when we return nullptr, we expect more data, but in this specific case, we put a reactor in a state "I expect more data" but in fact if it ever gets more data, it will throw an error.

This revision is now accepted and ready to land.Jun 20 2022, 10:48 AM