Page MenuHomePhabricator

[services] Backup - Add hash to send log request
AbandonedPublic

Authored by karol on Mar 30 2022, 5:04 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 4, 8:29 PM
Unknown Object (File)
Sat, Jan 4, 8:29 PM
Unknown Object (File)
Sat, Jan 4, 4:26 PM
Unknown Object (File)
Fri, Dec 20, 4:04 PM
Unknown Object (File)
Fri, Dec 20, 4:04 PM
Unknown Object (File)
Fri, Dec 20, 4:00 PM
Unknown Object (File)
Dec 6 2024, 8:17 PM
Unknown Object (File)
Dec 4 2024, 12:49 PM

Details

Summary

Depends on D3561

I think for every operation when we possibly will update something to the blob service, we need to require a hash of the data that's about to be uploaded. This is because in the blob service we require a hash for the Put operation.

Since we decided to store large logs in the blob service, we have to also require a hash here.

Test Plan
cd services
yarn run-backup-service

Diff Detail

Repository
rCOMM Comm
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

karol added a reviewer: ashoat.
ashoat requested changes to this revision.Mar 30 2022, 10:35 PM

Can you clarify how the API changes? Is the logHash always sent, or only for "large logs" that get sent to the blob service? If we expect each log to be unique, then what's the point here – won't we need to upload all of the logData to the backup service anyways, at which point the backup service can directly take the hash? Is there any risk to taking the client's "word" that the logHash correspond to the logData? Do we have any related documents / diagrams that need to be updated?

This revision now requires changes to proceed.Mar 30 2022, 10:35 PM
In D3562#97531, @ashoat wrote:

Can you clarify how the API changes? Is the logHash always sent, or only for "large logs" that get sent to the blob service? If we expect each log to be unique, then what's the point here – won't we need to upload all of the logData to the backup service anyways, at which point the backup service can directly take the hash? Is there any risk to taking the client's "word" that the logHash correspond to the logData? Do we have any related documents / diagrams that need to be updated?

The API changes just adding one more step - for sending a hash.
There are two reasons to have this:

  • As I said in the description for every set of data that is supposed to be passed along to the blob we need a hash anyway so I think it's better to calculate in on the client instead of in the backup service.
  • gRPC's OnReadDone callback is a little retarded to me (or I'm behind somehow...) because it doesn't recognize a successfully ended connection from the broken read. In both cases, the only thing we get is a bool flag: !ok inside of this callback. Maybe we could check the state in OnDone, but still, I guess it's safe to calculate the hash and check with the one sent to us by the client - we do not know how many chunks of data we're going to receive, so if we get !ok we cannot know if it's a broken connection or just the last chunk.

Is the logHash always sent, or only for "large logs" that get sent to the blob service?

This is a good point. Originally, my idea was to send it for every request but I guess we can skip this for small parts of data. Although I can see an edge-case already - what if data has 4MB in total. 4MB is also a gRPC limit for a data chunk. So we cannot know if it is just one chunk and the second chunk failed to read or there is just 4MB of data. I think the safest way is to always calculate and compare the hash.

ashoat requested changes to this revision.Mar 31 2022, 8:26 PM

Still doesn't make sense to me. I don't understand why we would trust the client to tell us this info. Don't we need to download the whole log to the backup service anyways? Can't the backup service generate the hash itself?

This revision now requires changes to proceed.Mar 31 2022, 8:26 PM

The idea is to be sure that we received all the bytes and they're valid. How else could we guarantee this if the grpc itself cannot recognize a successfully ended connection from a reading error?
So the data's journey begins on the client's device and ends on the blob/backup service. If the hash doesn't match it means something got lost on the way through the network and we should raise an error.

Don't we need to download the whole log to the backup service anyways?

Yes, I'm not sure how that's related here, it's kind of obvious.

Can't the backup service generate the hash itself?

It can but how would we know whether we have the same data in the blob service that was sent from the client?

Here's how the data travels:

client => backup => blob

So if we calculate hash on backup and it's asd123 and on blob we also obtain asd123 but on the user device we get data hash qwe456, then it doesn't seem like we have valid data on the services' side, does it?

ashoat requested changes to this revision.Apr 1 2022, 10:38 AM

How else could we guarantee this if the grpc itself cannot recognize a successfully ended connection from a reading error?

I'm sure there are lots of ways to do that. As a basic example (and I'm not saying this is the best thing to do), you could simply have a boolean that gets sent called transmissionFinished or something. Alternately, you could have the userID sent at the end. And maybe you could just gracefully close the connection? I suspect there is a way to distinguish an error from a graceful close.

Yes, I'm not sure how that's related here, it's kind of obvious.

It's related because there is literally no reason to send the logHash up to the server from the client if the server can simply generate it itself.

It can but how would we know whether we have the same data in the blob service that was sent from the client?

TCP guarantees ordering and consistency. I don't understand what problem you're solving here. Does gRPC actually have this issue? How do you think we guarantee the same thing for file uploads today without a hash?

This revision now requires changes to proceed.Apr 1 2022, 10:38 AM

If you're still confused about my feedback here, can you try to either check with @palys-swm to understand it better, or reach out over chat? (Rather than re-requesting review again)

I'm sure there are lots of ways to do that. As a basic example (and I'm not saying this is the best thing to do), you could simply have a boolean that gets sent called transmissionFinished or something. Alternately, you could have the userID sent at the end.

Bool flag will do.

And maybe you could just gracefully close the connection? I suspect there is a way to distinguish an error from a graceful close.

I've not found anything like it. But maybe I missed something, all we get is a bool flag from grpc and a proper stream ending is treated the same way as any error on the client side.

TCP guarantees ordering and consistency. I don't understand what problem you're solving here. Does gRPC actually have this issue? How do you think we guarantee the same thing for file uploads today without a hash?

Fair enough. But then, the only thing that feels weird is that we agreed and implemented sending hash in the blob service, calculating it again, and comparing results if they match. Given what you said, what is the purpose of that? I think it would be good to be consistent so either we check hashes or not. Now it feels like we send a blob like this: client => backup => blob and we only check the hash between the backup and the blob.

For me it looks like there's a valid reason for sending the hash. Before client starts sending the data, it declares what the hash value would be and the server starts saving in a place defined by the hash. We need to do that, because in worst case, the data might be so big to never fit into the memory. At the end, we should verify if the declared hash matches the actual. This verification is important because 1. a malicious client could try to lie about the value (but I don't see any useful attack possible by this) 2. a client might have some bugs (e.g. sending logic skips some bytes) and we want to catch that to avoid storing corrupted data and giving the client an impression that the data was successfully saved.

So basically, the valid use case is to protect against bugs in client code which might result in corrupted data.

Regarding ending of the connection: there's some need for a research about distinguishing between success and failure. It looks like the current agreement is that there's no way for grpc to give that info, but in this case it would make sense to follow @ashoat suggestion of e.g. a boolean flag. I'd even consider extending it, by adding a new state to the reactors which would mean that all the data was sent by a client.

From my point of view, this diff looks ok and there's a valid reason for always sending the hash, so I'm going to accept.

Thanks Tomek, that perspective convinced me. While I think the mechanism you are suggesting here is rather complex (saving the pending data directly as it streams in), it does seem better in the long-term, and I guess it's better to include this data now in case we want to use it in the future.

For now, I think it's very important that we do not save pending data as it streams in. Before we could do something like that, we would need to make sure that we could invalidate the data if the hash doesn't match, BEFORE anybody would read the data. I don't believe we have a way to handle that right now, and I'm not sure it makes sense to prioritize that work currently. So for now I would probably just have the implementation code completely ignore this value, except as a signal that the upload is complete.

Before client starts sending the data, it declares what the hash value would be and the server starts saving in a place defined by the hash. We need to do that, because in worst case, the data might be so big to never fit into the memory.

Is that actually how it works today? The server immediately starts saving, without receiving the full contents?

At the end, we should verify if the declared hash matches the actual.

What happens if it doesn't match? We're already saving the data, right? We would have to save it to a temporary place first, or make the save "incomplete" initially, in order to be able to invalidate it before anybody reads it.

This revision is now accepted and ready to land.Apr 5 2022, 7:38 PM

Is that actually how it works today? The server immediately starts saving, without receiving the full contents?

Yes. Yes and no actually. Let me explain, so we have 2 scenarios, either we receive:

  • a lot of data
  • small amount of data

For a small amount of data, we just save and that's it. For a big one, we receive consecutive chunks of data (let's say of size 4MB - that's the limit for gRPC). So when we receive 10 chunks, 4MB each, what happens is, we initialize the MPU (multi part uploader) and write the chunks to it (underneath MPU keeps the parts in some kind of tmp either removing them at the end(fail) or flushing to the actual destination(success)). So yes, we start saving via MPU but not in the final destination right away.

So I think this:

Before client starts sending the data, it declares what the hash value would be and the server starts saving in a place defined by the hash

is not a correct statement.

What happens if it doesn't match? We're already saving the data, right? We would have to save it to a temporary place first, or make the save "incomplete" initially, in order to be able to invalidate it before anybody reads it.

I think we can simply cancel the MPU operation and the AWS will handle this for us automatically.

Can we land this? Does anyone feel like the discussion should be continued somewhere?

The conclusion is, that we keep sending hashes, right?

looks like this change has already been added so we can abandon here I think.