Page MenuHomePhabricator

[services] Blob - Add append holder to proto
AbandonedPublic

Authored by karol on Apr 19 2022, 1:56 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jun 30, 10:02 AM
Unknown Object (File)
Sun, Jun 30, 10:02 AM
Unknown Object (File)
Sun, Jun 30, 10:00 AM
Unknown Object (File)
Thu, Jun 27, 11:20 AM
Unknown Object (File)
Fri, Jun 21, 3:36 PM
Unknown Object (File)
Fri, Jun 21, 12:32 PM
Unknown Object (File)
Wed, Jun 19, 4:46 PM
Unknown Object (File)
Tue, Jun 18, 5:57 AM

Details

Summary
Test Plan
cd services
yarn run-blob-service

Diff Detail

Repository
rCOMM Comm
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

  1. Any proto changes need to be reviewed by a lot more people
  2. Can you provide additional context about this new API? What's the difference between addReferenceToHashByHolder and AppendHolder?
  3. I keep coming back to this, but I do think it was a mistake to use "holder" as a unique ID like you decided to do a while back. We had a disagreement about it then, but the end result unfortunately is that we weren't able to prioritize a fix and we ended up getting stuck with an unideal API. When I see new APIs like this it feels like we're just digging ourselves further into a hole. Are you sure we can't use the blobID (aka blobHash) to set the new holder here?
  4. What is this API used for? The Linear task does not clarify the caller of the API, in what situation it will be used, etc.

In general, would be great if I didn't need to leave comments like this, and you addressed them all preemptively with a more detailed explanation

ashoat requested changes to this revision.Apr 19 2022, 10:48 AM
This revision now requires changes to proceed.Apr 19 2022, 10:48 AM
  1. Any proto changes need to be reviewed by a lot more people

Sure.

  1. Can you provide additional context about this new API? What's the difference between addReferenceToHashByHolder and AppendHolder?

There is no difference. it's the same method. Sorry for the confusion, I just wanted a shorter name.

  1. I keep coming back to this, but I do think it was a mistake to use "holder" as a unique ID like you decided to do a while back. We had a disagreement about it then, but the end result unfortunately is that we weren't able to prioritize a fix and we ended up getting stuck with an unideal API. When I see new APIs like this it feels like we're just digging ourselves further into a hole. Are you sure we can't use the blobID (aka blobHash) to set the new holder here?

I think this place isn't good for this. If you feel like it, we can prioritize the discussion in the associated task https://linear.app/comm/issue/ENG-430/consider-holder-to-not-be-unique

  1. What is this API used for? The Linear task does not clarify the caller of the API, in what situation it will be used, etc.

This method should be used when there's already an item in the database with a given hash so we will not have to attempt to upload the data again, but rather we just assign a new holder to the existing hash.

I updated the task with proper explanations

tomek requested changes to this revision.Apr 21 2022, 1:47 AM
tomek added inline comments.
native/cpp/CommonCpp/grpc/protos/blob.proto
44

Maybe we should explain the purpose of this api in a comment?

47–48

I'm slightly confused. In the linear task the these fields are named:

AppendHolder(existingHolder, newHolder)

In a comment here @ashoat suggested:

Are you sure we can't use the blobID (aka blobHash) to set the new holder here?

And @karol-bisztyga answered that:

I think this place isn't good for this.

But at the same time I see blobHash being used.

Can we clarify what's going on here?

It looks like it is not obvious which part of this request corresponds to an existing holder and which is new. Can we have names that explain it?

This revision now requires changes to proceed.Apr 21 2022, 1:47 AM
karol added inline comments.
native/cpp/CommonCpp/grpc/protos/blob.proto
47–48

I understand it this way: Ashoat said that we may want to reconsider having unique holders (unique across the system). For that, I think we shouldn't go here and instead, we can discuss it once more in the task I linked and do it separately.

We will always need a blob hash for this function regardless we use unique holders or not - we have to know where do we want to append a new holder.

It looks like it is not obvious which part of this request corresponds to an existing holder and which is new. Can we have names that explain it?

This function appends new holder to an existing hash. I see that it may not be clear, how about this then:

message AppendHolderRequest {
  string newHolder = 1;
  string existingBlobHash = 2;
}

?

tomek added 1 blocking reviewer(s): ashoat.
tomek added inline comments.
native/cpp/CommonCpp/grpc/protos/blob.proto
47–48

Ok, thanks!

Yeah, I think it's a lot more readable - I like it.

ashoat requested changes to this revision.Apr 21 2022, 1:35 PM

Confusing process

This has been a really confusing process.

In ENG-672, @karol-bisztyga proposed:

AppendHolder(existingHolder, newHolder)

This is why I asked:

Are you sure we can't use the blobID (aka blobHash) to set the new holder here?

Turns out at some point @karol-bisztyga changed his mind, and decided to make the API take blobHash exactly as I suggested! But instead of telling me something like "yes, I agree, I've changed it" he responded with "I think this place isn't good for this."

We can just use Put

Now that I understand what's actually going on here, I have another piece of feedback: I'm pretty sure we already have this API. We've explicitly talked about this scenario with Put, and have designed it to "short-circuit" if the blob already exists in the system and to skip re-uploading it.

Can't we simply use Put in all cases we want to use AppendHolder? We specify holder and blobHash, and see if the server "short-circuits". If the server doesn't "short-circuit", then we should conclude that in fact the server doesn't have the blob in the question, and the blob must be reuploaded.

In that scenario, what do we want the caller to do? I think we want the caller to reupload the blob, so continuing to use Put (as initially designed) seems like the right decision.

This revision now requires changes to proceed.Apr 21 2022, 1:36 PM

Yes, now I see what went wrong. It was totally my fault and I apologize. I understood this task wrong because I didn't read the context carefully.

I thought here we want to avoid re-uploading the same data. That was my mistake.

The method from this task has a different purpose:

As for keeping references, I talked with Tomek and we came to the conclusion that those references to attachments are only for the cleanup. We do not need them to handle backup operations at all.
This is because we want to keep track of which backup uses which blobs so we never remove blobs that are used by any backups.

To make a cleanup effective, we should store proper values in the blob service table so every new attachment reference here generates a new record in the blob table. Since blob holders are unique across the system, we should add a new method for this in blob.

This method will take a holder, read what hash it is assigned to, and then it will create a new holder pointing to that hash and put it in the database. After this, the method will return the newly created holder. This new holder we then will store in the backup table. In this case, when the backup will be removed, we're going to remove all the holders from the blob that will be listed here as references to attachments.

So it is more connected to handling attachments and the cleanup properly.

I'm going to review the following diffs, some of them should probably be abandoned.

About using put - yes, we should do this in this case, I created a task for this: https://linear.app/comm/issue/ENG-1030/make-sure-duplicate-hashes-are-added-to-the-database because looking at the code I'm not sure whether it does the job properly.

I also created a task for handling attachments properly https://linear.app/comm/issue/ENG-1029/handle-attachments with a task I linked before https://linear.app/comm/issue/ENG-672/create-appendholder-in-the-blob-service as a child task.

Overall, it's not a huge mistake but it is one for sure. It was just me not understanding some task I created a while ago. The method I added could be used but using put is just simpler and better.

Once again, sorry for this, I'm going to try to pay more attention next time.

As I said, this code is an effect of my not fully understanding the linear task. Sorry about that, abandoning.