Page MenuHomePhabricator

[blob] Add function to perform cleanup operation
ClosedPublic

Authored by bartek on Oct 4 2023, 3:18 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jan 6, 4:13 AM
Unknown Object (File)
Sat, Jan 4, 11:56 PM
Unknown Object (File)
Tue, Dec 31, 4:09 AM
Unknown Object (File)
Tue, Dec 31, 4:09 AM
Unknown Object (File)
Tue, Dec 31, 4:09 AM
Unknown Object (File)
Tue, Dec 31, 4:09 AM
Unknown Object (File)
Tue, Dec 31, 4:08 AM
Unknown Object (File)
Tue, Dec 31, 4:07 AM
Subscribers

Details

Summary

This diff adds a function that performs the core cleanup logic. Its logic should be fail-proof: failure of any steps should prevent blob deletion.

The algorithm (also documented in code comments):

  1. Get both blobs and holders rows that are marked as "unchecked".
  2. Organize them into structures that have information about blob-holder assignments and their existence.
  3. Filter out entries that contain both blob and holders - they're valid.
  4. Get more information about entries that contain only blob or only holders:
    • For holders, get the blob hash and check if it exists in DB
    • For blobs, check if at least 1 holder exists in DB
  5. Update the struct with query results and repeat step 3.
  6. Delete rows that still contain only blob or only holders. Mark others as "checked".

Depends on D9351

Test Plan

Ran this function on a dataset that contained the following combinations:

  • Blobs having blob hashes and holders, none marked as unchecked
  • Blobs having blob hashes and holders, blob item marked as unchecked, holders not marked
  • Blobs having blob hashes and holders, some holders marked as unchecked, blob hash not marked
  • Blobs having blob hashes and holders, some holders marked as unchecked, blob hash marked too
  • Blobs having blob hashes and holders, all holders marked as unchecked, blob hash not marked
  • Blobs having blob hashes and holders, blob hash and all holders marked as unchecked
  • Blobs having only blob hash row (marked as unchecked), no holders.
  • Blobs having only holder rows (marked as unchecked), no blob hash.

Diff Detail

Repository
rCOMM Comm
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

bartek held this revision as a draft.
bartek published this revision for review.Oct 4 2023, 3:32 AM
bartek added inline comments.
services/blob/src/service.rs
278–286 ↗(On Diff #31624)

This part couldn't be done in batch because it's a Query operation (not GetItem)

301–306 ↗(On Diff #31624)

S3 deletion will be handled in the next diff


Failure handling note: if any of the DB operations fails, that's no problem, they still will remain as 'unchecked' and will be handled during the next cleanup operation,

The only potential problem here is that S3 deletion might fail while DB rows are already deleted. Unfortunately, DDB transactions don't give us much control so we cannot rollback later.
One solution is to periodically (much less often than cleanup task) list S3 objects and check if their counterparts exist in DDB. This could be considered as a follow-up, it's only leaving junk in S3, but doesn't break anything

Do we also plan to uncheck items in some cases? It looks like we're handling a case where an item wasn't created properly. I think that we also sometimes should look for a case where an item was deleted only partially (e.g. only a holder was left).

This revision is now accepted and ready to land.Oct 6 2023, 6:38 AM
In D9352#275828, @tomek wrote:

Do we also plan to uncheck items in some cases? It looks like we're handling a case where an item wasn't created properly. I think that we also sometimes should look for a case where an item was deleted only partially (e.g. only a holder was left).

Deleting holders already marks related rows as unchecked so they can be processed by the cleanup procedure as soon as their protection period ends. It's done here

Rebase. Use HashSet instead of Vec to avoid duplicates

michal added inline comments.
services/blob/src/service.rs
230 ↗(On Diff #31788)
232–237 ↗(On Diff #31788)

Possibly could be done with one db query (requires changes to find_unchecked_items to only check if unchecked != null)? We could filter the items inside the service

298 ↗(On Diff #31788)
services/blob/src/service.rs
232–237 ↗(On Diff #31788)

You're right - this distinction is not even needed at all. IIRC I introduced it only for indexing performance - a boolean cannot be the sparse index key in DDB, but the Query operation could be potentially modified to look for both strings. I'll look into this

Rebase on parent diff feedback