Page MenuHomePhabricator

[services] Backup - Add DB operations for Backup
ClosedPublic

Authored by karol on Feb 10 2022, 6:28 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Dec 20, 6:36 PM
Unknown Object (File)
Fri, Dec 20, 6:36 PM
Unknown Object (File)
Fri, Dec 20, 6:36 PM
Unknown Object (File)
Fri, Dec 20, 6:36 PM
Unknown Object (File)
Fri, Dec 20, 6:35 PM
Unknown Object (File)
Fri, Dec 20, 6:35 PM
Unknown Object (File)
Fri, Dec 20, 6:31 PM
Unknown Object (File)
Wed, Dec 4, 1:04 PM

Details

Summary

Database operations on BackupItems. We probably will not use remove method on production but it will be needed for tests.

There were some custom steps so I didn't fully use internal methods but after I implement operations for BackupItems and LogItems, I'm going to look at how we can take as much advantage of them as possible.

Depends on D3159

Test Plan

Backup service builds

Diff Detail

Repository
rCOMM Comm
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

karol edited the test plan for this revision. (Show Details)
karol added reviewers: tomek, max, varun.
tomek requested changes to this revision.Feb 11 2022, 2:15 AM
tomek added inline comments.
services/backup/docker-server/contents/server/src/DatabaseManager.cpp
100–105

We're deleting here by primary-sort key pair, so there should be always only one match, right? If that's not the case, should we delete by FIELD_BACKUP_ID?

This revision now requires changes to proceed.Feb 11 2022, 2:15 AM
karol added inline comments.
services/backup/docker-server/contents/server/src/DatabaseManager.cpp
100–105

Yes, the purpose of this method is to remove only one specific item(please note the singular name)

We should not perform lookups by the backup id field. Rather than that, we should use the created field and identify backups by their creation date and user id(PK).

This revision is now accepted and ready to land.Feb 11 2022, 3:13 AM
This revision now requires review to proceed.Feb 11 2022, 3:13 AM
ashoat requested changes to this revision.Feb 13 2022, 8:29 PM
ashoat added inline comments.
services/backup/docker-server/contents/server/src/DatabaseManager.cpp
90–94

This seems wasteful... it looks like we're getting every single backup and then just using the last one. Can we make the query only match a single result?

100–105

We should not perform lookups by the backup id field.

  1. Is this for performance reasons?
  2. Our API for secondary auth uses the backup ID, and doesn't provide the creation date. This is a huge problem if what you're saying is true. If you are sure that we cannot perform backups based on the backup ID field, please bring this up in the architecture review.
This revision now requires changes to proceed.Feb 13 2022, 8:29 PM
services/backup/docker-server/contents/server/src/DatabaseManager.cpp
90–94

We can do it, sorry I didn't do it earlier.

100–105

I think we can handle this.

Let's break this down.

For now, we have something like this:

* backup
*  userID[PK]            string
*  created[SK]           timestamp
*  backupID              string
*  recoveryData          bytes
*  compactionHolder      string
*  attachmentHolders     list<string>

so the primary key is (userID, created).

For efficient queries:

  • we definitely need the userID as a partition key - we're always going to query the database by the user id because we want to operate on the specific user's backups.
  • we need created as well since we want to be able to fetch the newest backups

If we need also to query the database by the backupID effectively, I do not see why we don't create an index for it. I thought about this for a second and it seems like for the simple auth we only need to get that backupID. The way indexes work in dynamo DB is, they copy certain data to "a new place" and we can manually specify what data are going to be copied. So this index could be really small and consist of just the primary key(which is always projected) and the backupID.

If you are sure that we cannot perform backups based on the backup ID field, please bring this up in the architecture review.

I think we can handle this in the DB settings here without affecting the architecture of the whole service.

Regarding the original issue raised by Tomek here: I think every time we want to look up a backup in the database, we should identify it by the userID and created. backupID is more like a secret(https://phabricator.ashoat.com/D2950#83151). I realize this is misleading as usually a name like xxxID indicates an identifier by which we can query the database. I think it might be good then to rename this field.

Conclusion:

  • we want to create an index for the backupID that would project backupID and the primary key(userID,created)
  • we may rename backupID so it's less misleading, maybe something like backupSecret

What do you think @ashoat @palys-swm ?

add limit 1 to the query for latest backup

services/backup/docker-server/contents/server/src/DatabaseManager.cpp
100–105

We probably also don't want to replace the sort key with the backupID, because the queries that we need are going to need all the data of a backup when querying by the created value and we only need(I think) the minimum of the data when querying by the backupID. So making the backupID a sort key and indexing created would cause more data duplication than if we go the way I proposed above, which is: created stays as a sort key and we add an index for the backupID. Just clarifying.

@karol-bisztyga, thanks for the thorough comment. It made it really easy for me to review.

Regarding backupID vs. backupSecret, I worry that backupSecret would get confused with the backup key itself, since that can also be described as the backup secret. I'm open to a different name, but I personally don't mind the fact that it is named backupID but is a secret. I don't think being an ID necessarily implies being public knowledge.

Sort key vs. secondary index

I think your idea of an index makes sense. The only question I have is whether we want to replace the sort key with the backupID. I thought through your argument about efficiency and I'm not sure it applies... here's my analysis:

  1. CreateNewBackup: this is just an insert, and does not require a query
  2. SendLog: this just needs to check the backupID to see if it is valid. It can either query an "index" as you described, or the primary table – either way should work
  3. RecoverBackupKey: this will need to query for the most recent backupID and then see some backup data. So it is more efficient if the primary table has created as the sort key, since that way it doesn't have to do a secondary query.
  4. PullBackup: but this one is the opposite of RecoveryBackupKey. It also needs to see some backup data, and it would be more efficient if it could query the primary table directly with backupID as the sort key, instead of first having to do a secondary query.

It seems like this can all be optimized by making sure that the secondary index has all the info necessary to respond to RecoveryBackupKey or PullBackup (whichever one we choose to use the secondary index for). But I can't see an advantage to using created vs.backupID as the primary index.

Either way, I think we can proceed with the design as suggested. But I'm curious for @karol-bisztyga and @palys-swm's take here: is there a clear advantage to using created as the sort key instead of backupID?

This revision is now accepted and ready to land.Feb 14 2022, 8:07 PM

Ok, let's stay with the backupID name, I'm ok either way.

Regarding which field should be a sort key and which should be an index, we should look at the following methods then:

  • RecoverBackupKey
query bycreated
needsrecovery data, backup id
  • PullBackup
query bybackup id
needscompactionHolder, attachmentHolders

If we make created an index, then backupID would be a sort key, so for this index, we would need only to project recoveryData.
If we make backupID an index, then created would be a sort key, but it wouldn't be needed for the PullBackup method. So we'd have to project compactionHolder and attachmentHolders anyway.

I think you're right that it's better to have created as a secondary index and backupID as a sort key then. I'm going to change that in D2950.

now backupID is the sort key

This revision is now accepted and ready to land.Feb 16 2022, 2:43 AM
tomek requested changes to this revision.Feb 16 2022, 6:30 AM

Requesting changes just to make sure we don't land the code with build issues

This revision now requires changes to proceed.Feb 16 2022, 6:30 AM
This revision is now accepted and ready to land.Feb 16 2022, 7:12 AM
ashoat added inline comments.
services/backup/docker-server/contents/server/src/DatabaseManager.cpp
46 ↗(On Diff #9728)

This automatically inserts the new item into the secondary index, right?

101 ↗(On Diff #9728)

This automatically removes the item from the secondary index, right?

Just like in D3087, this diff was landed before addressing my comments.

I want to reiterate: this is deeply concerning. Behavior like this forces me to have to check every single diff when it's landed because I'm paranoid people are ignoring my comments. Behavior like this makes me wonder if I should ever accept a diff with comments.

@karol-bisztyga, you must ALWAYS check a diff before landing to make sure you have addressed all comments. Better yet would be if you addressed comments and answered questions when they are asked instead of waiting until the end and forgetting.

I'll repeat my question from D3087: in the future, would you prefer that I request changes any time I have a small question?

Hey, sorry about this. I think we can keep the current pattern of accepting with comments, I address everything most of the time, not sure why I missed these.

services/backup/docker-server/contents/server/src/DatabaseManager.cpp
46 ↗(On Diff #9728)

Yes

Whenever the data in the table is modified, the index is automatically changed to reflect the changes in the table.

101 ↗(On Diff #9728)

Yes

Whenever the data in the table is modified, the index is automatically changed to reflect the changes in the table.