Details

Reviewers

tomek
• max
varun
ashoat
• jim

Commits

rCOMMdd7ba24b6b1f: [services] Backup - Add DB operations for Backup

Summary

Database operations on BackupItems. We probably will not use remove method on production but it will be needed for tests.

There were some custom steps so I didn't fully use internal methods but after I implement operations for BackupItems and LogItems, I'm going to look at how we can take as much advantage of them as possible.

Depends on D3159

Test Plan

Backup service builds

Diff Detail

Repository

rCOMM Comm

Lint

No Lint Coverage

Unit

No Test Coverage

Event Timeline

• karol created this revision.Feb 10 2022, 2:28 PM

Herald added subscribers: atul, • adrian, tomek, ashoat. · View Herald TranscriptFeb 10 2022, 2:28 PM

• karol edited the summary of this revision. (Show Details)Feb 10 2022, 2:38 PM

• karol edited the test plan for this revision. (Show Details)

• karol added reviewers: tomek, • max, varun.

• karol added a parent revision: D3159: [services] Backup - Add Tools.

• karol added a child revision: D3161: [services] Backup - Add DB test for Backup.

Harbormaster completed remote builds in B6710: Diff 9520.Feb 10 2022, 2:40 PM

• karol requested review of this revision.Feb 10 2022, 2:40 PM

tomek requested changes to this revision.Feb 11 2022, 10:15 AM

tomek added inline comments.

services/backup/docker-server/contents/server/src/DatabaseManager.cpp
100–105 ↗	(On Diff #9520)	We're deleting here by primary-sort key pair, so there should be always only one match, right? If that's not the case, should we delete by `FIELD_BACKUP_ID`?

This revision now requires changes to proceed.Feb 11 2022, 10:15 AM

• karol requested review of this revision.Feb 11 2022, 10:56 AM

• karol added inline comments.

services/backup/docker-server/contents/server/src/DatabaseManager.cpp
100–105 ↗	(On Diff #9520)	Yes, the purpose of this method is to remove only one specific item(please note the singular name) We should not perform lookups by the backup id field. Rather than that, we should use the `created` field and identify backups by their creation date and user id(`PK`).

tomek accepted this revision.Feb 11 2022, 11:13 AM

This revision is now accepted and ready to land.Feb 11 2022, 11:13 AM

Herald added a reviewer: ashoat. · View Herald TranscriptFeb 11 2022, 11:13 AM

Herald added a subscriber: • benschac. · View Herald Transcript

This revision now requires review to proceed.Feb 11 2022, 11:13 AM

ashoat requested changes to this revision.Feb 14 2022, 4:29 AM

ashoat added inline comments.

services/backup/docker-server/contents/server/src/DatabaseManager.cpp
90–94 ↗	(On Diff #9520)	This seems wasteful... it looks like we're getting every single backup and then just using the last one. Can we make the query only match a single result?
100–105 ↗	(On Diff #9520)	We should not perform lookups by the backup id field. Is this for performance reasons? Our API for secondary auth uses the backup ID, and doesn't provide the creation date. This is a huge problem if what you're saying is true. If you are sure that we cannot perform backups based on the backup ID field, please bring this up in the architecture review.

This revision now requires changes to proceed.Feb 14 2022, 4:29 AM

ashoat mentioned this in D3177: [services] Backup - Fix primary key logic.Feb 14 2022, 4:45 AM

ashoat mentioned this in D3175: [services] Backup - Add DB operations for Logs.

• karol added inline comments.Feb 14 2022, 10:33 AM

services/backup/docker-server/contents/server/src/DatabaseManager.cpp
90–94 ↗	(On Diff #9520)	We can do it, sorry I didn't do it earlier.
100–105 ↗	(On Diff #9520)	I think we can handle this. Let's break this down. For now, we have something like this: * backup * userID[PK] string * created[SK] timestamp * backupID string * recoveryData bytes * compactionHolder string * attachmentHolders list<string> so the primary key is `(userID, created)`. For efficient queries: we definitely need the `userID` as a partition key - we're always going to query the database by the user id because we want to operate on the specific user's backups. we need `created` as well since we want to be able to fetch the newest backups If we need also to query the database by the `backupID` effectively, I do not see why we don't create an index for it. I thought about this for a second and it seems like for the simple auth we only need to get that `backupID`. The way indexes work in dynamo DB is, they copy certain data to "a new place" and we can manually specify what data are going to be copied. So this index could be really small and consist of just the primary key(which is always projected) and the `backupID`. If you are sure that we cannot perform backups based on the backup ID field, please bring this up in the architecture review. I think we can handle this in the DB settings here without affecting the architecture of the whole service. Regarding the original issue raised by Tomek here: I think every time we want to look up a backup in the database, we should identify it by the `userID` and `created`. `backupID` is more like a secret(https://phabricator.ashoat.com/D2950#83151). I realize this is misleading as usually a name like `xxxID` indicates an identifier by which we can query the database. I think it might be good then to rename this field. Conclusion: we want to create an index for the `backupID` that would project `backupID` and the primary key(`userID`,`created`) we may rename `backupID` so it's less misleading, maybe something like `backupSecret` What do you think @ashoat @palys-swm ?

add limit 1 to the query for latest backup

Harbormaster completed remote builds in B6773: Diff 9595.Feb 14 2022, 10:43 AM

• karol added inline comments.Feb 14 2022, 11:34 AM

services/backup/docker-server/contents/server/src/DatabaseManager.cpp
100–105 ↗	(On Diff #9520)	We probably also don't want to replace the sort key with the `backupID`, because the queries that we need are going to need all the data of a backup when querying by the `created` value and we only need(I think) the minimum of the data when querying by the `backupID`. So making the `backupID` a sort key and indexing `created` would cause more data duplication than if we go the way I proposed above, which is: `created` stays as a sort key and we add an index for the `backupID`. Just clarifying.

@karol-bisztyga, thanks for the thorough comment. It made it really easy for me to review.

Regarding backupID vs. backupSecret, I worry that backupSecret would get confused with the backup key itself, since that can also be described as the backup secret. I'm open to a different name, but I personally don't mind the fact that it is named backupID but is a secret. I don't think being an ID necessarily implies being public knowledge.

Sort key vs. secondary index

I think your idea of an index makes sense. The only question I have is whether we want to replace the sort key with the backupID. I thought through your argument about efficiency and I'm not sure it applies... here's my analysis:

CreateNewBackup: this is just an insert, and does not require a query
SendLog: this just needs to check the backupID to see if it is valid. It can either query an "index" as you described, or the primary table – either way should work
RecoverBackupKey: this will need to query for the most recent backupID and then see some backup data. So it is more efficient if the primary table has created as the sort key, since that way it doesn't have to do a secondary query.
PullBackup: but this one is the opposite of RecoveryBackupKey. It also needs to see some backup data, and it would be more efficient if it could query the primary table directly with backupID as the sort key, instead of first having to do a secondary query.

It seems like this can all be optimized by making sure that the secondary index has all the info necessary to respond to RecoveryBackupKey or PullBackup (whichever one we choose to use the secondary index for). But I can't see an advantage to using created vs.backupID as the primary index.

Either way, I think we can proceed with the design as suggested. But I'm curious for @karol-bisztyga and @palys-swm's take here: is there a clear advantage to using created as the sort key instead of backupID?

This revision is now accepted and ready to land.Feb 15 2022, 4:07 AM

ashoat added a reviewer: • jim.Feb 15 2022, 4:50 AM

Ok, let's stay with the backupID name, I'm ok either way.

Regarding which field should be a sort key and which should be an index, we should look at the following methods then:

RecoverBackupKey

query by	created
needs	recovery data, backup id

PullBackup

query by	backup id
needs	compactionHolder, attachmentHolders

If we make created an index, then backupID would be a sort key, so for this index, we would need only to project recoveryData.
If we make backupID an index, then created would be a sort key, but it wouldn't be needed for the PullBackup method. So we'd have to project compactionHolder and attachmentHolders anyway.

I think you're right that it's better to have created as a secondary index and backupID as a sort key then. I'm going to change that in D2950.

now backupID is the sort key

• karol requested review of this revision.Feb 15 2022, 12:34 PM

Harbormaster failed remote builds in B6817: Diff 9656!Feb 15 2022, 12:42 PM

• karol mentioned this in D3087: [services] Backup - Database - add Backup Item.Feb 16 2022, 8:24 AM

tomek accepted this revision.Feb 16 2022, 10:43 AM

This revision is now accepted and ready to land.Feb 16 2022, 10:43 AM

• karol mentioned this in D3161: [services] Backup - Add DB test for Backup.Feb 16 2022, 2:26 PM

Requesting changes just to make sure we don't land the code with build issues

This revision now requires changes to proceed.Feb 16 2022, 2:30 PM

hopefully fix CI

Harbormaster completed remote builds in B6880: Diff 9728.Feb 16 2022, 2:40 PM

ashoat mentioned this in D3076: [services] Backup - redesign proto.Feb 16 2022, 3:08 PM

tomek accepted this revision.Feb 16 2022, 3:12 PM

This revision is now accepted and ready to land.Feb 16 2022, 3:12 PM

ashoat accepted this revision.Feb 22 2022, 7:37 AM

ashoat added inline comments.

services/backup/docker-server/contents/server/src/DatabaseManager.cpp
46	This automatically inserts the new item into the secondary index, right?
101	This automatically removes the item from the secondary index, right?

Closed by commit rCOMMdd7ba24b6b1f: [services] Backup - Add DB operations for Backup. · Explain WhyMar 4 2022, 12:02 PM

This revision was automatically updated to reflect the committed changes.

• karol added a commit: rCOMMdd7ba24b6b1f: [services] Backup - Add DB operations for Backup.

Just like in D3087, this diff was landed before addressing my comments.

I want to reiterate: this is deeply concerning. Behavior like this forces me to have to check every single diff when it's landed because I'm paranoid people are ignoring my comments. Behavior like this makes me wonder if I should ever accept a diff with comments.

@karol-bisztyga, you must ALWAYS check a diff before landing to make sure you have addressed all comments. Better yet would be if you addressed comments and answered questions when they are asked instead of waiting until the end and forgetting.

I'll repeat my question from D3087: in the future, would you prefer that I request changes any time I have a small question?

Hey, sorry about this. I think we can keep the current pattern of accepting with comments, I address everything most of the time, not sure why I missed these.

services/backup/docker-server/contents/server/src/DatabaseManager.cpp
46	Yes Whenever the data in the table is modified, the index is automatically changed to reflect the changes in the table.
101	Yes Whenever the data in the table is modified, the index is automatically changed to reflect the changes in the table.

Thanks for clarifying!!

[services] Backup - Add DB operations for Backup
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Sort key vs. secondary index

Revision Contents
Changeset List

Diff 9728

services/backup/docker-server/contents/server/src/DatabaseManager.h

services/backup/docker-server/contents/server/src/DatabaseManager.cpp

[services] Backup - Add DB operations for BackupClosedPublicActions

Details

Diff Detail

Event Timeline

Sort key vs. secondary index

Revision ContentsChangeset List

Diff 9728

services/backup/docker-server/contents/server/src/DatabaseManager.h

services/backup/docker-server/contents/server/src/DatabaseManager.cpp

[services] Backup - Add DB operations for Backup
ClosedPublic
Actions

Revision Contents
Changeset List