Page MenuHomePhabricator

[services] Dev Mode - create backups for AWS S3 and dynamoDB
AbandonedPublic

Authored by karol on Mar 23 2022, 7:56 AM.
Tags
None
Referenced Files
F1772729: D3495.diff
Wed, May 15, 8:29 PM
F1772693: D3495.diff
Wed, May 15, 8:09 PM
Unknown Object (File)
Fri, Apr 26, 8:06 AM
Unknown Object (File)
Tue, Apr 23, 1:19 PM
Unknown Object (File)
Tue, Apr 23, 1:19 PM
Unknown Object (File)
Tue, Apr 23, 1:15 PM
Unknown Object (File)
Apr 2 2024, 2:19 PM
Unknown Object (File)
Mar 30 2024, 6:12 PM

Details

Summary

I think it would be good to keep track of the dumps from dynamo DB tables and the S3 buckets.
Every time someone changes a table, they should run the backup script, so the values get updated in our repository.
We can then easily recreate the S3 bucket structure and all the tables (without data) from the dynamo DB.
This can be useful for testing as well as for the tasks like setting up a local cloud (this is done in the upcoming diffs).
In general, I think having a backup of the structure of these functionalities like this is beneficial, what do you think?

Test Plan
cd services
yarn backup-aws

Diff Detail

Repository
rCOMM Comm
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

karol edited the test plan for this revision. (Show Details)
karol added reviewers: jim, tomek, max, varun.

TODO : It would be good to mark all the generated files as @generated so the phabricator "ignores" them. But first, I'd like to know your opinion about the idea in the first place.

jim requested changes to this revision.Apr 4 2022, 12:46 PM

First, there should probably be separate diffs for S3 and DynamoDB as these are very different use cases in my opinion.

Starting with dynamodb, let me make sure I understand -- database schema is implicitly defined through requests to insert items and in the source code through the database entity classes, right? How are indexes defined? In the description, you say "Every time someone changes a table, they should run the backup script". What constitutes changing a table? How will I know if I change a table?

Similarly, where is S3 bucket structure defined?

I don't really understand the justification either. Is this just to help developers see the structure? Shouldn't they be looking at the Database entity classes for this? Or is this actually going to be used in scripts as implied by "This can be useful for testing as well as for the tasks like setting up a local cloud (this is done in the upcoming diffs)."

Overall, I don't really like this -- there should be a canonical representation of the database and filesystem (S3) schema in the source code, which will make updating the schema more visible and less error-prone.

This revision now requires changes to proceed.Apr 4 2022, 12:46 PM

First, there should probably be separate diffs for S3 and DynamoDB as these are very different use cases in my opinion.

Yes, I know, it even contains "and" in the title which indicates it should be split. I didn't write it precisely, but I wanted to get your opinion about the idea in general before spending too much time on this as I knew it may be invalid in the first place.

Starting with dynamodb, let me make sure I understand -- database schema is implicitly defined through requests to insert items and in the source code through the database entity classes, right?

Right. I mean, these things are more like helpers for developers. There is no real schema, you could out of nowhere start assigning values to some totally new fields, because it's NoSQL, right? It's all more like a structure we agreed on and I think we should use tools like entities to keep that structure and avoid errors.

How are indexes defined?

You can spot we use some indexes in our code - check DatabaseManager.cpp of different services. Other than that, indexes are not persisted anywhere but on the cloud and that's the problem I'm trying to solve.

In the description, you say "Every time someone changes a table, they should run the backup script". What constitutes changing a table? How will I know if I change a table?

What I meant was changing a table on the cloud, so to do this, you'd log in to AWS, go to dynamo console and change anything about the table - table name, partition key name, sort key name, etc. The point is to keep track of this in our code so if we somehow lose the DB one day, we'll be able to recover the structure at least and we'll be able to apply the structure to another instance of the cloud (like the local cloud).

Similarly, where is S3 bucket structure defined?

We use S3 bucket names in our code (for now it's just commapp-blob as we decided we'll only access the S3 from the blob service, but there are other buckets for different purposes). Besides of that, it's not defined anywhere and that's what I'm trying to change. Similarily like for dynamo, every time someone modifies the S3 structure, they should run the backup script to update the S3 structure in our code.

I don't really understand the justification either. Is this just to help developers see the structure? Shouldn't they be looking at the Database entity classes for this? Or is this actually going to be used in scripts as implied by "This can be useful for testing as well as for the tasks like setting up a local cloud (this is done in the upcoming diffs)."

Please, read the description:

We can then easily recreate the S3 bucket structure and all the tables (without data) from the dynamo DB.
This can be useful for testing as well as for the tasks like setting up a local cloud (this is done in the upcoming diffs).

So, as I said above, every time we need to recreate the structure for some reason, this will be useful.

Overall, I don't really like this -- there should be a canonical representation of the database and filesystem (S3) schema in the source code, which will make updating the schema more visible and less error-prone.

Sorry, I don't understand, what do you mean by "canonical"? What alternatives do you see for this? Please, remember that the goal here is to keep track of what's on the cloud somehow.

@karol-bisztyga, I think @jimpo is suggesting having the production database schema defined in code, and the schema setup being automatically handled based on that code using an infrastructure-as-code platform such as Terraform. That is in contrast with your approach here, where you are instead having the schema setup handled by the individual developer, and then backed up as code. @jimpo correct me if I'm wrong?

Yes, in this diff I didn't sense it but I assumed we could use terraform in https://phabricator.ashoat.com/D3496#100509. If we both agree on that, I think we can do this.
https://linear.app/comm/issue/ENG-989/use-terraform-to-set-up-the-cloud

I changed my approach, now I used terraform for this. I'm abandoning this one. I decided that it will be faster to set up a new stack as I reordered the diffs and a lot of the differed from the ones from this stack. Let's follow up in the stack beginning at D3695.