Details

Reviewers

ashoat
varun
bartek

Commits

rCOMM2407fdb79d16: [terraform] Migration script for for self-hosted keyserver

Summary

This runs a migration on the primary node by doing the following:

Disables traffic to load balancer
Grab the domain and desired secondary node count from terraform.tfvars
Stops all primary and secondary nodes
Waits for health check to fail
Redeploys the primary node, causing the migration to run
Redeploys the secondary nodes
Waits for health check to succeed
Re-enables traffic to load balancer

On running the script on my personal setup, the migration script took 15 minutes

Test Plan

Script runs successfully. Monitored aws console through the running of the script and confirmed that tasks were being stopped. The primary node was reinitialized and finally the secondary nodes were set to reinitialize.

I set my keyserver nodes to use wyilio/keyserver:latest. I then updated latest to include a test migration which created a new table and populated it with a single random number entry.
Ran my migration script and verified by connecting mysql remotely to MariaDB that the table was being created and there was only a single entry

Diff Detail

Repository

rCOMM Comm

Branch

keyserver_to_aws

Lint

No Lint Coverage

Unit

No Test Coverage

Event Timeline

will created this revision.Jul 11 2024, 11:50 AM

Herald added a subscriber: tomek. · View Herald TranscriptJul 11 2024, 11:50 AM

Harbormaster completed remote builds in B30292: Diff 42239.Jul 11 2024, 12:07 PM

will requested review of this revision.Jul 11 2024, 12:07 PM

will edited the test plan for this revision. (Show Details)Jul 11 2024, 12:08 PM

will planned changes to this revision.Jul 11 2024, 1:44 PM

include logic to disable traffic until all nodes available

will edited the summary of this revision. (Show Details)Jul 16 2024, 2:00 PM

Harbormaster failed remote builds in B30395: Diff 42358!Jul 16 2024, 2:14 PM

bartek added inline comments.Jul 17 2024, 1:14 AM

services/terraform/self-host/aws-run-migration.sh
82–96 ↗	(On Diff #42358)	Is basing on HTTP health check more reliable than counting running tasks with `aws ecs describe-services` the same way you did below? Does it even work if you previously revoked security group ingress traffic?
98–103 ↗	(On Diff #42358)	Setting primary node desired count to 0, then back to 1 is not necessary when doing `--force-new-deployment` afterwards. I think you can get rid of these steps. The "force new deployment" replaces old task with the new one for you. However it does a no-downtime update (the old is killed after the new task is ready). If this is not desired, you can try merging `--desired-count 1` and `--force-new-deployment` into one command

will added inline comments.Jul 17 2024, 6:28 AM

services/terraform/self-host/aws-run-migration.sh
82–96 ↗	(On Diff #42358)	I originally used `aws ecs describe-services` but found that it was still possibly for the health check to be successful even when `aws ecs describe-services` resulted supposedly in no running tasks. Using the HTTP health check ensures that the below check for primary service health doesn't immediately launch the secondary nodes while the primary service is being relaunched If `allowed_ip` is properly configured then yes. I ensure that only traffic from the `allowed_ip` is accepted with the below ingress rule: ingress { from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["${var.allowed_ip}/32"] }

will added inline comments.Jul 17 2024, 6:33 AM

services/terraform/self-host/aws-run-migration.sh
98–103 ↗	(On Diff #42358)	From what I understand, by setting the maximum deployment percent to 100 and minimum to 0, I've disabled no-downtime updates. With my current config, I monitored task deployment on console during the running of the script and confirmed there was only 0 or 1 tasks in progress at any moment I do believe your single command is cleaner. Including in next rebase

review feedback, additional checks and comments

Harbormaster completed remote builds in B30422: Diff 42394.Jul 17 2024, 10:25 AM

Can you address the CI issues?

services/terraform/self-host/aws-run-migration.sh
1 ↗	(On Diff #42358)	I'm a bit confused about the term "migration" here. Is it referring to MariaDB migrations, and implying the script only needs to be used when the deployer knows that there is a pending MariaDB migration? If that's the case, I have a couple of questions: What's the workflow for deploying an update when there isn't a pending MariaDB migration? How do we mitigate the risk of somebody deploying an update with a pending MariaDB migration without using this script? It seems like this could cause some DB corruption issues, so I'm a bit concerned On the other hand, if this script is meant to be run for all AWS keyserver deployments, then perhaps it should be renamed to use "deploy" instead of "migrate". Maybe the script could be named `aws-deploy.sh`?
9 ↗	(On Diff #42358)	I might call this `num_desired_secondary_nodes`... it's unclear from context that it's a number, and could be interpreted as a boolean (whether we desire secondary nodes)
16–18 ↗	(On Diff #42358)	Are these copy-pasted from Terraform somewhere? Prefer to avoid copy-paste when possible, but if it's difficult to factor out given Terraform vs. bash, we should at least add a reference to both sides making sure that a dev doesn't update one without updating the other
21–23 ↗	(On Diff #42358)	I'm not 100% sure, but my impression was that it's safer to add quotes when making an assignment in bash That said if Shellcheck didn't complain maybe it's safe?
38 ↗	(On Diff #42358)	You use a mix of bash conditionals: `(( ))`, `[ ]`, and `[[ ]]`. Do we really need all three? It seems like we should at least drop `[ ]` (my impression was that it's replaced by `[[ ]]`). Would be good to understand why you chose each of them
52–56 ↗	(On Diff #42358)	Nice work figuring this out! Question for you – if somebody runs a `terraform apply` when this script is running, will that mess things up? And vice versa – what happens if `terraform apply` is running, and then somebody runs this script?
72 ↗	(On Diff #42358)	Nit: seems like extraneous newline
82–96 ↗	(On Diff #42358)	Much of this feels copy-pasted from `check_health`. Is there a way to factor it out?
98–103 ↗	(On Diff #42358)	Setting primary node desired count to 0, then back to 1 is not necessary when doing `--force-new-deployment` afterwards. I think you can get rid of these steps. A bit confused reading this – is @bartek's implication that we can skip line 71?
119 ↗	(On Diff #42358)	Do we really need to wait for all of the secondary nodes to come online before re-enabling public access to the load balancer?
119–124 ↗	(On Diff #42358)	Seems like we could just `break` here, and move these lines to the bottom of the script. I think that would improve readability
services/terraform/self-host/aws_lb.tf
13–18 ↗	(On Diff #42358)	What's this for? Are you basically making sure that the calling IP can get through the load balancer, even when we've otherwise disabled ingress? If that's the case, wondering if you considered hitting the primary node directly for the health check instead of going through the load balancer

(Passing back to @will)

This revision now requires changes to proceed.Jul 17 2024, 12:01 PM

will added inline comments.Jul 17 2024, 12:04 PM

services/terraform/self-host/aws-run-migration.sh
16–18 ↗	(On Diff #42358)	Talked to Ashoat about this. Instead of copy and pasting this, we might be able to use the `aws-cli` to determine what the names of these services are.
21–23 ↗	(On Diff #42358)	Going to go back through the script and add quotations where necessary on next rebase
38 ↗	(On Diff #42358)	Will pick one kind in the next rebase
52–56 ↗	(On Diff #42358)	To mitigate the risk of users running `terraform apply` and the script incorrectly at the same time, we've decided to remove the option of the user running `terraform apply` and only have them use the script
82–96 ↗	(On Diff #42358)	We're planning to resolve the code duplication by putting all of this in its own function and then passing some kind of comparator function which will return a boolean depending on whether the http status code is valid.
98–103 ↗	(On Diff #42358)	The problem I was encountering not setting primary node desired count to 0 was it automatically restarting the primary node after I shut down the task. My understanding is that we need to keep this in order to run the redeployment explicitly ourselves with aws ecs update-service --cluster $cluster_name --service $primary_service_name --force-new-deployment --desired-count 1 > /dev/null
119 ↗	(On Diff #42358)	After discussion with @ashoat, established that we can actually reenable internet traffic right after we've ensured that the primary node health check passes, triggering the start of the secondary nodes and reducing downtime. To explain the original concern, we depend on a health check to the primary service to deploy the secondary nodes (this ensures that the migration has run). We were worried that by having the primary node accessible on startup, the moment the primary node starts up, the health check is lost in the flood of traffic and the secondary nodes never deploy. What I did in this current iteration is have the flood of traffic only hit the load balancer once all the secondary and primary nodes started up. This is unnecessary and we should have users able to access the primary node, just that it should happen after the health check is successful
119–124 ↗	(On Diff #42358)	including in next rebase
services/terraform/self-host/aws_lb.tf
13–18 ↗	(On Diff #42358)	Yes, you're correct. This ensures the user is always able to access the load balancer even after we've turned off the load balancer Taking down the load balancer and having users directly hit the primary node is definitely a possible solution. The reason I went with the ingress approach was just due to ease of implementation. Might be overkill, but I'll include a follow up task where we consider going with a direct hit to the primary node. Otherwise, I think the ingress approach is okay for now

will added inline comments.Jul 17 2024, 12:31 PM

services/terraform/self-host/aws-run-migration.sh
1 ↗	(On Diff #42358)	Addressing this in https://linear.app/comm/issue/ENG-8835/resolve-risks-in-user-deploying-an-update-with-a-pending-mariadb. For anyone reading, high level overview is: Having the user always go through the script instead of ever running `terraform apply` themselves Running `terraform apply` and image version control is handled by the script instead of the user

ashoat mentioned this in D12730: [terraform] Ensure that the primary service is running before deploying secondary service.Jul 17 2024, 12:41 PM

review feedback with minimum requirements to land

Harbormaster completed remote builds in B30452: Diff 42433.Jul 17 2024, 5:31 PM

will added a child revision: D12791: [terraform] change keyserver version tag.Jul 17 2024, 7:17 PM

ashoat accepted this revision.Jul 17 2024, 7:34 PM

ashoat added inline comments.

services/terraform/self-host/aws-deploy.sh
111–123 ↗	(On Diff #42433)	Seems like still some code duplication here with `check_health`, but it's not super important
131 ↗	(On Diff #42433)	Is this true? I think it's only the primary node that's successfully running at this point
services/terraform/self-host/aws-run-migration.sh
16–18 ↗	(On Diff #42358)	Looks like you didn't have time to address this with the `aws-cli` commands. We should still do the minimum described above: we should at least add a reference to both sides making sure that a dev doesn't update one without updating the other

This revision is now accepted and ready to land.Jul 17 2024, 7:34 PM

will added a child revision: D12800: [terraform] use `.env` in self-host folder for keyserver config variables instead of specifying values in terraform.tfvars or terraform.tfvars.json.Jul 18 2024, 3:21 PM

feedback

will added inline comments.Jul 18 2024, 6:10 PM

services/terraform/self-host/aws-deploy.sh
111–123 ↗	(On Diff #42433)	Addressing in this issue: https://linear.app/comm/issue/ENG-8851/remove-health-check-duplication-code
131 ↗	(On Diff #42433)	That is correct. Fixing this in last rebase