Page MenuHomePhabricator

[terraform] Migration script for for self-hosted keyserver
ClosedPublic

Authored by will on Jul 11 2024, 11:50 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 20, 9:59 PM
Unknown Object (File)
Wed, Nov 20, 9:58 PM
Unknown Object (File)
Fri, Nov 15, 5:40 PM
Unknown Object (File)
Fri, Nov 8, 11:05 PM
Unknown Object (File)
Fri, Nov 8, 12:25 PM
Unknown Object (File)
Fri, Nov 8, 12:25 PM
Unknown Object (File)
Fri, Nov 8, 12:25 PM
Unknown Object (File)
Fri, Nov 8, 12:25 PM
Subscribers

Details

Summary

This runs a migration on the primary node by doing the following:

  1. Disables traffic to load balancer
  2. Grab the domain and desired secondary node count from terraform.tfvars
  3. Stops all primary and secondary nodes
  4. Waits for health check to fail
  5. Redeploys the primary node, causing the migration to run
  6. Redeploys the secondary nodes
  7. Waits for health check to succeed
  8. Re-enables traffic to load balancer

On running the script on my personal setup, the migration script took 15 minutes

Test Plan

Script runs successfully. Monitored aws console through the running of the script and confirmed that tasks were being stopped. The primary node was reinitialized and finally the secondary nodes were set to reinitialize.

I set my keyserver nodes to use wyilio/keyserver:latest. I then updated latest to include a test migration which created a new table and populated it with a single random number entry.
Ran my migration script and verified by connecting mysql remotely to MariaDB that the table was being created and there was only a single entry

Diff Detail

Repository
rCOMM Comm
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

will requested review of this revision.Jul 11 2024, 12:07 PM
will planned changes to this revision.Jul 11 2024, 1:44 PM

include logic to disable traffic until all nodes available

services/terraform/self-host/aws-run-migration.sh
82–96 ↗(On Diff #42358)
  • Is basing on HTTP health check more reliable than counting running tasks with aws ecs describe-services the same way you did below?
  • Does it even work if you previously revoked security group ingress traffic?
98–103 ↗(On Diff #42358)

Setting primary node desired count to 0, then back to 1 is not necessary when doing --force-new-deployment afterwards. I think you can get rid of these steps.

The "force new deployment" replaces old task with the new one for you.
However it does a no-downtime update (the old is killed after the new task is ready). If this is not desired, you can try merging --desired-count 1 and --force-new-deployment into one command

services/terraform/self-host/aws-run-migration.sh
82–96 ↗(On Diff #42358)
  • I originally used aws ecs describe-services but found that it was still possibly for the health check to be successful even when aws ecs describe-services resulted supposedly in no running tasks. Using the HTTP health check ensures that the below check for primary service health doesn't immediately launch the secondary nodes while the primary service is being relaunched
  • If allowed_ip is properly configured then yes. I ensure that only traffic from the allowed_ip is accepted with the below ingress rule:
ingress {
  from_port   = 443
  to_port     = 443
  protocol    = "tcp"
  cidr_blocks = ["${var.allowed_ip}/32"]
}
services/terraform/self-host/aws-run-migration.sh
98–103 ↗(On Diff #42358)

From what I understand, by setting the maximum deployment percent to 100 and minimum to 0, I've disabled no-downtime updates. With my current config, I monitored task deployment on console during the running of the script and confirmed there was only 0 or 1 tasks in progress at any moment

I do believe your single command is cleaner. Including in next rebase

review feedback, additional checks and comments

Can you address the CI issues?

services/terraform/self-host/aws-run-migration.sh
1 ↗(On Diff #42358)

I'm a bit confused about the term "migration" here. Is it referring to MariaDB migrations, and implying the script only needs to be used when the deployer knows that there is a pending MariaDB migration?

If that's the case, I have a couple of questions:

  1. What's the workflow for deploying an update when there isn't a pending MariaDB migration?
  2. How do we mitigate the risk of somebody deploying an update with a pending MariaDB migration without using this script? It seems like this could cause some DB corruption issues, so I'm a bit concerned

On the other hand, if this script is meant to be run for all AWS keyserver deployments, then perhaps it should be renamed to use "deploy" instead of "migrate". Maybe the script could be named aws-deploy.sh?

9 ↗(On Diff #42358)

I might call this num_desired_secondary_nodes... it's unclear from context that it's a number, and could be interpreted as a boolean (whether we desire secondary nodes)

16–18 ↗(On Diff #42358)

Are these copy-pasted from Terraform somewhere? Prefer to avoid copy-paste when possible, but if it's difficult to factor out given Terraform vs. bash, we should at least add a reference to both sides making sure that a dev doesn't update one without updating the other

21–23 ↗(On Diff #42358)

I'm not 100% sure, but my impression was that it's safer to add quotes when making an assignment in bash

That said if Shellcheck didn't complain maybe it's safe?

38 ↗(On Diff #42358)

You use a mix of bash conditionals: (( )), [ ], and [[ ]]. Do we really need all three? It seems like we should at least drop [ ] (my impression was that it's replaced by [[ ]]). Would be good to understand why you chose each of them

52–56 ↗(On Diff #42358)

Nice work figuring this out!

Question for you – if somebody runs a terraform apply when this script is running, will that mess things up? And vice versa – what happens if terraform apply is running, and then somebody runs this script?

72 ↗(On Diff #42358)

Nit: seems like extraneous newline

82–96 ↗(On Diff #42358)

Much of this feels copy-pasted from check_health. Is there a way to factor it out?

98–103 ↗(On Diff #42358)

Setting primary node desired count to 0, then back to 1 is not necessary when doing --force-new-deployment afterwards. I think you can get rid of these steps.

A bit confused reading this – is @bartek's implication that we can skip line 71?

119 ↗(On Diff #42358)

Do we really need to wait for all of the secondary nodes to come online before re-enabling public access to the load balancer?

119–124 ↗(On Diff #42358)

Seems like we could just break here, and move these lines to the bottom of the script. I think that would improve readability

services/terraform/self-host/aws_lb.tf
13–18 ↗(On Diff #42358)

What's this for? Are you basically making sure that the calling IP can get through the load balancer, even when we've otherwise disabled ingress? If that's the case, wondering if you considered hitting the primary node directly for the health check instead of going through the load balancer

ashoat requested changes to this revision.Jul 17 2024, 12:01 PM

(Passing back to @will)

This revision now requires changes to proceed.Jul 17 2024, 12:01 PM
services/terraform/self-host/aws-run-migration.sh
16–18 ↗(On Diff #42358)

Talked to Ashoat about this. Instead of copy and pasting this, we might be able to use the aws-cli to determine what the names of these services are.

21–23 ↗(On Diff #42358)

Going to go back through the script and add quotations where necessary on next rebase

38 ↗(On Diff #42358)

Will pick one kind in the next rebase

52–56 ↗(On Diff #42358)

To mitigate the risk of users running terraform apply and the script incorrectly at the same time, we've decided to remove the option of the user running terraform apply and only have them use the script

82–96 ↗(On Diff #42358)

We're planning to resolve the code duplication by putting all of this in its own function and then passing some kind of comparator function which will return a boolean depending on whether the http status code is valid.

98–103 ↗(On Diff #42358)

The problem I was encountering not setting primary node desired count to 0 was it automatically restarting the primary node after I shut down the task. My understanding is that we need to keep this in order to run the redeployment explicitly ourselves with

aws ecs update-service --cluster $cluster_name --service $primary_service_name --force-new-deployment --desired-count 1 > /dev/null
119 ↗(On Diff #42358)

After discussion with @ashoat, established that we can actually reenable internet traffic right after we've ensured that the primary node health check passes, triggering the start of the secondary nodes and reducing downtime.

To explain the original concern, we depend on a health check to the primary service to deploy the secondary nodes (this ensures that the migration has run).

We were worried that by having the primary node accessible on startup, the moment the primary node starts up, the health check is lost in the flood of traffic and the secondary nodes never deploy.

What I did in this current iteration is have the flood of traffic only hit the load balancer once all the secondary and primary nodes started up. This is unnecessary and we should have users able to access the primary node, just that it should happen after the health check is successful

119–124 ↗(On Diff #42358)

including in next rebase

services/terraform/self-host/aws_lb.tf
13–18 ↗(On Diff #42358)

Yes, you're correct. This ensures the user is always able to access the load balancer even after we've turned off the load balancer

Taking down the load balancer and having users directly hit the primary node is definitely a possible solution. The reason I went with the ingress approach was just due to ease of implementation. Might be overkill, but I'll include a follow up task where we consider going with a direct hit to the primary node. Otherwise, I think the ingress approach is okay for now

services/terraform/self-host/aws-run-migration.sh
1 ↗(On Diff #42358)

Addressing this in https://linear.app/comm/issue/ENG-8835/resolve-risks-in-user-deploying-an-update-with-a-pending-mariadb.

For anyone reading, high level overview is:

  1. Having the user always go through the script instead of ever running terraform apply themselves
  2. Running terraform apply and image version control is handled by the script instead of the user

review feedback with minimum requirements to land

ashoat added inline comments.
services/terraform/self-host/aws-deploy.sh
111–123 ↗(On Diff #42433)

Seems like still some code duplication here with check_health, but it's not super important

131 ↗(On Diff #42433)

Is this true? I think it's only the primary node that's successfully running at this point

services/terraform/self-host/aws-run-migration.sh
16–18 ↗(On Diff #42358)

Looks like you didn't have time to address this with the aws-cli commands. We should still do the minimum described above:

we should at least add a reference to both sides making sure that a dev doesn't update one without updating the other

This revision is now accepted and ready to land.Jul 17 2024, 7:34 PM
services/terraform/self-host/aws-deploy.sh
111–123 ↗(On Diff #42433)
131 ↗(On Diff #42433)

That is correct. Fixing this in last rebase