Page MenuHomePhabricator

[keyserver] run a temporary express health check server during migration for AWS load balancer health checks
ClosedPublic

Authored by will on Aug 26 2024, 2:06 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Oct 26, 8:28 PM
Unknown Object (File)
Tue, Oct 22, 6:30 PM
Unknown Object (File)
Tue, Oct 22, 6:30 PM
Unknown Object (File)
Tue, Oct 22, 6:30 PM
Unknown Object (File)
Tue, Oct 22, 6:29 PM
Unknown Object (File)
Tue, Oct 22, 1:18 PM
Unknown Object (File)
Tue, Oct 22, 1:18 PM
Unknown Object (File)
Tue, Oct 22, 10:22 AM
Subscribers

Details

Summary

We had a problem when running our last keyserver migration on Ashoat's keyserver where the migration took much longer than prior migrations. The AWS load balancer's default grace period for the ecs task ran out and
the migration process was cut off in the middle. This diff runs a temporary health check express server during the migration process so AWS can establish the ecs task as healthy

Was originally worried that we would need to fully stop the health check express server, but given that it's only on the master process, I don't believe this is necessary. Additionally, the default close provided function prevents new connections

Test Plan

Tested this locally by creating a dummy migration that slept for 10 seconds. During this time, keyserver endpoints were unavailable but the health check returned an OK response. Upon completing the migration,
both the health route and keyserver endpoints were available.

Also temporarily modified the health check route to timeout infinitely. Upon finishing the migration, the ongoing health check was terminated and the express server keyserver endpoints was able to be initialized.

Will not land without testing on AWS, ensuring a healthy primary node goes online with a dummy migration running past the default AWS grace period

Diff Detail

Repository
rCOMM Comm
Branch
add_migration_health_check
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

will requested review of this revision.Aug 26 2024, 2:23 PM
will planned changes to this revision.Aug 26 2024, 2:46 PM

Unfortunately, hanging connections prevents the keyserver express server from properly listening and accepting traffic

will planned changes to this revision.Aug 27 2024, 8:06 AM
will edited the summary of this revision. (Show Details)
will edited the test plan for this revision. (Show Details)

Still needs flow typing

i don't know flow too well, unfortunately

ashoat requested changes to this revision.Sep 4 2024, 11:16 PM

It's unclear to me why you need to stop the Express server at all. Why not take a similar approach to D13213?

keyserver/flow-typed/npm/stoppable_vx.x.x.js
4–14 ↗(On Diff #43700)

Should we remove this?

18 ↗(On Diff #43700)

Why are we supporting T being null or undefined?

19 ↗(On Diff #43700)

Nits: make it read-only, always use trailing commas

24–28 ↗(On Diff #43700)

Should we remove this?

This revision now requires changes to proceed.Sep 4 2024, 11:16 PM

It's unclear to me why you need to stop the Express server at all. Why not take a similar approach to D13213?

Explained some of my reasoning in D13213. We initialize an express server for the master process and we initialize express servers for non-master processes. In order for them not to conflict, we need to use stoppable to forcibly stop the master process's express server so that non-master processes can have their express servers listen properly on the port.

We need the express server in the master process as to make the health check available during the migration process.

There's another possible solution I didn't think of until now. We can possibly fork non-master processes earlier than the migration, start their express servers with only health checks, and then stall? This would have the added benefit of having multiple processes handling health check requests. Unsure if this is feasible or more desirable than my current solution

Three options here:

  1. Add a code comment explaining what you explained above.
  2. Your suggestion of starting the secondaries right away so they can respond to the health check. I think that's a little bit more clear / readable than what we're doing here.
  3. We could run all of the routes on the isMaster thread as well. But I'm a little confused how Express works when there is a server listening on both isMaster and non-isMaster... I get the impression that the non-isMaster threads might get "ignored" in some special way, based on the testing you've done.
This revision is now accepted and ready to land.Sep 5 2024, 1:12 PM
keyserver/flow-typed/npm/stoppable_vx.x.x.js
4–14 ↗(On Diff #43700)

Removing on next rebase

18 ↗(On Diff #43700)

I took this from the type script definitions for stoppable. It also looks like that's what http$Server's listen method gives us.

24–28 ↗(On Diff #43700)

Removing on next rebase

This revision was landed with ongoing or failed builds.Sep 8 2024, 9:57 PM
This revision was automatically updated to reflect the committed changes.