Page MenuHomePhabricator

[terraform] resolve faulty ecs task stop alarm
ClosedPublic

Authored by will on Aug 6 2024, 10:25 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 20, 3:43 AM
Unknown Object (File)
Wed, Nov 20, 3:43 AM
Unknown Object (File)
Wed, Nov 20, 3:43 AM
Unknown Object (File)
Wed, Nov 20, 3:43 AM
Unknown Object (File)
Wed, Nov 20, 3:41 AM
Unknown Object (File)
Wed, Nov 20, 2:44 AM
Unknown Object (File)
Fri, Nov 15, 8:52 AM
Unknown Object (File)
Sat, Nov 9, 7:58 PM
Subscribers

Details

Summary

Our ECS Task Stop alarm wasn't going off in previous instances of our identity service failing. This diff addresses this problem by not filtering based on ecs event rules directly, but redirecting them
to a cloudwatch log group and filtering on them there.

This addresses https://linear.app/comm/issue/ENG-8847/investigate-why-alarms-dont-trigger-when-identity-service-ooms and https://linear.app/comm/issue/ENG-8156/investigate-why-ecs-task-change-didnt-trigger

Test Plan

Created a identity service copy on staging, reduced memory allocation and downgraded the image, and then proceeded to call GetDeviceListForUsers with ~12,000 userIDs, causing an OOM error and the task to fail.

The alarm successfully triggered and sent an email

Diff Detail

Repository
rCOMM Comm
Branch
ecs_alarms
Lint
No Lint Coverage
Unit
No Test Coverage