Page MenuHomePhabricator

[terraform] resolve faulty ecs task stop alarm
ClosedPublic

Authored by will on Aug 6 2024, 10:25 AM.
Tags
None
Referenced Files
F2766240: D13002.diff
Thu, Sep 19, 3:25 PM
Unknown Object (File)
Wed, Sep 18, 5:34 AM
Unknown Object (File)
Tue, Sep 10, 11:20 AM
Unknown Object (File)
Sun, Sep 8, 4:55 PM
Unknown Object (File)
Sun, Sep 8, 4:55 PM
Unknown Object (File)
Sun, Sep 8, 6:37 AM
Unknown Object (File)
Fri, Sep 6, 11:31 AM
Unknown Object (File)
Thu, Sep 5, 4:42 PM
Subscribers

Details

Summary

Our ECS Task Stop alarm wasn't going off in previous instances of our identity service failing. This diff addresses this problem by not filtering based on ecs event rules directly, but redirecting them
to a cloudwatch log group and filtering on them there.

This addresses https://linear.app/comm/issue/ENG-8847/investigate-why-alarms-dont-trigger-when-identity-service-ooms and https://linear.app/comm/issue/ENG-8156/investigate-why-ecs-task-change-didnt-trigger

Test Plan

Created a identity service copy on staging, reduced memory allocation and downgraded the image, and then proceeded to call GetDeviceListForUsers with ~12,000 userIDs, causing an OOM error and the task to fail.

The alarm successfully triggered and sent an email

Diff Detail

Repository
rCOMM Comm
Lint
Lint Not Applicable
Unit
Tests Not Applicable