HomePhabricator
Diffusion Comm 5c106b1d9067

[terraform] resolve faulty ecs task stop alarm

Description

[terraform] resolve faulty ecs task stop alarm

Summary:
Our ECS Task Stop alarm wasn't going off in previous instances of our identity service failing. This diff addresses this problem by not filtering based on ecs event rules directly, but redirecting them
to a cloudwatch log group and filtering on them there.

This addresses https://linear.app/comm/issue/ENG-8847/investigate-why-alarms-dont-trigger-when-identity-service-ooms and https://linear.app/comm/issue/ENG-8156/investigate-why-ecs-task-change-didnt-trigger

Test Plan:
Created a identity service copy on staging, reduced memory allocation and downgraded the image, and then proceeded to call GetDeviceListForUsers with ~12,000 userIDs, causing an OOM error and the task to fail.

The alarm successfully triggered and sent an email

Reviewers: varun, bartek

Reviewed By: bartek

Subscribers: ashoat, tomek

Differential Revision: https://phab.comm.dev/D13002

Details

Provenance
willAuthored on Aug 6 2024, 9:57 AM
Reviewer
bartek
Differential Revision
D13002: [terraform] resolve faulty ecs task stop alarm
Parents
rCOMM09df98f2c578: [lib] fix processing DM ops by adding `MessageSourceMetadata` param
Branches
Unknown
Tags
Unknown