Page MenuHomePhabricator

[identity] Add exponential backoff for one-time key retries
ClosedPublic

Authored by bartek on Sep 16 2024, 11:16 PM.
Tags
None
Referenced Files
F3009497: D13356.diff
Fri, Oct 18, 10:23 PM
F3004280: D13356.id.diff
Fri, Oct 18, 1:58 PM
Unknown Object (File)
Sat, Oct 5, 3:48 PM
Unknown Object (File)
Thu, Oct 3, 7:29 PM
Unknown Object (File)
Thu, Oct 3, 6:45 PM
Unknown Object (File)
Wed, Oct 2, 10:19 PM
Unknown Object (File)
Wed, Oct 2, 2:37 AM
Unknown Object (File)
Mon, Sep 30, 10:43 PM
Subscribers

Details

Summary

Patrially addresses ENG-9264.
Used exponential backoff utility from our batch helper in our OTK logic, without introducing dedicated transact write helper, which would be much more involved.

Test Plan

Started local Identity. Registered a user device and uploaded a bunch of OTKs.
Added the following test to Commtest = it spawns 9 simultaneous GetOutboundKeys RPC calls:

#[tokio::test]
async fn otks() {
  // the device you uploaded OTKs for
  let user_id = "** paste here **";
  let device_id = "** paste here **";
  let access_token = "** paste here **";

  let request = OutboundKeysForUserRequest {
    user_id: user_id.to_string(),
  };

  // spawn 9 simultaneous GetOutboundKeys calls
  for i in 1..10 {
    let request = request.clone();
    let user_id = request.user_id.clone();
    let device_id = device_id.to_string();
    let access_token = access_token.to_string();
    tokio::spawn(async move {
      let mut client = get_auth_client(
        &service_addr::IDENTITY_GRPC.to_string(),
        user_id,
        device_id.clone(),
        access_token,
        PlatformMetadata::new(PLACEHOLDER_CODE_VERSION, DEVICE_TYPE),
      )
      .await
      .expect("Couldn't connect to identity service");

      let result = client
        .get_outbound_keys_for_user(request.clone())
        .await
        .unwrap()
        .into_inner();

      let device = result.devices.get(&device_id).unwrap();
      println!(
        "Thread {}: content={:?}, notif={:?}",
        i, device.one_time_content_prekey, device.one_time_notif_prekey
      );
    });
  }

  println!("Now sleeping");
  tokio::time::sleep(Duration::from_secs(3)).await;
  println!("Done");
}

Before introducing the backoff, was able to reproduce the issue (lowered max retries to 3, localstack is too fast, it was rarely failing with 8).

---- otks stdout ----
Now sleeping
Thread 3: content=Some("pVPfPv8GwO6A1nmh9sTCTtgXY1SgTpkg1axzSLfpKYO"), notif=Some("0glisLhk42UpSlSzPvRG1Ni54QOdrDnBiG22MI6c0V8")
Thread 4: content=Some("6s8AJVc21lEo5b7O70HlpsHJQI0xTMLNCRhERDBI0fC"), notif=None
Thread 7: content=None, notif=Some("pVPfPv8GwO6A1nmh9sTCTtgXY1SgTpkg1axzSLfpKYO")
Thread 8: content=Some("oTPvgqPb159Xrc3lcT9AxqTZ0KmhXDQS9BCIm6NkV1Q"), notif=None
Thread 2: content=None, notif=None
Thread 5: content=None, notif=None
Thread 9: content=None, notif=None
Thread 1: content=None, notif=Some("6s8AJVc21lEo5b7O70HlpsHJQI0xTMLNCRhERDBI0fC")
Thread 6: content=Some("mUlLHPepdsZtVD9FEkwgxuuLJAOO7rwiJrG4nrllsEy"), notif=None
Done

Also saw several MaxRetriesExceeded in Identity console

After introducing the backoff, the OTKs were always returned for all threads:

---- otks stdout ----
Now sleeping
Thread 7: content=Some("2LCsQHnU7CtVRbmrtK9U4hASqrauEz0xumpoikNX1XH"), notif=Some("Udmu7PoHUkP9VvrcHObwlJ69AwTcdk13OfidsHgzC9l")
Thread 3: content=Some("q2AeT2WSq591qLaMP91guNmrPXnWsQp0JQUzKdtnImC"), notif=Some("wCZ558PHRHYinPWGqCxUUE2BnjpesN03IMdLMplV9ii")
Thread 5: content=Some("KedtvWnxaeBqZkA0l52H9yZFAPaaMJuyDfgQQNJ1aIn"), notif=Some("u2wNG3GhPzVWaWaeYauSvfed9RR6x8i2zCsRuQMd9lv")
Thread 4: content=Some("9LzxGjX47mr3w6MObSG30URQ6LdLOBFOpgmeh6yLn1w"), notif=Some("2LCsQHnU7CtVRbmrtK9U4hASqrauEz0xumpoikNX1XH")
Thread 2: content=Some("jhketAygIKVLRWFm454ceAjxAQOQVlZerbGL2mtOP8Y"), notif=Some("q2AeT2WSq591qLaMP91guNmrPXnWsQp0JQUzKdtnImC")
Thread 1: content=Some("Z76yL9r8hlKJ0XoVUAV9CENU8IFIoUTWOo74iuHyNK8"), notif=Some("KedtvWnxaeBqZkA0l52H9yZFAPaaMJuyDfgQQNJ1aIn")
Thread 6: content=Some("0glisLhk42UpSlSzPvRG1Ni54QOdrDnBiG22MI6c0V8"), notif=Some("9LzxGjX47mr3w6MObSG30URQ6LdLOBFOpgmeh6yLn1w")
Thread 8: content=Some("pVPfPv8GwO6A1nmh9sTCTtgXY1SgTpkg1axzSLfpKYO"), notif=Some("jhketAygIKVLRWFm454ceAjxAQOQVlZerbGL2mtOP8Y")
Thread 9: content=Some("6s8AJVc21lEo5b7O70HlpsHJQI0xTMLNCRhERDBI0fC"), notif=Some("Z76yL9r8hlKJ0XoVUAV9CENU8IFIoUTWOo74iuHyNK8")
Done

Diff Detail

Repository
rCOMM Comm
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

bartek held this revision as a draft.
bartek published this revision for review.Sep 16 2024, 11:38 PM
varun added inline comments.
shared/comm-lib/src/database.rs
693 ↗(On Diff #44259)

Should we update this comment?

705 ↗(On Diff #44259)

Do we call this anywhere?

This revision is now accepted and ready to land.Sep 17 2024, 2:09 AM
shared/comm-lib/src/database.rs
693 ↗(On Diff #44259)

yeah

705 ↗(On Diff #44259)

Not outside batch helpers, but it's a part of the now-public API so it should be pub too