HomePhabricator
Diffusion Comm 3bee0f5fc9bf

[identity] Add exponential backoff for one-time key retries

Description

[identity] Add exponential backoff for one-time key retries

Summary:
Patrially addresses ENG-9264.
Used exponential backoff utility from our batch helper in our OTK logic, without introducing dedicated transact write helper, which would be much more involved.

Test Plan:
Started local Identity. Registered a user device and uploaded a bunch of OTKs.
Added the following test to Commtest = it spawns 9 simultaneous GetOutboundKeys RPC calls:

#[tokio::test]
async fn otks() {
  // the device you uploaded OTKs for
  let user_id = "** paste here **";
  let device_id = "** paste here **";
  let access_token = "** paste here **";

  let request = OutboundKeysForUserRequest {
    user_id: user_id.to_string(),
  };

  // spawn 9 simultaneous GetOutboundKeys calls
  for i in 1..10 {
    let request = request.clone();
    let user_id = request.user_id.clone();
    let device_id = device_id.to_string();
    let access_token = access_token.to_string();
    tokio::spawn(async move {
      let mut client = get_auth_client(
        &service_addr::IDENTITY_GRPC.to_string(),
        user_id,
        device_id.clone(),
        access_token,
        PlatformMetadata::new(PLACEHOLDER_CODE_VERSION, DEVICE_TYPE),
      )
      .await
      .expect("Couldn't connect to identity service");

      let result = client
        .get_outbound_keys_for_user(request.clone())
        .await
        .unwrap()
        .into_inner();

      let device = result.devices.get(&device_id).unwrap();
      println!(
        "Thread {}: content={:?}, notif={:?}",
        i, device.one_time_content_prekey, device.one_time_notif_prekey
      );
    });
  }

  println!("Now sleeping");
  tokio::time::sleep(Duration::from_secs(3)).await;
  println!("Done");
}

Before introducing the backoff, was able to reproduce the issue (lowered max retries to 3, localstack is too fast, it was rarely failing with 8).

---- otks stdout ----
Now sleeping
Thread 3: content=Some("pVPfPv8GwO6A1nmh9sTCTtgXY1SgTpkg1axzSLfpKYO"), notif=Some("0glisLhk42UpSlSzPvRG1Ni54QOdrDnBiG22MI6c0V8")
Thread 4: content=Some("6s8AJVc21lEo5b7O70HlpsHJQI0xTMLNCRhERDBI0fC"), notif=None
Thread 7: content=None, notif=Some("pVPfPv8GwO6A1nmh9sTCTtgXY1SgTpkg1axzSLfpKYO")
Thread 8: content=Some("oTPvgqPb159Xrc3lcT9AxqTZ0KmhXDQS9BCIm6NkV1Q"), notif=None
Thread 2: content=None, notif=None
Thread 5: content=None, notif=None
Thread 9: content=None, notif=None
Thread 1: content=None, notif=Some("6s8AJVc21lEo5b7O70HlpsHJQI0xTMLNCRhERDBI0fC")
Thread 6: content=Some("mUlLHPepdsZtVD9FEkwgxuuLJAOO7rwiJrG4nrllsEy"), notif=None
Done

Also saw several MaxRetriesExceeded in Identity console

After introducing the backoff, the OTKs were always returned for all threads:

---- otks stdout ----
Now sleeping
Thread 7: content=Some("2LCsQHnU7CtVRbmrtK9U4hASqrauEz0xumpoikNX1XH"), notif=Some("Udmu7PoHUkP9VvrcHObwlJ69AwTcdk13OfidsHgzC9l")
Thread 3: content=Some("q2AeT2WSq591qLaMP91guNmrPXnWsQp0JQUzKdtnImC"), notif=Some("wCZ558PHRHYinPWGqCxUUE2BnjpesN03IMdLMplV9ii")
Thread 5: content=Some("KedtvWnxaeBqZkA0l52H9yZFAPaaMJuyDfgQQNJ1aIn"), notif=Some("u2wNG3GhPzVWaWaeYauSvfed9RR6x8i2zCsRuQMd9lv")
Thread 4: content=Some("9LzxGjX47mr3w6MObSG30URQ6LdLOBFOpgmeh6yLn1w"), notif=Some("2LCsQHnU7CtVRbmrtK9U4hASqrauEz0xumpoikNX1XH")
Thread 2: content=Some("jhketAygIKVLRWFm454ceAjxAQOQVlZerbGL2mtOP8Y"), notif=Some("q2AeT2WSq591qLaMP91guNmrPXnWsQp0JQUzKdtnImC")
Thread 1: content=Some("Z76yL9r8hlKJ0XoVUAV9CENU8IFIoUTWOo74iuHyNK8"), notif=Some("KedtvWnxaeBqZkA0l52H9yZFAPaaMJuyDfgQQNJ1aIn")
Thread 6: content=Some("0glisLhk42UpSlSzPvRG1Ni54QOdrDnBiG22MI6c0V8"), notif=Some("9LzxGjX47mr3w6MObSG30URQ6LdLOBFOpgmeh6yLn1w")
Thread 8: content=Some("pVPfPv8GwO6A1nmh9sTCTtgXY1SgTpkg1axzSLfpKYO"), notif=Some("jhketAygIKVLRWFm454ceAjxAQOQVlZerbGL2mtOP8Y")
Thread 9: content=Some("6s8AJVc21lEo5b7O70HlpsHJQI0xTMLNCRhERDBI0fC"), notif=Some("Z76yL9r8hlKJ0XoVUAV9CENU8IFIoUTWOo74iuHyNK8")
Done

Reviewers: varun, kamil

Reviewed By: varun, kamil

Subscribers: ashoat, tomek

Differential Revision: https://phab.comm.dev/D13356

Details

Provenance
bartekAuthored on Mon, Sep 16, 11:00 PM
Reviewer
varun
Differential Revision
D13356: [identity] Add exponential backoff for one-time key retries
Parents
rCOMM1a92d95b7f42: [terraform] Deploy Tunnelbroker 0.16 to production
Branches
Unknown
Tags
Unknown