Page MenuHomePhabricator

[identity] otk db changes
ClosedPublic

Authored by varun on Apr 9 2024, 9:21 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 10, 1:58 AM
Unknown Object (File)
Fri, Jan 10, 1:55 AM
Unknown Object (File)
Fri, Jan 10, 1:49 AM
Unknown Object (File)
Sun, Dec 29, 4:35 AM
Unknown Object (File)
Sun, Dec 29, 4:35 AM
Unknown Object (File)
Sun, Dec 29, 4:35 AM
Unknown Object (File)
Sun, Dec 29, 4:35 AM
Unknown Object (File)
Sun, Dec 29, 4:35 AM
Subscribers

Details

Summary

change how we store and retrieve OTKs from DynamoDB

Test Plan

commtest passes, manual testing

Diff Detail

Repository
rCOMM Comm
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

varun held this revision as a draft.
services/identity/src/constants.rs
122–125 ↗(On Diff #38965)

To be fully correct, these should also be added to the DeviceRow struct in device_list.rs.
But as they're only internal helper counters, I'm wondering if that makes sense (gives no real benefit).
But at least a comment should be added to the struct to indicate they exist but are not handled there.

services/identity/src/database.rs
465 ↗(On Diff #38965)

Maybe we should add a retry counter to these loops? see mod batch_operations in comm_lib/database.rs

Ideally, we could extract sth like transact_write_helper similar to batch write, but this is a follow-up, not sure if worth doing now

507–511 ↗(On Diff #38965)

The AWS guide here https://aws.amazon.com/blogs/database/implement-resource-counters-with-amazon-dynamodb/ (Approach 1 - Atomic counters)
says that this might be unreliable due to transient AWS errors.
Not sure how it behaves in transactions, though. Hopefully it fails the whole transaction

614–616 ↗(On Diff #38965)

I guess we have two separate transactions because we want to track counts of content and notif keys separately.

Alternative approach would be

  1. Prepare put requests for content and notifs
  2. Map them into tuples ("content/notif", PutRequest)
  3. Merge these tuples together
  4. Batch them into 98 chunks (100 - 2)
  5. For each chunk,
    • count let content_count = tuples.iter().filter(|v| v.0 == "content").count();
    • count let notif_count = tuples.iter().filter(|v| v.0 == "notif").count();
    • requests: let requests = tuples.into_iter().map(|v| v.1).collect::<Vec<_>>();
    • single transaction:
      • requests
      • update_content_otk_operation
      • update_notif_otk_operation

Having single (and less transient-error-prone) DDB transaction at a cost of code complexity

690–697 ↗(On Diff #38965)

Same as above - we have helpers for that

711–719 ↗(On Diff #38965)

Builder pattern helps to avoid some memory operations and simplify code ;)

726–733 ↗(On Diff #38965)
  1. I know this is ugly, I haven't found any nice enough way of dealing with DDB error handling
  2. Maybe it's possible to do sth like this
fn is_transaction_canceled(err: &DynamoDBError, reason: &str) -> bool {
  match err {
    DynamoDBError::TransactionCanceledException(
              TransactionCanceledException {
                cancellation_reasons: Some(reasons),
                ..
              },
            ) if reasons
              .iter()
              .any(|reason| reason.code() == Some(reason)) => true,
    _ => false
}
759–760 ↗(On Diff #38965)

It's good to add .projection_expression(devices_table::ATTR_CONTENT_OTK_COUNT) to avoid unnecessarily getting other fields.

Otherwise, you can just use db_client.get_device_data() (it's in device_list.rs)

765–772 ↗(On Diff #38965)

We have dedicated helper struct that does this (device_list.rs)

780 ↗(On Diff #38965)
781–809 ↗(On Diff #38965)

Simplify

783–786 ↗(On Diff #38965)

Unfortunately this is the correct way now, because there's a problem with TryFromAttribute for numbers - https://linear.app/comm/issue/ENG-4658/improve-database-type-conversions-for-numbers

services/identity/src/ddb_utils.rs
25–30 ↗(On Diff #38965)

nit

services/identity/src/database.rs
588–594 ↗(On Diff #38965)

One thing that came to my mind - we should not allow uploading more than 98 keys in total.
The client should batch keys into chunks, not server. But I guess we never exceed this threshold anyway (have even lower limits).

If ever uploading more at once, the UploadOneTimeKeys RPC should return the keys that were successfully uploaded (or inverse - failed). And this would be too much complexity. This way, the caller will know which keys were successfully uploaded.
Example:
Uploading 250 keys is 3 transactions. Let's say 1st succeeds but 2nd fails. Some keys are uploaded, but others not

Also, this limit will let us avoid splitting into chunks server-side.

services/identity/src/database.rs
575 ↗(On Diff #38965)

We don't need this anymore, that was to monitor the previous inelegant solution

579–585 ↗(On Diff #38965)

Same as above

services/identity/src/database.rs
588–594 ↗(On Diff #38965)

olm only remembers the last 100 one-time keys so clients will never upload more than 100

614–616 ↗(On Diff #38965)

From the aws docs:

Don't group operations together in a transaction if it's not necessary. For example, if a single transaction with 10 operations can be broken up into multiple transactions without compromising the application correctness, we recommend splitting up the transaction. Simpler transactions improve throughput and are more likely to succeed.

Following this principle, I think we should avoid combining content and notif updates in a single transaction

services/identity/src/constants.rs
122–125 ↗(On Diff #38965)

good point. going to just add a comment for now, since i can't think of a good reason to add these fields to the struct

services/identity/src/database.rs
465 ↗(On Diff #38965)

i'll add a basic retry counter that returns an error after the 8th try. i'll also leave a todo comment to introduce and use a transact write helper

services/identity/src/database.rs
507–511 ↗(On Diff #38965)

i think it should fail the whole transaction

588–594 ↗(On Diff #38965)

I think we can assume that the list of OTKs that the client provides is sorted (i think oldest to newest, but need to confirm this). so if the client provides 150 content otks, we can just throw away the first 50 in the list since the olm account will have thrown them away already

588–594 ↗(On Diff #38965)

And this would be too much complexity

Did you mean wouldn't be too much complexity?

588–594 ↗(On Diff #38965)

olm only remembers the last 100 one-time keys so clients will never upload more than 100

*should never, i suppose it's possible though

588–594 ↗(On Diff #38965)

Ah I understand your concern now. Let's just immediately fail any request that comes in (register, upload_one_time_keys) that has more than 98 keys total so that we can do everything in a single transaction and avoid the scenario where, say, 99 content keys are uploaded but 0 notif keys are

services/identity/src/database.rs
614–616 ↗(On Diff #38965)

After thinking about this a little more, the alternative you propose seems like the right way to go (except we don't need chunks since we'll just fail the request immediately if the caller provides more than 49 content or notif otks)

services/identity/src/database.rs
711–719 ↗(On Diff #38965)

transact_items() expects a single TransactWriteItem object, so we have to use set_transact_item() since we have potentially 100 items here

varun edited the summary of this revision. (Show Details)
varun edited the test plan for this revision. (Show Details)

re-run commtest

varun edited the test plan for this revision. (Show Details)

get commtest to pass

varun published this revision for review.Apr 13 2024, 11:40 AM
bartek added inline comments.
services/identity/src/database.rs
661–662 ↗(On Diff #39095)

Should work

588–594 ↗(On Diff #38965)

Ah I understand your concern now. Let's just immediately fail any request that comes in (register, upload_one_time_keys) that has more than 98 keys total so that we can do everything in a single transaction and avoid the scenario where, say, 99 content keys are uploaded but 0 notif keys are

Yeah this is what I meant

614–616 ↗(On Diff #38965)

After thinking about this a little more, the alternative you propose seems like the right way to go (except we don't need chunks since we'll just fail the request immediately if the caller provides more than 49 content or notif otks)

Yes, if we don't need chunks, this gets much simpler

711–719 ↗(On Diff #38965)

transact_items() expects a single TransactWriteItem object, so we have to use set_transact_item() since we have potentially 100 items here

Right, I missed it

services/terraform/modules/shared/dynamodb.tf
241–242 ↗(On Diff #39095)

Good idea with informative column names

This revision is now accepted and ready to land.Apr 15 2024, 1:31 AM
This revision was automatically updated to reflect the committed changes.