Page MenuHomePhabricator

[Keyserver] Upload new onetime keys to identity service when requested
ClosedPublic

Authored by kamil on Aug 7 2023, 8:25 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jan 6, 3:59 PM
Unknown Object (File)
Mon, Jan 6, 3:59 PM
Unknown Object (File)
Mon, Jan 6, 3:59 PM
Unknown Object (File)
Mon, Jan 6, 3:59 PM
Unknown Object (File)
Mon, Jan 6, 3:59 PM
Unknown Object (File)
Mon, Jan 6, 3:59 PM
Unknown Object (File)
Mon, Jan 6, 3:58 PM
Unknown Object (File)
Mon, Jan 6, 3:43 PM
Subscribers

Details

Summary

When receiving a "refreshKeysRequest" from tunnelbroker, keyserver
should upload new onetime keys to identity service.

https://linear.app/comm/issue/ENG-4602

Depends on D8751

Test Plan

Integration test to follow, wanted to keep this JS only

Diff Detail

Repository
rCOMM Comm
Branch
arcpatch-D8752 (branched from master)
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

ashoat requested changes to this revision.Aug 7 2023, 11:02 AM

Continued issues with async / await... at this point I feel like a broken record saying "I feel like a broken record"

When are you going to brush up on this? Am I going to be leaving the exact same feedback on every one of your JS diffs for the next year?

keyserver/src/socket/tunnelbroker.js
28 ↗(On Diff #29592)

Whenever you use async / await, you should ask yourself if you need to be using it.

Do you need to be using it here?

29 ↗(On Diff #29592)

If we always expect event.data to be a string, we should use an invariant

33 ↗(On Diff #29592)

Are we using a union-of-disjoint-types for this? If no, please do that (with eg. a type field or something). If yes, let's use an invariant on the type, which can replace this comment

keyserver/src/utils/olm-utils.js
181–182 ↗(On Diff #29592)

These can just be inlined below

200 ↗(On Diff #29592)

This must be the 10th time you've done this. You're failing to await a promise, causing yet another potential Node.js crash

This revision now requires changes to proceed.Aug 7 2023, 11:02 AM

need to do a another async/await pass, and hit on commentary

jon marked 5 inline comments as done.

Address feedback. Correct async/await usage

marcin requested changes to this revision.Aug 21 2023, 2:17 AM
marcin added inline comments.
keyserver/src/socket/tunnelbroker.js
30

Nit: Can we enforce event.data to be a string by using flow types? It is generally better to avoid using invariant if we can achieve the same effect by proper typing.

35

Nit: even if now there is only one message type it is still better to check for it and return or throw in case unsupported message type is received. This way code is safer, more readable and maintainable.

keyserver/src/utils/olm-utils.js
192

Starting transactional db update inside another transactional db update looks a little bit suspicious to me. I think it would be better to create two independent promises one for content and another for notif account and let them run in parallel (await Promise.all). One thing that you will need to change is that in both of those promises you will call mark_keys_as_published. Then once both of the promises complete you will run the final rustApi.uploadOneTimeKeys.

It is actually safer to call mark_keys_as_published before they are actually uploaded since imagine a case when the keyserver crashes after rustAPI.uploadOneTimeKeys completes and before you call mark_keys_as_published. In such a case the same one time keys might be uploaded twice.

This revision now requires changes to proceed.Aug 21 2023, 2:17 AM

One nit about deviceID vs deviceId, and one request to check message.type (agreeing with @marcin). My other two comments are responses to @marcin

keyserver/src/socket/tunnelbroker.js
30

This type comes from here, which I believe is based on the WebSocket spec

35

I agree – now that we have a union-of-disjoint-types, it seems like we just need to check that message.type === 'refreshKeysRequest'

keyserver/src/utils/olm-utils.js
192

It is actually safer to call mark_keys_as_published before they are actually uploaded since imagine a case when the keyserver crashes after rustAPI.uploadOneTimeKeys completes and before you call mark_keys_as_published. In such a case the same one time keys might be uploaded twice.

What's the risk of uploading the same one-time keys twice? Ideally, the identity service will just dedup.

If we preemptively mark the keys as published, I'm worried that we'll end up having these keys permanently stored in the Olm account. They'll never get used because nobody knows about them. This pattern could lead to unconstrained growth in the size of the Olm account.

Starting transactional db update inside another transactional db update looks a little bit suspicious to me. I think it would be better to create two independent promises one for content and another for notif account and let them run in parallel (await Promise.all). One thing that you will need to change is that in both of those promises you will call mark_keys_as_published. Then once both of the promises complete you will run the final rustApi.uploadOneTimeKeys.

I'd like to understand more about what you're worried about in the current design. It seems like the advantage of doing it in one transaction is that we make sure we don't "save" the new one-time keys until we've confirmed that the identity service RPC is successful.

lib/types/tunnelbroker-messages.js
34
ashoat requested changes to this revision.Aug 21 2023, 2:14 PM
keyserver/src/utils/olm-utils.js
192

You mean like?

// For brevity I left out Promise.all
 const { deviceID, otkKeys: contentOneTimeKeys } = await fetchCallUpdateOlmAccount('content', generate_one_time_keys_and_mark_as_published);
 const { otkKeys: notifOneTimeKeys } = await fetchCallUpdateOlmAccount('notifications', generate_one_time_keys_and_mark_as_published);

await rustAPI.uploadOneTimeKeys(
          identityInfo.userId,
          deviceID,
          identityInfo.accessToken,
          contentOneTimeKeys,
          notifOneTimeKeys,
        );

Not a big fan of this, as rustAPI.uploadOneTimeKeys is the most likely to fail. And mark_keys_as_published does have a cost to it filling up the published keys buffer.

It is actually safer to call mark_keys_as_published before they are actually uploaded since imagine a case when the keyserver crashes after rustAPI.uploadOneTimeKeys completes and before you call mark_keys_as_published

On the identity side, each one-time key is an unique item in DDB. So the keys should not be duplicated, but rather overwritten. The potential concern in the current "transaction within a transaction" is that if mark_keys_as_published() fails (or keyserver crashes after upload). There's a chance that those keys were minted to other clients by identity service; before keyserver tries to upload them again. In which case, two or more clients to receive the same one-time-key, which will cause the subsequent create_inbound_session calls by the keyserver to fail.

Likewise, if we do mark_keys_as_published before we upload. Then crashes to keyserver after generation of the keys, but before the upload of the keys, will cause unreachable keys to be put into the one-time-key buffer. If these keys start to overflow the 100 key buffer in Olm, then Olm will start dropping keys which were published. This creates a similar situation to above, where create_inbound_session calls by the keyserver will fail as the one-time-key issued by identity service won't exist in the keyserver's one-time-keys buffer.

Either way, it seems like we will need some way to handle "the one time key you used in this create_outbound_message is invalid, try again". Eventually the invalid keys should be pruned back to a healthy state.

In practice, I don't think this will be much of an issue. A lot of these scenarios require keyserver to be crashing which should be the exception.

My preference is to keep the transaction logic as it is now, because the failure states are a bit more "all or nothing", and the most likely to fail action of rustAPI.uploadOneTimeKeys can fail multiple times without causing the olm accounts getting into a bad state.

keyserver/src/utils/olm-utils.js
192

The logic of fetchCallUpdateOlmAccount is based on spinlock with limited amount time of runs (we don't lock the call to fetchCallUpdateOlmAccount if there is another one going on - we wait and retry later keeping limited amount of retries.). If we perform too much work inside one fetchCallUpdateOlmAccount we increase the chance that other calls to fetchCallUpdateOlmAccount will fail. This it the reason why I was concerned about making a bested fetchCallUpdateOlmAccount with network request since it looks like a costly operation.

However your comments express that waisting one time key is a bigger concern than increasing the chance of failure of fetchCallUpdateOlmAccount so I will not block this differential on this change.

However one thing that occurred to me: where are the corresponding calls to account.remove_one_time_keys?

After reading @jon 's reply to my main comment I reached the conclusion that he was thoughtful about the "nested" code he wrote so I am not going to block this diff on my "nested spinlock" concerns.

@kamil, would you mind commandeering this revision?

kamil edited reviewers, added: jon; removed: kamil.
  • updated event type
  • update message type value to reflect value defined in Rust
  • test with real keyserver, tests are not accurate
lib/types/tunnelbroker-messages.js
38 ↗(On Diff #30872)

same reason as here: https://phab.comm.dev/D7691#inline-58055 - I would prefer to fix this once across the entire tunnelbroker codebase

lib/types/tunnelbroker-messages.js
38 ↗(On Diff #30872)
ashoat added inline comments.
keyserver/src/socket/tunnelbroker.js
34 ↗(On Diff #30872)

TBRefreshKeysRequest includes a deviceID parameter, which does not appear to be used here.

If the parameter isn't used, should it be excluded? Or is there a plan to use it for some validation step later or something?

lib/types/tunnelbroker-messages.js
44 ↗(On Diff #30872)

I think it would be good to differentiate messages TO Tunnelbroker (TBConnectionInitializationMessage) against messages FROM Tunnelbroker (TBMessage). Otherwise it's confusing why TBMessage doesn't include TBConnectionInitializationMessage

Can you rename these to add some clarity?

This revision is now accepted and ready to land.Sep 12 2023, 6:03 AM

address review

keyserver/src/socket/tunnelbroker.js
34 ↗(On Diff #30872)

That's a good question - and I am afraid only Jon knows the answer.

My guess:
It's not used in JS and I think it will not be used, but it was declared in Rust (D7480) and we tried to make JS types match Rust definition. For now, I don't see any reason why we need deviceID in RefreshKeyRequest, maybe we can remove it from both JS and Rust? But I don't have context as to why it was introduced.

lib/types/tunnelbroker-messages.js
44 ↗(On Diff #30872)

renaming to MessageFromTunnelbroker and MessageToTunnelbroker

keyserver/src/socket/tunnelbroker.js
34 ↗(On Diff #30872)

Can you create a follow-up task to consider removing this parameter if it's truly unused?

keyserver/src/socket/tunnelbroker.js
34 ↗(On Diff #30872)

Sure, I was planning to do it: ENG-4928