Page MenuHomePhabricator

Implement prekey rotation on native
ClosedPublic

Authored by marcin on Jan 11 2024, 3:43 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 5, 4:20 AM
Unknown Object (File)
Wed, Dec 25, 4:13 AM
Unknown Object (File)
Wed, Dec 25, 4:13 AM
Unknown Object (File)
Wed, Dec 25, 4:13 AM
Unknown Object (File)
Wed, Dec 25, 4:13 AM
Unknown Object (File)
Wed, Dec 25, 4:13 AM
Unknown Object (File)
Wed, Dec 25, 4:13 AM
Unknown Object (File)
Wed, Dec 25, 4:13 AM
Subscribers

Details

Summary

This differential implements prekey rotation on native that allows to rollback changes if upload to
identity service fails.

Test Plan

Diff Detail

Repository
rCOMM Comm
Branch
marcin/eng-6405
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

Upload content prekey signature, notifications prekey and notifications prekey signature.

native/cpp/CommonCpp/CryptoTools/CryptoModule.cpp
106 ↗(On Diff #35611)

This is a weird name... how about prekeyExistsAndOlderThan?

471 ↗(On Diff #35611)

I don't think we should be marking the prekey as published until it's published to the identity service

native/cpp/CommonCpp/NativeModules/CommCoreModule.cpp
707 ↗(On Diff #35611)

I don't understand why we persist that it's published and then unpersist that, instead of first persisting that it's not published, and then updating later

native/cpp/CommonCpp/NativeModules/CommCoreModule.cpp
707 ↗(On Diff #35611)

The reason is that if we persist unpublished then if a call to identity fails we are left with a discrepancy between the state on identity service and the state on native. If we rotate prekey and fail to upload to identity service then if someone asks the identity service about our prekey they will receive prekey that we no longer have and won't be able to instantiate a session until we eventually upload the right prekey and they ask for our prekeys again. This was our initial way of thinking.
On the other hand now I am wondering if inability to instantiate a session after failure to upload prekey isn't actually desired. In current approach we make it possible to instantiate a session with a user using prekey that shouldn't be in usage anymore.

native/cpp/CommonCpp/CryptoTools/CryptoModule.cpp
471 ↗(On Diff #35611)

I think my comment below explains what is it implemented this way and why I am considering to change this approach.

native/cpp/CommonCpp/NativeModules/CommCoreModule.cpp
707 ↗(On Diff #35611)

If we rotate prekey and fail to upload to identity service then if someone asks the identity service about our prekey they will receive prekey that we no longer have and won't be able to instantiate a session until we eventually upload the right prekey and they ask for our prekeys again.

My understanding is that Olm has the ability to hold onto the old prekey even after introducing a new prekey.

We should take advantage of this functionality. The approach we take to prekey rotation on the keyserver considers this, and makes sure to never rotate the prekey when the old one hasn't been published. (Olm can only remember one old prekey, so we can't rotate if the current one hasn't been published yet, or else the scenario you describe can happen.)

On the other hand now I am wondering if inability to instantiate a session after failure to upload prekey isn't actually desired. In current approach we make it possible to instantiate a session with a user using prekey that shouldn't be in usage anymore.

It's okay to instantiate a session with an old prekey.

native/cpp/CommonCpp/NativeModules/CommCoreModule.cpp
707 ↗(On Diff #35611)

On the other hand if we first upload and then mar as published we are running into a case where application could crash after prekeys are correctly uploaded to identity service but before we mark them as published. The current implementation is resilient to such scenario. If we want to first publish and then mark as published we need to additionally check if the prekey is unpublished in case rotation doesn't return new prekey and try to upload it if it is not. Then the application will be able to heal itself.

Change in logics:

  1. Try to rotate prekey.
  2. If prekey rotation doesn't return new prekey check if we have an unpublished prekey.
  3. If prekey rotation returns new prekey, persist crypto module state.
  4. If prekey rotation returned new prekey or we have an unpublished prekey then upload it to identity service.
  5. If upload to identity service was succesfull then mark prekey as published and persist crypto module state.

Looking at the keyserver code for prekey rotation, it looks like it only happens if the user isn't logged into identity. Can you please:

  1. Go through the keyserver code and make sure it matches up with the new code you're introducing here.
  2. Create a follow-up task to introduce prekey rotation for the keyserver.

On the other hand if we first upload and then mar as published we are running into a case where application could crash after prekeys are correctly uploaded to identity service but before we mark them as published. The current implementation is resilient to such scenario. If we want to first publish and then mark as published we need to additionally check if the prekey is unpublished in case rotation doesn't return new prekey and try to upload it if it is not. Then the application will be able to heal itself.

I think this is a valid concern. Can you think through what the "worst case" is in this scenario, and what we need to do to protect against it? Then if there are any follow-ups, maybe you can document them on Linear.

One thing I'm curious about is if Olm will allow a session initialization that uses an unpublished prekey (I think yes).

native/cpp/CommonCpp/CryptoTools/CryptoModule.cpp
108

Can you add the same comment as we have here?

native/cpp/CommonCpp/NativeModules/CommCoreModule.cpp
693

Typo

This revision is now accepted and ready to land.Jan 17 2024, 8:07 AM

Go through the keyserver code and make sure it matches up with the new code you're introducing here.

The keyserver prekey rotation logic doesn't match native prekey rotation in this diff. Let's summarise how prekey rotation works on both platforms:

  1. The keyserver:

Three operations:

  • new prekey creation (if necessary)
  • identity upload
  • marking new prekey as published
  • new prekey persistence

are transactional. If any step fails we are left with initial prekey state

  1. Native (implementation in this diff):

First we check if there is an unpublished prekey. If it is we publish to identity and then persist as published. If any of those steps fails the app will heal itself on next run. If the prekey was published we check if it needs rotation (new prekey creation) and create new prekey if it does. At this point it is persisted as unpublished. Then we try to upload to identity service and once it succeeds we persist new prekey as published. If any of those two steps fail new prekey remains persisted as unpublished and next run will heal this state.

I think this is a valid concern. Can you think through what the "worst case" is in this scenario, and what we need to do to protect against it? Then if there are any follow-ups, maybe you can document them on Linear.

The worst case scenario with native approach implemented in this diff is that if we rotate prekey, publish to identity service and fail to mark as published then we run into a scenario when prekey is published but the device thinks it is unpublished and will keep thinking so until user opens the app again and publishing and marking as published is retried. Assuming that user keep app closed for days then the actual time it will take until next prekey rotation will be longer than 30 days period since prekey rotation is based on the timestap when the device marks prekey as published. Unfortunately this is also the case if the user doesn't open the app in general since we might not even have a chance to run prekey rotation. The only solution here is to periodically schedule prekey rotation on behalf of user activity using OS primitives like WorkManager on Android and NSURLBackgroundSession on iOS. This is tracked here: https://linear.app/comm/issue/ENG-6491/implement-scheduling-prekey-rotation-periodically-on-behalf-of-user

Create a follow-up task to introduce prekey rotation for the keyserver.

As stated above this diff would introduce discrepancy between the keyserver prekey rotation logic and native prekey rotation logic. Here is the task to refactor the keyserver to match native logics: https://linear.app/comm/issue/ENG-6490/resolve-discrepancy-between-prekey-rotation-logics-on-native-and-the

  1. Fix typo
  2. Add comment
  3. Rebase before landing

Neither response was sufficient.

Go through the keyserver code and make sure it matches up with the new code you're introducing here.

Can you please identify the differences between native and keyserver concisely, in one or two sentences?

I think this is a valid concern. Can you think through what the "worst case" is in this scenario, and what we need to do to protect against it? Then if there are any follow-ups, maybe you can document them on Linear.

You talk about " a scenario when prekey is published but the device thinks it is unpublished", but you don't give any details about how this would be bad. I asked a question above that doesn't appear to be addressed:

One thing I'm curious about is if Olm will allow a session initialization that uses an unpublished prekey (I think yes).

Please try to think about this like a tech lead... your goal is to understand the system, make it consistent, and make sure that normal failures won't break the system.

Can you please identify the differences between native and keyserver concisely, in one or two sentences?

If new prekey upload fails or marking new prekey as published fails then the keyserver rollbacks everything to the state before rotation while native persists new prekey which is marked as unpublished. Prekey rotation on native tries first to publish prekey if it is initially unpublished while keyserver doesn't.

You talk about " a scenario when prekey is published but the device thinks it is unpublished", but you don't give any details about how this would be bad.

Details why is it bad are given here:

Assuming that user keep app closed for days then the actual time it will take until next prekey rotation will be longer than 30 days period since prekey rotation is based on the timestap when the device marks prekey as published.

In summary we are risking a case that prekey is used longer than it should be (30 days) because next prekey rotation will be delayed.

One thing I'm curious about is if Olm will allow a session initialization that uses an unpublished prekey (I think yes).

In this code (https://github.com/CommE2E/olm/blob/main/src/session.cpp#L135-L218) prekey is fetched from account during inbound session instantiation. In this code (https://github.com/CommE2E/olm/blob/main/src/account.cpp#L374-L383) we can see that fetching prekey from account doesn't involve checking whether it was published or not. Additionally I recall Annunay saying (verbally) that publishing is just for networking and doesn't matter cryptographically. So I think you are right that we can instantiate a session with unpublished prekey.

Thanks for responding! I think this diff is good to land, but I followed up on Linear with some further things we'll need to do to make sure prekey lifecycle code is consistent. Most of it is covered in ENG-6490, but also I created ENG-6494 and added a comment to ENG-4351.

This revision was automatically updated to reflect the committed changes.