Details

Reviewers

tomek
ashoat
bartek

Commits

rCOMMaea53f271c7d: Implement synchronization mechanisms to address process, thread and class level…

Summary

This differential implements synchronisation mechanisms in NSE so that it is safe for two different threads to use the same NSE instance at the same time to process different notifications.

Test Plan

I was never able to observe NotificationService run in parallel on separate threads to process different notifications so I am not sure if it is actually possible to properly test this differential. I tested that notifications work correctly in normal circumstances and that enforcing to call serviceExtensionTimeWillExpire by putting sleep() works correctly - notification is either displayed decrypted or with proper error message.

Discussion here: https://stackoverflow.com/questions/62566948/notification-service-extension-lifecycle, mentions that it is possible to launch two NSE processes with debugger. However inter-process safety is already tested by parent differentials. For this differential it would be necessary to have two NSE thread operate on one NSE class to process two different notifications. I think it is not possible to recreate such conditions.

Diff Detail

Repository

rCOMM Comm

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

tomek added inline comments.Aug 16 2023, 2:29 AM

native/ios/NotificationService/NotificationService.mm
55 ↗	(On Diff #29855)	I don't think so. Not sure what the goal was, but: If the goal was to set this value in one thread and then read in multiple, that would be probably incorrect, as only the result of one call of `threadSafeContentHandlerCall` would result in calling `contentHandlerCalled` - so we would ignore all but one notifications. If the goal is to set this value in one thread and read in the same, then it should be guaranteed that this value is visible later. So I don't think we need to synchronize, but I also don't know why it is needed at all.
152 ↗	(On Diff #29855)	We probably don't need this comment anymore.
153 ↗	(On Diff #29855)	Shouldn't we unregister every time we `return` from this method? If that's the case, we should create a separate method that wraps this one so that we can `unregister` in one place, instead of multiple ones.
360–368 ↗	(On Diff #29855)	What do we try to achieve here? If we assume that an instance of this class can be used by multiple threads, this code won't solve that - it would be still possible for two threads to call `self.contentHandlerCalled = NO;` and then to call `threadSafeContentHandlerCall`, which would make the second call invalid. If we are sure that the instance is used by one thread, then synchronizing the block is not necessary - the whole method is executed on one thread and we have a happen-before relationship.

This revision now requires changes to proceed.Aug 16 2023, 2:29 AM

marcin added inline comments.Aug 16 2023, 3:56 AM

native/ios/NotificationService/NotificationService.mm
55 ↗	(On Diff #29855)	I hope that the answer to the last @tomek 's comment answers this as well
153 ↗	(On Diff #29855)	That is correct - I forgot to add unregister calls before all returns.
360–368 ↗	(On Diff #29855)	There is more to the picture here and the question "can this class be accessed from multiple threads or not" is not exhaustive enough. There are two "types" of threads here, The first is the thread that iOS creates to call `didReceive...` callback on it. The second is the thread that we introduce by calling `registerForMemoryEvents` method that will be calling `completionHandler` of this instance (technically we are not creating additional thread - we just let the already existing main thread of the process call lambda that access this instance). That said we might have two cases: The iOS always creates new instance of NSE class for each thread that processes notification or reuse the instance but sequentially. The iOS might let two different threads access the same NSE class instance simultaneously. In case 1 is true then this synchronization is necessary and is correct since we have two threads that can call completion handler - the thread iOS created for this NSE class and the thread that will respond to memory events. And the line at 55 does not need synchronization since it will be called only from the thread that iOS created for this NSE instance. In case 2 is not true then this synchronization is not correct and @ashoat is right that this diff may introduce regressions. However if 2 is correct our code was, in fact, invalid from the start even before we introduced e2e notifs. e2e notifs only revealed the issue as we are actually mutatin notification content (previously we only used to persist it in a file). Why did I think that 1 is correct? An example from Apple docs: https://developer.apple.com/documentation/usernotifications/modifying_content_in_newly_delivered_notifications?language=objc shows usage of NSE that is not thread-safe in case of multiple threads accessing the same NSE instance. What makes me think that 2 might actually be correct? Signal's NSE code preambule: https://github.com/signalapp/Signal-iOS/blob/main/SignalNSE/NotificationService.swift#L10-L30 and the fact that they use atomic variables for contentHandler instead of directly accessing `self`: https://github.com/signalapp/Signal-iOS/blob/main/SignalNSE/NotificationService.swift#L37 Additionally element app keeps a hash table of contents and handlers that correspond to each other: https://github.com/vector-im/element-ios/blob/develop/RiotNSE/NotificationService.swift#L135-L136 I hope that the above answers @ashoat's questions: Provide more detail (including a link to the code) about Signal's beliefs? I linked the code. If you are confident that this diff won't introduce additional regressions in the case that Signal is right, can you please talk through the various "race" scenarios here, and explain why the behavior will be unchanged? If Signal's is right then this diff might introduce some regressions. In summary: I wrote this differential prioritising Apple docs more than Signal/Telegram/Element and believing that 1 is true. I hope that I justified that this diff is correct in case 1 is true and we trust the docs. However yesterday during our 1:1 @ashoat advised me to prioritise the code more than the docs. That said this differential needs refactor to adopt the threading patterns of Signal's and Telegram

Introduce thread safety after dropping assumptions about sequentiality of notification procesing by NSE.

Harbormaster completed remote builds in B21938: Diff 30130.Aug 21 2023, 4:11 AM

marcin added inline comments.Aug 21 2023, 4:24 AM

native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp
97–101 ↗	(On Diff #30130)	After discussion with @tomek we agreed that this code might be our primary suspect. One thread/process creates a temporary file and starts writing to it. Then another thread/process starts to execute this code and removes temporary file the first thread is writing to which causes the first thread to crash.
native/ios/NotificationService/NotificationService.mm
40 ↗	(On Diff #30130)	I we assume that it is possible for the same NSE instance to be used by many thread at once to process different notifications then we have to keep a collection on handlers and contents and accesses to this collection must be synchronized. I considered folly's implementation of `ConcurrentHashMap`: https://github.com/facebook/folly/blob/main/folly/concurrency/ConcurrentHashMap.h, however it was not sufficient for our case. In `serviceExtensionTimeWillExpire` we must atomically retrieve all contents and handlers and remove them from hash maps to prevent other threads from accessing them. `ConcurrentHashMap` does not offer the `getAndDeleteAll`- like functionality. Therefore this code uses custom mutex-based locking scheme to ensure atomicity of any block of code.
430 ↗	(On Diff #30130)	I want to make sure that whatever occurs inside `@synchronized` block is as fast as possible. That said the only operations that take place in `@synchronized` blocks are hash map keys additions/deletions. Content handler blocks are provided by the system so they might be potentially costly to call so I want to avoid calling them in `@synchronized` blocks.
480 ↗	(On Diff #30130)	Memory events are delivered at process level. Therefore if we decide to create separate `memorySource` for each thread/NSE instance we could see redundant messages about memory issue in case one happens. That said I decided that `memorySource` will be set once per process. In case memory event occurs the event handler will substitute a string value that is initially null. Each NSE instance belonging to the process will check for this value before displaying notification. The first one that sees non-null value will replace the notification content it was about to display with memory error notification.

It feels like this diff should be broken down into at least 3: changes to temporary file name, monitoring memory usage, and changes to support multiple contentHandlers. Can you break it down to make it easier to review? (Feel free to break it down into more than 3 if you prefer.)

I didn't have time to look into the concurrency / locking stuff – hoping that @tomek and @bartek can help there, as my time is increasingly limited starting now.

If I am unable to follow-up on this Request Changes after your next update, then feel free to have one of the other reviews confirm that the feedback was addressed, and then remove me from the list of reviewers.

native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp
97–101 ↗	(On Diff #30130)	Good theory! I'm not sure why I saw no logs for the failure case... but perhaps after a crash, Apple "penalizes" us and doesn't start the NSE for some time. That would also explain why we tend to see several of these failures in a row. It appears that you're addressing the potential issue here with two changes. One question for each change: Making sure each thread/process uses a distinct file with a random suffix: Do we need any corresponding code change for the code that reads the file(s)? Or is that code just reading any file within a target directory? Removing the line that deletes the file: Is there any concern about unrestrained growth in the size of this file if we're not deleting it?

This revision now requires changes to proceed.Aug 21 2023, 2:25 PM

marcin added inline comments.Aug 22 2023, 2:36 AM

native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp
97–101 ↗	(On Diff #30130)	Making sure each thread/process uses a distinct file with a random suffix The reason to introduce temporary file here is to have atomic reads and writes to main olm session file. The content of main olm file is read into memory, in-memory version is modified, modified content is written to temporary file and finally temporary file is renamed with the main olm session file. That said there is no code reading this file and there is no code accessing this file outside of this method. Did I correctly understand your question? Removing the line that deletes the file: Temporary file is removed each time an error occurs and is deleted at the end of this method (it is not currently deleted but I will add this line - forgot to do it thanks for catching). Is there any concern then? Although unlikely some uncaught error (or sth like power-off) might occur just before we remove the temporary file. It will not grow since any subsequent call to this method will create new temporary file but it may just stafy there undeleted. To be honest I don't have a better idea on how to solve this problem than some scheduled cleanup. Android and iOS have tools for this. Perhaps we could do it together with downloaded blobs cleanup? Alternatively we could use directory search API to check for files that are older that some threshold and delete them. Such code could be executed at the beginning of this function or in the NSE code just before decrypted and persisted notification is displayed.

I didn't have time to look into the concurrency / locking stuff – hoping that @tomek and @bartek can help there, as my time is increasingly limited starting now.

Sure! But probably it will be more efficient after the diff get split

Temporary file is removed each time an error occurs and is deleted at the end of this method (it is not currently deleted but I will add this line - forgot to do it thanks for catching). Is there any concern then? Although unlikely some uncaught error (or sth like power-off) might occur just before we remove the temporary file. It will not grow since any subsequent call to this method will create new temporary file but it may just stafy there undeleted. To be honest I don't have a better idea on how to solve this problem than some scheduled cleanup. Android and iOS have tools for this. Perhaps we could do it together with downloaded blobs cleanup? Alternatively we could use directory search API to check for files that are older that some threshold and delete them. Such code could be executed at the beginning of this function or in the NSE code just before decrypted and persisted notification is displayed.

Maybe we can consider using system's temp directory?

native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp
27–28 ↗	(On Diff #30130)	Seems like a really huge value. If we're choosing from alphanumeric values this gives more than 10^60 possible suffixes. Choosing 8 chars feels big enough, and avoids depleting entropy too much.
94 ↗	(On Diff #30130)	This random string may contain a space, but I guess it's safe on both platforms.

marcin mentioned this in D8905: Introduce thread-safety to NotificationsCryptoModule during cuncurrent access.Aug 22 2023, 3:54 AM

Make this diff contain synchronization implementation only.

marcin added a parent revision: D8907: Introduce memory monitoring in NSE.Aug 22 2023, 4:03 AM

marcin removed a parent revision: D8638: Possible solution for broken iOS notifications.

marcin removed a parent revision: D8803: Implement isStaffRelease method in PlatformSpecificTools.

marcin retitled this revision from Modify notification to display debugging messages in case an error in the NSE occurs to Implement synchronization mechanisms to address process, thread and class level concurrency..

Rebase

Harbormaster failed remote builds in B21996: Diff 30205!Aug 22 2023, 4:17 AM

marcin added inline comments.Aug 22 2023, 4:24 AM

native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp
94 ↗	(On Diff #30130)	This is valid observation. I will check since I am not sure.

Harbormaster failed remote builds in B22000: Diff 30209!Aug 22 2023, 4:33 AM

In D8795#262588, @tomek wrote:

Maybe we can consider using system's temp directory?

I did some research and here are my findings:

iOS:
- There is a method to get temporary directory for an app/app group NSTemporaryDirectory:https://developer.apple.com/documentation/foundation/1409211-nstemporarydirectory?language=objc
- Found an article that states those files are deleted by system after some time of inactivity: https://nshipster.com/temporary-files/
Android:
- There is a method getCacheDir(): https://developer.android.com/reference/android/content/Context#getCacheDir()
- The system can delete those files when it needs memory but it is not clear whether they will be deleted on some regular basis.

Looks like it might be worth to use methods above. However we will need to introduce new method in PlatformSpecificTools that would call relevant method on each platform. Confirmed with @tomek
that it should be a follow-up task.

native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp
94 ↗	(On Diff #30130)	I couldn't find any resources for this so I think it is better to remove potential spaces.

Rebase to fix CI

Harbormaster completed remote builds in B22004: Diff 30213.Aug 22 2023, 5:44 AM

marcin edited the test plan for this revision. (Show Details)Aug 22 2023, 7:13 AM

Mostly nits and minor requests

Still hoping somebody else can review the synchronization more closely, but it seems rather uncontroversial in this diff

native/ios/NotificationService/NotificationService.mm
123 ↗	(On Diff #30213)	Nit: should this line be realigned now that the line above has been shortened?
157 ↗	(On Diff #30213)	Nit: I think a space should be added here, so that it's clear that lines 158-164 are not part of step 5
167 ↗	(On Diff #30213)	This function has been significantly changed. I'm assuming the changes are inspired by another implementation – if so, can you link the other code that inspired you here?
398 ↗	(On Diff #30213)	I think it's a little confusing that we return `content` here given that we modify it in-place. Seeing it returned seems to imply we create a new one But then again, maybe it's more confused to skip an assignment at the callsite... it might not be clear that any mutation is occurring I would probably suggest removing the `return` and renaming the function to be more clear that it mutates the `content` in-place, but I'm curious for other people's perspectives
401 ↗	(On Diff #30213)	Typo
405 ↗	(On Diff #30213)	Might be worth explaining why we don't trust it – it looks like other open-source NSE implementations don't trust it either (Signal, Telegram, Element)

In D8795#262651, @marcin wrote:

Make this diff contain synchronization implementation only.

Should we update the summary?

Overall looks good and correct, but I have a couple of important questions.

native/ios/NotificationService/NotificationService.mm
53 ↗	(On Diff #30213)	Shouldn't we set up these in `init`?
176–226 ↗	(On Diff #30213)	This approach is risky because we can delete a handler without calling it. It can happen e.g. when an exception would be thrown somewhere inside the `for` loop. We can consider synchronizing this whole block which might be expensive for other threads - and still, they should expect to see an empty collection of handlers. Also, we don't have any way of making read-callHandler-delete a truly atomic operation, because it might fail in the middle. So it's a tradeoff between sometimes not calling the handler and sometimes calling it twice. We probably don't know if Apple would penalize us for not calling a handler... Overall, it seems like the chosen approach is the best, but it has some disadvantages.
420 ↗	(On Diff #30213)	Can't we always synchronize on `self`?

This revision now requires changes to proceed.Aug 22 2023, 10:09 AM

marcin added inline comments.Aug 23 2023, 7:19 AM

native/ios/NotificationService/NotificationService.mm
53 ↗	(On Diff #30213)	I would be scared to override `init` for this class. Apple docs are rather strict that we should no aim to instantiate NSE ourselves and `didReceive...` with `serviceExtensionTimeWillExpire` are the only API we should use. Additionally signal does global setup in `didReceive..` callback: https://github.com/signalapp/Signal-iOS/blob/6cc18788850558251db5f772ca62bce6b576dfc4/SignalNSE/NotificationService.swift#L111. This is the particular reason I made this method idempotent.
123 ↗	(On Diff #30213)	Unfortunately there is nothing we can do about it since it was what the formatter gives us.
167 ↗	(On Diff #30213)	Actually this implementation is our own invention. I examined Signal and Element code. Signal just silences current content: https://github.com/signalapp/Signal-iOS/blob/6cc18788850558251db5f772ca62bce6b576dfc4/SignalNSE/NotificationService.swift#L166. This suggests that they don't believe the same NSE can be used by two threads at the same time. On the other hand they keep `contentHandler` of current NSE as an atomic variable: https://github.com/signalapp/Signal-iOS/blob/6cc18788850558251db5f772ca62bce6b576dfc4/SignalNSE/NotificationService.swift#L37. Element on the other hand does a similar thing to us (they keep a collection of handlers): https://github.com/vector-im/element-ios/blob/develop/RiotNSE/NotificationService.swift#L38 but their way of handling NSE timeout is just to give up: https://github.com/vector-im/element-ios/blob/develop/RiotNSE/NotificationService.swift#L147 After discussion with @tomek we agreed that it looks like those apps were aware of the possibility of two threads accessing the same NSE instance at a time but didn't have a good idea on how to handle it in case `serviceExtensionTimeWillExpire` since we don't know which notification is it called for. That said I concluded that since we don't know which notification `serviceExtensionTimeWillExpire` is called for we will try to handle all notifications that are currently being processed.
176–226 ↗	(On Diff #30213)	We probably don't know if Apple would penalize us for not calling a handler... It is not penalized - notification is just displayed unchanged.
420 ↗	(On Diff #30213)	`setUpNSEInstance` is synchronized on `self` so it is a minor optimization to use different lock.

Reply to review

Harbormaster completed remote builds in B22042: Diff 30258.Aug 23 2023, 7:40 AM

marcin edited the summary of this revision. (Show Details)Aug 24 2023, 8:03 AM

Makes sense to me but might be a good idea for @bartek to also review this

native/ios/NotificationService/NotificationService.mm
172–185 ↗	(On Diff #30258)	Might be worth considering having a single array of pair-like structures.

tomek accepted this revision.Aug 24 2023, 8:52 AM

As far as I'm familiar with Obj-C synchronization mechanisms, this code makes sense to me

native/ios/NotificationService/NotificationService.mm
172–185 ↗	(On Diff #30258)	I think it could degrade readibility

This revision is now accepted and ready to land.Aug 25 2023, 7:42 AM

Refactor to reflect changes in parent differential

Harbormaster completed remote builds in B22214: Diff 30490.Aug 29 2023, 5:10 AM

Rebase to reflect changes in parent differential

Harbormaster completed remote builds in B22309: Diff 30649.Aug 31 2023, 5:49 AM

Rebase before landing

Harbormaster completed remote builds in B22429: Diff 30841.Sep 7 2023, 8:59 AM

Closed by commit rCOMMaea53f271c7d: Implement synchronization mechanisms to address process, thread and class level… (authored by marcin). · Explain WhySep 7 2023, 9:03 AM

This revision was automatically updated to reflect the committed changes.

marcin mentioned this in rCOMM70cf45852540: Introduce thread-safety to NotificationsCryptoModule during cuncurrent access.

marcin added a commit: rCOMMaea53f271c7d: Implement synchronization mechanisms to address process, thread and class level….

ashoat added inline comments.Oct 3 2023, 11:08 PM

native/ios/NotificationService/NotificationService.mm
147	This line doesn't do anything. To have the same effect as updating `self.bestAttemptContent` previously, you would need to call your `putContent` method again

marcin added inline comments.Oct 6 2023, 9:22 AM

native/ios/NotificationService/NotificationService.mm
147	You are right that this line is not necessary. However there is no need to call `putContent` here. Unless we exceed time limit the actual content displayed to the user is going to be `publicUserContent` variable. Contents that we store in the synchronized dictionary (that are stored there by `putContent`) are only used in `serviceExtensionTimeWillExpire` to know what the notification type was to display appropriate timeout error message. In `serviceExtensionTimeWillExpire` if we detect that content was badgeOnly content we will construct badgeOnlyNotif again. Therefore we only have to delete this line.

marcin mentioned this in D9400: Minor improvments to NSE code.Oct 6 2023, 9:31 AM

marcin mentioned this in rCOMM6835d8e3a1e6: Minor improvments to NSE code.Oct 19 2023, 8:54 AM

Implement synchronization mechanisms to address process, thread and class level concurrency.
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 30845

native/ios/NotificationService/NotificationService.mm

Implement synchronization mechanisms to address process, thread and class level concurrency.ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 30845

native/ios/NotificationService/NotificationService.mm

Implement synchronization mechanisms to address process, thread and class level concurrency.
ClosedPublic
Actions

Revision Contents
Changeset List