This differential implements synchronisation mechanisms in NSE so that it is safe for two different threads to use the same NSE instance at the same time to process different notifications.
Details
I was never able to observe NotificationService run in parallel on separate threads to process different notifications so I am not sure if it is actually possible to properly test this differential. I tested that notifications work correctly in normal circumstances and that enforcing to call serviceExtensionTimeWillExpire by putting sleep() works correctly - notification is either displayed decrypted or with proper error message.
Discussion here: https://stackoverflow.com/questions/62566948/notification-service-extension-lifecycle, mentions that it is possible to launch two NSE processes with debugger. However inter-process safety is already tested by parent differentials. For this differential it would be necessary to have two NSE thread operate on one NSE class to process two different notifications. I think it is not possible to recreate such conditions.
Diff Detail
- Repository
- rCOMM Comm
- Branch
- marcin/eng-4453
- Lint
No Lint Coverage - Unit
No Test Coverage
Event Timeline
native/ios/NotificationService/NotificationService.mm | ||
---|---|---|
55 | I don't think so. Not sure what the goal was, but:
So I don't think we need to synchronize, but I also don't know why it is needed at all. | |
152 | We probably don't need this comment anymore. | |
153 | Shouldn't we unregister every time we return from this method? If that's the case, we should create a separate method that wraps this one so that we can unregister in one place, instead of multiple ones. | |
360–368 | What do we try to achieve here? If we assume that an instance of this class can be used by multiple threads, this code won't solve that - it would be still possible for two threads to call self.contentHandlerCalled = NO; and then to call threadSafeContentHandlerCall, which would make the second call invalid. If we are sure that the instance is used by one thread, then synchronizing the block is not necessary - the whole method is executed on one thread and we have a happen-before relationship. |
native/ios/NotificationService/NotificationService.mm | ||
---|---|---|
55 | I hope that the answer to the last @tomek 's comment answers this as well | |
153 | That is correct - I forgot to add unregister calls before all returns. | |
360–368 | There is more to the picture here and the question "can this class be accessed from multiple threads or not" is not exhaustive enough. There are two "types" of threads here, The first is the thread that iOS creates to call didReceive... callback on it. The second is the thread that we introduce by calling registerForMemoryEvents method that will be calling completionHandler of this instance (technically we are not creating additional thread - we just let the already existing main thread of the process call lambda that access this instance).
In case 1 is true then this synchronization is necessary and is correct since we have two threads that can call completion handler - the thread iOS created for this NSE class and the thread that will respond to memory events. And the line at 55 does not need synchronization since it will be called only from the thread that iOS created for this NSE instance. In case 2 is not true then this synchronization is not correct and @ashoat is right that this diff may introduce regressions. However if 2 is correct our code was, in fact, invalid from the start even before we introduced e2e notifs. e2e notifs only revealed the issue as we are actually mutatin notification content (previously we only used to persist it in a file). Why did I think that 1 is correct? An example from Apple docs: https://developer.apple.com/documentation/usernotifications/modifying_content_in_newly_delivered_notifications?language=objc shows usage of NSE that is not thread-safe in case of multiple threads accessing the same NSE instance. What makes me think that 2 might actually be correct? Additionally element app keeps a hash table of contents and handlers that correspond to each other: https://github.com/vector-im/element-ios/blob/develop/RiotNSE/NotificationService.swift#L135-L136 I hope that the above answers @ashoat's questions:
I linked the code.
If Signal's is right then this diff might introduce some regressions. In summary: That said this differential needs refactor to adopt the threading patterns of Signal's and Telegram |
Introduce thread safety after dropping assumptions about sequentiality of notification procesing by NSE.
native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp | ||
---|---|---|
97–101 ↗ | (On Diff #30130) | After discussion with @tomek we agreed that this code might be our primary suspect. One thread/process creates a temporary file and starts writing to it. Then another thread/process starts to execute this code and removes temporary file the first thread is writing to which causes the first thread to crash. |
native/ios/NotificationService/NotificationService.mm | ||
40 ↗ | (On Diff #30130) | I we assume that it is possible for the same NSE instance to be used by many thread at once to process different notifications then we have to keep a collection on handlers and contents and accesses to this collection must be synchronized. I considered folly's implementation of ConcurrentHashMap: https://github.com/facebook/folly/blob/main/folly/concurrency/ConcurrentHashMap.h, however it was not sufficient for our case. In serviceExtensionTimeWillExpire we must atomically retrieve all contents and handlers and remove them from hash maps to prevent other threads from accessing them. ConcurrentHashMap does not offer the getAndDeleteAll- like functionality. Therefore this code uses custom mutex-based locking scheme to ensure atomicity of any block of code. |
430 ↗ | (On Diff #30130) | I want to make sure that whatever occurs inside @synchronized block is as fast as possible. That said the only operations that take place in @synchronized blocks are hash map keys additions/deletions. Content handler blocks are provided by the system so they might be potentially costly to call so I want to avoid calling them in @synchronized blocks. |
480 ↗ | (On Diff #30130) | Memory events are delivered at process level. Therefore if we decide to create separate memorySource for each thread/NSE instance we could see redundant messages about memory issue in case one happens. That said I decided that memorySource will be set once per process. In case memory event occurs the event handler will substitute a string value that is initially null. Each NSE instance belonging to the process will check for this value before displaying notification. The first one that sees non-null value will replace the notification content it was about to display with memory error notification. |
It feels like this diff should be broken down into at least 3: changes to temporary file name, monitoring memory usage, and changes to support multiple contentHandlers. Can you break it down to make it easier to review? (Feel free to break it down into more than 3 if you prefer.)
I didn't have time to look into the concurrency / locking stuff – hoping that @tomek and @bartek can help there, as my time is increasingly limited starting now.
If I am unable to follow-up on this Request Changes after your next update, then feel free to have one of the other reviews confirm that the feedback was addressed, and then remove me from the list of reviewers.
native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp | ||
---|---|---|
97–101 ↗ | (On Diff #30130) | Good theory! I'm not sure why I saw no logs for the failure case... but perhaps after a crash, Apple "penalizes" us and doesn't start the NSE for some time. That would also explain why we tend to see several of these failures in a row. It appears that you're addressing the potential issue here with two changes. One question for each change:
|
native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp | ||
---|---|---|
97–101 ↗ | (On Diff #30130) |
The reason to introduce temporary file here is to have atomic reads and writes to main olm session file. The content of main olm file is read into memory, in-memory version is modified, modified content is written to temporary file and finally temporary file is renamed with the main olm session file. That said there is no code reading this file and there is no code accessing this file outside of this method. Did I correctly understand your question?
Temporary file is removed each time an error occurs and is deleted at the end of this method (it is not currently deleted but I will add this line - forgot to do it thanks for catching). Is there any concern then? Although unlikely some uncaught error (or sth like power-off) might occur just before we remove the temporary file. It will not grow since any subsequent call to this method will create new temporary file but it may just stafy there undeleted. To be honest I don't have a better idea on how to solve this problem than some scheduled cleanup. Android and iOS have tools for this. Perhaps we could do it together with downloaded blobs cleanup? Alternatively we could use directory search API to check for files that are older that some threshold and delete them. Such code could be executed at the beginning of this function or in the NSE code just before decrypted and persisted notification is displayed. |
I didn't have time to look into the concurrency / locking stuff – hoping that @tomek and @bartek can help there, as my time is increasingly limited starting now.
Sure! But probably it will be more efficient after the diff get split
Temporary file is removed each time an error occurs and is deleted at the end of this method (it is not currently deleted but I will add this line - forgot to do it thanks for catching). Is there any concern then? Although unlikely some uncaught error (or sth like power-off) might occur just before we remove the temporary file. It will not grow since any subsequent call to this method will create new temporary file but it may just stafy there undeleted. To be honest I don't have a better idea on how to solve this problem than some scheduled cleanup. Android and iOS have tools for this. Perhaps we could do it together with downloaded blobs cleanup? Alternatively we could use directory search API to check for files that are older that some threshold and delete them. Such code could be executed at the beginning of this function or in the NSE code just before decrypted and persisted notification is displayed.
Maybe we can consider using system's temp directory?
native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp | ||
---|---|---|
27–28 ↗ | (On Diff #30130) | Seems like a really huge value. If we're choosing from alphanumeric values this gives more than 10^60 possible suffixes. Choosing 8 chars feels big enough, and avoids depleting entropy too much. |
94 ↗ | (On Diff #30130) | This random string may contain a space, but I guess it's safe on both platforms. |
native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp | ||
---|---|---|
94 ↗ | (On Diff #30130) | This is valid observation. I will check since I am not sure. |
I did some research and here are my findings:
- iOS:
- There is a method to get temporary directory for an app/app group NSTemporaryDirectory:https://developer.apple.com/documentation/foundation/1409211-nstemporarydirectory?language=objc
- Found an article that states those files are deleted by system after some time of inactivity: https://nshipster.com/temporary-files/
- Android:
- There is a method getCacheDir(): https://developer.android.com/reference/android/content/Context#getCacheDir()
- The system can delete those files when it needs memory but it is not clear whether they will be deleted on some regular basis.
Looks like it might be worth to use methods above. However we will need to introduce new method in PlatformSpecificTools that would call relevant method on each platform. Confirmed with @tomek
that it should be a follow-up task.
native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp | ||
---|---|---|
94 ↗ | (On Diff #30130) | I couldn't find any resources for this so I think it is better to remove potential spaces. |
Mostly nits and minor requests
Still hoping somebody else can review the synchronization more closely, but it seems rather uncontroversial in this diff
native/ios/NotificationService/NotificationService.mm | ||
---|---|---|
123 ↗ | (On Diff #30213) | Nit: should this line be realigned now that the line above has been shortened? |
157 ↗ | (On Diff #30213) | Nit: I think a space should be added here, so that it's clear that lines 158-164 are not part of step 5 |
167 ↗ | (On Diff #30213) | This function has been significantly changed. I'm assuming the changes are inspired by another implementation – if so, can you link the other code that inspired you here? |
398 ↗ | (On Diff #30213) | I think it's a little confusing that we return content here given that we modify it in-place. Seeing it returned seems to imply we create a new one But then again, maybe it's more confused to skip an assignment at the callsite... it might not be clear that any mutation is occurring I would probably suggest removing the return and renaming the function to be more clear that it mutates the content in-place, but I'm curious for other people's perspectives |
401 ↗ | (On Diff #30213) | Typo |
405 ↗ | (On Diff #30213) | Might be worth explaining why we don't trust it – it looks like other open-source NSE implementations don't trust it either (Signal, Telegram, Element) |
Should we update the summary?
Overall looks good and correct, but I have a couple of important questions.
native/ios/NotificationService/NotificationService.mm | ||
---|---|---|
53 ↗ | (On Diff #30213) | Shouldn't we set up these in init? |
176–226 ↗ | (On Diff #30213) | This approach is risky because we can delete a handler without calling it. It can happen e.g. when an exception would be thrown somewhere inside the for loop. We can consider synchronizing this whole block which might be expensive for other threads - and still, they should expect to see an empty collection of handlers. Also, we don't have any way of making read-callHandler-delete a truly atomic operation, because it might fail in the middle. So it's a tradeoff between sometimes not calling the handler and sometimes calling it twice. We probably don't know if Apple would penalize us for not calling a handler... Overall, it seems like the chosen approach is the best, but it has some disadvantages. |
420 ↗ | (On Diff #30213) | Can't we always synchronize on self? |
native/ios/NotificationService/NotificationService.mm | ||
---|---|---|
53 ↗ | (On Diff #30213) | I would be scared to override init for this class. Apple docs are rather strict that we should no aim to instantiate NSE ourselves and didReceive... with serviceExtensionTimeWillExpire are the only API we should use. Additionally signal does global setup in didReceive.. callback: https://github.com/signalapp/Signal-iOS/blob/6cc18788850558251db5f772ca62bce6b576dfc4/SignalNSE/NotificationService.swift#L111. This is the particular reason I made this method idempotent. |
123 ↗ | (On Diff #30213) | Unfortunately there is nothing we can do about it since it was what the formatter gives us. |
167 ↗ | (On Diff #30213) | Actually this implementation is our own invention. I examined Signal and Element code.
After discussion with @tomek we agreed that it looks like those apps were aware of the possibility of two threads accessing the same NSE instance at a time but didn't have a good idea on how to handle it in case serviceExtensionTimeWillExpire since we don't know which notification is it called for. That said I concluded that since we don't know which notification serviceExtensionTimeWillExpire is called for we will try to handle all notifications that are currently being processed. |
176–226 ↗ | (On Diff #30213) |
It is not penalized - notification is just displayed unchanged. |
420 ↗ | (On Diff #30213) | setUpNSEInstance is synchronized on self so it is a minor optimization to use different lock. |
As far as I'm familiar with Obj-C synchronization mechanisms, this code makes sense to me
native/ios/NotificationService/NotificationService.mm | ||
---|---|---|
172–185 ↗ | (On Diff #30258) | I think it could degrade readibility |
native/ios/NotificationService/NotificationService.mm | ||
---|---|---|
147 ↗ | (On Diff #30845) | This line doesn't do anything. To have the same effect as updating self.bestAttemptContent previously, you would need to call your putContent method again |
native/ios/NotificationService/NotificationService.mm | ||
---|---|---|
147 ↗ | (On Diff #30845) | You are right that this line is not necessary. However there is no need to call putContent here. Unless we exceed time limit the actual content displayed to the user is going to be publicUserContent variable. Contents that we store in the synchronized dictionary (that are stored there by putContent) are only used in serviceExtensionTimeWillExpire to know what the notification type was to display appropriate timeout error message. In serviceExtensionTimeWillExpire if we detect that content was badgeOnly content we will construct badgeOnlyNotif again. Therefore we only have to delete this line. |