Page MenuHomePhabricator

Implement synchronization mechanisms to address process, thread and class level concurrency.
ClosedPublic

Authored by marcin on Aug 11 2023, 5:25 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, May 7, 8:03 AM
Unknown Object (File)
Mon, May 6, 4:54 AM
Unknown Object (File)
Sat, May 4, 4:40 AM
Unknown Object (File)
Thu, May 2, 5:22 AM
Unknown Object (File)
Apr 7 2024, 6:21 PM
Unknown Object (File)
Apr 7 2024, 6:21 PM
Unknown Object (File)
Apr 7 2024, 6:21 PM
Unknown Object (File)
Apr 7 2024, 6:21 PM
Subscribers
None

Details

Summary

This differential implements synchronisation mechanisms in NSE so that it is safe for two different threads to use the same NSE instance at the same time to process different notifications.

Test Plan

I was never able to observe NotificationService run in parallel on separate threads to process different notifications so I am not sure if it is actually possible to properly test this differential. I tested that notifications work correctly in normal circumstances and that enforcing to call serviceExtensionTimeWillExpire by putting sleep() works correctly - notification is either displayed decrypted or with proper error message.

Discussion here: https://stackoverflow.com/questions/62566948/notification-service-extension-lifecycle, mentions that it is possible to launch two NSE processes with debugger. However inter-process safety is already tested by parent differentials. For this differential it would be necessary to have two NSE thread operate on one NSE class to process two different notifications. I think it is not possible to recreate such conditions.

Diff Detail

Repository
rCOMM Comm
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
native/ios/NotificationService/NotificationService.mm
55 ↗(On Diff #29855)

I don't think so. Not sure what the goal was, but:

  1. If the goal was to set this value in one thread and then read in multiple, that would be probably incorrect, as only the result of one call of threadSafeContentHandlerCall would result in calling contentHandlerCalled - so we would ignore all but one notifications.
  2. If the goal is to set this value in one thread and read in the same, then it should be guaranteed that this value is visible later.

So I don't think we need to synchronize, but I also don't know why it is needed at all.

152 ↗(On Diff #29855)

We probably don't need this comment anymore.

153 ↗(On Diff #29855)

Shouldn't we unregister every time we return from this method? If that's the case, we should create a separate method that wraps this one so that we can unregister in one place, instead of multiple ones.

360–368 ↗(On Diff #29855)

What do we try to achieve here? If we assume that an instance of this class can be used by multiple threads, this code won't solve that - it would be still possible for two threads to call self.contentHandlerCalled = NO; and then to call threadSafeContentHandlerCall, which would make the second call invalid. If we are sure that the instance is used by one thread, then synchronizing the block is not necessary - the whole method is executed on one thread and we have a happen-before relationship.

This revision now requires changes to proceed.Aug 16 2023, 2:29 AM
native/ios/NotificationService/NotificationService.mm
55 ↗(On Diff #29855)

I hope that the answer to the last @tomek 's comment answers this as well

153 ↗(On Diff #29855)

That is correct - I forgot to add unregister calls before all returns.

360–368 ↗(On Diff #29855)

There is more to the picture here and the question "can this class be accessed from multiple threads or not" is not exhaustive enough. There are two "types" of threads here, The first is the thread that iOS creates to call didReceive... callback on it. The second is the thread that we introduce by calling registerForMemoryEvents method that will be calling completionHandler of this instance (technically we are not creating additional thread - we just let the already existing main thread of the process call lambda that access this instance).
That said we might have two cases:

  1. The iOS always creates new instance of NSE class for each thread that processes notification or reuse the instance but sequentially.
  2. The iOS might let two different threads access the same NSE class instance simultaneously.

In case 1 is true then this synchronization is necessary and is correct since we have two threads that can call completion handler - the thread iOS created for this NSE class and the thread that will respond to memory events. And the line at 55 does not need synchronization since it will be called only from the thread that iOS created for this NSE instance.

In case 2 is not true then this synchronization is not correct and @ashoat is right that this diff may introduce regressions. However if 2 is correct our code was, in fact, invalid from the start even before we introduced e2e notifs. e2e notifs only revealed the issue as we are actually mutatin notification content (previously we only used to persist it in a file).

Why did I think that 1 is correct? An example from Apple docs: https://developer.apple.com/documentation/usernotifications/modifying_content_in_newly_delivered_notifications?language=objc shows usage of NSE that is not thread-safe in case of multiple threads accessing the same NSE instance.

What makes me think that 2 might actually be correct?
Signal's NSE code preambule: https://github.com/signalapp/Signal-iOS/blob/main/SignalNSE/NotificationService.swift#L10-L30
and the fact that they use atomic variables for contentHandler instead of directly accessing self: https://github.com/signalapp/Signal-iOS/blob/main/SignalNSE/NotificationService.swift#L37

Additionally element app keeps a hash table of contents and handlers that correspond to each other: https://github.com/vector-im/element-ios/blob/develop/RiotNSE/NotificationService.swift#L135-L136

I hope that the above answers @ashoat's questions:

Provide more detail (including a link to the code) about Signal's beliefs?

I linked the code.

If you are confident that this diff won't introduce additional regressions in the case that Signal is right, can you please talk through the various "race" scenarios here, and explain why the behavior will be unchanged?

If Signal's is right then this diff might introduce some regressions.

In summary:
I wrote this differential prioritising Apple docs more than Signal/Telegram/Element and believing that 1 is true. I hope that I justified that this diff is correct in case 1 is true and we trust the docs. However yesterday during our 1:1 @ashoat advised me to prioritise the code more than the docs.

That said this differential needs refactor to adopt the threading patterns of Signal's and Telegram

Introduce thread safety after dropping assumptions about sequentiality of notification procesing by NSE.

native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp
97–101 ↗(On Diff #30130)

After discussion with @tomek we agreed that this code might be our primary suspect. One thread/process creates a temporary file and starts writing to it. Then another thread/process starts to execute this code and removes temporary file the first thread is writing to which causes the first thread to crash.

native/ios/NotificationService/NotificationService.mm
40 ↗(On Diff #30130)

I we assume that it is possible for the same NSE instance to be used by many thread at once to process different notifications then we have to keep a collection on handlers and contents and accesses to this collection must be synchronized.

I considered folly's implementation of ConcurrentHashMap: https://github.com/facebook/folly/blob/main/folly/concurrency/ConcurrentHashMap.h, however it was not sufficient for our case. In serviceExtensionTimeWillExpire we must atomically retrieve all contents and handlers and remove them from hash maps to prevent other threads from accessing them. ConcurrentHashMap does not offer the getAndDeleteAll- like functionality. Therefore this code uses custom mutex-based locking scheme to ensure atomicity of any block of code.

430 ↗(On Diff #30130)

I want to make sure that whatever occurs inside @synchronized block is as fast as possible. That said the only operations that take place in @synchronized blocks are hash map keys additions/deletions. Content handler blocks are provided by the system so they might be potentially costly to call so I want to avoid calling them in @synchronized blocks.

480 ↗(On Diff #30130)

Memory events are delivered at process level. Therefore if we decide to create separate memorySource for each thread/NSE instance we could see redundant messages about memory issue in case one happens. That said I decided that memorySource will be set once per process. In case memory event occurs the event handler will substitute a string value that is initially null. Each NSE instance belonging to the process will check for this value before displaying notification. The first one that sees non-null value will replace the notification content it was about to display with memory error notification.

ashoat requested changes to this revision.Aug 21 2023, 2:25 PM

It feels like this diff should be broken down into at least 3: changes to temporary file name, monitoring memory usage, and changes to support multiple contentHandlers. Can you break it down to make it easier to review? (Feel free to break it down into more than 3 if you prefer.)

I didn't have time to look into the concurrency / locking stuff – hoping that @tomek and @bartek can help there, as my time is increasingly limited starting now.

If I am unable to follow-up on this Request Changes after your next update, then feel free to have one of the other reviews confirm that the feedback was addressed, and then remove me from the list of reviewers.

native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp
97–101 ↗(On Diff #30130)

Good theory! I'm not sure why I saw no logs for the failure case... but perhaps after a crash, Apple "penalizes" us and doesn't start the NSE for some time. That would also explain why we tend to see several of these failures in a row.

It appears that you're addressing the potential issue here with two changes. One question for each change:

  1. Making sure each thread/process uses a distinct file with a random suffix: Do we need any corresponding code change for the code that reads the file(s)? Or is that code just reading any file within a target directory?
  2. Removing the line that deletes the file: Is there any concern about unrestrained growth in the size of this file if we're not deleting it?
This revision now requires changes to proceed.Aug 21 2023, 2:25 PM
native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp
97–101 ↗(On Diff #30130)

Making sure each thread/process uses a distinct file with a random suffix

The reason to introduce temporary file here is to have atomic reads and writes to main olm session file. The content of main olm file is read into memory, in-memory version is modified, modified content is written to temporary file and finally temporary file is renamed with the main olm session file. That said there is no code reading this file and there is no code accessing this file outside of this method. Did I correctly understand your question?

Removing the line that deletes the file:

Temporary file is removed each time an error occurs and is deleted at the end of this method (it is not currently deleted but I will add this line - forgot to do it thanks for catching). Is there any concern then? Although unlikely some uncaught error (or sth like power-off) might occur just before we remove the temporary file. It will not grow since any subsequent call to this method will create new temporary file but it may just stafy there undeleted. To be honest I don't have a better idea on how to solve this problem than some scheduled cleanup. Android and iOS have tools for this. Perhaps we could do it together with downloaded blobs cleanup? Alternatively we could use directory search API to check for files that are older that some threshold and delete them. Such code could be executed at the beginning of this function or in the NSE code just before decrypted and persisted notification is displayed.

tomek requested changes to this revision.Aug 22 2023, 3:30 AM

I didn't have time to look into the concurrency / locking stuff – hoping that @tomek and @bartek can help there, as my time is increasingly limited starting now.

Sure! But probably it will be more efficient after the diff get split

Temporary file is removed each time an error occurs and is deleted at the end of this method (it is not currently deleted but I will add this line - forgot to do it thanks for catching). Is there any concern then? Although unlikely some uncaught error (or sth like power-off) might occur just before we remove the temporary file. It will not grow since any subsequent call to this method will create new temporary file but it may just stafy there undeleted. To be honest I don't have a better idea on how to solve this problem than some scheduled cleanup. Android and iOS have tools for this. Perhaps we could do it together with downloaded blobs cleanup? Alternatively we could use directory search API to check for files that are older that some threshold and delete them. Such code could be executed at the beginning of this function or in the NSE code just before decrypted and persisted notification is displayed.

Maybe we can consider using system's temp directory?

native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp
27–28 ↗(On Diff #30130)

Seems like a really huge value. If we're choosing from alphanumeric values this gives more than 10^60 possible suffixes. Choosing 8 chars feels big enough, and avoids depleting entropy too much.

94 ↗(On Diff #30130)

This random string may contain a space, but I guess it's safe on both platforms.

Make this diff contain synchronization implementation only.

marcin retitled this revision from Modify notification to display debugging messages in case an error in the NSE occurs to Implement synchronization mechanisms to address process, thread and class level concurrency..
native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp
94 ↗(On Diff #30130)

This is valid observation. I will check since I am not sure.

In D8795#262588, @tomek wrote:

Maybe we can consider using system's temp directory?

I did some research and here are my findings:

Looks like it might be worth to use methods above. However we will need to introduce new method in PlatformSpecificTools that would call relevant method on each platform. Confirmed with @tomek
that it should be a follow-up task.

native/cpp/CommonCpp/Notifications/BackgroundDataStorage/NotificationsCryptoModule.cpp
94 ↗(On Diff #30130)

I couldn't find any resources for this so I think it is better to remove potential spaces.

Mostly nits and minor requests

Still hoping somebody else can review the synchronization more closely, but it seems rather uncontroversial in this diff

native/ios/NotificationService/NotificationService.mm
123 ↗(On Diff #30213)

Nit: should this line be realigned now that the line above has been shortened?

157 ↗(On Diff #30213)

Nit: I think a space should be added here, so that it's clear that lines 158-164 are not part of step 5

167 ↗(On Diff #30213)

This function has been significantly changed. I'm assuming the changes are inspired by another implementation – if so, can you link the other code that inspired you here?

398 ↗(On Diff #30213)

I think it's a little confusing that we return content here given that we modify it in-place. Seeing it returned seems to imply we create a new one

But then again, maybe it's more confused to skip an assignment at the callsite... it might not be clear that any mutation is occurring

I would probably suggest removing the return and renaming the function to be more clear that it mutates the content in-place, but I'm curious for other people's perspectives

401 ↗(On Diff #30213)

Typo

405 ↗(On Diff #30213)

Might be worth explaining why we don't trust it – it looks like other open-source NSE implementations don't trust it either (Signal, Telegram, Element)

tomek requested changes to this revision.Aug 22 2023, 10:09 AM

Make this diff contain synchronization implementation only.

Should we update the summary?

Overall looks good and correct, but I have a couple of important questions.

native/ios/NotificationService/NotificationService.mm
53 ↗(On Diff #30213)

Shouldn't we set up these in init?

176–226 ↗(On Diff #30213)

This approach is risky because we can delete a handler without calling it. It can happen e.g. when an exception would be thrown somewhere inside the for loop. We can consider synchronizing this whole block which might be expensive for other threads - and still, they should expect to see an empty collection of handlers. Also, we don't have any way of making read-callHandler-delete a truly atomic operation, because it might fail in the middle. So it's a tradeoff between sometimes not calling the handler and sometimes calling it twice. We probably don't know if Apple would penalize us for not calling a handler...

Overall, it seems like the chosen approach is the best, but it has some disadvantages.

420 ↗(On Diff #30213)

Can't we always synchronize on self?

This revision now requires changes to proceed.Aug 22 2023, 10:09 AM
native/ios/NotificationService/NotificationService.mm
53 ↗(On Diff #30213)

I would be scared to override init for this class. Apple docs are rather strict that we should no aim to instantiate NSE ourselves and didReceive... with serviceExtensionTimeWillExpire are the only API we should use. Additionally signal does global setup in didReceive.. callback: https://github.com/signalapp/Signal-iOS/blob/6cc18788850558251db5f772ca62bce6b576dfc4/SignalNSE/NotificationService.swift#L111. This is the particular reason I made this method idempotent.

123 ↗(On Diff #30213)

Unfortunately there is nothing we can do about it since it was what the formatter gives us.

167 ↗(On Diff #30213)

Actually this implementation is our own invention. I examined Signal and Element code.

  1. Signal just silences current content: https://github.com/signalapp/Signal-iOS/blob/6cc18788850558251db5f772ca62bce6b576dfc4/SignalNSE/NotificationService.swift#L166. This suggests that they don't believe the same NSE can be used by two threads at the same time. On the other hand they keep contentHandler of current NSE as an atomic variable: https://github.com/signalapp/Signal-iOS/blob/6cc18788850558251db5f772ca62bce6b576dfc4/SignalNSE/NotificationService.swift#L37.
  2. Element on the other hand does a similar thing to us (they keep a collection of handlers): https://github.com/vector-im/element-ios/blob/develop/RiotNSE/NotificationService.swift#L38 but their way of handling NSE timeout is just to give up: https://github.com/vector-im/element-ios/blob/develop/RiotNSE/NotificationService.swift#L147

After discussion with @tomek we agreed that it looks like those apps were aware of the possibility of two threads accessing the same NSE instance at a time but didn't have a good idea on how to handle it in case serviceExtensionTimeWillExpire since we don't know which notification is it called for.

That said I concluded that since we don't know which notification serviceExtensionTimeWillExpire is called for we will try to handle all notifications that are currently being processed.

176–226 ↗(On Diff #30213)

We probably don't know if Apple would penalize us for not calling a handler...

It is not penalized - notification is just displayed unchanged.

420 ↗(On Diff #30213)

setUpNSEInstance is synchronized on self so it is a minor optimization to use different lock.

Makes sense to me but might be a good idea for @bartek to also review this

native/ios/NotificationService/NotificationService.mm
172–185 ↗(On Diff #30258)

Might be worth considering having a single array of pair-like structures.

As far as I'm familiar with Obj-C synchronization mechanisms, this code makes sense to me

native/ios/NotificationService/NotificationService.mm
172–185 ↗(On Diff #30258)

I think it could degrade readibility

This revision is now accepted and ready to land.Aug 25 2023, 7:42 AM
  1. Refactor to reflect changes in parent differential

Rebase to reflect changes in parent differential

native/ios/NotificationService/NotificationService.mm
147

This line doesn't do anything. To have the same effect as updating self.bestAttemptContent previously, you would need to call your putContent method again

native/ios/NotificationService/NotificationService.mm
147

You are right that this line is not necessary. However there is no need to call putContent here. Unless we exceed time limit the actual content displayed to the user is going to be publicUserContent variable. Contents that we store in the synchronized dictionary (that are stored there by putContent) are only used in serviceExtensionTimeWillExpire to know what the notification type was to display appropriate timeout error message. In serviceExtensionTimeWillExpire if we detect that content was badgeOnly content we will construct badgeOnlyNotif again. Therefore we only have to delete this line.