Details

Reviewers

kamil
bartek
• kuba
ashoat

Commits

rCOMM4ddc00224434: [keyserver] Process new messages for search

Summary

issues: https://linear.app/comm/issue/ENG-3314/implement-a-function-for-processing-messages-for-search-and-storing,
https://linear.app/comm/issue/ENG-3315/implement-adding-processed-messages-to-db-when-new-messages-are
tokenizeAndStem splits the text into words, removes stopwords, and stemms the remaining words. It returns an array. The second paramether is a boolean tellinng whether to keep the stopwords. source
code:https://github.com/NaturalNode/natural/blob/master/lib/natural/stemmers/stemmer.js

Test Plan

Tested that when a new message is created, a proper field appears in the search table. Tested that when a message is edited (by passing an artificial edit message to processMessagesForSearch
function) a proper field is edited in the search table.
run yarn jest search-utils

Diff Detail

Repository

rCOMM Comm

Lint

No Lint Coverage

Unit

No Test Coverage

Event Timeline

inka created this revision.Mar 15 2023, 11:31 AM

Herald added subscribers: atul, tomek, ashoat. · View Herald TranscriptMar 15 2023, 11:31 AM

inka added parent revisions: D7076: [keyserver] Add search table to the db, D6962: [lib][keyserver] Add editing message type to messageTypes.Mar 15 2023, 11:32 AM

Harbormaster completed remote builds in B17341: Diff 23751.Mar 15 2023, 11:46 AM

inka requested review of this revision.Mar 15 2023, 11:46 AM

ashoat added inline comments.Mar 15 2023, 2:41 PM

keyserver/src/database/search_utils.js
36–38 ↗	(On Diff #23751)	Please format this like other SQL queries. Note the indentation we use in the codebase, and note the spacing we use in the codebase Please make sure you don't have any lines longer than 80 chars

Address review and add early return when there is nothing to insert

Harbormaster completed remote builds in B17394: Diff 23834.Mar 20 2023, 10:53 AM

Looks good, but a couple notes in message-creator.js

keyserver/src/creators/message-creator.js
186 ↗	(On Diff #23834)	We should only pass `newMessageInfos` in here. The messages that are in `returnMessageInfos` but not `newMessageInfos` are messages that have already been created. Those messages should have already been indexed This brings to mind a question: is the indexing process idempotent? Meaning, if I index the same message twice, will it be the same as if I indexed that message once?
207 ↗	(On Diff #23834)	I don't think we should block the return on this. Search indexing is usually implemented as a "post-processing step"... the user creating the message shouldn't need to wait on the search indexing to complete for the endpoint to return. Instead, I think we should include this in `postMessageSendPromise`. Can you move the call to `processMessagesForSearch` into `postMessageSend`? You can use the `messageInfos` parameter (`stripLocalIDs` should have no effect on indexing I think)
keyserver/src/database/search_utils.js
1 ↗	(On Diff #23834)	Can you name this file `search-utils.js` to match the naming convention in the codebase?

This revision now requires changes to proceed.Mar 20 2023, 9:14 PM

inka mentioned this in D7117: [keyserver] Add migration - process messages in our db for search.Mar 21 2023, 10:12 AM

inka added a child revision: D7117: [keyserver] Add migration - process messages in our db for search.Mar 21 2023, 10:12 AM

inka added inline comments.Mar 24 2023, 12:34 PM

keyserver/src/creators/message-creator.js
186 ↗	(On Diff #23834)	The indexing process is idempotent, but if we indexed a message that has a later edit, than this later edit needs to be indexed as well. Otherwise we would have the outdated content in the search table.

Rename file, move processMessagesForSearch to postMessageSend

Harbormaster completed remote builds in B17584: Diff 24072.Mar 24 2023, 12:52 PM

ashoat requested changes to this revision.Mar 24 2023, 7:02 PM

ashoat mentioned this in D7076: [keyserver] Add search table to the db.

ashoat added inline comments.

keyserver/src/database/search-utils.js
23–26	I think we'll need to revisit this following discussion in D7076. Arguably that discussion maybe should be happening here...

This revision now requires changes to proceed.Mar 24 2023, 7:03 PM

Changes due to changes in D7076
I will continue the discussion on parsing after I scope my next goal

Harbormaster completed remote builds in B17652: Diff 24159.Mar 27 2023, 8:17 AM

Requesting changes for inline comment, and for the discussion previously in D7076

keyserver/src/creators/message-creator.js
301 ↗	(On Diff #24159)	I realized something: this is actually very risky because if it throws an exception, nothing will catch it. The issue is that if the promise rejects without being caught, Node.js treats this a bit like an exception being thrown and not caught. You'll see an "unhandled promise rejection" warning in our current version of Node. But in more recent versions of Node, this would crash the whole app. We have two options here: Wrap this call with `handleAsyncPromise` (Probably better) Make sure to await the Promise returned by this call (probably in the `Promise.all` at the bottom of the function). This works because then the promise rejection will cause `postMessageSend` to reject, which will be handled by the `handleAsyncPromise` on line 199

This revision now requires changes to proceed.Mar 27 2023, 1:59 PM

Answering to comments on D7076:

natural.PorterStemmer.tokenizeAndStem uses AggressiveTokenizer by default (PorterStemmer gets this function from Stemmer that it extendes, and Stemmer uses AggressiveTokenizer).

From the other tokanizers the only one that looks potentially useful for us is RegexpTokenizer, that allows to define own regex.

const tokenizer = new natural.RegexpTokenizer({
      pattern: /([A-Za-zÀ-ÿ-'"]+|[0-9._]+|.|!|\?|:|;|,|-)/iu,
    });
    console.log(tokenizer.tokenize(text));

produces

[ 'Hello', 'test', "it's", 'working', '😄', 'test' ]

But it would be a bit difficult to have it parse it's as it's and 'hello' as hello. Generally we would have to come up with a good regex.

The disadvantage of Intl.Segmenter is that there seems to be some problem with types: Intl is typed in flow builtin definitions, but it does not contain Segmenter https://github.com/facebook/flow/blob/main/lib/intl.js
But it works, so if I work around the flow problem it should be fine.

Although it's worth noting that the stemmer will remove the s from it's anyway, since it's treating it as a modifying suffix.
So after applying Intl.Segmenter.segment and natural.PorterStemmer.stem we get

['hello', 'test', "it'", 'work', '😄', 'test']

Intl.Segmenter seems like the best bet! We can figure out the Flow types... worst case we might need to do something like const OurIntl: OurIntlType = (Intl: any); or something.

In D7077#214454, @inka wrote:
Although it's worth noting that the stemmer will remove the s from it's anyway, since it's treating it as a modifying suffix.
So after applying Intl.Segmenter.segment and natural.PorterStemmer.stem we get
['hello', 'test', "it'", 'work', '😄', 'test']

That seems good to me! It makes sense that it would stem "it's" into "it". One thing that's weird is that it's not stripping the apostrophe... ideally it would just be "it", which will likely be filtered because it's a stop word. I'm not sure if "it'" (with apostrophe) counts as a stop word.

One easy solution would be to add some manual code that strips punctuation (non-letter characters) after the stemmer runs.

I don't think we want to strip all non-letter characters, because then we loose the emojis again.
Here is punctuation regex defined by wikipedia:

[][!"#$%&'()*+,./:;<=>?@\^_`{|}~-]

maybe we could use that?

I'm also wondering if we should remove the punctuation from inside of words. Maybe only strip the punctuation at the beginning and at the end? Since it makes sense to leave o'clock and 100,000 and such.

I run a test, and for hello{item}bye the segmenter parsed it to hello {item} bye for all symbols apart from ., _ , ', @, which were left as hello{item}bye . And 100{item}000 it parsed to 100 {item} 000 for all symbols apart from ., ,, ;, _, ', @. So the separated punctuation symbols would be removed, and only those special cases I listed would be left.

inka mentioned this in D7238: [keyserver] Add type workaround for Intl.Segmenter.Mar 29 2023, 4:27 PM

I added D7238 to type Intl to have the Segmenter field.

inka requested review of this revision.Mar 29 2023, 6:42 PM

I'm also wondering if we should remove the punctuation from inside of words. Maybe only strip the punctuation at the beginning and at the end? Since it makes sense to leave o'clock and 100,000 and such.

I think it would be fine to map o'clock to oclock. The idea of removing the punctuation inside of words makes sense to me. I think it would be good to have some unit tests to confirm that our tokenizer/stemmer works how we expect it to.

keyserver/src/database/search-utils.js
23–26 ↗	(On Diff #24159)	I think we still need to change this, right?

This revision now requires changes to proceed.Mar 29 2023, 8:09 PM

inka added inline comments.Mar 30 2023, 9:39 AM

keyserver/src/database/search-utils.js
23–26 ↗	(On Diff #24159)	Yes, I requested review to get the answer, I will update this now

Await for processMessagesForSearch promise, Use Intl for tokenizing, add tests

inka edited the test plan for this revision. (Show Details)Mar 30 2023, 1:13 PM

Harbormaster completed remote builds in B17815: Diff 24381.Mar 30 2023, 1:26 PM

It would be good to have unit tests for segmentAndStem, but I don't want to block you from landing this. Can you create a follow-up task for that before landing?

Please also make sure to address my inline comments before landing.

keyserver/src/database/search-utils.js
12 ↗	(On Diff #24381)	Can you link where you got this from?
15 ↗	(On Diff #24381)	Shorthand
18 ↗	(On Diff #24381)	I would add a `.map(({ segment }) => segment)` here before the `filter`, since we only seem to need the `segment` property on the result
23 ↗	(On Diff #24381)	Please try to stick to a consistent convention of `camelCase` for variable names, and please try to catch this before putting diffs up to avoid unnecessary review churn!
26 ↗	(On Diff #24381)	[No action needed] It is risky to reuse a RegExp with the `/g` global setting, because `RegExp.exec()` and `RegExp.test()` are stateful. `String.replace` does not have the same statefulness problem (`RegExp.lastIndex` is always reset to 0 after calling), so it's "safe"... but this is a potential "footgun" to keep in mind

This revision is now accepted and ready to land.Mar 30 2023, 9:22 PM

@ashoat I did add search-utils.test.js that tests segmentAndStem. It's at the bottom of this diff, I think you maybe didn't see it? Please tell me if I'm misunderstanding.
The cases I tased are:

if it removes punctuation
if it removes uppercase
if if removes stopwords (on example stopwords)
if it removes excess whitespace (like from "word word")

keyserver/src/database/search-utils.js
12 ↗	(On Diff #24381)	Source: wikipedia Those are ASCII punctuation chars. But I found that there exists a Unicode property escapes that allows to match all of unicode punctuation:MDN docs `const punctuationRegex = /\p{General_Category=Punctuation}/u;` There are 7 punctuation categories as defined by unicode. Here are their lists: list1 list2 list3 list4 list5 list6 list7 I tested 3 symbols from each list, they all matched this regex. I tested using `jest` in keyserver/. I checked that emojis are not matched by this regex (with an example). I tested some letters like ł, Å, Ø, ķ - they don't match the regex either. On the other hand it doesn't match any of + < = > ^ ` \| ~ that were matched by the ASCII regex I think this would still be a better option. I'm going to use it and if you disagree please let me know. I'm requesting review for this
23 ↗	(On Diff #24381)	Sorry about doing this a lot recently, I will try to catch this better

Address review, change punctuation regex

Harbormaster completed remote builds in B17887: Diff 24502.Mar 31 2023, 11:54 AM

It's at the bottom of this diff, I think you maybe didn't see it?

Oops!! Not sure how I missed that, sorry about it.

I think it would be good to have unit tests of the exact result of segmentAndStem, rather than just checking the result with a RegExp. That way the reader can tell what segmentAndStem is doing exactly, and we can test it end-to-end.

Regarding the punctuation, are we are deciding between "hello+goodbye" being treated as one word "hellogoodbye", versus being split into two words "hello goodbye"? If so, I think it makes sense to split into two words in that case. (Whereas I would prefer for cases like "Ashoat's" to be stemmed into "Ashoat" if possible.)

(If we had a unit test of hello+goodbye and Ashoat's showing what we expect the exact result to be, that would answer my question!)

Reaccepting to unblock for now, in case you want to land and address later

This revision is now accepted and ready to land.Mar 31 2023, 3:44 PM

Remove spreading iterator to array, to fix allocation error https://linear.app/comm/issue/ENG-3575/nodejs-allocation-error
Remove regex tests, add segmentAndStem tests.
Use replaceAll instead of replace

are we are deciding between "hello+goodbye" being treated as one word "hellogoodbye", versus being split into two words "hello goodbye"?

No, with this approach "hello+bye" is parsed to "hello + bye", because segmenter splits them, and .replaceAll doesn't remove the +, since it doesn't match the punctuationRegex. So any of

+ < = > ^ ` | ~

would be left

I'm requesting review, since my goal cannot be landed anyway until I solve setting db variables for all team members

inka added inline comments.Apr 3 2023, 10:41 AM

keyserver/src/database/search-utils.js
12 ↗	(On Diff #24562)	To use `replaceAll`, the regex has to be global (source: MDN docs, and I was indeed getting errors when it wasn't)

Harbormaster completed remote builds in B17932: Diff 24562.Apr 3 2023, 10:55 AM

In D7077#216716, @inka wrote:
are we are deciding between "hello+goodbye" being treated as one word "hellogoodbye", versus being split into two words "hello goodbye"?

No, with this approach "hello+bye" is parsed to "hello + bye", because segmenter splits them, and .replaceAll doesn't remove the +, since it doesn't match the punctuationRegex. So any of
+ < = > ^ ` | ~
would be left

I'm requesting review, since my goal cannot be landed anyway until I solve setting db variables for all team members

Ah, okay – thanks for clarifying. I think turning hello+bye into hello + bye is probably fine. In some sense I wonder if + should be a stop word" but I'd rather move forward here than nitpick too much over the details.

I'm honestly still confused about the punctuation question. I think the issue is that I naturally assume that the stemmer will remove most punctuation (eg. I would expect the stemmer to take brother’s, into brother), so I'm not sure which cases the punctuation RegExp is necessary for.

I assume there's some good reason for it, though – and the unit tests seem good!

This revision is now accepted and ready to land.Apr 3 2023, 8:50 PM

Closed by commit rCOMM4ddc00224434: [keyserver] Process new messages for search. · Explain WhyApr 7 2023, 1:21 PM

This revision was automatically updated to reflect the committed changes.

• InkaSokolowska mentioned this in rCOMM24f517176988: [keyserver] Add type workaround for Intl.Segmenter.

• InkaSokolowska added a commit: rCOMM4ddc00224434: [keyserver] Process new messages for search.

• InkaSokolowska mentioned this in rCOMM19613c3beb46: [keyserver] Add migration - process messages in our db for search.