Page MenuHomePhabricator

[keyserver] Process new messages for search
ClosedPublic

Authored by inka on Mar 15 2023, 4:31 AM.
Tags
None
Referenced Files
F3374132: D7077.diff
Tue, Nov 26, 2:00 PM
Unknown Object (File)
Fri, Nov 8, 5:04 PM
Unknown Object (File)
Fri, Nov 8, 5:04 PM
Unknown Object (File)
Fri, Nov 8, 5:04 PM
Unknown Object (File)
Fri, Nov 8, 5:04 PM
Unknown Object (File)
Fri, Nov 8, 5:04 PM
Unknown Object (File)
Fri, Nov 8, 5:04 PM
Unknown Object (File)
Fri, Nov 8, 5:04 PM
Subscribers

Details

Summary

issues: https://linear.app/comm/issue/ENG-3314/implement-a-function-for-processing-messages-for-search-and-storing,
https://linear.app/comm/issue/ENG-3315/implement-adding-processed-messages-to-db-when-new-messages-are
tokenizeAndStem splits the text into words, removes stopwords, and stemms the remaining words. It returns an array. The second paramether is a boolean tellinng whether to keep the stopwords. source
code:https://github.com/NaturalNode/natural/blob/master/lib/natural/stemmers/stemmer.js

Test Plan

Tested that when a new message is created, a proper field appears in the search table. Tested that when a message is edited (by passing an artificial edit message to processMessagesForSearch
function) a proper field is edited in the search table.
run yarn jest search-utils

Diff Detail

Repository
rCOMM Comm
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

inka requested review of this revision.Mar 15 2023, 4:46 AM
keyserver/src/database/search_utils.js
36–38 ↗(On Diff #23751)
  1. Please format this like other SQL queries. Note the indentation we use in the codebase, and note the spacing we use in the codebase
  2. Please make sure you don't have any lines longer than 80 chars

Address review and add early return when there is nothing to insert

ashoat requested changes to this revision.Mar 20 2023, 2:14 PM

Looks good, but a couple notes in message-creator.js

keyserver/src/creators/message-creator.js
186 ↗(On Diff #23834)

We should only pass newMessageInfos in here. The messages that are in returnMessageInfos but not newMessageInfos are messages that have already been created. Those messages should have already been indexed

This brings to mind a question: is the indexing process idempotent? Meaning, if I index the same message twice, will it be the same as if I indexed that message once?

207 ↗(On Diff #23834)

I don't think we should block the return on this. Search indexing is usually implemented as a "post-processing step"... the user creating the message shouldn't need to wait on the search indexing to complete for the endpoint to return.

Instead, I think we should include this in postMessageSendPromise. Can you move the call to processMessagesForSearch into postMessageSend? You can use the messageInfos parameter (stripLocalIDs should have no effect on indexing I think)

keyserver/src/database/search_utils.js
1 ↗(On Diff #23834)

Can you name this file search-utils.js to match the naming convention in the codebase?

This revision now requires changes to proceed.Mar 20 2023, 2:14 PM
keyserver/src/creators/message-creator.js
186 ↗(On Diff #23834)

The indexing process is idempotent, but if we indexed a message that has a later edit, than this later edit needs to be indexed as well. Otherwise we would have the outdated content in the search table.

Rename file, move processMessagesForSearch to postMessageSend

ashoat requested changes to this revision.Mar 24 2023, 12:02 PM
ashoat added inline comments.
keyserver/src/database/search-utils.js
23–26

I think we'll need to revisit this following discussion in D7076. Arguably that discussion maybe should be happening here...

This revision now requires changes to proceed.Mar 24 2023, 12:03 PM

Changes due to changes in D7076
I will continue the discussion on parsing after I scope my next goal

ashoat requested changes to this revision.Mar 27 2023, 6:59 AM

Requesting changes for inline comment, and for the discussion previously in D7076

keyserver/src/creators/message-creator.js
301 ↗(On Diff #24159)

I realized something: this is actually very risky because if it throws an exception, nothing will catch it.

The issue is that if the promise rejects without being caught, Node.js treats this a bit like an exception being thrown and not caught.

You'll see an "unhandled promise rejection" warning in our current version of Node. But in more recent versions of Node, this would crash the whole app.

We have two options here:

  1. Wrap this call with handleAsyncPromise
  2. (Probably better) Make sure to await the Promise returned by this call (probably in the Promise.all at the bottom of the function). This works because then the promise rejection will cause postMessageSend to reject, which will be handled by the handleAsyncPromise on line 199
This revision now requires changes to proceed.Mar 27 2023, 6:59 AM

Answering to comments on D7076:

natural.PorterStemmer.tokenizeAndStem uses AggressiveTokenizer by default (PorterStemmer gets this function from Stemmer that it extendes, and Stemmer uses AggressiveTokenizer).

From the other tokanizers the only one that looks potentially useful for us is RegexpTokenizer, that allows to define own regex.

const tokenizer = new natural.RegexpTokenizer({
      pattern: /([A-Za-zÀ-ÿ-'"]+|[0-9._]+|.|!|\?|:|;|,|-)/iu,
    });
    console.log(tokenizer.tokenize(text));

produces

[ 'Hello', 'test', "it's", 'working', '😄', 'test' ]

But it would be a bit difficult to have it parse it's as it's and 'hello' as hello. Generally we would have to come up with a good regex.

The disadvantage of Intl.Segmenter is that there seems to be some problem with types: Intl is typed in flow builtin definitions, but it does not contain Segmenter https://github.com/facebook/flow/blob/main/lib/intl.js
But it works, so if I work around the flow problem it should be fine.

Although it's worth noting that the stemmer will remove the s from it's anyway, since it's treating it as a modifying suffix.
So after applying Intl.Segmenter.segment and natural.PorterStemmer.stem we get

['hello', 'test', "it'", 'work', '😄', 'test']

Intl.Segmenter seems like the best bet! We can figure out the Flow types... worst case we might need to do something like const OurIntl: OurIntlType = (Intl: any); or something.

In D7077#214454, @inka wrote:

Although it's worth noting that the stemmer will remove the s from it's anyway, since it's treating it as a modifying suffix.
So after applying Intl.Segmenter.segment and natural.PorterStemmer.stem we get

['hello', 'test', "it'", 'work', '😄', 'test']

That seems good to me! It makes sense that it would stem "it's" into "it". One thing that's weird is that it's not stripping the apostrophe... ideally it would just be "it", which will likely be filtered because it's a stop word. I'm not sure if "it'" (with apostrophe) counts as a stop word.

One easy solution would be to add some manual code that strips punctuation (non-letter characters) after the stemmer runs.

I don't think we want to strip all non-letter characters, because then we loose the emojis again.
Here is punctuation regex defined by wikipedia:

[][!"#$%&'()*+,./:;<=>?@\^_`{|}~-]

maybe we could use that?

I'm also wondering if we should remove the punctuation from inside of words. Maybe only strip the punctuation at the beginning and at the end? Since it makes sense to leave o'clock and 100,000 and such.

I run a test, and for hello{item}bye the segmenter parsed it to hello {item} bye for all symbols apart from ., _ , ', @, which were left as hello{item}bye . And 100{item}000 it parsed to 100 {item} 000 for all symbols apart from ., ,, ;, _, ', @. So the separated punctuation symbols would be removed, and only those special cases I listed would be left.

inka requested review of this revision.Mar 29 2023, 11:42 AM
ashoat requested changes to this revision.Mar 29 2023, 1:09 PM

I'm also wondering if we should remove the punctuation from inside of words. Maybe only strip the punctuation at the beginning and at the end? Since it makes sense to leave o'clock and 100,000 and such.

I think it would be fine to map o'clock to oclock. The idea of removing the punctuation inside of words makes sense to me. I think it would be good to have some unit tests to confirm that our tokenizer/stemmer works how we expect it to.

keyserver/src/database/search-utils.js
23–26 ↗(On Diff #24159)

I think we still need to change this, right?

This revision now requires changes to proceed.Mar 29 2023, 1:09 PM
keyserver/src/database/search-utils.js
23–26 ↗(On Diff #24159)

Yes, I requested review to get the answer, I will update this now

Await for processMessagesForSearch promise, Use Intl for tokenizing, add tests

It would be good to have unit tests for segmentAndStem, but I don't want to block you from landing this. Can you create a follow-up task for that before landing?

Please also make sure to address my inline comments before landing.

keyserver/src/database/search-utils.js
12 ↗(On Diff #24381)

Can you link where you got this from?

15 ↗(On Diff #24381)

Shorthand

18 ↗(On Diff #24381)

I would add a .map(({ segment }) => segment) here before the filter, since we only seem to need the segment property on the result

23 ↗(On Diff #24381)

Please try to stick to a consistent convention of camelCase for variable names, and please try to catch this before putting diffs up to avoid unnecessary review churn!

26 ↗(On Diff #24381)

[No action needed]

It is risky to reuse a RegExp with the /g global setting, because RegExp.exec() and RegExp.test() are stateful. String.replace does not have the same statefulness problem (RegExp.lastIndex is always reset to 0 after calling), so it's "safe"... but this is a potential "footgun" to keep in mind

This revision is now accepted and ready to land.Mar 30 2023, 2:22 PM
inka requested review of this revision.Mar 31 2023, 4:20 AM

@ashoat I did add search-utils.test.js that tests segmentAndStem. It's at the bottom of this diff, I think you maybe didn't see it? Please tell me if I'm misunderstanding.
The cases I tased are:

  • if it removes punctuation
  • if it removes uppercase
  • if if removes stopwords (on example stopwords)
  • if it removes excess whitespace (like from "word word")
keyserver/src/database/search-utils.js
12 ↗(On Diff #24381)

Source: wikipedia
Those are ASCII punctuation chars.
But I found that there exists a Unicode property escapes that allows to match all of unicode punctuation:MDN docs
const punctuationRegex = /\p{General_Category=Punctuation}/u;

There are 7 punctuation categories as defined by unicode. Here are their lists:
list1
list2
list3
list4
list5
list6
list7
I tested 3 symbols from each list, they all matched this regex. I tested using jest in keyserver/.
I checked that emojis are not matched by this regex (with an example). I tested some letters like ł, Å, Ø, ķ - they don't match the regex either.

On the other hand it doesn't match any of

+ < = > ^ ` | ~

that were matched by the ASCII regex

I think this would still be a better option. I'm going to use it and if you disagree please let me know. I'm requesting review for this

23 ↗(On Diff #24381)

Sorry about doing this a lot recently, I will try to catch this better

Address review, change punctuation regex

It's at the bottom of this diff, I think you maybe didn't see it?

Oops!! Not sure how I missed that, sorry about it.

I think it would be good to have unit tests of the exact result of segmentAndStem, rather than just checking the result with a RegExp. That way the reader can tell what segmentAndStem is doing exactly, and we can test it end-to-end.

Regarding the punctuation, are we are deciding between "hello+goodbye" being treated as one word "hellogoodbye", versus being split into two words "hello goodbye"? If so, I think it makes sense to split into two words in that case. (Whereas I would prefer for cases like "Ashoat's" to be stemmed into "Ashoat" if possible.)

(If we had a unit test of hello+goodbye and Ashoat's showing what we expect the exact result to be, that would answer my question!)

Reaccepting to unblock for now, in case you want to land and address later

This revision is now accepted and ready to land.Mar 31 2023, 8:44 AM

Remove spreading iterator to array, to fix allocation error https://linear.app/comm/issue/ENG-3575/nodejs-allocation-error
Remove regex tests, add segmentAndStem tests.
Use replaceAll instead of replace

inka requested review of this revision.EditedApr 3 2023, 3:39 AM

are we are deciding between "hello+goodbye" being treated as one word "hellogoodbye", versus being split into two words "hello goodbye"?

No, with this approach "hello+bye" is parsed to "hello + bye", because segmenter splits them, and .replaceAll doesn't remove the +, since it doesn't match the punctuationRegex. So any of

+ < = > ^ ` | ~

would be left

I'm requesting review, since my goal cannot be landed anyway until I solve setting db variables for all team members

keyserver/src/database/search-utils.js
12 ↗(On Diff #24562)

To use replaceAll, the regex has to be global (source: MDN docs, and I was indeed getting errors when it wasn't)

In D7077#216716, @inka wrote:

are we are deciding between "hello+goodbye" being treated as one word "hellogoodbye", versus being split into two words "hello goodbye"?

No, with this approach "hello+bye" is parsed to "hello + bye", because segmenter splits them, and .replaceAll doesn't remove the +, since it doesn't match the punctuationRegex. So any of

+ < = > ^ ` | ~

would be left

I'm requesting review, since my goal cannot be landed anyway until I solve setting db variables for all team members

Ah, okay – thanks for clarifying. I think turning hello+bye into hello + bye is probably fine. In some sense I wonder if + should be a stop word" but I'd rather move forward here than nitpick too much over the details.

I'm honestly still confused about the punctuation question. I think the issue is that I naturally assume that the stemmer will remove most punctuation (eg. I would expect the stemmer to take brother’s, into brother), so I'm not sure which cases the punctuation RegExp is necessary for.

I assume there's some good reason for it, though – and the unit tests seem good!

This revision is now accepted and ready to land.Apr 3 2023, 1:50 PM