issues: https://linear.app/comm/issue/ENG-3314/implement-a-function-for-processing-messages-for-search-and-storing,
https://linear.app/comm/issue/ENG-3315/implement-adding-processed-messages-to-db-when-new-messages-are
tokenizeAndStem splits the text into words, removes stopwords, and stemms the remaining words. It returns an array. The second paramether is a boolean tellinng whether to keep the stopwords. source
code:https://github.com/NaturalNode/natural/blob/master/lib/natural/stemmers/stemmer.js
Details
Tested that when a new message is created, a proper field appears in the search table. Tested that when a message is edited (by passing an artificial edit message to processMessagesForSearch
function) a proper field is edited in the search table.
run yarn jest search-utils
Diff Detail
- Repository
- rCOMM Comm
- Lint
No Lint Coverage - Unit
No Test Coverage
Event Timeline
keyserver/src/database/search_utils.js | ||
---|---|---|
36–38 ↗ | (On Diff #23751) |
|
Looks good, but a couple notes in message-creator.js
keyserver/src/creators/message-creator.js | ||
---|---|---|
186 ↗ | (On Diff #23834) | We should only pass newMessageInfos in here. The messages that are in returnMessageInfos but not newMessageInfos are messages that have already been created. Those messages should have already been indexed This brings to mind a question: is the indexing process idempotent? Meaning, if I index the same message twice, will it be the same as if I indexed that message once? |
207 ↗ | (On Diff #23834) | I don't think we should block the return on this. Search indexing is usually implemented as a "post-processing step"... the user creating the message shouldn't need to wait on the search indexing to complete for the endpoint to return. Instead, I think we should include this in postMessageSendPromise. Can you move the call to processMessagesForSearch into postMessageSend? You can use the messageInfos parameter (stripLocalIDs should have no effect on indexing I think) |
keyserver/src/database/search_utils.js | ||
1 ↗ | (On Diff #23834) | Can you name this file search-utils.js to match the naming convention in the codebase? |
keyserver/src/creators/message-creator.js | ||
---|---|---|
186 ↗ | (On Diff #23834) | The indexing process is idempotent, but if we indexed a message that has a later edit, than this later edit needs to be indexed as well. Otherwise we would have the outdated content in the search table. |
Changes due to changes in D7076
I will continue the discussion on parsing after I scope my next goal
Requesting changes for inline comment, and for the discussion previously in D7076
keyserver/src/creators/message-creator.js | ||
---|---|---|
301 ↗ | (On Diff #24159) | I realized something: this is actually very risky because if it throws an exception, nothing will catch it. The issue is that if the promise rejects without being caught, Node.js treats this a bit like an exception being thrown and not caught. You'll see an "unhandled promise rejection" warning in our current version of Node. But in more recent versions of Node, this would crash the whole app. We have two options here:
|
Answering to comments on D7076:
natural.PorterStemmer.tokenizeAndStem uses AggressiveTokenizer by default (PorterStemmer gets this function from Stemmer that it extendes, and Stemmer uses AggressiveTokenizer).
From the other tokanizers the only one that looks potentially useful for us is RegexpTokenizer, that allows to define own regex.
const tokenizer = new natural.RegexpTokenizer({ pattern: /([A-Za-zÀ-ÿ-'"]+|[0-9._]+|.|!|\?|:|;|,|-)/iu, }); console.log(tokenizer.tokenize(text));
produces
[ 'Hello', 'test', "it's", 'working', '😄', 'test' ]
But it would be a bit difficult to have it parse it's as it's and 'hello' as hello. Generally we would have to come up with a good regex.
The disadvantage of Intl.Segmenter is that there seems to be some problem with types: Intl is typed in flow builtin definitions, but it does not contain Segmenter https://github.com/facebook/flow/blob/main/lib/intl.js
But it works, so if I work around the flow problem it should be fine.
Although it's worth noting that the stemmer will remove the s from it's anyway, since it's treating it as a modifying suffix.
So after applying Intl.Segmenter.segment and natural.PorterStemmer.stem we get
['hello', 'test', "it'", 'work', '😄', 'test']
Intl.Segmenter seems like the best bet! We can figure out the Flow types... worst case we might need to do something like const OurIntl: OurIntlType = (Intl: any); or something.
That seems good to me! It makes sense that it would stem "it's" into "it". One thing that's weird is that it's not stripping the apostrophe... ideally it would just be "it", which will likely be filtered because it's a stop word. I'm not sure if "it'" (with apostrophe) counts as a stop word.
One easy solution would be to add some manual code that strips punctuation (non-letter characters) after the stemmer runs.
I don't think we want to strip all non-letter characters, because then we loose the emojis again.
Here is punctuation regex defined by wikipedia:
[][!"#$%&'()*+,./:;<=>?@\^_`{|}~-]
maybe we could use that?
I'm also wondering if we should remove the punctuation from inside of words. Maybe only strip the punctuation at the beginning and at the end? Since it makes sense to leave o'clock and 100,000 and such.
I run a test, and for hello{item}bye the segmenter parsed it to hello {item} bye for all symbols apart from ., _ , ', @, which were left as hello{item}bye . And 100{item}000 it parsed to 100 {item} 000 for all symbols apart from ., ,, ;, _, ', @. So the separated punctuation symbols would be removed, and only those special cases I listed would be left.
I'm also wondering if we should remove the punctuation from inside of words. Maybe only strip the punctuation at the beginning and at the end? Since it makes sense to leave o'clock and 100,000 and such.
I think it would be fine to map o'clock to oclock. The idea of removing the punctuation inside of words makes sense to me. I think it would be good to have some unit tests to confirm that our tokenizer/stemmer works how we expect it to.
keyserver/src/database/search-utils.js | ||
---|---|---|
23–26 ↗ | (On Diff #24159) | I think we still need to change this, right? |
keyserver/src/database/search-utils.js | ||
---|---|---|
23–26 ↗ | (On Diff #24159) | Yes, I requested review to get the answer, I will update this now |
It would be good to have unit tests for segmentAndStem, but I don't want to block you from landing this. Can you create a follow-up task for that before landing?
Please also make sure to address my inline comments before landing.
keyserver/src/database/search-utils.js | ||
---|---|---|
12 | Can you link where you got this from? | |
15 | Shorthand | |
18 | I would add a .map(({ segment }) => segment) here before the filter, since we only seem to need the segment property on the result | |
23 | Please try to stick to a consistent convention of camelCase for variable names, and please try to catch this before putting diffs up to avoid unnecessary review churn! | |
26 | [No action needed] It is risky to reuse a RegExp with the /g global setting, because RegExp.exec() and RegExp.test() are stateful. String.replace does not have the same statefulness problem (RegExp.lastIndex is always reset to 0 after calling), so it's "safe"... but this is a potential "footgun" to keep in mind |
@ashoat I did add search-utils.test.js that tests segmentAndStem. It's at the bottom of this diff, I think you maybe didn't see it? Please tell me if I'm misunderstanding.
The cases I tased are:
- if it removes punctuation
- if it removes uppercase
- if if removes stopwords (on example stopwords)
- if it removes excess whitespace (like from "word word")
keyserver/src/database/search-utils.js | ||
---|---|---|
12 | Source: wikipedia There are 7 punctuation categories as defined by unicode. Here are their lists: On the other hand it doesn't match any of + < = > ^ ` | ~ that were matched by the ASCII regex I think this would still be a better option. I'm going to use it and if you disagree please let me know. I'm requesting review for this | |
23 | Sorry about doing this a lot recently, I will try to catch this better |
It's at the bottom of this diff, I think you maybe didn't see it?
Oops!! Not sure how I missed that, sorry about it.
I think it would be good to have unit tests of the exact result of segmentAndStem, rather than just checking the result with a RegExp. That way the reader can tell what segmentAndStem is doing exactly, and we can test it end-to-end.
Regarding the punctuation, are we are deciding between "hello+goodbye" being treated as one word "hellogoodbye", versus being split into two words "hello goodbye"? If so, I think it makes sense to split into two words in that case. (Whereas I would prefer for cases like "Ashoat's" to be stemmed into "Ashoat" if possible.)
(If we had a unit test of hello+goodbye and Ashoat's showing what we expect the exact result to be, that would answer my question!)
Remove spreading iterator to array, to fix allocation error https://linear.app/comm/issue/ENG-3575/nodejs-allocation-error
Remove regex tests, add segmentAndStem tests.
Use replaceAll instead of replace
are we are deciding between "hello+goodbye" being treated as one word "hellogoodbye", versus being split into two words "hello goodbye"?
No, with this approach "hello+bye" is parsed to "hello + bye", because segmenter splits them, and .replaceAll doesn't remove the +, since it doesn't match the punctuationRegex. So any of
+ < = > ^ ` | ~
would be left
I'm requesting review, since my goal cannot be landed anyway until I solve setting db variables for all team members
Ah, okay – thanks for clarifying. I think turning hello+bye into hello + bye is probably fine. In some sense I wonder if + should be a stop word" but I'd rather move forward here than nitpick too much over the details.
I'm honestly still confused about the punctuation question. I think the issue is that I naturally assume that the stemmer will remove most punctuation (eg. I would expect the stemmer to take brother’s, into brother), so I'm not sure which cases the punctuation RegExp is necessary for.
I assume there's some good reason for it, though – and the unit tests seem good!