Details

Reviewers

kamil
bartek
ashoat

Commits

rCOMMadb573ce6a25: [keyserver] Add search table to the db

Summary

issue: https://linear.app/comm/issue/ENG-3313/create-a-table-in-the-db-for-holding-processed-messages-for-search
We will hold messages processed for search in a new bd table. This table will has:

processed_content - stemmed and stripped of stopwords content. When the message changes, this filed will be updated accordingly.
message_id the id of the message. When the message gets edited, this id will be updated to the id of the edit message. Otherwise fetching the correct full content would be difficult.
original_message_id - the id of the original message. This is first of all to have a primary key in this table that doesn't change (since message_id will be changing). The id of the original

message was just a natural candidate for this. Secondly, since we already have this, this will be used to fetch the original message (to ex access creation time or the author)

For now we will be only keeping here processed text messages, but we might extend this to ex calendar entries in the future.

Test Plan

Run yarn dev in keyserver/ checked that the migration succeeds. Created a new db, attached it in db_config.json, run yarn dev in keyserver/, cheked that the db is created correctly.

Diff Detail

Repository

rCOMM Comm

Lint

No Lint Coverage

Unit

No Test Coverage

Event Timeline

inka created this revision.Mar 15 2023, 11:11 AM

Herald added subscribers: atul, tomek. · View Herald TranscriptMar 15 2023, 11:11 AM

inka added a parent revision: D7075: [keyserver] Add natural library for stemming and removing stopwords.Mar 15 2023, 11:11 AM

I'm wondering if I shouldn't change "message" to some other word, since we might want to include calendar entries in the future. But the word "entry" already has a special meaning in our code, and I don't have good ideas how to call a super class of both message and entry

Harbormaster completed remote builds in B17340: Diff 23750.Mar 15 2023, 11:26 AM

inka requested review of this revision.Mar 15 2023, 11:26 AM

inka added a child revision: D7077: [keyserver] Process new messages for search.Mar 15 2023, 11:32 AM

ashoat requested changes to this revision.Mar 15 2023, 2:40 PM

ashoat added inline comments.

keyserver/src/database/setup-db.js
252 ↗	(On Diff #23750)	In the diff description, you say: When the message gets edited, this id will be updated to the id of the edit message. Otherwise fetching the correct full content would be difficult. Can you clarify how / why it would be difficult?
253 ↗	(On Diff #23750)	Is `processed_content` a string `tokens.join(' ')`, eg. looks like "after stemming / stop words" here?
254 ↗	(On Diff #23750)	What does this do? Does it create a new column, or is it a modifier on an existing column? Should it be given a name?
255 ↗	(On Diff #23750)	Should we use `utf8mb4` like we use elsewhere? I think it has some better behavior for emojis or something, I don't recall exactly
375 ↗	(On Diff #23750)	Did you actually test this like you claim in the test plan? I assume this comma is not valid SQL, so I'm confused as to how this didn't come up in your testing...

This revision now requires changes to proceed.Mar 15 2023, 2:40 PM

inka added inline comments.Mar 20 2023, 8:57 AM

keyserver/src/database/setup-db.js
252 ↗	(On Diff #23750)	Can you clarify how / why it would be difficult? Suppose we kept the id of the original message. When we run a full text search on this table, we only get the processed message, so we need to fetch the full message content form the `messages` table. To fetch the most recent edit, we would have to query for all messages with `id` or `target_message` matching the `id` and pick the most recent one. Me and @tomek discussed this and figured that if we save the id of the edited message, then we can very easily fetch this and the original message, and send them to the client (in some single data structure), and they will have all the information about this message that they need. For the approach with changing the id in message_id column (and using the fact that we keep the old id as the primary key) the query would be something like: SELECT * FROM messages m JOIN search s ON s.original_message_id = m.id OR m.id = s.message_id WHERE MATCH(s.processed_content) AGAINST(query); And we would get the two messages for each match- the original and the most recent edit. But for the approach where we keep only the old id we would need to either run two queries to get the original and the newest edit separately, or have the query return all edits and handle that in js, or make one query return all originals but only the newest edits (I can only think of a way to do that that requires a nested SELECT: SELECT * FROM messages m JOIN search s ON s.original_message_id = m.id OR m.target_message = s.original_message_id WHERE MATCH(s.processed_content) AGAINST('friend') AND (m.target_message IS NULL OR m.time=(SELECT MAX(mm.time) FROM messages mm WHERE mm.target_message = m.target_message)); )
253 ↗	(On Diff #23750)	Yes
254 ↗	(On Diff #23750)	It is a modifier on an existing column. It can be given a name, but this also works - it gives the index the name of the column. This is how it's done in docs. Should I give it a name anyway?
255 ↗	(On Diff #23750)	I suppose I should use the same value as for `messages` table, yes. `utf8mb4` can store 4-byte chars, that include musical symbols, some historic alphabets, some emoji’s and some other symbols.
375 ↗	(On Diff #23750)	I tested this diff as described in the test plan (that's how I encountered the problem with migrations), but I think I forgot to check the effects of this this last statement. Sorry

inka added inline comments.Mar 20 2023, 11:10 AM

keyserver/src/database/setup-db.js
255 ↗	(On Diff #23750)	Although the natural library treats everything not in [a-zA-Z0-9] as a separator, so emojis are not indexed (source code)

Address review. Tested the new db creation (in prod mode, because I'm having weird problems with my setup. For more info see https://linear.app/comm/issue/ENG-3325/clean-database-migration-fail),
made sure the primary key is created.

Harbormaster completed remote builds in B17395: Diff 23837.Mar 20 2023, 11:29 AM

Almost there!

keyserver/src/database/setup-db.js
250 ↗	(On Diff #23837)	Can we name this `message_search`, since the columns seem specific to messages?
255 ↗	(On Diff #23837)	For the `messages` table, we set `COLLATE=utf8mb4_bin` here as well. Can we add that here? It seems like a good idea in case we ever add another column but forget to set the `COLLATE` mode
252 ↗	(On Diff #23750)	Thanks for explaining!
254 ↗	(On Diff #23750)	It looks like in the codebase we have a convention to name all indices. Can you give it a name?
255 ↗	(On Diff #23750)	Thanks for investigating the `natural` library! Did you also investigate how the `\W` character class is treated specifically in Node.js's RegExp implementation? Wondering where you got the `[a-zA-Z0-9]` part

This revision now requires changes to proceed.Mar 20 2023, 9:30 PM

Address review, rebase

keyserver/src/database/setup-db.js
250 ↗	(On Diff #23837)	When I talked to @tomek about this task initially, we thought that in the future we might want to use this for searching over calendar entries as well. I was referring to that in my comment on this diff I'm wondering if I shouldn't change "message" to some other word, since we might want to include calendar entries in the future. But the word "entry" already has a special meaning in our code, and I don't have good ideas how to call a super class of both message and entry and in the summary For now we will be only keeping here processed text messages, but we might extend this to ex calendar entries in the future. I can change it either way, but we should decide whether we possibly want to use this table for calendar entires in the future. The benefit of using the same table for messages and entries is that if we fetch them by for example time, we will get them sorted correctly right away, and won't have to think about how many of each we should fetch
255 ↗	(On Diff #23750)	I got it from regex page on wikipedia, and regex101. They both say that `\W` matches exactly to `[^A-Za-z0-9_]` but `_` is added back by natural. It is also the same in MDN docs Typing in "sokołowska" shows "soko owska" in the search table with this implementation, and typing "hello😄hello" shows "hello hello"

Harbormaster completed remote builds in B17460: Diff 23919.Mar 21 2023, 11:20 AM

I can change it either way, but we should decide whether we possibly want to use this table for calendar entires in the future. The benefit of using the same table for messages and entries is that if we fetch them by for example time, we will get them sorted correctly right away, and won't have to think about how many of each we should fetch

Defer to you on this! If you decide to name it in such a way to support calendar entries in the future, then please rename the columns as well. On the other hand, if you decide to name it in a message-specific way, then please rename the table.

Keep in mind it's very easy to rename a table / columns later!

(My comments are all repeats of my previous review)

keyserver/src/database/setup-db.js
254 ↗	(On Diff #23750)	It doesn't appear that you responded to this feedback?
255 ↗	(On Diff #23750)	Please investigate how the `\W` character class is treated specifically in Node.js's RegExp implementation

This revision now requires changes to proceed.Mar 21 2023, 2:17 PM

ashoat added inline comments.Mar 24 2023, 4:22 PM

keyserver/src/database/setup-db.js
255 ↗	(On Diff #23750)	I did some playing around with V8 and generally confirmed @inka's research. I didn't get to the point where I was able to precisely identify that `\W` matches `[^A-Za-z0-9_]`... instead, I gave up early after I found that the behavior isn't really what we want, eg.: However, after some Googling and playing around I was able to get `Intl.Segmenter` to handle this in a much better way: @inka, do you think we could use `Intl.Segmenter` for tokenization, and then the `natural` library for stemming? I'm open to `patch-package` on `natural` if necessary (but ideally perhaps we can avoid it).

ashoat attached a referenced file: F443323: Screenshot 2023-03-24 at 12.18.25 PM.png. (Show Details)Mar 24 2023, 4:22 PM

ashoat attached a referenced file: F443322: Screenshot 2023-03-24 at 10.28.41 AM.png. (Show Details)

ashoat added inline comments.Mar 24 2023, 7:01 PM

keyserver/src/database/setup-db.js
255 ↗	(On Diff #23750)	There also seem to be a variety of tokenizers available in the `natural` library. Can you clarify: What is the default tokenization algorithm used by `natural.PorterStemmer.tokenizeAndStem`? I'm guessing it's `AggressiveTokenizer` since you linked that, but I want to make sure. Have you looked into the other tokenization algorithms? Are any of them better than `AggressiveTokenizer` for the string I shared in my screenshot above? Are any of them better than `Intl.Segmenter`?

In retrospect, D7077 is the right place for this discussion – @inka, would you mind responding to my questions in this diff there?

This revision is now accepted and ready to land.Mar 24 2023, 7:02 PM

Ah but actually some of my other feedback hasn't been addressed yet

This revision now requires changes to proceed.Mar 24 2023, 7:03 PM

Address review: rename the table and give a name to the fulltext index. All indexes are named by concatenating the names of the columns they index, so I'm using this convention.
Goinng to continue the discussion on parsing in D7077

inka added inline comments.Mar 27 2023, 7:46 AM

keyserver/src/database/migration-config.js
230–232 ↗	(On Diff #24158)	It seems we always create indexes in separate statements in setup-db.js. I followed this convention here. If this is incorrect please let me know.

Harbormaster completed remote builds in B17651: Diff 24158.Mar 27 2023, 7:52 AM

inka mentioned this in D7077: [keyserver] Process new messages for search.Mar 27 2023, 8:01 AM

ashoat accepted this revision.Mar 27 2023, 1:54 PM

ashoat added inline comments.

keyserver/src/database/migration-config.js
230–232 ↗	(On Diff #24158)	Seems fine! I would maybe move the `PRIMARY KEY` here too for consistency with your code in `setup-db.js`, but it doesn't matter much

This revision is now accepted and ready to land.Mar 27 2023, 1:54 PM

Rebase, move creating PRIMARY KEY

Harbormaster completed remote builds in B17725: Diff 24260.Mar 28 2023, 8:12 AM

inka added inline comments.Mar 28 2023, 12:13 PM

keyserver/src/database/setup-db.js
255 ↗	(On Diff #23750)	`natural.PorterStemmer.tokenizeAndStem` uses `AggressiveTokenizer` by default (PorterStemmer gets this function from Stemmer that it extendes, and Stemmer uses AggressiveTokenizer). From the other tokanizers the only one that looks potentially useful for us is `RegexpTokenizer`, that allows to define own regex. const tokenizer = new natural.RegexpTokenizer({ pattern: /([A-Za-zÀ-ÿ-'"]+\|[0-9._]+\|.\|!\|\?\|:\|;\|,\|-)/iu, }); console.log(tokenizer.tokenize(text)); produces [ 'Hello', 'test', "it's", 'working', '😄', 'test' ] But it would be a bit difficult to have it parse `it's` as `it's` and `'hello'` as `hello`. Generally we would have to come up with a good regex. The disadvantage of `Intl.Segmenter` is that there seems to be some problem with types: `Intl` is typed in flow builtin definitions, but it does not contain `Segmenter` https://github.com/facebook/flow/blob/main/lib/intl.js But it works, so if I work around the flow problem it should be fine.