Page MenuHomePhabricator

[keyserver] Add search table to the db
ClosedPublic

Authored by inka on Mar 15 2023, 4:11 AM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Dec 19, 5:59 PM
Unknown Object (File)
Tue, Dec 10, 2:46 AM
Unknown Object (File)
Fri, Dec 6, 9:47 PM
Unknown Object (File)
Fri, Dec 6, 9:47 PM
Unknown Object (File)
Fri, Dec 6, 5:54 PM
Unknown Object (File)
Thu, Dec 5, 8:24 PM
Unknown Object (File)
Thu, Dec 5, 8:24 PM
Unknown Object (File)
Thu, Dec 5, 8:24 PM
Subscribers

Details

Summary

issue: https://linear.app/comm/issue/ENG-3313/create-a-table-in-the-db-for-holding-processed-messages-for-search
We will hold messages processed for search in a new bd table. This table will has:

  • processed_content - stemmed and stripped of stopwords content. When the message changes, this filed will be updated accordingly.
  • message_id the id of the message. When the message gets edited, this id will be updated to the id of the edit message. Otherwise fetching the correct full content would be difficult.
  • original_message_id - the id of the original message. This is first of all to have a primary key in this table that doesn't change (since message_id will be changing). The id of the original

message was just a natural candidate for this. Secondly, since we already have this, this will be used to fetch the original message (to ex access creation time or the author)

For now we will be only keeping here processed text messages, but we might extend this to ex calendar entries in the future.

Test Plan

Run yarn dev in keyserver/ checked that the migration succeeds. Created a new db, attached it in db_config.json, run yarn dev in keyserver/, cheked that the db is created correctly.

Diff Detail

Repository
rCOMM Comm
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

I'm wondering if I shouldn't change "message" to some other word, since we might want to include calendar entries in the future. But the word "entry" already has a special meaning in our code, and I don't have good ideas how to call a super class of both message and entry

inka requested review of this revision.Mar 15 2023, 4:26 AM
ashoat requested changes to this revision.Mar 15 2023, 7:40 AM
ashoat added inline comments.
keyserver/src/database/setup-db.js
252 ↗(On Diff #23750)

In the diff description, you say:

When the message gets edited, this id will be updated to the id of the edit message. Otherwise fetching the correct full content would be difficult.

Can you clarify how / why it would be difficult?

253 ↗(On Diff #23750)

Is processed_content a string tokens.join(' '), eg. looks like "after stemming / stop words" here?

254 ↗(On Diff #23750)

What does this do? Does it create a new column, or is it a modifier on an existing column? Should it be given a name?

255 ↗(On Diff #23750)

Should we use utf8mb4 like we use elsewhere? I think it has some better behavior for emojis or something, I don't recall exactly

375 ↗(On Diff #23750)

Did you actually test this like you claim in the test plan? I assume this comma is not valid SQL, so I'm confused as to how this didn't come up in your testing...

This revision now requires changes to proceed.Mar 15 2023, 7:40 AM
keyserver/src/database/setup-db.js
252 ↗(On Diff #23750)

Can you clarify how / why it would be difficult?

Suppose we kept the id of the original message. When we run a full text search on this table, we only get the processed message, so we need to fetch the full message content form the messages table. To fetch the most recent edit, we would have to query for all messages with id or target_message matching the id and pick the most recent one.
Me and @tomek discussed this and figured that if we save the id of the edited message, then we can very easily fetch this and the original message, and send them to the client (in some single data structure), and they will have all the information about this message that they need.

For the approach with changing the id in message_id column (and using the fact that we keep the old id as the primary key) the query would be something like:

SELECT * FROM messages m JOIN search s ON s.original_message_id = m.id OR m.id = s.message_id WHERE MATCH(s.processed_content) AGAINST(query);

And we would get the two messages for each match- the original and the most recent edit.

But for the approach where we keep only the old id we would need to either run two queries to get the original and the newest edit separately, or have the query return all edits and handle that in js, or make one query return all originals but only the newest edits (I can only think of a way to do that that requires a nested SELECT:

SELECT * FROM messages m JOIN search s ON s.original_message_id = m.id OR m.target_message = s.original_message_id WHERE MATCH(s.processed_content) AGAINST('friend') AND (m.target_message IS NULL OR m.time=(SELECT MAX(mm.time) FROM messages mm WHERE mm.target_message = m.target_message));

)

253 ↗(On Diff #23750)

Yes

254 ↗(On Diff #23750)

It is a modifier on an existing column. It can be given a name, but this also works - it gives the index the name of the column. This is how it's done in docs. Should I give it a name anyway?

255 ↗(On Diff #23750)

I suppose I should use the same value as for messages table, yes. utf8mb4 can store 4-byte chars, that include musical symbols, some historic alphabets, some emoji’s and some other symbols.

375 ↗(On Diff #23750)

I tested this diff as described in the test plan (that's how I encountered the problem with migrations), but I think I forgot to check the effects of this this last statement. Sorry

keyserver/src/database/setup-db.js
255 ↗(On Diff #23750)

Although the natural library treats everything not in [a-zA-Z0-9] as a separator, so emojis are not indexed (source code)

Address review. Tested the new db creation (in prod mode, because I'm having weird problems with my setup. For more info see https://linear.app/comm/issue/ENG-3325/clean-database-migration-fail),
made sure the primary key is created.

ashoat requested changes to this revision.Mar 20 2023, 2:30 PM

Almost there!

keyserver/src/database/setup-db.js
250 ↗(On Diff #23837)

Can we name this message_search, since the columns seem specific to messages?

255 ↗(On Diff #23837)

For the messages table, we set COLLATE=utf8mb4_bin here as well. Can we add that here? It seems like a good idea in case we ever add another column but forget to set the COLLATE mode

252 ↗(On Diff #23750)

Thanks for explaining!

254 ↗(On Diff #23750)

It looks like in the codebase we have a convention to name all indices. Can you give it a name?

255 ↗(On Diff #23750)

Thanks for investigating the natural library! Did you also investigate how the \W character class is treated specifically in Node.js's RegExp implementation? Wondering where you got the [a-zA-Z0-9] part

This revision now requires changes to proceed.Mar 20 2023, 2:30 PM

Address review, rebase

keyserver/src/database/setup-db.js
250 ↗(On Diff #23837)

When I talked to @tomek about this task initially, we thought that in the future we might want to use this for searching over calendar entries as well. I was referring to that in my comment on this diff

I'm wondering if I shouldn't change "message" to some other word, since we might want to include calendar entries in the future. But the word "entry" already has a special meaning in our code, and I don't have good ideas how to call a super class of both message and entry

and in the summary

For now we will be only keeping here processed text messages, but we might extend this to ex calendar entries in the future.

I can change it either way, but we should decide whether we possibly want to use this table for calendar entires in the future. The benefit of using the same table for messages and entries is that if we fetch them by for example time, we will get them sorted correctly right away, and won't have to think about how many of each we should fetch

255 ↗(On Diff #23750)

I got it from regex page on wikipedia, and regex101. They both say that \W matches exactly to [^A-Za-z0-9_] but _ is added back by natural. It is also the same in MDN docs
Typing in "sokołowska" shows "soko owska" in the search table with this implementation, and typing "hello😄hello" shows "hello hello"

I can change it either way, but we should decide whether we possibly want to use this table for calendar entires in the future. The benefit of using the same table for messages and entries is that if we fetch them by for example time, we will get them sorted correctly right away, and won't have to think about how many of each we should fetch

Defer to you on this! If you decide to name it in such a way to support calendar entries in the future, then please rename the columns as well. On the other hand, if you decide to name it in a message-specific way, then please rename the table.

Keep in mind it's very easy to rename a table / columns later!

ashoat requested changes to this revision.Mar 21 2023, 7:17 AM

(My comments are all repeats of my previous review)

keyserver/src/database/setup-db.js
254 ↗(On Diff #23750)

It doesn't appear that you responded to this feedback?

255 ↗(On Diff #23750)

Please investigate how the \W character class is treated specifically in Node.js's RegExp implementation

This revision now requires changes to proceed.Mar 21 2023, 7:17 AM
keyserver/src/database/setup-db.js
255 ↗(On Diff #23750)

I did some playing around with V8 and generally confirmed @inka's research. I didn't get to the point where I was able to precisely identify that \W matches [^A-Za-z0-9_]... instead, I gave up early after I found that the behavior isn't really what we want, eg.:

Screenshot 2023-03-24 at 10.28.41 AM.png (177×431 px, 35 KB)

However, after some Googling and playing around I was able to get Intl.Segmenter to handle this in a much better way:

Screenshot 2023-03-24 at 12.18.25 PM.png (144×897 px, 44 KB)

@inka, do you think we could use Intl.Segmenter for tokenization, and then the natural library for stemming? I'm open to patch-package on natural if necessary (but ideally perhaps we can avoid it).

keyserver/src/database/setup-db.js
255 ↗(On Diff #23750)

There also seem to be a variety of tokenizers available in the natural library. Can you clarify:

  1. What is the default tokenization algorithm used by natural.PorterStemmer.tokenizeAndStem? I'm guessing it's AggressiveTokenizer since you linked that, but I want to make sure.
  2. Have you looked into the other tokenization algorithms? Are any of them better than AggressiveTokenizer for the string I shared in my screenshot above? Are any of them better than Intl.Segmenter?

In retrospect, D7077 is the right place for this discussion – @inka, would you mind responding to my questions in this diff there?

This revision is now accepted and ready to land.Mar 24 2023, 12:02 PM
ashoat requested changes to this revision.Mar 24 2023, 12:03 PM

Ah but actually some of my other feedback hasn't been addressed yet

This revision now requires changes to proceed.Mar 24 2023, 12:03 PM

Address review: rename the table and give a name to the fulltext index. All indexes are named by concatenating the names of the columns they index, so I'm using this convention.
Goinng to continue the discussion on parsing in D7077

keyserver/src/database/migration-config.js
230–232 ↗(On Diff #24158)

It seems we always create indexes in separate statements in setup-db.js. I followed this convention here. If this is incorrect please let me know.

ashoat added inline comments.
keyserver/src/database/migration-config.js
230–232 ↗(On Diff #24158)

Seems fine! I would maybe move the PRIMARY KEY here too for consistency with your code in setup-db.js, but it doesn't matter much

This revision is now accepted and ready to land.Mar 27 2023, 6:54 AM

Rebase, move creating PRIMARY KEY

keyserver/src/database/setup-db.js
255 ↗(On Diff #23750)

natural.PorterStemmer.tokenizeAndStem uses AggressiveTokenizer by default (PorterStemmer gets this function from Stemmer that it extendes, and Stemmer uses AggressiveTokenizer).

From the other tokanizers the only one that looks potentially useful for us is RegexpTokenizer, that allows to define own regex.

const tokenizer = new natural.RegexpTokenizer({
      pattern: /([A-Za-zÀ-ÿ-'"]+|[0-9._]+|.|!|\?|:|;|,|-)/iu,
    });
    console.log(tokenizer.tokenize(text));

produces

[ 'Hello', 'test', "it's", 'working', '😄', 'test' ]

But it would be a bit difficult to have it parse it's as it's and 'hello' as hello. Generally we would have to come up with a good regex.

The disadvantage of Intl.Segmenter is that there seems to be some problem with types: Intl is typed in flow builtin definitions, but it does not contain Segmenter https://github.com/facebook/flow/blob/main/lib/intl.js
But it works, so if I work around the flow problem it should be fine.

keyserver/src/database/setup-db.js
255 ↗(On Diff #23750)

(sorry, was supposed to answer elsewhere, going to comment on D7077 as well )