Paths

Table of Contentst

[services] Backup - Add data hash to the log item
ClosedPublic
Actions

Authored by • karol on Jun 22 2022, 3:30 AM.

Details

Reviewers

tomek
ashoat

Commits

rCOMM2dc9d5f4c27b: [services] Backup - Add data hash to the log item

Summary

Depends on D4277

linear task: https://linear.app/comm/issue/ENG-1288/properly-handle-the-dynamodb-item-size-limit

If we want to be able to move a log item's data dynamically after creation, from the database to the S3, we need to store the data hash in the backup log's database.

Test Plan

Services still build

Diff Detail

Repository

rCOMM Comm

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

• karol created this revision.Jun 22 2022, 3:30 AM

Herald added subscribers: • abosh, atul, • adrian and 2 others. · View Herald TranscriptJun 22 2022, 3:30 AM

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 22 2022, 3:32 AM

Harbormaster failed remote builds in B9869: Diff 13654!

fix build

Harbormaster completed remote builds in B9876: Diff 13661.Jun 22 2022, 4:10 AM

• karol requested review of this revision.Jun 22 2022, 4:10 AM

• karol mentioned this in D4233: [services] Backup - Fix data size database limit.Jun 22 2022, 5:37 AM

• karol edited the summary of this revision. (Show Details)Jun 22 2022, 5:39 AM

• karol edited the test plan for this revision. (Show Details)

• karol added reviewers: tomek, ashoat.

• karol added a parent revision: D4277: [services] Tests - Blob - Add test logic.

• karol added a child revision: D4321: [services] Backup - Send Log - Fix logic in terminate callback.Jun 22 2022, 5:50 AM

• karol mentioned this in D4322: [services] Backup - Add data hash to database manager.Jun 22 2022, 5:54 AM

ashoat added inline comments.Jun 22 2022, 7:47 AM

services/backup/src/DatabaseEntities/LogItem.cpp
129–134 ↗	(On Diff #13661)	Is there a way we can avoid hardcoding these? I'm worried that somebody will add a field and forget to update this

Back to you with question

This revision now requires changes to proceed.Jun 22 2022, 7:48 AM

tomek requested changes to this revision.Jun 22 2022, 8:21 AM

tomek added inline comments.

services/backup/src/DatabaseEntities/LogItem.cpp
129 ↗	(On Diff #13661)	Which encoding is used by our DB? Are we sure that every char takes exactly 1 byte there?

• karol requested review of this revision.Jun 29 2022, 7:26 AM

• karol added inline comments.

services/backup/src/DatabaseEntities/LogItem.cpp
129 ↗	(On Diff #13661)	Based on this: Strings are Unicode with UTF-8 binary encoding. The size of a string is (length of attribute name) + (number of UTF-8-encoded bytes). from googling UTF-8 is an 8-bit character encoding for Unicode so 1 char = 8bits = 1byte, so the answer is yes.
129–134 ↗	(On Diff #13661)	I do not see a clear way to do this, sorry :c I understand your concern and I share it too. We may try to wrap these values into some kind of array and then iterate here, what do you think?

ashoat accepted this revision.Jun 29 2022, 9:34 AM

ashoat added inline comments.

services/backup/src/DatabaseEntities/LogItem.cpp
129–134 ↗	(On Diff #13661)	Happy to address this a later diff. Maybe we can create a Linear task to discuss? One potential idea I have – could we do something similar to the "message specs" we have on native C++ here? The idea would be, if somebody wants to add a new `LogItemField` they would have to create a new subclass. That might be overkill, though. I'm definitely open to other solutions!

ashoat added inline comments.Jun 29 2022, 9:38 AM

services/backup/src/DatabaseEntities/LogItem.cpp
129 ↗	(On Diff #13661)	In Unicode (including UTF-8), some chars can take more than others. In particular, UTF-8 takes 1-4 bytes for a single char. This can be found with a simple Google search: Only the first 128 UTF-8 codepoints take a single byte. That said, it's probably fair to assume that the chars in `FIELD_BACKUP_ID` are simple ones that only take a single byte. If we end up creating "specs" as I suggest below, we could set up a unit test that goes through every single spec and asserts that every char in every field name is one of the first 128 UTF-8 codepoints. Alternately, we could simply use a more robust method for counting the size of a UTF-8 string. I'm sure something is available in `std`.

ashoat attached a referenced file: F87117: Screen Shot 2022-06-29 at 12.36.26 PM.png. (Show Details)Jun 29 2022, 9:38 AM

tomek accepted this revision.Jun 30 2022, 3:33 AM

tomek added inline comments.

services/backup/src/DatabaseEntities/LogItem.cpp
129–134 ↗	(On Diff #13661)	This approach worked well for messages, so I agree that this might be a good idea

This revision is now accepted and ready to land.Jun 30 2022, 3:33 AM

• karol added inline comments.Jul 1 2022, 4:57 AM

services/backup/src/DatabaseEntities/LogItem.cpp
129 ↗	(On Diff #13661)	Oh snap, sorry about that, you're right, it can be 1-4 bytes long. I think it's better to use a library than play with the bytes ourselves checking with the docs, etc. How about this one https://github.com/nemtrif/utfcpp? I quickly tested it and it worked just fine. As a side note: I think that this mistake wouldn't break anything. Assuming there may be some utf-8 encoded data, the utf-8 size will always be smaller than the normal size. So at most, we'd move the data to the S3 sooner than needed.
129–134 ↗	(On Diff #13661)	https://linear.app/comm/issue/ENG-1336/avoid-hardcoding-database-items-fields-names

ashoat added inline comments.Jul 1 2022, 10:40 AM

services/backup/src/DatabaseEntities/LogItem.cpp
129 ↗	(On Diff #13661)	Are you sure there's no way to count the length of a UTF-8 string in `std`? How about this approach?

• karol added inline comments.Jul 8 2022, 2:14 AM

services/backup/src/DatabaseEntities/LogItem.cpp
129 ↗	(On Diff #13661)	I've never said there's no way to do this with std. I know we can do this with std. As I said: I think it's better to use a library than play with the bytes ourselves checking with the docs, etc. You didn't really reply to that idea/opinion. As I looked on the Internet there are numerous ways to do this. I'm going to create a task for this as: As a side note: I think that this mistake wouldn't break anything. Assuming there may be some utf-8 encoded data, the utf-8 size will always be smaller than the normal size. So at most, we'd move the data to the S3 sooner than needed. https://linear.app/comm/issue/ENG-1367/properly-handle-dynamodb-encoding-in-services

Closed by commit rCOMM2dc9d5f4c27b: [services] Backup - Add data hash to the log item (authored by • karol). · Explain WhyJul 8 2022, 2:30 AM

This revision was automatically updated to reflect the committed changes.

• karol added a commit: rCOMM2dc9d5f4c27b: [services] Backup - Add data hash to the log item.

• karol mentioned this in rCOMM764dee263ce7: [services] Backup - Add data hash to database manager.

I've never said there's no way to do this with std. I know we can do this with std. As I said:

I think it's better to use a library than play with the bytes ourselves checking with the docs, etc.

You didn't really reply to that idea/opinion.

@karol-bisztyga, I'm really not sure what you're stressing about here, and what the reason is for this aggressive comment.

The std suggestion I made is a direct response to the quote you shared, that you felt I didn't respond to. The std suggestion was, in fact, my response to your "idea/opinion". I think we should use a library, and that library should be std (the C++ standard library).

Hopefully this clarifies any confusion?