Page MenuHomePhabricator

[services] Backup - Add data hash to the log item
ClosedPublic

Authored by karol on Jun 22 2022, 3:30 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Nov 10, 12:33 PM
Unknown Object (File)
Sat, Nov 9, 4:19 AM
Unknown Object (File)
Tue, Oct 22, 2:58 PM
Unknown Object (File)
Tue, Oct 22, 8:30 AM
Unknown Object (File)
Sun, Oct 20, 5:42 PM
Unknown Object (File)
Fri, Oct 18, 9:42 PM
Unknown Object (File)
Fri, Oct 18, 9:42 PM
Unknown Object (File)
Fri, Oct 18, 9:42 PM

Details

Summary

Depends on D4277

linear task: https://linear.app/comm/issue/ENG-1288/properly-handle-the-dynamodb-item-size-limit

If we want to be able to move a log item's data dynamically after creation, from the database to the S3, we need to store the data hash in the backup log's database.

Test Plan

Services still build

Diff Detail

Repository
rCOMM Comm
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Jun 22 2022, 3:32 AM
Harbormaster failed remote builds in B9869: Diff 13654!
karol edited the test plan for this revision. (Show Details)
karol added reviewers: tomek, ashoat.
services/backup/src/DatabaseEntities/LogItem.cpp
129–134 ↗(On Diff #13661)

Is there a way we can avoid hardcoding these? I'm worried that somebody will add a field and forget to update this

ashoat requested changes to this revision.Jun 22 2022, 7:48 AM

Back to you with question

This revision now requires changes to proceed.Jun 22 2022, 7:48 AM
tomek requested changes to this revision.Jun 22 2022, 8:21 AM
tomek added inline comments.
services/backup/src/DatabaseEntities/LogItem.cpp
129 ↗(On Diff #13661)

Which encoding is used by our DB? Are we sure that every char takes exactly 1 byte there?

karol added inline comments.
services/backup/src/DatabaseEntities/LogItem.cpp
129 ↗(On Diff #13661)

Based on this:

Strings are Unicode with UTF-8 binary encoding. The size of a string is (length of attribute name) + (number of UTF-8-encoded bytes).

from googling

UTF-8 is an 8-bit character encoding for Unicode

so 1 char = 8bits = 1byte, so the answer is yes.

129–134 ↗(On Diff #13661)

I do not see a clear way to do this, sorry :c I understand your concern and I share it too. We may try to wrap these values into some kind of array and then iterate here, what do you think?

ashoat added inline comments.
services/backup/src/DatabaseEntities/LogItem.cpp
129–134 ↗(On Diff #13661)

Happy to address this a later diff. Maybe we can create a Linear task to discuss?

One potential idea I have – could we do something similar to the "message specs" we have on native C++ here? The idea would be, if somebody wants to add a new LogItemField they would have to create a new subclass.

That might be overkill, though. I'm definitely open to other solutions!

services/backup/src/DatabaseEntities/LogItem.cpp
129 ↗(On Diff #13661)

In Unicode (including UTF-8), some chars can take more than others. In particular, UTF-8 takes 1-4 bytes for a single char. This can be found with a simple Google search:

Screen Shot 2022-06-29 at 12.36.26 PM.png (327×713 px, 100 KB)

Only the first 128 UTF-8 codepoints take a single byte. That said, it's probably fair to assume that the chars in FIELD_BACKUP_ID are simple ones that only take a single byte.

If we end up creating "specs" as I suggest below, we could set up a unit test that goes through every single spec and asserts that every char in every field name is one of the first 128 UTF-8 codepoints.

Alternately, we could simply use a more robust method for counting the size of a UTF-8 string. I'm sure something is available in std.

tomek added inline comments.
services/backup/src/DatabaseEntities/LogItem.cpp
129–134 ↗(On Diff #13661)

This approach worked well for messages, so I agree that this might be a good idea

This revision is now accepted and ready to land.Jun 30 2022, 3:33 AM
services/backup/src/DatabaseEntities/LogItem.cpp
129 ↗(On Diff #13661)

Oh snap, sorry about that, you're right, it can be 1-4 bytes long.

I think it's better to use a library than play with the bytes ourselves checking with the docs, etc.

How about this one https://github.com/nemtrif/utfcpp? I quickly tested it and it worked just fine.

As a side note: I think that this mistake wouldn't break anything. Assuming there may be some utf-8 encoded data, the utf-8 size will always be smaller than the normal size. So at most, we'd move the data to the S3 sooner than needed.

129–134 ↗(On Diff #13661)
services/backup/src/DatabaseEntities/LogItem.cpp
129 ↗(On Diff #13661)

Are you sure there's no way to count the length of a UTF-8 string in std? How about this approach?

services/backup/src/DatabaseEntities/LogItem.cpp
129 ↗(On Diff #13661)

I've never said there's no way to do this with std. I know we can do this with std. As I said:

I think it's better to use a library than play with the bytes ourselves checking with the docs, etc.

You didn't really reply to that idea/opinion.

As I looked on the Internet there are numerous ways to do this.

I'm going to create a task for this as:

As a side note: I think that this mistake wouldn't break anything. Assuming there may be some utf-8 encoded data, the utf-8 size will always be smaller than the normal size. So at most, we'd move the data to the S3 sooner than needed.

https://linear.app/comm/issue/ENG-1367/properly-handle-dynamodb-encoding-in-services

I've never said there's no way to do this with std. I know we can do this with std. As I said:

I think it's better to use a library than play with the bytes ourselves checking with the docs, etc.

You didn't really reply to that idea/opinion.

@karol-bisztyga, I'm really not sure what you're stressing about here, and what the reason is for this aggressive comment.

The std suggestion I made is a direct response to the quote you shared, that you felt I didn't respond to. The std suggestion was, in fact, my response to your "idea/opinion". I think we should use a library, and that library should be std (the C++ standard library).

Hopefully this clarifies any confusion?