Page MenuHomePhabricator

[services] Backup - Add function to get utf8 string length
AbandonedPublic

Authored by jakub on Jul 15 2022, 3:01 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, May 14, 6:41 AM
Unknown Object (File)
Thu, May 9, 3:57 PM
Unknown Object (File)
Tue, Apr 30, 6:18 PM
Unknown Object (File)
Tue, Apr 30, 6:18 PM
Unknown Object (File)
Tue, Apr 30, 6:18 PM
Unknown Object (File)
Tue, Apr 30, 6:18 PM
Unknown Object (File)
Tue, Apr 30, 6:18 PM
Unknown Object (File)
Tue, Apr 30, 6:17 PM

Details

Summary

Depends on D4505

Fixing function for getting utf8 string length. Now it returns proper length of string with multibytes characters, not a count of bytes that string takes.

Test Plan

cd services && yarn run-unit-tests backup

Diff Detail

Repository
rCOMM Comm
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

Harbormaster returned this revision to the author for changes because remote builds failed.Jul 15 2022, 3:03 AM
Harbormaster failed remote builds in B10551: Diff 14516!

Add missing <algorithm> library

jakub edited the test plan for this revision. (Show Details)

This looks much cleaner than before! Can you either explain how this works, or link a resource that does?

karol added inline comments.
services/backup/src/Tools.cpp
58 ↗(On Diff #14520)

So, the assumption here is that every starting byte (a byte that starts a multi-byte/single-byte sign) starts with something else than 0x80 and all other bytes start with 0x80?

This revision is now accepted and ready to land.Jul 18 2022, 2:01 AM
tomek requested changes to this revision.Jul 18 2022, 5:53 AM
tomek added inline comments.
services/backup/src/Tools.cpp
58 ↗(On Diff #14520)

This makes sense: each code point starts with either 0 and has 1 byte or with 11 and has 2 or more bytes.

Please add a comment that explains what's going on here.

This revision now requires changes to proceed.Jul 18 2022, 5:53 AM

Add comment with explanations of getUtf8Length function

tomek added inline comments.
services/backup/src/Tools.cpp
60 ↗(On Diff #14948)
This revision is now accepted and ready to land.Jul 26 2022, 4:09 AM
jakub retitled this revision from [services] Backup - Fix function to get utf8 string length to [services] Backup - Add function to get utf8 string length.Jul 26 2022, 8:32 AM
ashoat requested changes to this revision.Jul 26 2022, 11:16 AM
ashoat added inline comments.
services/backup/src/Tools.cpp
56 ↗(On Diff #14950)

"Length" is a bit ambiguous, but to me it seems to imply we are counting the number of characters rather than the size (in bytes) of the string

62 ↗(On Diff #14950)

It's confusing to me that you say we "skip these bytes" when the purpose is to measure the size (in bytes) of the string. Are you counting the length of the strength or the size (in bytes) of the string?

This revision now requires changes to proceed.Jul 26 2022, 11:16 AM
services/backup/src/Tools.cpp
56 ↗(On Diff #14950)

D4505 has been abandoned, and yet this diff appears to be modifying an existing getUtf8Length function. Where is that function introduced? You're not still planning on landing D4505, are you?

jakub marked an inline comment as done.

Rebase

ashoat requested changes to this revision.Jul 27 2022, 6:27 AM

My questions have not been answered, but this diff has appeared in my queue. If you update a diff but aren't yet ready for review, but please hit "Plan Changes" to keep the diff off of your reviewer's queues

This revision now requires changes to proceed.Jul 27 2022, 6:27 AM

This function seems to be redundant, because the standard c++ string method "length" or "size" return exact size of string in bytes.
source

Example code, using the UTF8-CPP library for comparison and conversion from string to hex values

#include "utf8/utf8.h"

#include <iostream>
#include <string>

using namespace std;

void show(string str) {
  cout << "calculate for [" << str << "]" << endl;
  cout << "normal size: " << str.size() << endl;
  cout << "utf-8 size: " << utf8::distance(str.begin(), str.end()) << endl;
}

std::string string_to_hex(const std::string& input) {
  static const char hex_digits[] = "0123456789ABCDEF";
  std::string output;
  output.reserve(input.length() * 2);
  for (unsigned char c : input) {
    output.push_back(hex_digits[c >> 4]);
    output.push_back(hex_digits[c & 15]);
    output.push_back(' ');
  }
  return output;
}

int main()
{
  string s = "zasiąść";
  show(s);
  // the same output as from https://mothereff.in/utf-8
  cout << string_to_hex(s) << endl;
}

/**
 * output:

calculate for [zasiąść]
normal size: 10
utf-8 size: 7
7A 61 73 69 C4 85 C5 9B C4 87 
*/
/*
 * bytes from https://mothereff.in/utf-8
\x7A\x61\x73\x69\xC4\x85\xC5\x9B\xC4\x87
*/

The result was completely the same.

services/backup/src/Tools.cpp
56 ↗(On Diff #14950)

N

62 ↗(On Diff #14950)

Currently this function count the length. Previous one count size of the string in bytes, but it crashed unexpected in docker container during unit tests.

services/backup/src/Tools.cpp
62 ↗(On Diff #14950)

The fact that your two iterations were doing two very different things should've been a red flag (to you and the reviewers) that something was wrong here. Glad to hear that it's possible to measure the length of a std::string via generic functions in the std library... I was very surprised initially when it was suggested that that was not possible

jakub marked an inline comment as not done.Aug 17 2022, 2:31 AM
jakub added inline comments.
services/backup/src/Tools.cpp
56 ↗(On Diff #14950)

It should be introduced in this commit, because I'm not planning on landing D4505.