-
Notifications
You must be signed in to change notification settings - Fork 141
refactor(string): Add functions for handling UTF8 encoded strings #2045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
refactor(string): Add functions for handling UTF8 encoded strings #2045
Conversation
| return charsWritten * sizeof(wchar_t); | ||
| } | ||
| #else | ||
| #error "UTF-8 conversion functions not implemented for this platform" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ICU can replace the existing Windows-specific WideCharToMultiByte and MultiByteToWideChar functions, but it's probably easier to hold off on integrating until VC6 is dropped.
| /** | ||
| convert the given UnicodeString to UTF-8 and store it in self. | ||
| */ | ||
| void convertToUtf8(const UnicodeString& unicodeStr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should merge with the established translate functions? Or would this cause trouble for some callers?
See:
void AsciiString::translate(const UnicodeString& stringSrc)
{
validate();
/// @todo srj put in a real translation here; this will only work for 7-bit ascii
clear();
Int len = stringSrc.getLength();
for (Int i = 0; i < len; i++)
concat((char)stringSrc.getCharAt(i));
validate();
}
void UnicodeString::translate(const AsciiString& stringSrc)
{
validate();
/// @todo srj put in a real translation here; this will only work for 7-bit ascii
clear();
Int len = stringSrc.getLength();
for (Int i = 0; i < len; i++)
concat((WideChar)stringSrc.getCharAt(i));
validate();
}|
|
||
| // Single-character functions | ||
|
|
||
| // Returns the number of bytes in a UTF-8 character based on the lead byte. Returns 0 if invalid lead byte. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something reads wrong in this sentence.
| size_t utf8_num_bytes(char lead); | ||
| // Returns number of bytes to truncate if str contains incomplete UTF-8 character at the end. | ||
| // 0 if no truncation needed. Assumes correct UTF-8 up to length - 1. | ||
| int utf8_truncate_if_incomplete(const char* str, size_t length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using term "truncate", better just return number of invalid bytes at the end. This function does not need to care about the callers use case, just inform about facts.
| // Validation functions | ||
|
|
||
| // Validates whether the given string is valid UTF-8. | ||
| bool utf8_validate_string(const char* str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this function, and the function below, can be combined with the utf8_truncate_if_incomplete. Return 0 on success, otherwise return position of first invalid character beginning from the end of the string (excl. null terminator).
| // Gets the size in bytes required to hold the widechar representation of the given UTF-8 string, including null terminator. | ||
| size_t get_size_as_widechar(const char* s); | ||
| // Converts a widechar string to UTF-8. Assumes tgt has enough space (use get_size_as_utf8 to determine size). | ||
| size_t convert_widechar_to_utf8(const wchar_t* orig, char* tgt, size_t tgtsize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can shorten function name to wchar_to_utf8
| // Single-character functions | ||
|
|
||
| // Returns the number of bytes in a UTF-8 character based on the lead byte. Returns 0 if invalid lead byte. | ||
| size_t utf8_num_bytes(char lead); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WWVegas code uses Snake_Case for function names. Otherwise use camelCase like GameEngine does.
| // Gets the size in bytes required to hold the widechar representation of the given UTF-8 string, including null terminator. | ||
| size_t get_size_as_widechar(const char* s); | ||
| // Converts a widechar string to UTF-8. Assumes tgt has enough space (use get_size_as_utf8 to determine size). | ||
| size_t convert_widechar_to_utf8(const wchar_t* orig, char* tgt, size_t tgtsize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Common naming convention for arguments like this is src, dest.
Also dest often comes first, then src. See void* memcpy( void* dest, const void* src, std::size_t count )
Reason being, it matches more the "LeftSide = RightSide" convention.
This PR adds some functions for working with UTF-8 encoded strings. It originated from #1119 where a requirement to deal with UTF8-encoded multi-byte characters arose.
Previously, the GameSpy-specific
MultiByteToWideCharSingleLineandWideCharStringToMultiBytefunctions were used for converting between Unicode and UTF8. This change moves the functionality to WWLib and updates the previously mentioned functions to utilize the WWLib implementation.The following sets of functions are added
char*andwchar_t*The conversion between UTF8-encoded
char*andwchar_t*are also exposed throughAsciiStringandUnicodeString.