refactor(string): Add functions for handling UTF8 encoded strings #2045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

slurmlord wants to merge 4 commits into TheSuperHackers:main from slurmlord:slurmlord/utf8-functions

slurmlord commented Jan 1, 2026

This PR adds some functions for working with UTF-8 encoded strings. It originated from #1119 where a requirement to deal with UTF8-encoded multi-byte characters arose.

Previously, the GameSpy-specific MultiByteToWideCharSingleLine and WideCharStringToMultiByte functions were used for converting between Unicode and UTF8. This change moves the functionality to WWLib and updates the previously mentioned functions to utilize the WWLib implementation.

The following sets of functions are added

Validation of UTF8-encoded strings
Conversion between UTF8-encoded char* and wchar_t*
Size requirement calculations for conversions
Size parsing of multi-byte characters from lead byte
Seeking to end of previous character in case of an incomplete ending sequence (truncation)

The conversion between UTF8-encoded char* and wchar_t* are also exposed through AsciiString and UnicodeString.

slurmlord added 4 commits

January 1, 2026 16:40


          Add base UTF8 functions to WWLib

0d95c02


          Add UTF8 functions to AsciiString and UnicodeString

4c3a537


          Use new UTF8 functions in ParseAsciiStringToGameInfo

e7eb7d2


          Happy new year!

d83e5ee

slurmlord commented

View reviewed changes

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp

    
                return charsWritten * sizeof(wchar_t);

              }

              #else

              #error "UTF-8 conversion functions not implemented for this platform"

Author

slurmlord Jan 1, 2026

I think ICU can replace the existing Windows-specific WideCharToMultiByte and MultiByteToWideChar functions, but it's probably easier to hold off on integrating until VC6 is dropped.

xezon reviewed

View reviewed changes

Core/GameEngine/Include/Common/AsciiString.h

    
              	/**

              		convert the given UnicodeString to UTF-8 and store it in self.

              	*/

              	void convertToUtf8(const UnicodeString& unicodeStr);

xezon Jan 3, 2026

Perhaps this should merge with the established translate functions? Or would this cause trouble for some callers?

See:

void AsciiString::translate(const UnicodeString& stringSrc)
{
	validate();
	/// @todo srj put in a real translation here; this will only work for 7-bit ascii
	clear();
	Int len = stringSrc.getLength();
	for (Int i = 0; i < len; i++)
		concat((char)stringSrc.getCharAt(i));
	validate();
}

void UnicodeString::translate(const AsciiString& stringSrc)
{
	validate();
	/// @todo srj put in a real translation here; this will only work for 7-bit ascii
	clear();
	Int len = stringSrc.getLength();
	for (Int i = 0; i < len; i++)
		concat((WideChar)stringSrc.getCharAt(i));
	validate();
}

Core/Libraries/Source/WWVegas/WWLib/utf8.h

    
              // Single-character functions

              // Returns the number of bytes in a UTF-8 character based on the lead byte. Returns 0 if invalid lead byte.

xezon Jan 3, 2026

Something reads wrong in this sentence.

Core/Libraries/Source/WWVegas/WWLib/utf8.h

    
              size_t utf8_num_bytes(char lead);

              // Returns number of bytes to truncate if str contains incomplete UTF-8 character at the end.

              // 0 if no truncation needed. Assumes correct UTF-8 up to length - 1.

              int utf8_truncate_if_incomplete(const char* str, size_t length);

xezon Jan 3, 2026

Instead of using term "truncate", better just return number of invalid bytes at the end. This function does not need to care about the callers use case, just inform about facts.

Core/Libraries/Source/WWVegas/WWLib/utf8.h

    
              // Validation functions

              // Validates whether the given string is valid UTF-8.

              bool utf8_validate_string(const char* str);

xezon Jan 3, 2026

Perhaps this function, and the function below, can be combined with the utf8_truncate_if_incomplete. Return 0 on success, otherwise return position of first invalid character beginning from the end of the string (excl. null terminator).

Core/Libraries/Source/WWVegas/WWLib/utf8.h

    
              // Gets the size in bytes required to hold the widechar representation of the given UTF-8 string, including null terminator.

              size_t get_size_as_widechar(const char* s);

              // Converts a widechar string to UTF-8. Assumes tgt has enough space (use get_size_as_utf8 to determine size).

              size_t convert_widechar_to_utf8(const wchar_t* orig, char* tgt, size_t tgtsize);

xezon Jan 3, 2026

Can shorten function name to wchar_to_utf8

Core/Libraries/Source/WWVegas/WWLib/utf8.h

    
              // Single-character functions

              // Returns the number of bytes in a UTF-8 character based on the lead byte. Returns 0 if invalid lead byte.

              size_t utf8_num_bytes(char lead);

xezon Jan 3, 2026

WWVegas code uses Snake_Case for function names. Otherwise use camelCase like GameEngine does.

Core/Libraries/Source/WWVegas/WWLib/utf8.h

    
              // Gets the size in bytes required to hold the widechar representation of the given UTF-8 string, including null terminator.

              size_t get_size_as_widechar(const char* s);

              // Converts a widechar string to UTF-8. Assumes tgt has enough space (use get_size_as_utf8 to determine size).

              size_t convert_widechar_to_utf8(const wchar_t* orig, char* tgt, size_t tgtsize);

xezon Jan 3, 2026

Common naming convention for arguments like this is src, dest.

Also dest often comes first, then src. See void* memcpy( void* dest, const void* src, std::size_t count )

Reason being, it matches more the "LeftSide = RightSide" convention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet