Skip to content

Conversation

@slurmlord
Copy link

This PR adds some functions for working with UTF-8 encoded strings. It originated from #1119 where a requirement to deal with UTF8-encoded multi-byte characters arose.

Previously, the GameSpy-specific MultiByteToWideCharSingleLine and WideCharStringToMultiByte functions were used for converting between Unicode and UTF8. This change moves the functionality to WWLib and updates the previously mentioned functions to utilize the WWLib implementation.

The following sets of functions are added

  • Validation of UTF8-encoded strings
  • Conversion between UTF8-encoded char* and wchar_t*
  • Size requirement calculations for conversions
  • Size parsing of multi-byte characters from lead byte
  • Seeking to end of previous character in case of an incomplete ending sequence (truncation)

The conversion between UTF8-encoded char* and wchar_t* are also exposed through AsciiString and UnicodeString.

return charsWritten * sizeof(wchar_t);
}
#else
#error "UTF-8 conversion functions not implemented for this platform"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ICU can replace the existing Windows-specific WideCharToMultiByte and MultiByteToWideChar functions, but it's probably easier to hold off on integrating until VC6 is dropped.

/**
convert the given UnicodeString to UTF-8 and store it in self.
*/
void convertToUtf8(const UnicodeString& unicodeStr);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should merge with the established translate functions? Or would this cause trouble for some callers?

See:

void AsciiString::translate(const UnicodeString& stringSrc)
{
	validate();
	/// @todo srj put in a real translation here; this will only work for 7-bit ascii
	clear();
	Int len = stringSrc.getLength();
	for (Int i = 0; i < len; i++)
		concat((char)stringSrc.getCharAt(i));
	validate();
}

void UnicodeString::translate(const AsciiString& stringSrc)
{
	validate();
	/// @todo srj put in a real translation here; this will only work for 7-bit ascii
	clear();
	Int len = stringSrc.getLength();
	for (Int i = 0; i < len; i++)
		concat((WideChar)stringSrc.getCharAt(i));
	validate();
}


// Single-character functions

// Returns the number of bytes in a UTF-8 character based on the lead byte. Returns 0 if invalid lead byte.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something reads wrong in this sentence.

size_t utf8_num_bytes(char lead);
// Returns number of bytes to truncate if str contains incomplete UTF-8 character at the end.
// 0 if no truncation needed. Assumes correct UTF-8 up to length - 1.
int utf8_truncate_if_incomplete(const char* str, size_t length);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using term "truncate", better just return number of invalid bytes at the end. This function does not need to care about the callers use case, just inform about facts.

// Validation functions

// Validates whether the given string is valid UTF-8.
bool utf8_validate_string(const char* str);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this function, and the function below, can be combined with the utf8_truncate_if_incomplete. Return 0 on success, otherwise return position of first invalid character beginning from the end of the string (excl. null terminator).

// Gets the size in bytes required to hold the widechar representation of the given UTF-8 string, including null terminator.
size_t get_size_as_widechar(const char* s);
// Converts a widechar string to UTF-8. Assumes tgt has enough space (use get_size_as_utf8 to determine size).
size_t convert_widechar_to_utf8(const wchar_t* orig, char* tgt, size_t tgtsize);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can shorten function name to wchar_to_utf8

// Single-character functions

// Returns the number of bytes in a UTF-8 character based on the lead byte. Returns 0 if invalid lead byte.
size_t utf8_num_bytes(char lead);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WWVegas code uses Snake_Case for function names. Otherwise use camelCase like GameEngine does.

// Gets the size in bytes required to hold the widechar representation of the given UTF-8 string, including null terminator.
size_t get_size_as_widechar(const char* s);
// Converts a widechar string to UTF-8. Assumes tgt has enough space (use get_size_as_utf8 to determine size).
size_t convert_widechar_to_utf8(const wchar_t* orig, char* tgt, size_t tgtsize);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common naming convention for arguments like this is src, dest.

Also dest often comes first, then src. See void* memcpy( void* dest, const void* src, std::size_t count )

Reason being, it matches more the "LeftSide = RightSide" convention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants