Fix: Use encoding detection for file reading to handle non-UTF-8 files #8551
+57
−65
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What?
Replaces
String::from_utf8_lossy()withbytes_to_string_smart()in theread_filehandler to properly decode files encoded in Windows code pages and other non-UTF-8 encodings.Why?
Fixes #7582. On Windows, files are often saved in legacy code page encodings (e.g., Windows-1255 for Hebrew, Windows-1251 for Cyrillic). The
read_filehandler was usingString::from_utf8_lossy()which assumes UTF-8 and replaces invalid byte sequences with replacement characters (), causing non-English text to appear as gibberish. For example, Hebrew text "שלום" would be rendered as "x^x•x(xŸ xHx•x©x™x¡x•xª...".The codebase already has a robust encoding detection solution (
bytes_to_string_smart()intext_encoding.rs) that useschardetngto automatically detect and decode various encodings. This fix applies that same logic to file reading, ensuring consistent behavior across the codebase.How?
bytes_to_string_smart()fromcrate::text_encodinginread_file.rsString::from_utf8_lossy()calls in two locations:format_line()function (for displaying file lines)collect_file_lines()function (for the raw line content)reads_windows_1255_hebrew_text()to verify Hebrew text encoded in Windows-1255 is correctly decodedreads_non_utf8_lines()test to be more flexible with invalid byte sequencesThe
bytes_to_string_smart()function:All existing tests pass, and the new test confirms the fix works correctly.