Skip to content

Conversation

@danielrafailov1
Copy link

What?

Replaces String::from_utf8_lossy() with bytes_to_string_smart() in the read_file handler to properly decode files encoded in Windows code pages and other non-UTF-8 encodings.

Why?

Fixes #7582. On Windows, files are often saved in legacy code page encodings (e.g., Windows-1255 for Hebrew, Windows-1251 for Cyrillic). The read_file handler was using String::from_utf8_lossy() which assumes UTF-8 and replaces invalid byte sequences with replacement characters (), causing non-English text to appear as gibberish. For example, Hebrew text "שלום" would be rendered as "x^x•x(xŸ xHx•x©x™x¡x•xª...".

The codebase already has a robust encoding detection solution (bytes_to_string_smart() in text_encoding.rs) that uses chardetng to automatically detect and decode various encodings. This fix applies that same logic to file reading, ensuring consistent behavior across the codebase.

How?

  • Import and use bytes_to_string_smart() from crate::text_encoding in read_file.rs
  • Replace String::from_utf8_lossy() calls in two locations:
    • format_line() function (for displaying file lines)
    • collect_file_lines() function (for the raw line content)
  • Add test reads_windows_1255_hebrew_text() to verify Hebrew text encoded in Windows-1255 is correctly decoded
  • Update existing reads_non_utf8_lines() test to be more flexible with invalid byte sequences

The bytes_to_string_smart() function:

  • Fast-paths UTF-8 files (no encoding detection overhead)
  • Uses encoding detection for non-UTF-8 files
  • Supports Windows code pages (1255, 1251, 1252, etc.), ISO-8859 variants, and many other encodings
  • Gracefully falls back to lossy UTF-8 for truly invalid sequences

All existing tests pass, and the new test confirms the fix works correctly.

Fixes issue openai#7582 where non-English characters (e.g., Hebrew) were
converted to gibberish when reading files. The read_file handler was
using String::from_utf8_lossy() which assumes UTF-8 and mangles text
encoded in Windows code pages.

- Replace from_utf8_lossy with bytes_to_string_smart in read_file.rs
- bytes_to_string_smart uses encoding detection to properly decode
  Windows code pages (Windows-1255 for Hebrew, Windows-1251 for
  Cyrillic, etc.)
- Add test to verify Hebrew text (Windows-1255) is decoded correctly
- Update existing test to be more flexible with invalid byte sequences
@etraut-openai
Copy link
Collaborator

@codex review

@chatgpt-codex-connector
Copy link
Contributor

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Character encoding issues

2 participants