Fix: Use encoding detection for file reading to handle non-UTF-8 files #8551

danielrafailov1 · 2025-12-26T05:21:03Z

What?

Replaces String::from_utf8_lossy() with bytes_to_string_smart() in the read_file handler to properly decode files encoded in Windows code pages and other non-UTF-8 encodings.

Why?

Fixes #7582. On Windows, files are often saved in legacy code page encodings (e.g., Windows-1255 for Hebrew, Windows-1251 for Cyrillic). The read_file handler was using String::from_utf8_lossy() which assumes UTF-8 and replaces invalid byte sequences with replacement characters (), causing non-English text to appear as gibberish. For example, Hebrew text "שלום" would be rendered as "x^x•x(xŸ xHx•x©x™x¡x•xª...".

The codebase already has a robust encoding detection solution (bytes_to_string_smart() in text_encoding.rs) that uses chardetng to automatically detect and decode various encodings. This fix applies that same logic to file reading, ensuring consistent behavior across the codebase.

How?

Import and use bytes_to_string_smart() from crate::text_encoding in read_file.rs
Replace String::from_utf8_lossy() calls in two locations:
- format_line() function (for displaying file lines)
- collect_file_lines() function (for the raw line content)
Add test reads_windows_1255_hebrew_text() to verify Hebrew text encoded in Windows-1255 is correctly decoded
Update existing reads_non_utf8_lines() test to be more flexible with invalid byte sequences

The bytes_to_string_smart() function:

Fast-paths UTF-8 files (no encoding detection overhead)
Uses encoding detection for non-UTF-8 files
Supports Windows code pages (1255, 1251, 1252, etc.), ISO-8859 variants, and many other encodings
Gracefully falls back to lossy UTF-8 for truly invalid sequences

All existing tests pass, and the new test confirms the fix works correctly.

Fixes issue openai#7582 where non-English characters (e.g., Hebrew) were converted to gibberish when reading files. The read_file handler was using String::from_utf8_lossy() which assumes UTF-8 and mangles text encoded in Windows code pages. - Replace from_utf8_lossy with bytes_to_string_smart in read_file.rs - bytes_to_string_smart uses encoding detection to properly decode Windows code pages (Windows-1255 for Hebrew, Windows-1251 for Cyrillic, etc.) - Add test to verify Hebrew text (Windows-1255) is decoded correctly - Update existing test to be more flexible with invalid byte sequences

etraut-openai · 2025-12-26T08:01:38Z

@codex review

chatgpt-codex-connector · 2025-12-26T08:03:42Z

Codex Review: Didn't find any major issues. Hooray!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

testing fix

9e31e6d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: Use encoding detection for file reading to handle non-UTF-8 files #8551

Fix: Use encoding detection for file reading to handle non-UTF-8 files #8551

danielrafailov1 commented Dec 26, 2025

Uh oh!

etraut-openai commented Dec 26, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix: Use encoding detection for file reading to handle non-UTF-8 files #8551

Are you sure you want to change the base?

Fix: Use encoding detection for file reading to handle non-UTF-8 files #8551

Conversation

danielrafailov1 commented Dec 26, 2025

What?

Why?

How?

Uh oh!

etraut-openai commented Dec 26, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants