fix: force UTF-8 encoding for Python bridge on Windows#2
Open
quotentiroler wants to merge 1 commit intoblueraai:mainfrom
Open
fix: force UTF-8 encoding for Python bridge on Windows#2quotentiroler wants to merge 1 commit intoblueraai:mainfrom
quotentiroler wants to merge 1 commit intoblueraai:mainfrom
Conversation
On Windows, Python defaults to the system ANSI code page (e.g. cp1252) for stdio streams. When Node.js sends UTF-8 JSON containing multi-byte characters (like smart quotes U+201C/U+201D encoded as \xe2\x80\x9c/\x9d) over stdin pipes, Python misreads the \x9d byte as a cp1252 character, producing surrogate \udc9d. This later crashes json.dumps with 'surrogates not allowed' during AST parsing at the finalization step. Fix applied in two layers (belt and suspenders): 1. python/ast_worker.py: Wrap sys.stdin/sys.stdout with explicit UTF-8 TextIOWrapper on Windows, ensuring correct decoding regardless of system locale settings. 2. src/crawl/bridge.ts: Pass PYTHONUTF8=1 and PYTHONIOENCODING=utf-8 environment variables when spawning the Python child process, which forces Python to use UTF-8 for all IO operations. Fixes indexing failures on Windows where source files contain Unicode characters like smart quotes, em dashes, or other non-ASCII text.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On Windows, indexing repositories containing source files with Unicode characters (smart quotes
"", em dashes, etc.) fails at 99% with:Root Cause
PythonBridgespawnsast_worker.pyand writes UTF-8 JSON to its stdin pipe. However, Python on Windows defaults to the system ANSI code page (e.g.cp1252) for stdio streams — not UTF-8.Multi-byte UTF-8 sequences like
\xe2\x80\x9d(right double quote U+201D) get misread: the\x9dbyte is undefined in cp1252 and produces a surrogate character\udc9dvia Python'ssurrogateescapeerror handler. Whenjson.dumps()later tries to encode this surrogate back to UTF-8, it crashes with "surrogates not allowed".Fix
Applied a belt-and-suspenders fix in two layers:
python/ast_worker.py: On Windows, wrapssys.stdin/sys.stdoutwith explicitio.TextIOWrapper(encoding='utf-8')to ensure correct decoding regardless of system locale settings.src/crawl/bridge.ts: PassesPYTHONUTF8=1andPYTHONIOENCODING=utf-8environment variables when spawning the Python child process, which forces Python to use UTF-8 for all IO operations.Testing
Tested on Windows 11 with German locale (cp1252). Before the fix, indexing the huggingface/lerobot repository (517 files) failed 4 consecutive times at 99%. After the fix, indexing completes successfully (517/517 files) and search returns results.