Skip to content

fix: sanitize lone Unicode surrogates in YAML serialization#39316

Open
alexmakeev wants to merge 1 commit intomicrosoft:mainfrom
alexmakeev:fix/sanitize-lone-surrogates-yaml
Open

fix: sanitize lone Unicode surrogates in YAML serialization#39316
alexmakeev wants to merge 1 commit intomicrosoft:mainfrom
alexmakeev:fix/sanitize-lone-surrogates-yaml

Conversation

@alexmakeev
Copy link

Problem

yamlEscapeValueIfNeeded() in packages/injected/src/yaml.ts escapes control characters (\x00-\x1f, \x7f-\x9f) but does not handle lone Unicode surrogates (\uD800-\uDFFF).

When a page assigns text containing lone surrogates to DOM nodes (e.g. via element.textContent = 'text\uD800more'), the aria snapshot captures these verbatim. The resulting YAML string produces invalid JSON per RFC 8259 §8.2, causing "no low surrogate in string" errors in MCP clients.

Fix

Apply the same String.prototype.toWellFormed() pattern already used in packages/playwright-core/src/cli/driver.ts for non-JavaScript language bindings, with a regex fallback for environments without toWellFormed().

Lone surrogates are replaced with U+FFFD (Unicode replacement character). Valid surrogate pairs (emoji, etc.) are preserved.

Changes

  • packages/injected/src/yaml.ts — sanitize lone surrogates in yamlEscapeValueIfNeeded() before escaping control characters
  • tests/mcp/snapshot-unicode.spec.ts — test coverage for lone high/low surrogates, valid emoji preservation, and mixed content

Test plan

  • Lone high surrogate (\uD800) → replaced with \uFFFD
  • Lone low surrogate (\uDC00) → replaced with \uFFFD
  • Valid emoji (\uD83D\uDE00 = 😀) → preserved
  • Mixed content (emoji + lone surrogates + text) → correct handling

Ref: microsoft/playwright-mcp#1410

Lone surrogates (U+D800–U+DFFF) in DOM textContent are valid in JavaScript
strings but produce invalid JSON when serialized. The yamlEscapeValueIfNeeded()
function escapes control characters but does not handle lone surrogates,
causing 'no low surrogate in string' errors in MCP clients.

Apply the same toWellFormed() pattern already used in driver.ts for
non-JavaScript language bindings, with a regex fallback for older environments.

Fixes microsoft/playwright-mcp#1410
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments