-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Add tests for common TextDecoder implementation mistakes #56892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add tests for common TextDecoder implementation mistakes #56892
Conversation
d412234 to
3cf258f
Compare
3cf258f to
b132d29
Compare
annevk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing work. Thanks so much!
If have two questions:
- I guess the lack of semicolons was maybe copied from existing tests? It would be preferable to have them, but it's not required.
- Can we perhaps split this file by encoding type? E.g., UTF-8, UTF-16, single-byte encodings, and multi-byte encodings?
| // Bun is incorrect | ||
| test(() => { | ||
| // This is the only decoder which does not clear internal state before throwing in stream mode (non-EOF throws) | ||
| // So the internal state of this decoder can legitimately persist after an error was thrown |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very sketchy. I wonder if we should change this somehow in the standard. No need to change the test for now though as it indeed appears correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My take is that continuing streaming after fatal has thrown an error is a very bad idea, regardless of the encoding, and should be banned.
I will elaborate more in whatwg/encoding#358
This testsuite collects cases where at least one implementation fails
This is an export of https://github.com/ExodusOSS/bytes/blob/9c8c9baa/tests/encoding/mistakes.test.js
It demonstrates bugs in all major implementations - not a single of the existing ones was correct for all encodings.
In total, 10 different implementations were tested + my implementation in js.
Important
All of the three major browser engines got UTF-8 decoding wrong.
#56799 and #56844 were tiny parts of this.
On no testcase all implementations agree on an incorrect behavior: at least one is consistent with the spec.
Perhaps the most controversial part of this is a test for whatwg/encoding#115, but Firefox, Deno, Servo agree with spec and the spec is clear on what should happen and documents that specific case. Also, WebKit and Chrome now treat concatenated input as invalid too, but perform replacement on it incorrectly.
On some tests,
For data, see https://docs.google.com/spreadsheets/d/1pdEefRG6r9fZy61WHGz0TKSt8cO4ISWqlpBN5KntIvQ/edit
Legend:
FAILmeans returning incorrect results on some inputs, i.e. not matching the spec on a simple.decode()call.STATEis more serious in some cases - it means internal inconsistency, non-streaming.decode()result depends on previous call, state leaks between calls.STREAMalso could be more serious in some cases and demonstrates an internal inconsistencyDecoding result depends on the buffers shape:
decode(a, { stream }) + decode(b) !== decode(a + b)With Dynamic Record Sizing, this could potentially be controlled with a MitM and cause different decoding results for the same bytes, all without affecting TLS (as it verifies bytes).
Demo for
utf-8mishandling on WebKit and Chrome: https://tmp-demo.rray.org/utf-8 (small static html over https)Due to the nature of cross-tests, implementations not behaving correctly per spec on single-shot
new TextDecoder(encoding).decode(arg)calls were not fully tested forSTATE/STREAM, soFAILtakes priority (even in cases where it's less significant).The rest is explained in the table and the testcases
Bug refs, found with these cross-tests:
For a clean implementation in js, see
@exodus/bytes/encoding.js