[CDRIVER-6017] Tweak visitor behavior when finding corrupt BSON#2019
[CDRIVER-6017] Tweak visitor behavior when finding corrupt BSON#2019vector-of-bool wants to merge 5 commits intomongodb:masterfrom
Conversation
We now invoke `visit_corrupt` in more cases where `bson_visit_all` cannot properly decode elements, rather than just stopping with the boolean error code.
| * | ||
| * Passing this as a flag has no effect. | ||
| */ | ||
| BSON_VALIDATE_CORRUPT = (1 << 30), |
There was a problem hiding this comment.
Can we reuse BSON_VALIDATE_UTF8 instead? It already "has no effect" as a behavior control flag and already describes the very UTF-8 validity checks which BSON_VALIDATE_CORRUPT is meant to describe (appears redundant).
s/BSON_VALIDATE_CORRUPT/BSON_VALIDATE_UTF8/g
There was a problem hiding this comment.
Maybe, but VISIT_CORRUPT is now (as of this PR) also used for when a subdocument is invalid (previously this was also just return true;, meaning that bson_validate was also missing this case).
I considered adding a .visit_invalid_utf8 callback, which is potentially useful in general, but there's ambiguity in the semantics of "corruption", because the spec technically says that BSON strings are UTF-8, so a non-utf-8 string could arguably be considered corruption.
| while (_bson_iter_next_internal (iter, 0, &key, &bson_type, &unsupported)) { | ||
| if (*key && !bson_utf8_validate (key, strlen (key), false)) { | ||
| iter->err_off = iter->off; | ||
| VISIT_CORRUPT (iter, data); |
There was a problem hiding this comment.
Can we take this opportunity to extend test suite coverage of the .visit_corrupt callback function to account for these new checks? atm there only seems to be a single test case. Perhaps the VALIDATE_TEST macro can be extended to also check .visit_corrupt callback behavior?
There was a problem hiding this comment.
Yes, I think this is a good idea.
There was a problem hiding this comment.
This change results in rejecting inserts that previously succeeded:
bson_t to_insert = BSON_INITIALIZER;
ASSERT (bson_append_utf8 (&to_insert, "v", 1, "a\xFF\b", 3));
bool ok = mongoc_collection_insert_one (coll, &to_insert, NULL, NULL, &error);
ASSERT_OR_PRINT (ok, error);
// Before PR : OK
// After PR : invalid document for insert: corrupt BSONWrite helpers validate UTF-8 by default. However, this validation does not appear documented, and appears to have always been broken.
I would rather remove the UTF-8 validation in write helpers. I expect that would be a performance improvement, and remove the (low?) risk of breaking applications inserting invalid UTF-8.
Notably: UTF-8 key validation appeared to work (and is tested in PHP mongodb/mongo-php-driver#1819 (comment)).
It seems like most drivers do not validate UTF-8 on write. See survey + slack discussion.
Proposal:
- Do not validate UTF-8 values (but validate keys) if
BSON_VALIDATE_UTF8is not set. - Remove
BSON_VALIDATE_UTF8from the default write helper flags (since UTF-8 validation never worked and the write helpers did not document expecting UTF-8 validation). - Do validate UTF-8 in
bson_validateifBSON_VALIDATE_UTF8is set (to match documented behavior ofbson_validate).
This is a minimal fix for some edge cases around BSON iteration/visitation and text validation. A larger change could be made to make the APIs more complete and easier to use, but this is enough to fix a bug related to UTF-8 validation.
We now invoke
visit_corruptin more cases wherebson_visit_allcannot properly decode elements, rather than just stopping with the boolean error code.This also adds an error code to
bson_validatefor encountering corruption. Previously, it would just seterror->code = 0and the message tocorrupt BSON, which is counter-intuitive to those that expecterror->code == 0to indicate success.See also: CDRIVER-4448