-
Notifications
You must be signed in to change notification settings - Fork 10
Initial implementation of encode() using simdjson's string_builder #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ua 5.1 to 5.4 and LuaJIT
…o the compiler syntax
|
Thanks for taking the time to work on the encoder! It's looking pretty good. I am planning to leave a few comments inline as well to help me understand the code a little better. Clarifications
Yeah, it's for allowing a wide array of setups as possible, even if it's a little annoying at times...
Much appreciated! Were these tests adapted from somewhere? Generated from something? Just need to know if we need to update the license section in the README.
I think this should probably be required for now. I know in Lua it's easy to not have UTF-8 strings, and if using JSON as a serialization format, whatever other thing that takes in JSON likely will want it as UTF-8, we might as well ensure that before outputting it. Changes
I like the global setting for setting the -- instead of this:
json = simdjson.encode(deepData, 10)
-- do something like this:
json = simdjson.encode(deepData, {maxDepth=10})
That way any future options can go into the table and not affect backwards compatibility. Build
@tobil4sk has done almost all of the windows build config. I might defer to them on that one if they can take a look! |
|
Also just saw the mingw test failing... Keeping it running across the board is always a fun challenge! |
| if (lua_isboolean(L, lindex)) { | ||
| if (lua_toboolean(L, lindex)) { | ||
| // Use append_raw with string_view for batched append (more efficient than multiple char appends) | ||
| #if __cplusplus >= 202002L |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was playing with this and builder.append(true) actually works in C++17! Based on https://isocpp.org/files/papers/CppDevSurvey-2025-summary.pdf (page 10). I would consider moving the project to C++17.
Would moving to C++17 also allow using builder.append for floats? That way the string formatting likely wouldn't be needed and simdjson can handle it for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw the same with newer versions of C++. The default in my IDE is 20; it was working, but the builds failed using the Makefile. The only difference was the C++ version, so I went back to re-implement it using only the C++11-compatible features.
The template is for number-compatible types in append(), which means the following:
enum class number_type {
floating_point_number=1, /// a binary64 number
signed_integer, /// a signed integer that fits in a 64-bit word using two's complement
unsigned_integer, /// a positive integer larger or equal to 1<<63
big_integer /// a big integer that does not fit in a 64-bit word
};It also accepts a single char inputs.
Much appreciated! Were these tests adapted from somewhere? Generated from something? Just need to know if we need to update the license section in the README.
I have previously implemented an encoder; these tests are implementations of the tests I created for that purpose. I used Copilot to generate the boilerplate for Busted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent some time looking more closely at the different compiler versions, it depends on what "json standard" we want to be compatible with. At the moment your parsing checks against cjson in tests so I tried to maintain that compatibility/standardisation.
When I use the built-in float handling on my Mac M-series (arm) with C++11 and 20, I get different outputs, specifically, the padding is different.
encode numbers correctly should encode array of numbers
./spec/encode_spec.lua:51: Expected objects to be the same.
Passed in:
(string) '{"numbers":[1,2,3,4,5,-1,-2,0,1.500000,2.700000]}'
Expected:
(string) '{"numbers":[1,2,3,4,5,-1,-2,0,1.5,2.7]}'
I will look into this further and see if there is something that I have missed/done incorrectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment your parsing checks against cjson in tests so I tried to maintain that compatibility/standardisation.
I wasn't necessarily going for full compatibility with lua-cjson. It was just easy enough to implement tests originally. But yeah, I'd be curious what simdjson's builder append does with floats.
| static int setMaxEncodeDepth(lua_State *); | ||
| static int getMaxEncodeDepth(lua_State *); | ||
| static int setEncodeBufferSize(lua_State *); | ||
| static int getEncodeBufferSize(lua_State *); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The older functions here are in snake case (except when referring to the ParsedObject type), so I think it's cleanest to stick with snake case for the new ones as well.
…ed when formatting and broke the builds
…g passed into encode()
…ake the code easier to read
|
@FourierTransformer I've moved the configuration for encode() to being a lua table (currently) with two values I updated I've updated the Windows imports and reverted the It seems that I've introduced a flaky test. This is based on unchanged code randomly failing between builds. I'll investigate this and see if there's anything obvious that I need to fix. |
…y encode() call, potentially many times over
Hello,
As discussed in #102, I have wanted to implement an encode() feature in your lua-simdjson package/module/library. I've finally had the time to draft my first version for you to review.
It has taken me a few attempts to implement this in a way that demonstrates the functionality and logic end-to-end with tests. I felt that this would be a reasonable approach that could then be iterated on as required. I have not bounced the version in the files for this reason.
Maintaining compatibility with C++11 was taken seriously; I assume that's for maximum compatibility with Lua 5.1?
encode()has up to two inputs: the first is the Lua type, and the second is an optionalmax_depththat can override a global setting if required.The following additional methods have also been added
setMaxEncodeDepth()getMaxEncodeDepth()setEncodeBufferSize()getEncodeBufferSize()These are getters and setters for the global buffer and encoding depth values. I wasn't sure of a standardised/idiomatic way to do this in Lua; it seems that there's very little consistency in approaches, happy to adjust this if there are patterns that you prefer.
I wasn't able to get the Windows tests to pass without applying an architecture flag to the builds. Maybe that's something that you have configured in your build variables? I've added it to the
nmakeMakefile.win.I added a
.clang-formatfile; our local code style guidelines are not compatible, so adding the clang-format should overwrite our local settings and not break your formatting.There is an assortment of new tests for encoding, as well as some encoding-related performance comparisons and basic security tests. I do validate that the output is valid Unicode (UTF-8) by default. Should that be optional?
Thank you in advance. I look forward to your feedback/comments.