Skip to content

Conversation

@jlambatl
Copy link
Contributor

@jlambatl jlambatl commented Jan 2, 2026

Hello,

As discussed in #102, I have wanted to implement an encode() feature in your lua-simdjson package/module/library. I've finally had the time to draft my first version for you to review.

It has taken me a few attempts to implement this in a way that demonstrates the functionality and logic end-to-end with tests. I felt that this would be a reasonable approach that could then be iterated on as required. I have not bounced the version in the files for this reason.

Maintaining compatibility with C++11 was taken seriously; I assume that's for maximum compatibility with Lua 5.1?

encode() has up to two inputs: the first is the Lua type, and the second is an optional max_depth that can override a global setting if required.

The following additional methods have also been added
setMaxEncodeDepth()
getMaxEncodeDepth()
setEncodeBufferSize()
getEncodeBufferSize()

These are getters and setters for the global buffer and encoding depth values. I wasn't sure of a standardised/idiomatic way to do this in Lua; it seems that there's very little consistency in approaches, happy to adjust this if there are patterns that you prefer.

I wasn't able to get the Windows tests to pass without applying an architecture flag to the builds. Maybe that's something that you have configured in your build variables? I've added it to the nmake Makefile.win.

I added a .clang-format file; our local code style guidelines are not compatible, so adding the clang-format should overwrite our local settings and not break your formatting.

There is an assortment of new tests for encoding, as well as some encoding-related performance comparisons and basic security tests. I do validate that the output is valid Unicode (UTF-8) by default. Should that be optional?

Thank you in advance. I look forward to your feedback/comments.

@FourierTransformer
Copy link
Owner

Thanks for taking the time to work on the encoder! It's looking pretty good. I am planning to leave a few comments inline as well to help me understand the code a little better.

Clarifications

Maintaining compatibility with C++11 was taken seriously; I assume that's for maximum compatibility with Lua 5.1?

Yeah, it's for allowing a wide array of setups as possible, even if it's a little annoying at times...

There is an assortment of new tests for encoding, as well as some encoding-related performance comparisons and basic security tests.

Much appreciated! Were these tests adapted from somewhere? Generated from something? Just need to know if we need to update the license section in the README.

I do validate that the output is valid Unicode (UTF-8) by default. Should that be optional?

I think this should probably be required for now. I know in Lua it's easy to not have UTF-8 strings, and if using JSON as a serialization format, whatever other thing that takes in JSON likely will want it as UTF-8, we might as well ensure that before outputting it.

Changes

encode() has up to two inputs: the first is the Lua type, and the second is an optional max_depth that can override a global setting if required.

I like the global setting for setting the max_depth and buffer_size. For the second argument in encode, if you could have it take in a table instead of a single value, it ends up making it a lot more extensible with other options later.

-- instead of this:
json = simdjson.encode(deepData, 10)

-- do something like this:
json = simdjson.encode(deepData, {maxDepth=10})

That way any future options can go into the table and not affect backwards compatibility.

Build

I wasn't able to get the Windows tests to pass without applying an architecture flag to the builds. Maybe that's something that you have configured in your build variables? I've added it to the nmake Makefile.win.

@tobil4sk has done almost all of the windows build config. I might defer to them on that one if they can take a look!

@FourierTransformer
Copy link
Owner

Also just saw the mingw test failing... Keeping it running across the board is always a fun challenge!

if (lua_isboolean(L, lindex)) {
if (lua_toboolean(L, lindex)) {
// Use append_raw with string_view for batched append (more efficient than multiple char appends)
#if __cplusplus >= 202002L
Copy link
Owner

@FourierTransformer FourierTransformer Jan 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was playing with this and builder.append(true) actually works in C++17! Based on https://isocpp.org/files/papers/CppDevSurvey-2025-summary.pdf (page 10). I would consider moving the project to C++17.

Would moving to C++17 also allow using builder.append for floats? That way the string formatting likely wouldn't be needed and simdjson can handle it for us.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the same with newer versions of C++. The default in my IDE is 20; it was working, but the builds failed using the Makefile. The only difference was the C++ version, so I went back to re-implement it using only the C++11-compatible features.

The template is for number-compatible types in append(), which means the following:

enum class number_type {
    floating_point_number=1, /// a binary64 number
    signed_integer,          /// a signed integer that fits in a 64-bit word using two's complement
    unsigned_integer,        /// a positive integer larger or equal to 1<<63
    big_integer              /// a big integer that does not fit in a 64-bit word
};

It also accepts a single char inputs.

Much appreciated! Were these tests adapted from somewhere? Generated from something? Just need to know if we need to update the license section in the README.

I have previously implemented an encoder; these tests are implementations of the tests I created for that purpose. I used Copilot to generate the boilerplate for Busted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent some time looking more closely at the different compiler versions, it depends on what "json standard" we want to be compatible with. At the moment your parsing checks against cjson in tests so I tried to maintain that compatibility/standardisation.

When I use the built-in float handling on my Mac M-series (arm) with C++11 and 20, I get different outputs, specifically, the padding is different.

encode numbers correctly should encode array of numbers
./spec/encode_spec.lua:51: Expected objects to be the same.
Passed in:
(string) '{"numbers":[1,2,3,4,5,-1,-2,0,1.500000,2.700000]}'
Expected:
(string) '{"numbers":[1,2,3,4,5,-1,-2,0,1.5,2.7]}'

I will look into this further and see if there is something that I have missed/done incorrectly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment your parsing checks against cjson in tests so I tried to maintain that compatibility/standardisation.

I wasn't necessarily going for full compatibility with lua-cjson. It was just easy enough to implement tests originally. But yeah, I'd be curious what simdjson's builder append does with floats.

static int setMaxEncodeDepth(lua_State *);
static int getMaxEncodeDepth(lua_State *);
static int setEncodeBufferSize(lua_State *);
static int getEncodeBufferSize(lua_State *);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The older functions here are in snake case (except when referring to the ParsedObject type), so I think it's cleanest to stick with snake case for the new ones as well.

@jlambatl
Copy link
Contributor Author

jlambatl commented Jan 4, 2026

@FourierTransformer I've moved the configuration for encode() to being a lua table (currently) with two values

{
  maxDepth = 10,
  bufferSize = 8192
}

I updated max_depth and buffer_size to be camelCase based on your example. I can revert this to snake_case if required.

I've updated the Windows imports and reverted the Makefile.win to its previous state.

It seems that I've introduced a flaky test. This is based on unchanged code randomly failing between builds. I'll investigate this and see if there's anything obvious that I need to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants