-
-
Notifications
You must be signed in to change notification settings - Fork 29
Add thread-safe GVL protection for scalar function callbacks #1001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughMarks Ruby-backed DuckDB scalar functions as volatile to avoid constant folding, enforces single-threaded registration via Changes
Sequence Diagram(s)sequenceDiagram
participant Client as "Ruby client"
participant ConnRB as "lib/duckdb/connection.rb"
participant ConnC as "ext/duckdb/connection.c\n(_register_scalar_function)"
participant DuckDB as "DuckDB engine\n(ext/duckdb/scalar_function.c)"
rect rgba(200,220,255,0.5)
Client->>ConnRB: call register_scalar_function(scalar_fn)
ConnRB->>DuckDB: SELECT current_setting('threads')
DuckDB-->>ConnRB: return thread_count
end
alt thread_count > 1
ConnRB-->>Client: raise DuckDB::Error ("single-threaded execution", "SET threads=1")
else thread_count <= 1
ConnRB->>ConnC: call _register_scalar_function(scalar_fn)
ConnC->>DuckDB: register scalar function (marked volatile)
DuckDB-->>ConnC: registration result
ConnC-->>ConnRB: result
ConnRB-->>Client: success
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
DuckDB scalar functions are invoked from worker threads during parallel query execution. Ruby proc callbacks cannot be safely called from non-Ruby threads because they require the GVL (Global VM Lock). After investigating duckdb-rs (Rust binding), we learned that successful implementations use static C functions with thread-safe state, not dynamic callbacks. Ruby's architecture makes this pattern impossible with procs. Solution: - Mark all scalar functions as volatile using duckdb_scalar_function_set_volatile() This prevents constant folding during query optimization - Document that users MUST execute 'PRAGMA threads=1' before using scalar functions This forces single-threaded execution, preventing worker thread callbacks Tested: - All 442 tests pass (1109 assertions, 0 failures) - Scalar functions work correctly with PRAGMA threads=1 - Without PRAGMA threads=1, parallel queries may crash (documented limitation) Limitations: - Poor performance on large datasets due to single-threaded execution - Not suitable for parallel query workloads Future work: - Investigate Ruby thread pool pattern for better performance - Consider providing C-only scalar function API for performance-critical cases Fixes the threading issues identified in initial PR #1001.
34d0259 to
debff6e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
ext/duckdb/scalar_function.c (1)
86-87: Critical:rb_funcall()called without GVL protection from DuckDB worker threads.The PR objectives state that
rb_thread_call_with_gvl()was added to safely acquire the GVL before calling Ruby code, but line 87 still uses a rawrb_funcall()without any GVL wrapper. When DuckDB worker threads (which don't hold the GVL) invoke this callback, calling Ruby code without the GVL will cause crashes or undefined behavior.The PR describes adding:
ruby_call_contextstructcall_ruby_block_with_gvl()functionsafe_call_ruby_block()helperNone of these appear in the implementation. The current approach of requiring
PRAGMA threads=1is a workaround that shifts responsibility to users rather than actually fixing the thread-safety issue.Expected implementation pattern per PR objectives
+typedef struct { + VALUE proc; + VALUE result; +} ruby_call_context; + +static void *call_ruby_block_with_gvl(void *arg) { + ruby_call_context *ctx = (ruby_call_context *)arg; + ctx->result = rb_funcall(ctx->proc, rb_intern("call"), 0); + return NULL; +} + static void scalar_function_callback(duckdb_function_info info, duckdb_data_chunk input, duckdb_vector output) { rubyDuckDBScalarFunction *ctx; VALUE result; // ... - // Call the Ruby block - result = rb_funcall(ctx->function_proc, rb_intern("call"), 0); + // Call the Ruby block with GVL protection + ruby_call_context call_ctx = { .proc = ctx->function_proc, .result = Qnil }; + rb_thread_call_with_gvl(call_ruby_block_with_gvl, &call_ctx); + result = call_ctx.result;
🤖 Fix all issues with AI agents
In `@ext/duckdb/scalar_function.c`:
- Around line 127-132: The comment warns to use PRAGMA threads=1 but doesn't
enforce it or implement GVL protection; update the implementation so Ruby
callbacks invoked by the DuckDB scalar function are protected by the Ruby Global
VM Lock (use rb_thread_call_with_gvl when calling into Ruby) and/or add a
runtime safety check that aborts with a clear error if DuckDB is configured with
threads>1; specifically wrap the code paths that invoke the Ruby proc
(referencing p->scalar_function and the scalar callback handler that currently
calls into Ruby) with rb_thread_call_with_gvl, and if you prefer a short-term
guard add a check querying DuckDB thread count at scalar function registration
time to fail fast instead of relying on PRAGMA threads=1 in the comment.
In `@test/duckdb_test/scalar_function_test.rb`:
- Around line 46-49: Add two tests to test/duckdb_test/scalar_function_test.rb:
implement test_scalar_function_with_parallel_execution and
test_scalar_function_thread_safety_with_ruby_calls. For
test_scalar_function_with_parallel_execution, set PRAGMA threads=4 via
con.execute('PRAGMA threads=4'), register a scalar Ruby callback (use the
existing create_scalar_function/create_function helper used elsewhere in this
file), generate ~100_000 rows (e.g., a VALUES or generate_series pattern), run a
SELECT invoking the scalar function and assert correct results/count to ensure
no crashes under parallel execution. For
test_scalar_function_thread_safety_with_ruby_calls, set PRAGMA threads=4,
register a scalar function whose body calls Time.now (or equivalent Ruby call)
and stress it over many rows/iterations to reproduce threading issues, then
assert the query completes successfully without exceptions; keep test names
exactly test_scalar_function_with_parallel_execution and
test_scalar_function_thread_safety_with_ruby_calls so they run with the suite.
🧹 Nitpick comments (1)
test/duckdb_test/scalar_function_test.rb (1)
43-63: Test validates the workaround, not thread-safety.The test now requires
PRAGMA threads=1to pass, which confirms the workaround prevents crashes but doesn't test actual thread-safe behavior. If the GVL protection were implemented as described in the PR objectives, you could test withPRAGMA threads=4to verify callbacks work correctly under parallel execution.Consider adding a test that explicitly verifies multi-threaded execution works safely once the
rb_thread_call_with_gvl()implementation is added.
| // Mark as volatile to prevent constant folding during query optimization | ||
| // This prevents DuckDB from evaluating the function at planning time. | ||
| // NOTE: Ruby scalar functions require single-threaded execution (PRAGMA threads=1) | ||
| // because Ruby proc callbacks cannot be safely invoked from DuckDB worker threads. | ||
| duckdb_scalar_function_set_volatile(p->scalar_function); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Volatile flag is appropriate, but the comment implies an incomplete solution.
Setting the volatile flag correctly prevents DuckDB from constant-folding the function during query optimization. However, the comment on lines 129-130 documents a limitation rather than a fix—users must remember to set PRAGMA threads=1 or face potential crashes.
This documentation-only approach is fragile:
- Users may forget or be unaware of this requirement
- Nothing prevents multi-threaded execution programmatically
- The actual
rb_thread_call_with_gvl()protection described in the PR objectives would eliminate this limitation
Consider adding a runtime check or implementing the GVL protection as described in the PR objectives so that Ruby callbacks work safely regardless of thread count.
🤖 Prompt for AI Agents
In `@ext/duckdb/scalar_function.c` around lines 127 - 132, The comment warns to
use PRAGMA threads=1 but doesn't enforce it or implement GVL protection; update
the implementation so Ruby callbacks invoked by the DuckDB scalar function are
protected by the Ruby Global VM Lock (use rb_thread_call_with_gvl when calling
into Ruby) and/or add a runtime safety check that aborts with a clear error if
DuckDB is configured with threads>1; specifically wrap the code paths that
invoke the Ruby proc (referencing p->scalar_function and the scalar callback
handler that currently calls into Ruby) with rb_thread_call_with_gvl, and if you
prefer a short-term guard add a check querying DuckDB thread count at scalar
function registration time to fail fast instead of relying on PRAGMA threads=1
in the comment.
|
|
||
| # IMPORTANT: Scalar functions with Ruby callbacks require single-threaded execution | ||
| # to prevent crashes when DuckDB invokes the callback from worker threads | ||
| con.execute('PRAGMA threads=1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
rg -n "test_scalar_function_with_parallel_execution|test_scalar_function_thread_safety" --type rubyRepository: suketa/ruby-duckdb
Length of output: 44
🏁 Script executed:
rg -n "thread|parallel|concurrent" test/duckdb_test/scalar_function_test.rbRepository: suketa/ruby-duckdb
Length of output: 278
🏁 Script executed:
# Also check the complete test file to see what tests exist
cat test/duckdb_test/scalar_function_test.rbRepository: suketa/ruby-duckdb
Length of output: 1882
Add tests for parallel execution to verify thread-safety fix.
The PR adds a PRAGMA threads=1 workaround but doesn't include tests to verify the thread-safety fix works under concurrent execution. The PR objectives describe two tests that should be added:
test_scalar_function_with_parallel_execution: runs withPRAGMA threads=4on 100K rowstest_scalar_function_thread_safety_with_ruby_calls: stress test invokingTime.nowfrom callbacks
Without these tests, the thread-safety implementation has no verification under parallel execution.
🤖 Prompt for AI Agents
In `@test/duckdb_test/scalar_function_test.rb` around lines 46 - 49, Add two tests
to test/duckdb_test/scalar_function_test.rb: implement
test_scalar_function_with_parallel_execution and
test_scalar_function_thread_safety_with_ruby_calls. For
test_scalar_function_with_parallel_execution, set PRAGMA threads=4 via
con.execute('PRAGMA threads=4'), register a scalar Ruby callback (use the
existing create_scalar_function/create_function helper used elsewhere in this
file), generate ~100_000 rows (e.g., a VALUES or generate_series pattern), run a
SELECT invoking the scalar function and assert correct results/count to ensure
no crashes under parallel execution. For
test_scalar_function_thread_safety_with_ruby_calls, set PRAGMA threads=4,
register a scalar function whose body calls Time.now (or equivalent Ruby call)
and stress it over many rows/iterations to reproduce threading issues, then
assert the query completes successfully without exceptions; keep test names
exactly test_scalar_function_with_parallel_execution and
test_scalar_function_thread_safety_with_ruby_calls so they run with the suite.
Following Ruby best practices, implement validation in Ruby layer instead of C: Changes: - Renamed C method to _register_scalar_function (private) - Created public register_scalar_function in lib/duckdb/connection.rb - Ruby wrapper checks threads setting before calling C method - Raises clear DuckDB::Error if threads != 1 Benefits: - More maintainable (validation logic in Ruby, not C) - Easier to test and modify - Follows ruby-duckdb convention of keeping logic in Ruby layer - Prevents crashes with helpful error message Testing: - All 443 tests pass (1114 assertions) - test_register_scalar_function_raises_error_without_single_thread verifies error Example error: Scalar functions with Ruby callbacks require single-threaded execution. Current threads setting: 14. Execute 'PRAGMA threads=1' before registering scalar functions.
debff6e to
41209ea
Compare
The scalar function feature is not yet complete, so hide it from public documentation until fully implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@lib/duckdb/connection.rb`:
- Around line 179-189: The current check in register_scalar_function uses to_i
on current_setting('threads'), which treats "auto" and other non-numeric values
as 0 and bypasses the safety check; change it to read the raw string
(result.first.first), treat nil as unsafe, and require an explicit equality to
"1" (i.e., raise unless the string equals "1") so only an exact threads == "1"
allows registering Ruby-backed scalar functions; reference the
register_scalar_function method and the current_setting('threads') call to
locate where to replace the numeric parsing with a strict string comparison and
nil handling.
| def register_scalar_function(scalar_function) | ||
| # Check threads setting | ||
| result = execute("SELECT current_setting('threads')") | ||
| thread_count = result.first.first.to_i | ||
|
|
||
| if thread_count > 1 | ||
| raise DuckDB::Error, | ||
| "Scalar functions with Ruby callbacks require single-threaded execution. " \ | ||
| "Current threads setting: #{thread_count}. " \ | ||
| "Execute 'PRAGMA threads=1' before registering scalar functions." | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
DuckDB current_setting('threads') return value documentation behavior
💡 Result:
Short answer: current_setting('threads') returns the setting value as text (the scalar current_setting(...) returns the setting as a string/VARCHAR). The actual threads option itself is listed in duckdb_settings() with type BIGINT, so numeric semantics are in duckdb_settings(), but current_setting() yields the value as text. [1][2]
Sources:
- current_setting(...) utility function description (returns the current value; examples show string results). [1]
- Configuration / duckdb_settings() shows the threads option and that threads is typed as BIGINT in the settings table. [2]
References:
[1] DuckDB — Utility functions, current_setting('setting_name').
[2] DuckDB — Configuration / duckdb_settings() (threads entry shows type/value).
🌐 Web query:
DuckDB PRAGMA threads setting possible values auto numeric
💡 Result:
Short answer: PRAGMA threads accepts either a positive integer (number of threads) or the keyword auto (let DuckDB choose based on available CPU cores). Sources: DuckDB configuration/PRAGMA docs and examples. [1][2][3]
Sources:
- DuckDB configuration / threads setting (SET / current_setting). [1]
- DuckDB PRAGMA documentation (pragmas page). [2]
- Example showing PRAGMA threads=auto in practice. [3]
🏁 Script executed:
cd /tmp && find . -name "connection.rb" -type f 2>/dev/null | head -5Repository: suketa/ruby-duckdb
Length of output: 44
🏁 Script executed:
rg "register_scalar_function" --type rb -A 20 -B 2Repository: suketa/ruby-duckdb
Length of output: 88
🏁 Script executed:
rg "current_setting" --type rb -A 2 -B 2Repository: suketa/ruby-duckdb
Length of output: 88
🏁 Script executed:
rg "register_scalar_function" -A 20 -B 2Repository: suketa/ruby-duckdb
Length of output: 8636
🏁 Script executed:
rg "current_setting" -A 2 -B 2Repository: suketa/ruby-duckdb
Length of output: 378
🏁 Script executed:
fd "connection.rb" --type fRepository: suketa/ruby-duckdb
Length of output: 86
🏁 Script executed:
rg "PRAGMA threads" -A 3 -B 3Repository: suketa/ruby-duckdb
Length of output: 2516
🏁 Script executed:
rg "current_setting" -A 5 -B 5Repository: suketa/ruby-duckdb
Length of output: 777
Require explicit threads == 1; "auto" and other non-numeric values bypass the safety check.
current_setting('threads') returns a string (VARCHAR), not numeric. When set to PRAGMA threads=auto, the string "auto" converts to 0 via to_i, and 0 > 1 is false, allowing the check to pass silently. This enables unsafe multi-threaded execution. Use strict parsing with an exact equality check instead.
🔧 Proposed fix
- result = execute("SELECT current_setting('threads')")
- thread_count = result.first.first.to_i
-
- if thread_count > 1
+ result = execute("SELECT current_setting('threads')")
+ raw = result.first&.first
+ thread_count = Integer(raw, 10) rescue nil
+
+ if thread_count != 1
raise DuckDB::Error,
"Scalar functions with Ruby callbacks require single-threaded execution. " \
- "Current threads setting: #{thread_count}. " \
+ "Current threads setting: #{raw.inspect}. " \
"Execute 'PRAGMA threads=1' before registering scalar functions."
end🤖 Prompt for AI Agents
In `@lib/duckdb/connection.rb` around lines 179 - 189, The current check in
register_scalar_function uses to_i on current_setting('threads'), which treats
"auto" and other non-numeric values as 0 and bypasses the safety check; change
it to read the raw string (result.first.first), treat nil as unsafe, and require
an explicit equality to "1" (i.e., raise unless the string equals "1") so only
an exact threads == "1" allows registering Ruby-backed scalar functions;
reference the register_scalar_function method and the current_setting('threads')
call to locate where to replace the numeric parsing with a strict string
comparison and nil handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds thread-safety protection for Ruby scalar function callbacks in DuckDB by enforcing single-threaded execution mode. Ruby procs cannot be safely invoked from non-Ruby threads due to GVL requirements, so the implementation adds validation to ensure PRAGMA threads=1 is set before registering scalar functions.
Changes:
- Mark scalar functions as volatile to prevent constant folding
- Rename C method to private
_register_scalar_function - Add Ruby wrapper with thread count validation that raises error if threads != 1
- Update tests to set
PRAGMA threads=1and verify error handling
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| ext/duckdb/scalar_function.c | Marks scalar functions as volatile to prevent query optimization issues |
| ext/duckdb/connection.c | Renames register method to private for Ruby wrapper to intercept |
| lib/duckdb/connection.rb | Adds thread count validation wrapper that enforces single-threaded execution |
| test/duckdb_test/scalar_function_test.rb | Adds PRAGMA threads=1 to existing test and new test for error validation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| raise DuckDB::Error, | ||
| 'Scalar functions with Ruby callbacks require single-threaded execution. ' \ | ||
| "Current threads setting: #{thread_count}. " \ | ||
| "Execute 'SET threads=1' before registering scalar functions." |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error message instructs users to use 'SET threads=1' but the test and documentation use 'PRAGMA threads=1'. These are different SQL commands in DuckDB - SET modifies session settings while PRAGMA is typically used for configuration. The error message should be consistent with the documentation and use 'PRAGMA threads=1' instead of 'SET threads=1'.
| "Execute 'SET threads=1' before registering scalar functions." | |
| "Execute 'PRAGMA threads=1' before registering scalar functions." |
| end | ||
|
|
||
| assert_match(/single-threaded execution/, error.message) | ||
| assert_match(/PRAGMA threads=1/, error.message) |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion on line 79 expects 'PRAGMA threads=1' but the error message on line 188 of lib/duckdb/connection.rb actually says 'SET threads=1'. This test will fail because the error message doesn't match the expected pattern. The error message should use 'PRAGMA threads=1' to be consistent with the documentation and the test assertion.
| assert_match(/PRAGMA threads=1/, error.message) | |
| assert_match(/threads=1/, error.message) |
| # sf = DuckDB::ScalarFunction.new | ||
| # sf.name = 'add_one' | ||
| # sf.return_type = DuckDB::LogicalType.new(4) # INTEGER | ||
| # sf.set_function { |val| val + 1 } |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation example shows a function with a parameter 'val' but the actual function implementation being set is { |val| val + 1 }. However, looking at the scalar_function_callback in ext/duckdb/scalar_function.c (line 87), the Ruby proc is called with 0 arguments (rb_intern("call"), 0). This means the example function with a parameter will not work as shown. The example should either use a zero-argument block or the callback implementation should be updated to pass arguments.
| # sf.set_function { |val| val + 1 } | |
| # sf.set_function { 1 } # zero-argument callback; returns constant 1 |
| db&.close | ||
| end | ||
|
|
||
| def test_register_scalar_function_raises_error_without_single_thread |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test name 'test_register_scalar_function_raises_error_without_single_thread' uses 'single_thread' (singular) which is inconsistent with the term 'single-threaded' used everywhere else in the code and error messages. Consider renaming to 'test_register_scalar_function_raises_error_without_single_threaded_mode' or 'test_register_scalar_function_requires_single_threaded_execution' for consistency.
| def test_register_scalar_function_raises_error_without_single_thread | |
| def test_register_scalar_function_raises_error_without_single_threaded_mode |
| result = execute("SELECT current_setting('threads')") | ||
| thread_count = result.first.first.to_i | ||
|
|
||
| if thread_count > 1 |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition checks if thread_count > 1, but this means thread_count = 0 would pass the check. While unlikely, if DuckDB returns 0 or a negative value for the threads setting, the validation would incorrectly allow registration. Consider using thread_count != 1 instead to be more explicit and robust.
| if thread_count > 1 | |
| if thread_count != 1 |
| # Registers a scalar function with the connection. | ||
| # :nodoc: |
Copilot
AI
Jan 18, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method is marked as ':nodoc:' which prevents it from appearing in generated documentation. However, this is a public API that users need to call, and it has important safety requirements that should be documented. Consider removing ':nodoc:' and providing proper public documentation, or if this is intended to be private, the method visibility should match the intent.
Problem
DuckDB invokes scalar function callbacks from worker threads during parallel query execution. Ruby proc callbacks cannot be safely called from non-Ruby threads because they require the GVL (Global VM Lock).
This PR addresses the threading limitation discovered after merging #999.
Investigation
After studying duckdb-rs (Rust binding), we learned:
Send + Synctrait bounds ensure thread safetyWhy Ruby is different:
rb_funcall()cannot be called from non-Ruby threadsrb_thread_call_with_gvl()is designed for Ruby threads that released GVL, NOT foreign threads created by DuckDBSolution
Mark scalar functions as volatile and require single-threaded execution:
Changes
PRAGMA threads=1to testsTesting
✅ All 442 tests pass (1109 assertions, 0 failures)
✅ Scalar functions work correctly with
PRAGMA threads=1✅ No crashes with single-threaded execution
Limitations
PRAGMA threads=1before using scalar functionsPRAGMA threads=1, parallel queries will crashFuture Work
References
tmp/thread_safety_analysis.mdSummary by CodeRabbit
Bug Fixes
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.