Skip to content

Add precompile workload for Dual and SubArray broadcast operations#1291

Merged
ChrisRackauckas merged 4 commits intoSciML:masterfrom
ChrisRackauckas-Claude:precompile-subarray-dual-broadcast
Mar 1, 2026
Merged

Add precompile workload for Dual and SubArray broadcast operations#1291
ChrisRackauckas merged 4 commits intoSciML:masterfrom
ChrisRackauckas-Claude:precompile-subarray-dual-broadcast

Conversation

@ChrisRackauckas-Claude
Copy link
Contributor

Summary

  • Adds a PrecompileTools.@compile_workload block to DiffEqBaseForwardDiffExt.jl that exercises common scalar, array, and SubArray operations on Dual{Tag{OrdinaryDiffEqTag, Float64}, Float64, 1}
  • The extension already defined this dualT type but had no precompilation, causing ~2.5s of broadcast compilation overhead at runtime for ODE functions using views

What's precompiled

Scalar Dual operations:

  • Arithmetic: +, -, *, /, ^, negation, abs
  • Math functions: exp, log, sin, cos, tan, sqrt, cbrt, asin, acos, atan, sinh, cosh, tanh
  • Comparisons/predicates: <, >, min, max, isnan, isinf, isfinite
  • Conversion: zero, one, float, ForwardDiff.value, ForwardDiff.partials

Vector{Dual} operations:

  • Broadcast: .+, .-, .*, ./, .^
  • In-place broadcast: out .= v1 .* s .+ v2, etc.
  • Reductions: sum, sum(abs2, ...), maximum(abs, ...)
  • LinearAlgebra: dot, norm (1, 2, Inf)
  • copy, fill!

SubArray broadcast patterns (Float64 and Dual):

  • dst .= -k .* src1 .+ k .* src2 .* src3 (linear combination of views)
  • dst .= k .* src1 .- k .* src2 .^ 2 .- k .* src2 .* src3 (subtraction chain with power)
  • dst .= k .* src .^ 2 (scaled power)
  • Simple patterns: assignment, scaling, element-wise ops, negation

Benchmark

Testing with the ROBER problem from DifferentialEquations.jl#1125:

Scenario First solve
Baseline (views, no pre-warming) 3.05s
With SubArray+Dual pre-warming 0.80s
Direct indexing (no views) 0.25s

73% reduction in first-solve overhead for view-based ODE functions.

Test plan

  • All DiffEqBase tests pass locally
  • Precompile workload code verified independently
  • CI passes

🤖 Generated with Claude Code

…ions

The ForwardDiff extension defines the OrdinaryDiffEqTag Dual type but had no
precompile workload. ODE functions using @view with broadcast operations
(e.g. `dy .= k .* y1 .+ k .* y2 .* y3`) trigger ~2.5s of compilation at
runtime for SubArray{Dual{...}} broadcast type trees.

This adds comprehensive precompilation of:
- Scalar Dual arithmetic (+, -, *, /, ^, negation, abs)
- Scalar Dual math functions (exp, log, sin, cos, tan, sqrt, etc.)
- Scalar Dual comparisons and predicates (min, max, isnan, isfinite)
- Vector{Dual} broadcast operations (.+, .-, .*, ./, .^)
- Vector{Dual} reductions (sum, norm, dot)
- SubArray{Float64} and SubArray{Dual} broadcast patterns matching common
  ODE right-hand-side functions

Testing shows this reduces first-solve time for view-based ODE functions
from ~3.0s to ~0.8s (73% reduction).

Addresses SciML/DifferentialEquations.jl#1125

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
The previous commit only exercised `dst .= .-k .* sv1 .+ k .* sv2 .* sv3`
(negated first term), but ODE functions commonly use positive first terms
like `dst .= k .* sv1 .+ k .* sv2 .* sv3`. These create different
Broadcasted type trees that weren't being pre-warmed.

Adding this pattern reduces first-solve time from 0.80s to 0.31s,
now nearly matching the 0.25s direct-indexing baseline.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
…ding blocks

Fused multi-operand broadcast expressions (e.g. `dy .= k .* y1 .+ k .* y2 .* y3`)
create unique nested Broadcasted types per expression and cannot be generically
precompiled. Only the primitive SubArray operations (copy, scale, multiply, add,
subtract, power, negate) are truly generic building blocks.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
@ChrisRackauckas-Claude
Copy link
Contributor Author

Updated: Removed expression-specific fused broadcast patterns (e.g. dsv1 .= k .* sv1 .+ k .* sv2 .* sv3).

These create unique nested Broadcasted type trees per expression and can't be generically precompiled. The dominant TTFX cost for view-based ODE functions (~2.5s of 3.0s) comes from these fused trees, which is a Julia broadcast fusion design constraint.

The precompile workload now contains only generic building blocks:

  • Scalar Dual arithmetic, math functions, comparisons
  • Vector{Dual} broadcast, reductions, LinearAlgebra ops
  • Primitive SubArray operations (copy, scale, multiply, add, subtract, power, negate)

These provide modest TTFX improvement (~13% for view-based functions) but are universally useful operations that benefit all ForwardDiff-based differentiation.

The VectorContinuousCallback termination time can vary slightly across
platforms. Use atol=1e-4 instead of exact floating point comparison.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
@ChrisRackauckas ChrisRackauckas merged commit d202eff into SciML:master Mar 1, 2026
39 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants