Implement Bluestein's algorithm for large primes (>100) by wheeheee · Pull Request #107 · JuliaMath/FFTA.jl

wheeheee · 2026-02-15T16:28:51Z

In the process, noticed mul! doesn't return the output array, which it is supposed to do according to the docs.

Local benchmarks indicate that this is close enough to FFTW's performance, so not too bad, a big improvement on the previous O(N^2) DFT. Allocates, but that's not too important overall.
Perhaps Rader could be better here, but Bluestein's algorithm is much easier to write. Doesn't really affect anything else.

codecov · 2026-02-15T16:32:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.40%. Comparing base (fc584f3) to head (412bbf5).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #107      +/-   ##
==========================================
- Coverage   96.22%   95.40%   -0.83%     
==========================================
  Files           4        4              
  Lines         424      457      +33     
==========================================
+ Hits          408      436      +28     
- Misses         16       21       +5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dannys4 · 2026-02-15T16:56:40Z

this is actually really impressive and gives me some hope for rader if we decided to implement it. Bluestein, tmk, is referred to as relatively impractical outside the discrete transform world (i.e., people like using it for chirp transforms and z transforms?) because of the big allocations. In fact, they hadn't actually implemented it when the 2005 FFTW paper was released.

I agree that Bluestein is significantly easier than rader's to implement and, for that reason, significantly more maintainable. So this is definitely a first step.

dannys4 · 2026-02-15T17:01:13Z

One thing to consider, however, is the allocation. In an ideal world, I would like to have the allocations all take place when the FFTAPlan is created so that the actual execution of the FFT is entirely arithmetic and memory-bound operations. Further, if you do several FFTs of the same size, you should absolutely pre-allocate the space. Unfortunately, this is really tricky due to Julia's behavior with parallelism. On the other hand, this algorithm is really intended for larger primes, so I don't believe it's fair to consider the allocation time as "free" (at least, the penalty incurred when bouncing around all different areas of memory).

I'd like, if possible, to call in the wisdom of @andreasnoack to ask: What is the "best" way of working with pre-allocations for an FFTAPlan? Should we be doing so, or is it better right now to just allocate on-the-fly? Any thoughts?

dannys4

I'm not not approving it---I just wanted to give some feedback/ask questions while considering the allocation thing.

dannys4 · 2026-02-15T17:18:05Z

src/callgraph.jl

        push!(workspace, T[])
-        push!(nodes, CallGraphNode(0, 0, DFT, N, s_in, s_out, w))
+        # use Bluestein's algorithm for big primes
+        LEAF_ALG = N < 100 ? DFT : BLUESTEIN


Is 100 benchmarked, arbitrary, or in "the literature"? I'd like, if possible, to make this a bit configurable, because I'd bet that this is heavily influenced by system.

It's from a bit of eyeballing at the benchmarks, yes, and it could be easily made configurable. It works decently well without tuning, maybe theoretically justified by counting muls and adds (I did not actually count...)

Yeah, it's definitely tricky without auto-tuning for per-system performance. I personally think that an atomic global (to ensure thread safety) that is configurable and only checked here at planning time would be much better without any real overhead.

I was gonna go the keyword argument route, but out of curiosity, how do you do that? Not too important because the cost is easily amortised, but doesn't that still incur the normal penalties for accessing globals?

okay it occurred to me that atomics are only first class in Julia 1.11+, so definitely just do a kwarg for now. For future ref, you can do something like

mutable struct AtomicSwitchover @atomic n::Int; end const BLUESTEIN_SWITCHOVER = AtomicSwitchover(100); function change_bluestein_switch!(N::Int) return (@atomic BLUESTEIN_SWITCHOVER.n = N) end

Btw, I changed the default to 73, because that's the break-even point for my pretty old machine, and it appears to be good on the GitHub runner too. How is it on yours?

dannys4 · 2026-02-15T17:20:04Z

src/algos.jl

+    Xk = (@. tmp = conj(b_series) * conv_a_b)
+
+    out_inds = range(start_out; step=stride_out, length=N)
+    out[out_inds] = Xk


I'm a little confused by 90-98---is there any reason you're not just using a for loop to fill out? I feel like that would avoid a bunch of unnecessary operations (and also make the code clearer)

Yeah, it is a bit messy, should have.

src/algos.jl

dannys4 · 2026-02-15T17:30:36Z

okay I think a better plan here is probably: 1) make the 100 configurable, 2) add a test that actually calls bluestein, 3) I'll approve the PR, and then 4) add issues on improving the performance. The allocations (at least in the 2d case) can easily be a real slog, so I want to ensure that the user can keep old behavior if they would like it.

wheeheee · 2026-02-16T01:20:48Z

No problem. I did think about pre-allocating stuff, but couldn't immediately see a good place to do that, so I put up the PR first in hopes that you would. Also, I suspected that since Bluestein is so heavy the memory allocation would be mostly amortised, and hopefully GC now is good enough to not really free those arrays across a 2D run, and re-allocate back the same memory so the overhead would still be acceptable, despite an alarming number of allocations on paper.

left _ispow24 alone just in case.

wheeheee · 2026-02-16T07:37:04Z

src/algos.jl

    w2 = w * w1
    w3 = w * w2
-    w4 = w * w3
+    w4 = w2 * w2


The test for N=103 reveals there's quite a bit of error, much above FFTW's, although the atol of that test may be a bit more stringent than needed. I can't really spot the problem in the current implementation, so...just grasping in the dark.
This way of getting w4 seems to slightly decrease the error, but it's not at all significant.

Anyway, Bluestein's algorithm does naturally introduce extra numerical error, so it may just be that FFTW has special tricks to get some accuracy back that this doesn't.

adhere to mul! contract

de47494

wheeheee mentioned this pull request Feb 15, 2026

PFA implementation #105

Open

dannys4 reviewed Feb 15, 2026

View reviewed changes

src/algos.jl Outdated Show resolved Hide resolved

src/algos.jl Outdated Show resolved Hide resolved

wheeheee added 4 commits February 16, 2026 14:46

add bluestein's algorithm for large primes

e82a961

slight cleanup

b21ce4f

left _ispow24 alone just in case.

add bluestein kwarg

b85d706

w4

412bbf5

wheeheee force-pushed the bluestein branch from b801500 to 412bbf5 Compare February 16, 2026 07:14

wheeheee commented Feb 16, 2026

View reviewed changes

Conversation

wheeheee commented Feb 15, 2026

Uh oh!

codecov bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dannys4 commented Feb 15, 2026

Uh oh!

dannys4 commented Feb 15, 2026

Uh oh!

dannys4 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dannys4 commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wheeheee commented Feb 16, 2026

Uh oh!

wheeheee Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Feb 15, 2026 •

edited

Loading

dannys4 commented Feb 15, 2026 •

edited

Loading

wheeheee Feb 16, 2026 •

edited

Loading