Skip to content

Implement Bluestein's algorithm for large primes (>100)#107

Open
wheeheee wants to merge 5 commits intoJuliaMath:mainfrom
wheeheee:bluestein
Open

Implement Bluestein's algorithm for large primes (>100)#107
wheeheee wants to merge 5 commits intoJuliaMath:mainfrom
wheeheee:bluestein

Conversation

@wheeheee
Copy link
Contributor

In the process, noticed mul! doesn't return the output array, which it is supposed to do according to the docs.

Local benchmarks indicate that this is close enough to FFTW's performance, so not too bad, a big improvement on the previous O(N^2) DFT. Allocates, but that's not too important overall.
Perhaps Rader could be better here, but Bluestein's algorithm is much easier to write. Doesn't really affect anything else.

@wheeheee wheeheee mentioned this pull request Feb 15, 2026
@codecov
Copy link

codecov bot commented Feb 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.40%. Comparing base (fc584f3) to head (412bbf5).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #107      +/-   ##
==========================================
- Coverage   96.22%   95.40%   -0.83%     
==========================================
  Files           4        4              
  Lines         424      457      +33     
==========================================
+ Hits          408      436      +28     
- Misses         16       21       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dannys4
Copy link
Collaborator

dannys4 commented Feb 15, 2026

this is actually really impressive and gives me some hope for rader if we decided to implement it. Bluestein, tmk, is referred to as relatively impractical outside the discrete transform world (i.e., people like using it for chirp transforms and z transforms?) because of the big allocations. In fact, they hadn't actually implemented it when the 2005 FFTW paper was released.

I agree that Bluestein is significantly easier than rader's to implement and, for that reason, significantly more maintainable. So this is definitely a first step.

@dannys4
Copy link
Collaborator

dannys4 commented Feb 15, 2026

One thing to consider, however, is the allocation. In an ideal world, I would like to have the allocations all take place when the FFTAPlan is created so that the actual execution of the FFT is entirely arithmetic and memory-bound operations. Further, if you do several FFTs of the same size, you should absolutely pre-allocate the space. Unfortunately, this is really tricky due to Julia's behavior with parallelism. On the other hand, this algorithm is really intended for larger primes, so I don't believe it's fair to consider the allocation time as "free" (at least, the penalty incurred when bouncing around all different areas of memory).

I'd like, if possible, to call in the wisdom of @andreasnoack to ask: What is the "best" way of working with pre-allocations for an FFTAPlan? Should we be doing so, or is it better right now to just allocate on-the-fly? Any thoughts?

Copy link
Collaborator

@dannys4 dannys4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not not approving it---I just wanted to give some feedback/ask questions while considering the allocation thing.

src/callgraph.jl Outdated
push!(workspace, T[])
push!(nodes, CallGraphNode(0, 0, DFT, N, s_in, s_out, w))
# use Bluestein's algorithm for big primes
LEAF_ALG = N < 100 ? DFT : BLUESTEIN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 100 benchmarked, arbitrary, or in "the literature"? I'd like, if possible, to make this a bit configurable, because I'd bet that this is heavily influenced by system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from a bit of eyeballing at the benchmarks, yes, and it could be easily made configurable. It works decently well without tuning, maybe theoretically justified by counting muls and adds (I did not actually count...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's definitely tricky without auto-tuning for per-system performance. I personally think that an atomic global (to ensure thread safety) that is configurable and only checked here at planning time would be much better without any real overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was gonna go the keyword argument route, but out of curiosity, how do you do that? Not too important because the cost is easily amortised, but doesn't that still incur the normal penalties for accessing globals?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay it occurred to me that atomics are only first class in Julia 1.11+, so definitely just do a kwarg for now. For future ref, you can do something like

mutable struct AtomicSwitchover @atomic n::Int; end
const BLUESTEIN_SWITCHOVER = AtomicSwitchover(100);
function change_bluestein_switch!(N::Int) 
    return (@atomic BLUESTEIN_SWITCHOVER.n = N)
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I changed the default to 73, because that's the break-even point for my pretty old machine, and it appears to be good on the GitHub runner too. How is it on yours?

src/algos.jl Outdated
Xk = (@. tmp = conj(b_series) * conv_a_b)

out_inds = range(start_out; step=stride_out, length=N)
out[out_inds] = Xk
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by 90-98---is there any reason you're not just using a for loop to fill out? I feel like that would avoid a bunch of unnecessary operations (and also make the code clearer)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is a bit messy, should have.

@dannys4
Copy link
Collaborator

dannys4 commented Feb 15, 2026

okay I think a better plan here is probably: 1) make the 100 configurable, 2) add a test that actually calls bluestein, 3) I'll approve the PR, and then 4) add issues on improving the performance. The allocations (at least in the 2d case) can easily be a real slog, so I want to ensure that the user can keep old behavior if they would like it.

@wheeheee
Copy link
Contributor Author

No problem. I did think about pre-allocating stuff, but couldn't immediately see a good place to do that, so I put up the PR first in hopes that you would. Also, I suspected that since Bluestein is so heavy the memory allocation would be mostly amortised, and hopefully GC now is good enough to not really free those arrays across a 2D run, and re-allocate back the same memory so the overhead would still be acceptable, despite an alarming number of allocations on paper.

w2 = w * w1
w3 = w * w2
w4 = w * w3
w4 = w2 * w2
Copy link
Contributor Author

@wheeheee wheeheee Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test for N=103 reveals there's quite a bit of error, much above FFTW's, although the atol of that test may be a bit more stringent than needed. I can't really spot the problem in the current implementation, so...just grasping in the dark.
This way of getting w4 seems to slightly decrease the error, but it's not at all significant.

Anyway, Bluestein's algorithm does naturally introduce extra numerical error, so it may just be that FFTW has special tricks to get some accuracy back that this doesn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants