Implement Bluestein's algorithm for large primes (>100)#107
Implement Bluestein's algorithm for large primes (>100)#107wheeheee wants to merge 5 commits intoJuliaMath:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #107 +/- ##
==========================================
- Coverage 96.22% 95.40% -0.83%
==========================================
Files 4 4
Lines 424 457 +33
==========================================
+ Hits 408 436 +28
- Misses 16 21 +5 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
this is actually really impressive and gives me some hope for rader if we decided to implement it. Bluestein, tmk, is referred to as relatively impractical outside the discrete transform world (i.e., people like using it for chirp transforms and z transforms?) because of the big allocations. In fact, they hadn't actually implemented it when the 2005 FFTW paper was released. I agree that Bluestein is significantly easier than rader's to implement and, for that reason, significantly more maintainable. So this is definitely a first step. |
|
One thing to consider, however, is the allocation. In an ideal world, I would like to have the allocations all take place when the FFTAPlan is created so that the actual execution of the FFT is entirely arithmetic and memory-bound operations. Further, if you do several FFTs of the same size, you should absolutely pre-allocate the space. Unfortunately, this is really tricky due to Julia's behavior with parallelism. On the other hand, this algorithm is really intended for larger primes, so I don't believe it's fair to consider the allocation time as "free" (at least, the penalty incurred when bouncing around all different areas of memory). I'd like, if possible, to call in the wisdom of @andreasnoack to ask: What is the "best" way of working with pre-allocations for an FFTAPlan? Should we be doing so, or is it better right now to just allocate on-the-fly? Any thoughts? |
dannys4
left a comment
There was a problem hiding this comment.
I'm not not approving it---I just wanted to give some feedback/ask questions while considering the allocation thing.
src/callgraph.jl
Outdated
| push!(workspace, T[]) | ||
| push!(nodes, CallGraphNode(0, 0, DFT, N, s_in, s_out, w)) | ||
| # use Bluestein's algorithm for big primes | ||
| LEAF_ALG = N < 100 ? DFT : BLUESTEIN |
There was a problem hiding this comment.
Is 100 benchmarked, arbitrary, or in "the literature"? I'd like, if possible, to make this a bit configurable, because I'd bet that this is heavily influenced by system.
There was a problem hiding this comment.
It's from a bit of eyeballing at the benchmarks, yes, and it could be easily made configurable. It works decently well without tuning, maybe theoretically justified by counting muls and adds (I did not actually count...)
There was a problem hiding this comment.
Yeah, it's definitely tricky without auto-tuning for per-system performance. I personally think that an atomic global (to ensure thread safety) that is configurable and only checked here at planning time would be much better without any real overhead.
There was a problem hiding this comment.
I was gonna go the keyword argument route, but out of curiosity, how do you do that? Not too important because the cost is easily amortised, but doesn't that still incur the normal penalties for accessing globals?
There was a problem hiding this comment.
okay it occurred to me that atomics are only first class in Julia 1.11+, so definitely just do a kwarg for now. For future ref, you can do something like
mutable struct AtomicSwitchover @atomic n::Int; end
const BLUESTEIN_SWITCHOVER = AtomicSwitchover(100);
function change_bluestein_switch!(N::Int)
return (@atomic BLUESTEIN_SWITCHOVER.n = N)
endThere was a problem hiding this comment.
Btw, I changed the default to 73, because that's the break-even point for my pretty old machine, and it appears to be good on the GitHub runner too. How is it on yours?
src/algos.jl
Outdated
| Xk = (@. tmp = conj(b_series) * conv_a_b) | ||
|
|
||
| out_inds = range(start_out; step=stride_out, length=N) | ||
| out[out_inds] = Xk |
There was a problem hiding this comment.
I'm a little confused by 90-98---is there any reason you're not just using a for loop to fill out? I feel like that would avoid a bunch of unnecessary operations (and also make the code clearer)
There was a problem hiding this comment.
Yeah, it is a bit messy, should have.
|
okay I think a better plan here is probably: 1) make the |
|
No problem. I did think about pre-allocating stuff, but couldn't immediately see a good place to do that, so I put up the PR first in hopes that you would. Also, I suspected that since Bluestein is so heavy the memory allocation would be mostly amortised, and hopefully GC now is good enough to not really free those arrays across a 2D run, and re-allocate back the same memory so the overhead would still be acceptable, despite an alarming number of allocations on paper. |
left _ispow24 alone just in case.
| w2 = w * w1 | ||
| w3 = w * w2 | ||
| w4 = w * w3 | ||
| w4 = w2 * w2 |
There was a problem hiding this comment.
The test for N=103 reveals there's quite a bit of error, much above FFTW's, although the atol of that test may be a bit more stringent than needed. I can't really spot the problem in the current implementation, so...just grasping in the dark.
This way of getting w4 seems to slightly decrease the error, but it's not at all significant.
Anyway, Bluestein's algorithm does naturally introduce extra numerical error, so it may just be that FFTW has special tricks to get some accuracy back that this doesn't.
In the process, noticed
mul!doesn't return the output array, which it is supposed to do according to the docs.Local benchmarks indicate that this is close enough to FFTW's performance, so not too bad, a big improvement on the previous O(N^2) DFT. Allocates, but that's not too important overall.
Perhaps Rader could be better here, but Bluestein's algorithm is much easier to write. Doesn't really affect anything else.