CamillaDSP - Cross-platform IIR and FIR engine for crossovers, room correction etc.

I just updated the "develop" branch with a new and improved version.

This introduces a new synchronous resampler that replaces the previous FastSync, BalancedSync and AccurateSync types with a single one called Synchronous. This uses FFT for a major speedup. The resampling quality is equivalent to the old AccurateSync while being faster than the old FastSync.


The Async resamplers have also been optimized and are now around a factor 2 faster than before.


I also added a fix for the problem ChrisPatlach reported.


If you used any of the Sync resamplers before, you need to change the resampler type in the config to "Synchronous" in order to run this version.
 
Time for another update! I am now using the faster FFT routines I developed for the resampler also in the convolver. This gave a nice speedup of up to a factor 2! This made the difference between RustFFT and FFTW much smaller. I ran a test with a dummy pipeline consisting of upsampling 44.1 -> 192 kHz, and 8 65k FIR filters per channel. These are the cpu load numbers from my Ryzen laptop:
"old" RustFFT: 62%
"new" RustFFT: 37%
FFTW: 34%


The new version, 0.2.1 is in the develop branch.
 
Great work, thanks a lot. Rust seems much more reliable and maintainable than C. The latest Rust TLS library audit is extraordinarily positive about that code in Rust, probably not just due to expertise of the library authors rustls/TLS-01-report.pdf at master * ctz/rustls * GitHub

How does your FFT compare to FFTW in non power-of-two FFT lengths? E.g. samplerate-long FFT which centers bins for every integer-Hz frequency. Thanks.
 
I think it's easier to maintain good quality code in Rust than C yes. Especially the borrow checker makes a huge difference. You can't get away with sloppy code that leaks memory or keeps dangling pointers, it just won't allow it. Then the testing framework that is also built in makes unit testing quick and easy. And for some reason that is hard to explain it's a really fun language to work with.

It seems that more and more of the big players start using it. Microsoft for example really seems to like it: Microsoft: Rust Is the Industry’s ‘Best Chance’ at Safe Systems Programming – The New Stack


The numbers I gave above are for a chunksize of 4096. FFTW is much less sensitive to the length, so it wins easily when the chunksize is a weird number. Here are a few more numbers (RealFFT is the"new" RustFFT):
Chunksize 4567, same pipeline but with 44.1kHz:
FFTW: 15%
RealFFT: 35%


Chunksize 44100, same pipeline, back at 192kHz:
FFTW: 22%
RealFFT 34%:


For those who try the new version, could you please give some feedback? Of course most importantly if there some problem, but a simple confirmation that it works is also useful.
 
Last edited:
Quick test on my raspberry pi 4 with the same pipeline. Well not so quick, compiling on the Pi takes forever, especially with FFTW...


I had to go down to 96kHz since 192k was a bit much.

I'm running with resampling 44.1 -> 96kHz, chunksize 4096, 8 FIR filters per channel of 65k taps each.

RealFFT: 62%
FFTW: 58%


Pushing it a bit more:
Resampling 44.1 -> 192kHz, chunksize 4096, 6 FIR filters per channel of 65k taps each.

RealFFT: 95%

FFTW: 87%


I think most people would run much less than 6 x 65k taps, so I think it's safe to say the Pi4 runs just fine at 192kHz with resampling enabled.
 
Nice measurements, thanks. The difference on x86 for non-pow2 FFTs is not so big, FFTW has been heavily optimized for years on x86.

On the other hand the ARM side - I've read that rust uses NEON instructions - while I believe FFTW does too, the almost same values would suggest FFTW code is not so optimized on ARM and your library is perfectly on par. Great result, kudos.
 
Did some test with my set up on raspberry pi 4.


Running Jrivers media center 26 which converts from 44 k to 96k using sox.


Camilla dsp set up as a 3 way cross over using FIR 65K taps. Sample rate 96K, Chunksize 8192, in/out S32LE.

Output 6 channels to raspberry pi hdmi.

Good results, thanks Henrik ! CPU load reading from task manager about 15% ( Jrivers 5%, Camilladsp 10%). JRivers runs well and is reasonably responsive, no sound cut outs when interacting with Jrivers.




Christian
 
Did some test with my set up on raspberry pi 4.


Running Jrivers media center 26 which converts from 44 k to 96k using sox.


Camilla dsp set up as a 3 way cross over using FIR 65K taps. Sample rate 96K, Chunksize 8192, in/out S32LE.

Output 6 channels to raspberry pi hdmi.

Good results, thanks Henrik ! CPU load reading from task manager about 15% ( Jrivers 5%, Camilladsp 10%). JRivers runs well and is reasonably responsive, no sound cut outs when interacting with Jrivers.




Christian
Great! Thanks for sharing :)
 
Nice measurements, thanks. The difference on x86 for non-pow2 FFTs is not so big, FFTW has been heavily optimized for years on x86.

On the other hand the ARM side - I've read that rust uses NEON instructions - while I believe FFTW does too, the almost same values would suggest FFTW code is not so optimized on ARM and your library is perfectly on par. Great result, kudos.
I made a simple test program to see how the speed is changing with FFT length. The code is here: GitHub - HEnquist/fftbenching
For a perfect FFT the processing time would increase as N*logN, but in reality it's of course much more complicated.
First, this is how the processing time increases as the FFT length increases from 2 to 255.
My laptop with a Ryzen 2700u
time_per_iter_x64.png


Raspberry Pi 4

time_per_iter_arm.png





Since the first plots became a bit messy, I also divided the time for RustFFT with the time for FFTW.
Laptop with Ryzen 2700u:
fft_comparison_x64.png


Raspberry Pi 4:
fft_comparison_arm.png




I think that the main thing we can see here is that FFTW has a lot more different code paths to handle different lengths in an optimal or close to optimal way. This is one of the reasons why the FFTW library is so huge.

RustFFT is much smaller and implements much fewer special cases. Therefore it has to rely on more general algorithms for many lengths. These general algorithms are much slower than the specialized ones FFTW has. On the other hand, when RustFFT also has a specialized algorithm, like for powers of 2, there isn't much difference.




Rust can use SIMD in two ways. The simple one is to rely on the code optimization of LLVM to use SIMD. This is enabled by default and works for pretty much all instruction sets like AVX, SSE and NEON. This is what I am using.


The advanced way is to use intrinsics. This is very low level, you essentially tell the compiler exactly which instructions it should generate. The downside is that the code becomes specific to a certain instruction set so you need to write many versions of everything. This was is still very new in Rust and only works for SSE and AVX so far, while NEON is still unfinished. I played with the idea of using it for the asynchronous resampler but as it is now it's too much work. Once there are some libraries that abstract over many instruction sets, I will take another look.
 
Last edited:
Just for fun, here are two more Rust FFT libraries compared to FFTW. This is done on my Ryzen laptop.



"fourier":
fft_comparison_fourier.png


"chfft"
fft_comparison_chfft.png




Compared to RustFFT they are sometimes faster and sometimes slower. But when they are slower, they are sometimes a lot slower. In total, RustFFT seems to be the best compromise. Maybe someone could combine the best parts of all three...
 
Would it make sense to use FFTW where possible and for the rest use your library?
The real-to-complex transform of FFTW does exactly the same as my RealFFT wrapper for RustFFT. So it's always possible to use either one. The RealFFT and FFTW versions of the CamillaDSP convolver are nearly the same except for the imports and a few lines where the ffts are set up.
The single reason use FFTW is that it's very fast. Reasons not to are that it's so huge, and the GPL license limits the use to GPL projects. Not a problem in CamillaDSP but doesn't work in the resampling lib (Rubato) which is MIT.
The reasons to use RustFFT (via RealFFT) are that it's small and fast to compile, fast enough for the majority of uses, and has an easier MIT license, which means I can have Rubato under MIT.
I prefer the MIT license for my libraries since I want people to use them, and I think the GPL license might limit that.
 
This guy Raspberry Pi 4 PCI-Express Bridge “Chip” – Zak's Electronics Blog ~* has designed a replacement board for the RPi4 USB3 controller chip which feeds the PCI-e x1 lines to the USB3 ports. Plus a $5 PCI-e riser using USB3 cable USB 3.0 Pcie PCI-E Express 1x To 16x Extender Riser Card Adapter Power BTC Cable | eBay ... and the peripheral options extend to PCI-e.

A nice 8ch PCI-e Xonar D2X ASUS Xonar D2X Sound Card | eBay and voila:

* standalone stereo analog- camilladsp - 8ch analog Xover
* standalone SPDIF - camilladsp - 8ch analog Xover
* USB-audio 2ch - camilladsp - 8ch analog xover using USB-C gadget mode
* USB-audio 2ch in/8 ch out soundcard with any advanced DSP using USB-C gadget mode

Or PCI-e - PCI riser using the same cable instead PCI-E Express X1 to Dual PCI Riser Extend Adapter Card With USB 3.0 Cable | eBay or the PCI-e extension above plus PCI Express PCIE To PCI Adapter Card Asmedia 1083 Chip Riser Extender 32bit | eBay and using any of the multichannel PCI soundcards sold out on ebay (vast majority of current PCI-e soundcards are just the old PCIs with some of the PCI/PCI-e bridges onboard).

Things are moving ahead :)
 
Yes, but how about latest kernel and USB-gadget support from the latest linux code base? Where does one get support for the HW? On RPi USB-gadget issues are discussed/solved directly with developers of the dwc2 IP in the broadcom SoC, patches go directly to upstream kernel repository. Is there access to such key know-how on the nanopi m4?
 
OK but that is kernel 4.4. If you hit a problem in alsa, who will you talk to when everyone at alsa-devel will ask you to use the latest code first? What about all the driver changes since 4.4? Who will you send a patch to?

I see there are some individually maintained kernels (e.g. GitHub - heiher/linux at nanopi-m4 - but it says : 206 commits ahead, 15346 commits behind torvalds:master).