CamillaDSP - Cross-platform IIR and FIR engine for crossovers, room correction etc.

Just to recap.

To mimic a brutefir setup running a 65536 tap Dirac test filter with 8 partitions for a left and right channel I'd configure this in the latest develop branch:

Code:
devices:
  samplerate: 44100
  chunksize: 8192
  ...

Code:
filters:
  r_fir:
    type: Conv
    parameters:
      type: Values
      length: 65536
  l_fir:
    type: Conv
    parameters:
      type: Values
      length: 65536

Is that it?

Thx.
 
Alright! I'm very interested in this! Please tell more :)
I used the obscure pcm_hook feature of ALSA. Compiling only against libasound2 (meaning not having to bring in the full alsa-lib sources) it allows 4 hooks involving PCM devices. Open (really hook install), modify hw_params, hw_free, and close.

I use the hw_free call to send a stop signal to camilla so that camilla closes the loopback device and it can then be reopened with any parameters by the playback program. I use the hw_params hook to send camilla a new config with the updated parameters telling it to reopen the loopback with the now appropriate settings.

Sadly hook install is too late in that the loopback device is already locked down if camilla has it open. I send stop to camilla here anyway though because it at least causes camilla to free the connection such that the playback program only fails once if camilla was already reading from loopback. (This is when I ran into the double stop breaks camilla issue.)

The annoying thing is that I can't figure out how to get the new hw_params inside the hw_params hook function. The actual ALSA callback sends them along but the hook code stashes them in a private field that I don't know how to access without compiling against the full alsa-lib code. libasound2 has a bunch of declared but undefined structures that are defined internally to the alsa-lib code but not published in libasound2. Someone better with ALSA than me might know how to get them (I've asked on alsa-devel but no solutions yet.) Fortunately the parameters are available from the proc file system.

So what I really do is call an external program for hook creation and hw_free that does the websocket stop command and another one that parses the proc settings and configures camilla however it wants.

I actually think the external callback is nice as people might have very different things they want to do with camilla at different sample rates. Different FIR taps being the obvious one, but maybe they have a selection of them for different house curves, etc.

The two commands could of course be made one with an argument. I think it's even possible to pass in the scripts you want run in the asound.conf/.asoundrc files but I didn't implement that.

In terms of being more robust it would be helpful if camilla rather than just dying when the loopback is opened with a rate that isn't supported by the output hardware returned an error code (hopefully matching ALSA's numbering scheme) that the hook could pass back to the playback program as an error but remained running and ready to accept a new request. One problem with loopback is it doesn't reflect the actual hardware's capabilities.
 
RPi?

What about CamillaDSP and RPi4 (or another HW) as USB soundcard? I gave up with notebook (pulseaudio sees only 2ch) and still struggling with my PC, Camilla does not work at my daily Ubuntu 18.04 and I am too scared to upgrade to 20.04 (PulseAudio Crossover Rack does not work, there are differences in the debug log, but not enough to make a bug report) or Fedora 32 (alsa plays only 2 channels on ATI Ellesmere).


Is the RPi4 powerful enough for a 3 way FIR and EQ?
 
I used the obscure pcm_hook feature of ALSA. Compiling only against libasound2 (meaning not having to bring in the full....
Alright! Yes I think having it separate like this is the way to go, there is no reasonable way to include something like this in CamillaDSP without causing tons of issues for every other use case.
This whole thing would certainly become easier if somebody fixed the broken pcm_notify feature of the loopback. But there doesn't seem to be anything happening with that unfortunately.

If you use the wait flag, -w, then CamillaDSP will stay running and waiting for a new command if opening the playback device fails. I'll have to think a bit about how the error code could be made available.

Is your code available somewhere? It would be fun to take a look.
 
What about CamillaDSP and RPi4 (or another HW) as USB soundcard? I gave up with notebook (pulseaudio sees only 2ch) and still struggling with my PC, Camilla does not work at my daily Ubuntu 18.04 and I am too scared to upgrade to 20.04 (PulseAudio Crossover Rack does not work, there are differences in the debug log, but not enough to make a bug report) or Fedora 32 (alsa plays only 2 channels on ATI Ellesmere).


Is the RPi4 powerful enough for a 3 way FIR and EQ?

Is it the old pulseaudio of 18.04 that is stopping it from working? Do you absolutely need pulse? In my main system I simply let pulse output to an alsa loopback and then I capture from there. Then the oulse backend of CamillaDSP isn't needed.

The pi4 is pretty good, it can handle quite long filters at high sample rates. How long filters do you use and at what rate?

There is something strange going on with the latest pulseaudio. I can capture audio from pulse on my fedora 32 laptop, but with very high cpu usage. This wasn't a problem before.
 
Didn't work as I suggested.

When checked with --check, an error on missing "values" were issued.

Below worked:

Code:
  l_fir:
    type: Conv
    parameters:
      type: Values
      values: [ 0.0 ]
      length: 65536
It seems the values field is still required, or the check function still expects it.
Oops sorry, the values field is still required. You can leave out the entire parameters: block to make a dirac spike, but that of course means you cant specify the length. I'll make the values field optional as well.

If you put "values: [ 1.0 ]" instead it will make a dirac, [1.0, 0.0, 0.0 ..... 0.0] of the length you want.Now you just have zeros (which works fine for just testing cpu load, but not very useful if you want to look at the output).
 
Last edited:
What you say is the function as being implemented wouldn't act like a "transparent" filter!?!
Meaning - no audio playback is possible?


I now fired up the DSP for the first time.

It immediately locked up the CPU at 100%. (brutefir runs with dirac idle at 2-3%.)

The PI4 runs at 1500Mhz on PIOS64.
For the test I pipe squeezelite >> CDSP >> aplay.

systemd:
Code:
ExecStart=/bin/sh -c "/usr/local/bin/squeezelite -n slcdsp -b 20000:20000 -a 32 -o - | /usr/local/bin/camilladsp /etc/camilladsp/configs/config.yml | /usr/bin/aplay --quiet -D hw:0,0 -r 44100 -f S32_LE -c 2"

It's basically the same systemd setup as brutefir.

My config:

Code:
---
devices:
  samplerate: 44100
  chunksize: 8192
  silence_threshold: -60
  silence_timeout: 3.0
  capture:
    type: Stdin
    channels: 2
    format: S32LE
  playback:
    type: Stdout
    channels: 2
    format: S32LE

filters:
  r_fir:
    type: Conv
    parameters:
      type: Values
      values: [ 1.0 ]
      length: 65536

  l_fir:
    type: Conv
    parameters:
      type: Values
      values: [ 1.0 ]
      length: 65536

mixers:
  mono:
    channels:
      in: 2
      out: 2
    mapping:
      - dest: 0
        sources:
          - channel: 0
            gain: -6
            inverted: false
          - channel: 1
            gain: -6
            inverted: false
      - dest: 1
        sources:
          - channel: 0
            gain: -6
            inverted: false
          - channel: 1
            gain: -6
            inverted: false

pipeline:
  - type: Mixer
    name: mono
  - type: Filter
    channel: 0
    names:
      - r_fir
  - type: Filter
    channel: 1
    names:
      - l_fir

What am I doing wrong?
 
Add this to the devices section:
Code:
queuelimit: 1
The 100% cpu happens when you have a source that will feed data at an unlimited rate, and a playback device with a limited rate. Then CamillaDSP will read as fast as it can until it has filled its internal queues. At that point it settles down and you should see a low cpu usage. Setting queuelimit to 1 makes the max internal queue length 1 element, so the queues get filled right away.


To make a transparent filter you give "values: [1.0]" and "lengh: 65536". That takes all the values from "values" and adds zeroes after to make it 65536 elements long in total. If you give "values: [0.0]", then the first element also becomes zero so the whole vector becomes just 66536 zeroes.

If you leave out the whole parameters block, it generates a filter that is a single one, followed by chunksize-1 zeroes.
 
I just figured something out. Playback works. :D

However.

There's more.

If idle, CDSP goes through the roof starting at 106% then goes down to around 100% after a couple of seconds. And it stays that way.
As soon as I start playback it drops to 3%. Not bad.
As soon as I stop playback it goes up again to around 100%.

There's something wrong if no signal is present on stdin.
 
Code:
---
devices:
  samplerate: 44100
  chunksize: 8192
  silence_threshold: -60
  silence_timeout: 3.0
  queuelimit: 1
  capture:
    type: File
    filename: /dev/stdin
    channels: 2
    format: S32LE
  playback:
    type: File
    filename: /dev/stdout
    channels: 2
    format: S32LE


Same situation.

I also just noticed a kind of of scrambled mess (distortions) for half a second after starting the service and playback the first time. After that all is OK. Next tracks run without issues.
 
I'd like to share some early benchmarks I just did.

I've been testing
* the RUSTFFT performance vs. FFTW3
* CamillaDSP vs. Brutefir
* with and without NEON/march optimizations

I've been offline convolving from file to file stored on tmpfs using 2^16 taps.


Code:
Performance Test

RPI4 4GB @1500MHz
PIOS64
SSD as boot and root device

camilladsp develop branch - Oct-07-2020

Testfile: .wav 44100/16bit - 00:03:06.16


TC
1. camilladsp-fftw $CONFIG
2. camilladsp $CONFIG
3. brutefir -nodefault $CONFIG_BF
4. TC1 without Rust opt flags (supposedly enabling NEON)
5. TC2 without No Rust opt flags (supposedly enabling NEON)

##PI4OPTS-NEON-FFTW3
real	0m8,176s
user	0m9,272s
sys	0m0,692s

##PI4OPTS-NEON-RUSTFFT
real	0m8,787s
user	0m9,815s
sys	0m0,712s

>> +7.5%

##Brutefir
real	0m7,297s
user	0m0,114s
sys	0m0,382s

>> -12.2%


TC4/TC5 - Test w/o RUSTFLAGS ( Neon and march=native) did not show any differences-

What I found surprising is that there seems no change with NEON in or out.
And 2nd that brutefir was still quite a bit faster on the job.
RUSTFFT seems to be about 7.5% slower than FFTW. That's not surprising
since Henrik mentioned a slight advantage of FFTW over RustFFT earlier.

The Q that IMO remains - What's going on with NEON?


Enjoy.
 
I'd like to share some early benchmarks I just did.

I've been testing
* the RUSTFFT performance vs. FFTW3
* CamillaDSP vs. Brutefir
* with and without NEON/march optimizations

I've been offline convolving from file to file stored on tmpfs using 2^16 taps.


What I found surprising is that there seems no change with NEON in or out.
And 2nd that brutefir was still quite a bit faster on the job.
RUSTFFT seems to be about 7.5% slower than FFTW. That's not surprising
since Henrik mentioned a slight advantage of FFTW over RustFFT earlier.

The Q that IMO remains - What's going on with NEON?


Enjoy.
The differences between RustFFT and FFTW are as expected. Same goes for CamillaDSP vs brutefir. I'm getting closer and closer to brutefir in speed, but I will probably never beat it.



For this test the majority of cpu time is used by the FFT/iFFT. It seems like RustFFT doesn't have any loops that the compiler manages to vectorize. And then there would be no speed advantage with neon. There is ongoing work to support AVX in RustFFT, but nothing for NEON yet. Proper support for NEON is very new in Rust so I expect things to change in this area.


When FFTW is used there shouldn't be any difference at all, since it's compiled by another compiler that doesn't care about the rustflags.


I think you would see a difference if you benchmark with a config that uses the asynchronous resampler. That one uses a lot of "simple" loops. IIRC this is where I saw some improvement (but still not a lot).



I started looking into how to properly use SIMD in the resampler, but it's a lot of work since a lot of code has to rewritten as separate versions for SSE, AVX, NEON etc.
 
I just figured something out. Playback works. :D

However.

There's more.

If idle, CDSP goes through the roof starting at 106% then goes down to around 100% after a couple of seconds. And it stays that way.
As soon as I start playback it drops to 3%. Not bad.
As soon as I stop playback it goes up again to around 100%.

There's something wrong if no signal is present on stdin.
Should be fixed now. Can you try this again with the new version in develop?