CamillaDSP - Cross-platform IIR and FIR engine for crossovers, room correction etc

spfenwick · 2023-09-24 9:33 am

@phofman - yes, it's odd but that's what happens with v1.0.3.

phofman · 2023-09-24 12:11 pm

It's expected the major load comes from the processing thread, but it may not hurt to confirm, e.g. https://unixhealthcheck.com/blog?id=465

spfenwick · 2023-09-25 9:16 pm

phofman said:
It's expected the major load comes from the processing thread, but it may not hurt to confirm, e.g. https://unixhealthcheck.com/blog?id=465

These are the threads created by v2.0.0a3 while encountering the performance issue:

Three threads are using CPU time. Two are clearly the capture and playback threads. The third - the one with the high CPU usage - isn't labelled but is presumably the processing thread.

I also ran the test on an rpi3b+ and didn't find a performance issue. It looks like it's specific to the combination Intel + Linux.

phofman · 2023-09-26 7:41 am

Thanks. You already used perf, maybe compiling --release with debug info https://nnethercote.github.io/perf-book/profiling.html#debug-info and trying to get more perf details about method calls, like in https://rust-lang.github.io/packed_simd/perf-guide/prof/linux.html would tell more.

spfenwick · 2023-09-26 11:00 pm

phofman said:
Thanks. You already used perf, maybe compiling --release with debug info https://nnethercote.github.io/perf-book/profiling.html#debug-info and trying to get more perf details about method calls, like in https://rust-lang.github.io/packed_simd/perf-guide/prof/linux.html would tell more.

Here is some more information from perf showing the functions executed in thee processing thread. I haven't yet explored the extra debug options, but will do that. For these tests I've gone back to 1.0.3 as that lets me compare the two performance states by just changing the config.

First is 1.0.3 from github (default compiler settings) with the Gain filter - i.e. good performance:

Next is the exact same binary without the Gain filter - i.e. bad performance

The two look very similar except:
- Gain:😛roces_waveform disappears from the second one - as expected as I removed it from the config
- The number of "event cycles" goes up. I assume this reflects the overall runtime being longer
- The percentage of time spent in Biquad:😛rocess_waveform increases dramatically - presumably indicating this is the function responsible for the longer runtime.

I'm not sure what to make of this, other than the exact same code is running slower for no apparent reason. The "perf stat" results I posted earlier are for these same scenarios - so the slowdown seems to be due to reduced IPC but there's no obvious reason for that. Biquad:😛rocess_waveform is a very simple function with a tight inner loop that is completely linear with no branches, so there's very little room for its behaviour to change.

Finally for comparison here's v1.0.3 compiled with target-cpu=x86-64-v2 without the Gain filter (i.e. the same config as test 2)

The performance is pretty much the same as the test 1, so it's good to see the problem has gone away, but again there's no obvious reason why.

I've also opened a thread over on the Rust forum to see if anyone can explain this (https://users.rust-lang.org/t/unexplained-order-of-magnitude-drop-in-performance/100406). One suggestion was it might be related to spectre/meltdown mitigations which was my thought too. It looks like "target-cpu=x86-64-v2" will avoid the problem but I'm still curious what is going on.

UniQ · 2023-09-26 11:54 pm

@spfenwick
ipsum lorem test test.. this is an 'inline-code' marked text to avoid :P becoming a 😛 smiley....

ps. The post is editable for 30 minutes from posting it, after 30 minutes the edit button disappears, except for post #1 which can be edited indefinitely by the thread starter.

Sorry for the off-topic, cheers.

spfenwick · 2023-09-27 1:54 am

Ultima Thule said:
Sorry for the off-topic, cheers.

Thanks. I didn't realise my post was being "smilied" when I wrote it.

UniQ · 2023-09-27 2:35 am

@spfenwick
I forgot to mention one can press the 'Preview' button to the right in the post editing window (desktop PC browser) to see how the post comes out before posting it, eventual smileys and other artifacts will show up.

phofman · 2023-09-27 12:46 pm

@spfenwick : did you do the tests above on linux, or on the virtualized VSL2?

spfenwick · 2023-09-28 7:44 am

phofman said:
@spfenwick : did you do the tests above on linux, or on the virtualized VSL2?

I assume rather than VSL2 you mean WSL2. The tests above were on Linux, but the same issues show up on WSL2 as well.

The thread over at the Rust forum (https://users.rust-lang.org/t/unexplained-order-of-magnitude-drop-in-performance/100406).may be starting to shed some light. The issue seems to be related to denormal floating point numbers, which have a significant performance penalty on Intel but not AMD. Setting some flags in the x86 MSR to treat denormal numbers as zero makes the problem go away. That does affect the semantics of floating point numbers so it may not be an appropriate solution to use everywhere, especially as CamillaDSP uses FFTs for convolutions.

I'm not sure that explains everything - especially why the problem is Linux-only or why it doesn't occur with "target-cpu=x86-64-v2" - but at least it's a start.

phofman · 2023-09-28 8:16 am

Amazing catch, hats off! That discussion on rust forum shows that rust devs are truly low-level CPU connoisseurs. Actually I have never heard of denormals, feeling like a freshman 🙂

This article https://www.earlevel.com/main/2019/04/19/floating-point-denormals/ suggests to add tiny alternating positive and negative DC offsets to the DSP buffer to avoid the denormals. But I sort of do not understand how a tiny offset removes the denormals when the stream are negative and positive numbers. Perhaps because the chances of a negative value hitting the denormals range after adding the positive DC shift are much lower than processing a whole stream of samples close to zero where basically all values in the stream fall into the denormals?

spfenwick · 2023-09-28 8:21 am

phofman said:
Actually I have never heard of denormals, feeling like a freshman

It took me back to studying Computer Science at university too many years ago. I don't think I've come across them since then.

HenrikEnquist · 2023-09-28 12:49 pm

Oh it's denormals! I did not expect those to pop up while processing a non-zero signal.
What it does now is to flush any stored denormals after each chunk:

Code:

    /// Flush stored subnormal numbers to zero.
    fn flush_subnormals(&mut self) {
        if self.s1.is_subnormal() {
            trace!("Biquad filter '{}', flushing subnormal s1", self.name);
            self.s1 = 0.0;
        }
        if self.s2.is_subnormal() {
            trace!("Biquad filter '{}', flushing subnormal s2", self.name);
            self.s2 = 0.0;
        }
    }

That solved an issue where the cpu load increased a few seconds after the signal goes quiet.

But apparently these nasty things can show up during the biquad calculations, without ending up in s1 or s2. This will need some investigation...

TNT · 2023-09-28 2:58 pm

"File" tab... 2.0.... I had high hopes for dates in a column and if the header was klicked, the list of files would be sorted on date... 🙂

//

HenrikEnquist · 2023-09-28 3:37 pm

Yes sorting would be nice. The current file listing is very basic, but I'll see if I can find some ready-made table component to use instead. That could enable stuff like filtering too.

HenrikEnquist · 2023-09-28 9:31 pm

I implemented a biquad in Python to make it easier to track the values. It doesn't help, I can't explain how there can be denormals while a sine is playing.
This is the output, and the internal variables while processing a 1 kHz sine at 48 kHs sampling rate. The filter is the same Peaking filter as @spfenwick used in the example.
Blue is the signal, green and orange are the s1 and s2 state variables, all shown on a dB scale. While the sine wave is playing, they all stay >-150 dB which is well within what can be represented with normal numbers (with a margin of over 1000 dB!).
Then the signal ends at 3 seconds, and the variables all start dropping until they end are low enough to need denormals, some seconds after the plot ends. That is the problem that was solved by flushing denormals after each chunk.

This is as far as I have gotten today. To be continued..

spfenwick · 2023-09-28 9:50 pm

I'm having trouble understanding it too. Even without the analysis you've done it seems intuitively that any audio signal would be way above the level of double precision denormals. Aside from the decay after the signal ends, a sample randomly falling below 10^-308 wouldn't be expected within the lifetime of the universe.

And even if denormals are somehow happening, the Gain of 0dB is just a multiply by 1.0, so should make no difference.

Let me know if there are any tests you'd like me to run on an Intel CPU.

HenrikEnquist · 2023-09-28 10:03 pm

spfenwick said:
Aside from the decay after the signal ends, a sample randomly falling below 10^-308 wouldn't be expected within the lifetime of the universe.

An occasional denormal doesn't cause any trouble, just a temporary slowdown for one or a few math operations. It's not a problem until there are lots of them.

spfenwick said:
Let me know if there are any tests you'd like me to run on an Intel CPU.

Will do! But at the moment I have no idea what to try next 🤔

roderickvd · 2023-09-29 7:49 am

Maybe to an offer a suggestion, wrong as it may be, just to see if it ignites any spark when troubleshooting:

Could it be that it’s not the filter that causes the denormals, but that it’s already so in the input thread, spuriously or not?

Did not yet take the time to do a code run-through.

phofman · 2023-09-29 10:28 am

The input question is interesting, since just multiplying the input values with 1.0 float fixes the issue, according to @spfenwick 's measurements, IIUC.

Search

Amplifiers

Source & Line

Loudspeakers

Design & Build

General Interest

Live Sound

Member Areas

Site

Featured Vendors

Members Market

Vendors Market

Vendors

Search

CamillaDSP - Cross-platform IIR and FIR engine for crossovers, room correction etc

spfenwick

phofman

spfenwick

phofman

spfenwick

Attachments

UniQ

spfenwick

UniQ

phofman

spfenwick

phofman

spfenwick

HenrikEnquist

TNT

HenrikEnquist

HenrikEnquist

spfenwick

HenrikEnquist

roderickvd

phofman