Open Source DSP XOs

Trevor White · 2012-05-12 3:21 am

steph_tsf said:
Interesting discussion. Let's focus on a Butterworth highpass 8th-order 30 Hz -3dB. Sampling frequency 96 kHz. What method would you advise, on a DSP56K, for getting rid of the noise & distorsion produced by the native (naïve ?) series IIR BiQuad implementation ? Keep in mind we need the exact same Bode plot. What would be most exact and efficient way ?
- Boosting the precision using more elaborate arithmetic routines for the multiply-accumulation (say, reaching a 32-bit data path inside the BiQuad, operating with 64-bit accumulator) ? What would be the computing time penalty ?
- Decimating the signal, applying a series IIR BiQuad in a lower sampling frequency context, upsampling the signal for getting back a 96 kHz sampling frequency ? What decimation factor to use ? Can we guarantee a Bode plot that's exact ?
- Any other method welcome

For low frequency filters I used state variable filters on an Analog Devices SHARC because the direct form and SIMD optimized library functions produced to much noise. I assume the state variable filters would work equally well for fixed point 32 bit and 24 bit DSP's.

steph_tsf · 2012-05-12 8:52 am

Trevor White said:
For low frequency filters I used state variable filters on an Analog Devices SHARC because the direct form and SIMD optimized library functions produced to much noise.

Hi Trevor, would you dare saying that for the low frequencies (like a 2nd-order 30Hz highpass on a 96 kHz system with Q in the range of 0.3 to 3.0), that the output from a hand-coded Chamberlin Digital State Variable Filter will be cleaner than the output from a hand-coded Direct Form IIR BiQuad ?

Trevor White · 2012-05-13 3:15 am

steph_tsf said:
Hi Trevor, would you dare saying that for the low frequencies (like a 2nd-order 30Hz highpass on a 96 kHz system with Q in the range of 0.3 to 3.0), that the output from a hand-coded Chamberlin Digital State Variable Filter will be cleaner than the output from a hand-coded Direct Form IIR BiQuad ?

yes it should be 😉

steph_tsf · 2012-05-13 11:26 pm

chaparK said:
Finally, 24 bits is 144 dB of dynamic range: that covers everything from ants walking to the heart of the space shuttle's engines. If you can't do proper audio processing on a 24-bit platform then maybe you should indeed stick to analogue filters and let dsp to dsp guys.

A friendly reply is to recommend the Handbook of Digital Signal Processing Engineering Applications (Elliott, Vaidynathan, Harris, Pasthoon, Silva, +)

steph_tsf · 2012-05-14 12:33 am

rsdio said:
By the way, if anyone knows of a professionally-designed DSP product that has an interrupt per audio sample, please let me know. That would be a very interesting design choice considering the number of cycles of overhead per interrupt.

Being from the old (naïve ?) "one interrupt per sample" audio DSP school, I feel great respect for any system that is not "one interrupt per sample", thus needing audio buffers (size from 256 to 1024 samples) like the Steinberg ASIO for MS Windows, that some skilled people can persuade to never crash during a live performance.
I feel the same respect for ALSA and JACK running on GNU/Linux and Mac OS X.
With the advent of multicore GHz-class CPUs, in a system featuring a single sampling frequency, I feel there should be a possibility to run ALSA and JACK with a "buffer_size = 1" setting.
Within ALSA and JACK, in a system featuring a single sampling frequency, a nice feature would be to get a CPU core and its associated cache memory, entirely dedicated to the audio occuring at Fs, thus one interrupt per sample. Is this currently feasible ? Knowing the 2.6 Kernel may not digest a 10 µs interrupt (96 kHz audio), it may be necessary to let the kernel think that he is dealing with an audio system operating at Fs/256 with a 256-sized buffer, allowing an other CPU core to deal with the details. The Fs/256 "software interrupt" would only be seen and processed by the main CPU cores running the kernel (like taking the user inputs), while the Fs "hardware interrupt" would only be seen and processed by the CPU core dedicated to ALSA and JACK.

rsdio · 2012-05-14 2:34 am

steph_tsf said:
With the advent of multicore GHz-class CPUs, in a system featuring a single sampling frequency, I feel there should be a possibility to run ALSA and JACK with a "buffer_size = 1" setting.
Within ALSA and JACK, in a system featuring a single sampling frequency, a nice feature would be to get a CPU core and its associated cache memory, entirely dedicated to the audio occuring at Fs, thus one interrupt per sample. Is this currently feasible ?

The question is moot. Whether the OS can handle the interrupts or not, the audio hardware is not designed for this. USB Audio Class cannot interrupt any faster than 1 ms at Full Speed, which is 100 times slower than what you suggest for 96 kHz audio. I do not know the limits of FireWire audio, off hand, but they are probably similar due to the packet nature of most bus designs. PCI Audio designs might be capable of one interrupt per sample, but most are designed to use the system DMA rather than an interrupt.

Another consideration is that systems like OSX are designed around symmetric multi-processing, but you describe an asymmetric system where one processor handles audio exclusively while others handle general non-audio tasks. It would require quite a lot of kernel changes, I'd imagine, to implement what you suggest.

Basically, general purpose processors are probably always going to be limited to dealing with buffers of multiple audio samples. Meanwhile, embedded processors can be dedicated to audio and thus might be able to handle one interrupt per sample, but they will probably still need to communicate with general processors via buffers, unless the embedded system has its own media source (such as a CD Player).

abraxalito · 2012-05-14 2:46 am

rsdio said:
Meanwhile, embedded processors can be dedicated to audio and thus might be able to handle one interrupt per sample, but they will probably still need to communicate with general processors via buffers, unless the embedded system has its own media source (such as a CD Player).

In my own implementations of DSP on the ARM Cortex M0 and M3, I've managed so far to avoid using interrupts entirely. Since processing audio is entirely deterministic, I can't see the point myself. Interrupts for me are when you can't really be sure when an event is going to happen - in contrast, digital audio is 100% predictable. Why spend the overhead of context save and restore when not absolutely essential?

Trevor White · 2012-05-14 3:09 am

abraxalito said:
In my own implementations of DSP on the ARM Cortex M0 and M3, I've managed so far to avoid using interrupts entirely. Since processing audio is entirely deterministic, I can't see the point myself. Interrupts for me are when you can't really be sure when an event is going to happen - in contrast, digital audio is 100% predictable. Why spend the overhead of context save and restore when not absolutely essential?

The problem is if you use the DSP for anything else such as updating a display, reading buttons, serial comms, USB etc then you could run into trouble. The idea of using interrupts is to make sure the audio processing always gets priority over anything else. All of the lower priority tasks can be polled as usual.

abraxalito · 2012-05-14 3:16 am

Trevor White said:
The problem is if you use the DSP for anything else such as updating a display, reading buttons, serial comms, USB etc then you could run into trouble.

Yep, I don't do that with my designs. All user interface I/O I'd typically offload to another CPU. That's one reason I really love the LPC43XX dual core approach. In the future, tablets will be so ubiquitous that they'll be the first choice for any user interfaces then all user I/O will happen over the interface to the tablet which might well be USB (or perhaps Bluetooth for wireless).

The idea of using interrupts is to make sure the audio processing always gets priority over anything else. All of the lower priority tasks can be polled as usual.

Yep, its one way to handle it. The way I prefer is to make sure the audio processing gets done first, then beyond that there's time for those less urgent tasks, most of which could be handled by polling.

Trevor White · 2012-05-14 3:57 am

abraxalito said:
Yep, I don't do that with my designs. All user interface I/O I'd typically offload to another CPU. That's one reason I really love the LPC43XX dual core approach. In the future, tablets will be so ubiquitous that they'll be the first choice for any user interfaces then all user I/O will happen over the interface to the tablet which might well be USB (or perhaps Bluetooth for wireless).

Yep, its one way to handle it. The way I prefer is to make sure the audio processing gets done first, then beyond that there's time for those less urgent tasks, most of which could be handled by polling.

But unless you have something to interrupt those lower priority tasks then you run into trouble if they stretch out to long. This leads you back to a preemptive multitasking OS and that means interrupts. You can't escape it 😉

abraxalito · 2012-05-14 4:04 am

Trevor White said:
But unless you have something to interrupt those lower priority tasks then you run into trouble if they stretch out to long.

Yep, I agree - so I make sure they don't stretch out too long. I write in assembler and count every cycle 😀

This leads you back to a preemptive multitasking OS and that means interrupts. You can't escape it 😉

But I can and do - a RTOS is a medicine which is worse than the disease. I think of an RTOS as rather like government - the smaller the better, ideally none.😛

Where an RTOS makes sense (I've used OS-9/68k back in the days of Microware) is where you need to make changes on the fly. Or where you have a team of programmers who need to dovetail their efforts together. But in a dedicated embedded system done by a single DIYer, you just don't need it.

rsdio · 2012-05-14 4:04 am

abraxalito said:
In my own implementations of DSP on the ARM Cortex M0 and M3, I've managed so far to avoid using interrupts entirely. Since processing audio is entirely deterministic, I can't see the point myself. Interrupts for me are when you can't really be sure when an event is going to happen - in contrast, digital audio is 100% predictable. Why spend the overhead of context save and restore when not absolutely essential?

Why spend the overhead of polling? I can guarantee that the last firmware that I wrote spends fewer cycles on interrupts than cycles for polling.

Invariably, digital audio is handled by some peripheral hardware that is independent of the CPU instruction sequence, so you need either interrupts or polling of hardware status to synchronize the two (unless you're building custom hardware in an FPGA where there is no distinction between I/O and instructions). You can't really predict 100%, polling is not the same as predicting.

When you have a loop with polling, whatever you think that you have coded "first" in your loop can easily become inverted to be the "last" thing that gets done, unless every one of your operations short-circuits back to the first item to guarantee its priority sequence.

abraxalito · 2012-05-14 4:08 am

rsdio said:
Why spend the overhead of polling?

Because its less than the overhead of interrupt context save and restore. Simple really. Most of the time just a read of one status register.

I can guarantee that the last firmware that I wrote spends fewer cycles on interrupts than cycles for polling.

Your choice 😀

Invariably, digital audio is handled by some peripheral hardware that is independent of the CPU instruction sequence, so you need either interrupts or polling of hardware status to synchronize the two (unless you're building custom hardware in an FPGA where there is no distinction between I/O and instructions). You can't really predict 100%, polling is not the same as predicting.

I poll for a new sample to arrive once I've completed doing everything for that sample cycle.

rsdio · 2012-05-14 4:10 am

abraxalito said:
Yep, I don't do that with my designs. All user interface I/O I'd typically offload to another CPU. That's one reason I really love the LPC43XX dual core approach.

Not every design should be solved with a dual core CPU. Also, not everybody wants to write two separate firmwares, one for user interface and one for signal I/O. There are still some advantages to a single core with a single firmware, especially if there is enough DMA and other systems that can operate in parallel with the "code."

The TMS320 firmware that I keep referring to as an example has 5 or 6 channels of DMA active, each triggered directly by the hardware, and the "code" does very little but keep track so that USB transfers reflect the asynchronous conversions. There is still some polling, so that the interrupts can be trimmed to the minimum number of cycles necessary.

abraxalito · 2012-05-14 4:13 am

rsdio said:
Not every design should be solved with a dual core CPU.

Tilting at windmills.

Also, not everybody wants to write two separate firmwares, one for user interface and one for signal I/O. There are still some advantages to a single core with a single firmware, especially if there is enough DMA and other systems that can operate in parallel with the "code."

I don't disagree. But I tend to work with very small systems - the smallest LPCs don't even have microDMA controllers.

rsdio · 2012-05-14 4:17 am

abraxalito said:
Because its less than the overhead of interrupt context save and restore. Simple really. Most of the time just a read of one status register.

Except that the one status register might be read hundreds of times before it's status is "true" - in that case you've used more cycles than an interrupt. Granted, I've seen TMS320 interrupts implemented in the C Language which have a ridiculous number of cycles of overhead, which is why I code the interrupts in assembly.

Your choice 😀

I poll for a new sample to arrive once I've completed doing everything for that sample cycle.

In the simplest cases, it really is just a choice. I thought of saying that it's "six of one or a half dozen of the other." In your case, you simply burn all the spare cycles you might have in a tight little loop polling for the next sample, but those same cycles could be burned in an interrupt context save and potentially introduce less random jitter. In other words, the time between a hardware event and the first assembly instruction in an interrupt is way more regular (perhaps perfectly regular) compared to the time between that same hardware event and a tight polling loop that might hit on a different cycle offset every sample. Your solution may introduce more jitter than an interrupt-based solution, unless the jitter is alleviated elsewhere.

p.s. I agree with you that an RTOS is basically not necessary for a single firmware developer (and I don't think it matters whether it's DIY or professional).

EDIT: I also wanted to point out that some designs with multiple CPUs require two separate firmware storage memories, which complicates manufacturing by requiring that two firmware images be loaded onto a new board before it will function. There are solutions, but a single-code, single-firmware design is the easiest to manufacture. The dual-code CPUs that you like probably make this easy, but they're not the only option.

abraxalito · 2012-05-14 4:40 am

rsdio said:
Except that the one status register might be read hundreds of times before it's status is "true" - in that case you've used more cycles than an interrupt.

I believe you've employed some faulty reasoning.

Reading the status register once per audio sample uses say 3 cycles (read, test, conditional branch). Whereas which CPU do you know of only has a 3 cycle overhead for interrupts? With an interrupt on every audio sample that overhead is on every sample, just as my polling is. Perhaps you're still thinking big buffers with an interrupt on every 256 samples? In which case you'd have a point.

In the simplest cases, it really is just a choice. I thought of saying that it's "six of one or a half dozen of the other." In your case, you simply burn all the spare cycles you might have in a tight little loop polling for the next sample, but those same cycles could be burned in an interrupt context save and potentially introduce less random jitter.

I'm not sure the jitter's better with interrupts. But there is a fixed interrupt latency mode for some of the Cortex M series CPUs, that would potentially have jitter of less than one CPU clock cycle. But the penalty is higher latency overall. In any case, I'm not worried about jitter so its not a criteria I bother with here. Is there a good reason you know of that I should bother with it? (The audio output doesn't depend on it, I load audio samples into a FIFO and that's clocked out under hardware control).

rsdio · 2012-05-14 5:42 am

abraxalito said:
I believe you've employed some faulty reasoning.

Reading the status register once per audio sample uses say 3 cycles (read, test, conditional branch). Whereas which CPU do you know of only has a 3 cycle overhead for interrupts? With an interrupt on every audio sample that overhead is on every sample, just as my polling is.

Ha! I would say that you've employed faulty assumptions.

Reading a status register does not force it to be ready the first time it is read, it merely loads the current status whether the audio sample is ready or not. Your code loops back to read the status register again and again until the hardware says that an audio sample is actually there. Between the point where your code first accesses the status register and when the audio sample actually arrives could easily be hundreds of cycles. If you need those cycles for non-audio use, then they've been wasted.

The TMS320 has, I believe, only 1 or 2 cycles of interrupt overhead. Of course, the instruction pipeline must finish executing anything that is in process, but very few instructions are more than 1 cycle. Then, the TMS320 actually pushes values to two stacks in parallel. Being an enhanced Harvard architecture, the TMS320 has 3 read busses and 2 write busses, allowing a single cycle to push two values onto two stacks simultaneously. From there, additional overhead depends upon your assembly interrupt implementation.

Perhaps you're still thinking big buffers with an interrupt on every 256 samples? In which case you'd have a point.

The elephant in the room is that SIMD vector processing is more efficient when working on multiple samples. Not only are the arithmetic unit operations more efficient, but systems like AltiVec make the process of streaming data into the SIMD unit more efficient than auto-increment address register accesses. Thus, processing one audio sample at a time can be a terrible waste of power, which is really important on a battery-operated platform.

In my case, I'm dealing with 64-point or 128-point FFT calculations, so it is pointless to work on 1 sample at a time. Similarly, many crossover implementations will already have latency of more than 1 sample, depending upon the code. This isn't to say that all crossovers might as well work on buffers, but a great many should.

But the penalty is higher latency overall.

Not necessarily. It's highly dependent upon the code, whether interrupt-based or polled. If you can prove that your code never loops when polling, or basically never had a conditional execution path, then you might be able to claim minimum latency with no wasted cycles.

Is there a good reason you know of that I should bother with it? (The audio output doesn't depend on it, I load audio samples into a FIFO and that's clocked out under hardware control).

It depends upon the hardware. It's highly unlikely that there would be a problem, but it's possible in rare cases. In your specific case with a FIFO, there's no problem at all, so you're good!

abraxalito · 2012-05-14 6:06 am

rsdio said:
Reading a status register does not force it to be ready the first time it is read, it merely loads the current status whether the audio sample is ready or not.

Oh we seem to have gotten crossed wires here in our discussion - the audio sample is being polled for separately, the status register is there to tell us whether we need to service the less urgent I/O tasks (like buttons, display updates etc). Most of the times the status reg is going to say there's nothing to do for them as they're on 10s of mS timeframe, or longer.

Your code loops back to read the status register again and again until the hardware says that an audio sample is actually there. Between the point where your code first accesses the status register and when the audio sample actually arrives could easily be hundreds of cycles. If you need those cycles for non-audio use, then they've been wasted.

Ah you've made an important mis-assumption - the code I write is tight, so if I'm waiting in a loop for 100s of cycles, I am indeed throwing away very useful CPU cycles and I don't generally wish to do that - its a sub-optimal system. In a typical digital filter I might only have 1536 CPU cycles to play with for each audio sample, hundreds thrown away in a redundant loop is indeed a big waste.

The TMS320 has, I believe, only 1 or 2 cycles of interrupt overhead. Of course, the instruction pipeline must finish executing anything that is in process, but very few instructions are more than 1 cycle. Then, the TMS320 actually pushes values to two stacks in parallel. Being an enhanced Harvard architecture, the TMS320 has 3 read busses and 2 write busses, allowing a single cycle to push two values onto two stacks simultaneously. From there, additional overhead depends upon your assembly interrupt implementation.

So then in hardware its taken 2 cycles, how much context saving needs to go on in software? At very least a couple of register pushes right? So we're already beyond the 3 cycles on the LPC. But the TMS320 is in an entirely different league to the Cortex M0 in terms of power consumption, given its multiple buses.

The elephant in the room is that SIMD vector processing is more efficient when working on multiple samples. Not only are the arithmetic unit operations more efficient, but systems like AltiVec make the process of streaming data into the SIMD unit more efficient than auto-increment address register accesses. Thus, processing one audio sample at a time can be a terrible waste of power, which is really important on a battery-operated platform.

Well then, let's stack up my power requirements (around 11mA at 1.8V so 20mW) against yours? How much current does the TMS320 take and at what supply voltages?

Trevor White · 2012-05-14 2:41 pm

abraxalito said:
I believe you've employed some faulty reasoning.

Reading the status register once per audio sample uses say 3 cycles (read, test, conditional branch). Whereas which CPU do you know of only has a 3 cycle overhead for interrupts? With an interrupt on every audio sample that overhead is on every sample, just as my polling is. Perhaps you're still thinking big buffers with an interrupt on every 256 samples? In which case you'd have a point.

I'm not sure the jitter's better with interrupts. But there is a fixed interrupt latency mode for some of the Cortex M series CPUs, that would potentially have jitter of less than one CPU clock cycle. But the penalty is higher latency overall. In any case, I'm not worried about jitter so its not a criteria I bother with here. Is there a good reason you know of that I should bother with it? (The audio output doesn't depend on it, I load audio samples into a FIFO and that's clocked out under hardware control).

why do you have to interrupt the processor for each sample ?

Surely this is not a requirement for an active crossover where a bit of latency is neither here nor there.

Search

Amplifiers

Source & Line

Loudspeakers

Design & Build

General Interest

Live Sound

Member Areas

Site

Featured Vendors

Members Market

Vendors Market

Vendors

Search

Open Source DSP XOs

Trevor White

steph_tsf

Attachments

Trevor White

steph_tsf

Attachments

steph_tsf

rsdio

abraxalito

Trevor White

abraxalito

Trevor White

abraxalito

rsdio

abraxalito

rsdio

abraxalito

rsdio

abraxalito

rsdio

abraxalito

Trevor White