Open Source DSP XOs

twest820 · 2013-10-02 12:07 am

There's a variety of 16 bit multiplies, including SISD and SIMD. The multiply section of the instruction set docs is here.

Using SMLAL is several instructions faster than 32 + 16 = 48 bit extended precision---the multiplies aren't too much slower but shifting costs for 16 bit alignment are higher than those for 32 bit. For IIR 32 bit samples with 64 bit feedback is more accurate than 48 bit as well as faster. For FIR, 32 bit samples with 64 bit MAC seems to be fine. So, IMO, 16 bit operations aren't very interesting given 19 bit DACs cost a few dollars. Once one adds a few bits to the DAC resolution to avoid fixed point saturation and a reasonable number of guard bits a 32 bit data path seems most attractive.

abraxalito · 2013-10-02 12:27 am

twest820 said:
There's a variety of 16 bit multiplies, including SISD and SIMD. The multiply section of the instruction set docs is here.

Thanks for the useful link - I see I need to anorak-up on the full spectrum of the M4's math capabilities

Using SMLAL is several instructions faster than 32 + 16 = 48 bit extended precision---the multiplies aren't too much slower but shifting costs for 16 bit alignment are higher than those for 32 bit. For IIR 32 bit samples with 64 bit feedback is more accurate than 48 bit as well as faster. For FIR, 32 bit samples with 64 bit MAC seems to be fine. So, IMO, 16 bit operations aren't very interesting given 19 bit DACs cost a few dollars. Once one adds a few bits to the DAC resolution to avoid fixed point saturation and a reasonable number of guard bits a 32 bit data path seems most attractive.

I just bought a few tubes of 16bit DACs (8 cents a piece) so I might find those 16 bit ops useful yet

twest820 · 2013-10-02 2:16 am

The M4 lacks saturating instructions. So most biquad implementations (including the ones in the CMSIS DSP library) require two guard MSBs to prevent overflow. And even if saturating instructions are available the two guards are rather useful to minimize errors from internal clipping. I'm not up on precision requirements for FIRs---don't use them so haven't done numerical studies---but for IIRs typically somewhere around 8 to 13 guard bits are desirable in the data path for 44.1 depending on the XO and EQ being done on the subwoofer end of things and just how low frequency it goes. Call it 2 + 16 + 10 = 28 bits to do a reasonable job of holding the numerical noise floor below the precision of a 16 bit DAC. On a 32 bit processor that implies a 32 bit data path and ops.

There's a fair bit of wiggle room in this as to just how inaccurate one allows the filter responses to be come due to coefficient truncation and feedback errors. A bottom up way of answering the question of how many bits is good enough is to consider that the larger the number of biquads the lower their individual numerical noise floor needs to be. If one makes the inaccurate but convenient simplifying assumption noise from each biquad in the filter bank is IID then somewhere 1.5 guard bits tend to be desirable to handle noise combination. My experience is the noise floor is about double that in practice (though, again, it depends on what the filters are doing at what frequency). And, by definition, the noise floor is half of the LSB. So three or four LSB guard bits plus the two MSBs is a reasonable baseline. Fitting all that into 16 bit operations is equivalent to using a 10 or 11 bit DAC, which is something like 63dB DNR. Given decent analog volume management that's no great loss for MP3s of loudness war pop music. For critical listening I come up with something around 80 or 90dB DNR being desirable when I run the subjective maths, mostly depending on the allowances one chooses to make for crest factor in the music and the ear's ability to pull out artifacts from well under the noise floor. Assuming bang on analog volume and filter scaling that's 14 bits out of the DAC, suggesting 19 bits minimum with 21 bits being preferable to have a little margin for numerical noise crests and scaling issues. That also implies a 32 bit data path and ops on a 32 bit processor.

So one might as well set up the input shift from Q15 to Q31 and output shift to Q15 or Q23 to take advantage of the full bit depth. It's easy enough to synthesize a filter bank this way in both 16 and 32 bit and ABX the two output streams. I've done this a few different ways to test the above and got 100% discrimination in favour of 32 bit for both Q15 and Q23 output.

If one's using DSPs there's certainly room for debate as to whether options such as 26 bit with 52 bit MAC or 28 bit with 56 bit MAC are good enough. When choosing a DSC like the M4 it's pretty well irrelevant as basically all current processors are 32 bit.

Trevor White · 2013-10-02 2:24 am

abraxalito said:
32bit which can generate either a 32bit or 64bit result. There's no 16bit multiply though from memory there might be a dual (i.e. in parallel) 16bit multiply instruction.

Where did you get your quote?

http://www.iar.com/Global/Resources/Developers_Toolbox/Building_and_debugging/Designing_advanced_DSP_applications_on_the_Kinetis_ARM_Cortex-M4_MCU_part1.pdf

Also I recall reading on a forum how someone had to switch to a SHARC from an M4 because it didn't have enough floating point throughput.

twest820 · 2013-10-02 4:52 am

Hmm, post 439 for M4 versus SHARC tradeoffs? IAR's paper seems a little wonky as it kind of contradicts itself in places---suspect that's probably mostly due to English as a second language---but it's generally in the right direction. Not sure why one would choose do embedded audio DSP in floating point. Doubles on a PC CPU, eh, why not? Perhaps this someone on some forum didn't want to learn fixed point.

I should clarify my remark about saturating math above. The M4 single cycle MAC instructions one would normally use for IIR or FIR implementations aren't saturating. For about triple the cycles one can do MUL+QADD instead of SMLAD to get a saturating 32 bit accumulate. There is no saturating 64 bit accumulate, which is what I was speaking to at the start of post 443. Plus (unless the docs are missing something) SMLAL doesn't set the Q flag on overflow, so it's not possible to check for overflow and go fail safe on the sample in the event of a scaling error.

Also, I suppose if one was clever enough there's probably a way to reduce the number of shifts required to implement 48 bit MAC and feedback with SMLALDX.

abraxalito said:
Thanks for the useful link

Actually, sorry, looks like the link in post 441 isn't persistent. So it comes back to a more general ARM section which doesn't quite match the M4 in all instructions, though it's the same for most of them. Collapse up the tree and go Cortex-M -> M4 -> r0p1 -> User Guide -> Instruction Set to get the M4 instruction set.

hochopeper · 2013-10-02 5:18 am

Thoughts on Adapteva epiphany 16 core RISC processors for audio DSP?

I've got a Zynq Z-7020 Parallella on the way to have a play with the FPGA in the zynq processor and would like to play with the epiphany chip to see what I can learn.

Trevor White · 2013-10-02 9:56 am

twest820 said:
Hmm, post 439 for M4 versus SHARC tradeoffs? IAR's paper seems a little wonky as it kind of contradicts itself in places---suspect that's probably mostly due to English as a second language---but it's generally in the right direction. Not sure why one would choose do embedded audio DSP in floating point. Doubles on a PC CPU, eh, why not? Perhaps this someone on some forum didn't want to learn fixed point.

So what do I do with a fixed point processor if I need 20 dB of gain ?

twest820 · 2013-10-02 1:11 pm

Depends the nature of the gain. Fixed point scaling is probably the area to read up on to answer your question. Given a 32 bit number, treating it as Q6.25 rather than Q2.29 makes it 24dB bigger without touching the bits. So, for optimal bit depth utilization, for 20dB gain one would typically attenuate by 4dB in the math and move the point over by four bits. If it's known MSBs in the input are unused then the point may not need to move as much.

If it's a consistent 20dB gain I'd probably do that in analog in the DAC output buffers rather than changing the definition of dBFS between DSP stages. If it's say, a 20dB peaking EQ to compensate for dipole roll off, point motion may not be necessary. If applied prior to the EQ patch the lowpass for a sub or bandpass for a mid typically reduces the signal level enough a right shift isn't needed to prevent overflow. This is dangerous without saturating math and requires a few extra cycles for safety checks. So there's a performance versus noise floor tradeoff. The simple way to apply the gain safely is to right shift by enough bits to guarantee overflow won't occur. But that means loss of LSBs and a corresponding increase in noise floor if MSBs go unused.

If it's 20dB gain on one channel and 0dB on another with a requirement that both channel's output buffers have the same gain then I'd be tempted to implement that as a 20dB attenuation on the 0dB channel in the DAC's volume register. In principle that yields optimal bit depth utilization but, given a DSP with 32 bit math and a DAC with 24 bit math, it can be preferable to do the attenuation on the DSP as more guard bits are available and hence roundoff errors are lower. Depending on the DAC innards putting, say, 2dB of the attenuation on the DSP and right shifting by three bits in the DAC might be optimal.

If you're more confused than when you started reading this you're probably on the right track. The arbitrariness of the point takes a little getting used to and, if one's really trying to get every bit of resolution possible, it tends to come down to trying a few different implementations and measuring the DAC output in the analog domain to see what's best. Throwing guard bits at the problem has a way of being more convenient. That's one reason why the first paragraph in post 443 comes up with 28 bits whilst the second says 21 bits.

Trevor White · 2013-10-02 2:40 pm

twest820 said:
Depends the nature of the gain. Fixed point scaling is probably the area to read up on to answer your question. Given a 32 bit number, treating it as Q6.25 rather than Q2.29 makes it 24dB bigger without touching the bits. So, for optimal bit depth utilization, for 20dB gain one would typically attenuate by 4dB in the math and move the point over by four bits. If it's known MSBs in the input are unused then the point may not need to move as much.

If it's a consistent 20dB gain I'd probably do that in analog in the DAC output buffers rather than changing the definition of dBFS between DSP stages. If it's say, a 20dB peaking EQ to compensate for dipole roll off, point motion may not be necessary. If applied prior to the EQ patch the lowpass for a sub or bandpass for a mid typically reduces the signal level enough a right shift isn't needed to prevent overflow. This is dangerous without saturating math and requires a few extra cycles for safety checks. So there's a performance versus noise floor tradeoff. The simple way to apply the gain safely is to right shift by enough bits to guarantee overflow won't occur. But that means loss of LSBs and a corresponding increase in noise floor if MSBs go unused.

If it's 20dB gain on one channel and 0dB on another with a requirement that both channel's output buffers have the same gain then I'd be tempted to implement that as a 20dB attenuation on the 0dB channel in the DAC's volume register. In principle that yields optimal bit depth utilization but, given a DSP with 32 bit math and a DAC with 24 bit math, it can be preferable to do the attenuation on the DSP as more guard bits are available and hence roundoff errors are lower. Depending on the DAC innards putting, say, 2dB of the attenuation on the DSP and right shifting by three bits in the DAC might be optimal.

If you're more confused than when you started reading this you're probably on the right track. The arbitrariness of the point takes a little getting used to and, if one's really trying to get every bit of resolution possible, it tends to come down to trying a few different implementations and measuring the DAC output in the analog domain to see what's best. Throwing guard bits at the problem has a way of being more convenient. That's one reason why the first paragraph in post 443 comes up with 28 bits whilst the second says 21 bits.

you can't move the point on a Q0.31 fixed point DSP so it's not possible to efficiently multiply by a number greater than 1 !!

The sigma dsp and the TI TAS dsps are the only dsps with the ability to scale numbers by 1 or more but only upto a limit of say +/-32 due to their Q5.23 fixed point formats.

twest820 · 2013-10-02 3:27 pm

Sorry, no. The location of the point is a matter of interpretation. The bits in the math stay the same. Hardware support for normalization of certain formats allows saving a cycle for a shift operation and may make better use of guard bits within MACs. I suggest working through a few examples to understand how the shifting works; you may find it helpful to look at how GetOptimalNumberOfFractionalBits() is used in this implementation.

Trevor White · 2013-10-03 2:09 am

twest820 said:
Sorry, no. The location of the point is a matter of interpretation. The bits in the math stay the same. Hardware support for normalization of certain formats allows saving a cycle for a shift operation and may make better use of guard bits within MACs. I suggest working through a few examples to understand how the shifting works; you may find it helpful to look at how GetOptimalNumberOfFractionalBits() is used in this implementation.

So how do I multiply a Q0.31 number by 10.12 using a single cycle multiply instruction ?

twest820 · 2013-10-03 2:27 am

Optimal utilization of fractional bits for just the one multiply would be Q0.31 * Q4.27 = Q4.59. See linked code, the instruction set section of the manual for your processor of choice, and disassembly of your compiler's output for implementation details.

Trevor White · 2013-10-03 4:21 am

twest820 said:
Optimal utilization of fractional bits for just the one multiply would be Q0.31 * Q4.27 = Q4.59. See linked code, the instruction set section of the manual for your processor of choice, and disassembly of your compiler's output for implementation details.

I use a sharc dsp where the compiler has native support for Q0.31 fracs. There is no way that I know to represent any number above 1 or below -1 using this fractional format which is ok for your plain vanilla IIR and FIR filters but if you need a volume control etc then you need to multiply by a number greater than 1 !!

ie

fract X=10.21r; // produces error: fractional constant out of range

fract Y=0.05r; // this is ok

Y=Y*10.21r; // produces error: fractional constant out of range

and guess what happens if I multiple by a floating point number instead

Y=Y*10.21; // produces zero even when the arithmetic is not saturated

sorry no dice

twest820 · 2013-10-03 5:02 am

I'm not familiar with VisualDSP++ as I don't have the $3600 lying around for a license. But it's unsurprising range checking against strongly typed fixed point values would require constants be prescaled before presentation to the compiler. That's just good tool behaviour. Weaker typing like you get in C# due to its lack of fixed point support makes it easier to slip the point. But has all the usual disadvantages of reduced typing. Encapsulating scaling in a biquad object such as the example I linked to mitigates this; same approach works for FIR.

0.05r * 10.21 = 0r seems like a bug. I'd expect to get 0x41581062 back, give or take a little depending on how exactly the multiply ends up being performed. Or at least an error flag getting set. It's likely you'd have better luck inquiring in Analog's forums than on DIYA.

Trevor White · 2013-10-03 10:08 am

twest820 said:
I'm not familiar with VisualDSP++ as I don't have the $3600 lying around for a license. But it's unsurprising range checking against strongly typed fixed point values would require constants be prescaled before presentation to the compiler. That's just good tool behaviour. Weaker typing like you get in C# due to its lack of fixed point support makes it easier to slip the point. But has all the usual disadvantages of reduced typing. Encapsulating scaling in a biquad object such as the example I linked to mitigates this; same approach works for FIR.

0.05r * 10.21 = 0r seems like a bug. I'd expect to get 0x41581062 back, give or take a little depending on how exactly the multiply ends up being performed. Or at least an error flag getting set. It's likely you'd have better luck inquiring in Analog's forums than on DIYA.

why don't you try it on an M4 and let us know

I can't see how you can do any sort of efficient multiplications for numbers greater than 1 on a dsp that is designed for a Q0.31 fixed point format. You may say that the position of the point is arbitrary but to the hardware it is not !! The only way around it is to either add something N times or do shift lefts and add for the whole number part and shift right and adds for the fractional part. Both are inefficient processes and defeat the benefits of single cycle hardware multiplications. This is why the Sigma and TAS DSP's have allocated extra left bits to handle values greater than one

and 0.05r * 10.21 is no bug simply because 10.21 cannot be converted to a valid fractional value for a Q0.31 dsp.

chaparK · 2013-10-03 11:36 am

Example: you wish to multiply by 3.8

1. multiply by (3.8/4) = 0.95 < 1
2. Shift left 2x

In practice, you 'de-normalize' the data only when you leave the routine, so the shifting part does not really impact performance.

For instance, if you implement a biquad, you will have coefs that are greater than 1. So you divide all your coefs (by 2, 4, 8 - you're the one who decides). At the end of the biquad routine, you perform 1 single shifting operation on the data to undo the above coefficient scaling.

Trevor White · 2013-10-03 12:05 pm

chaparK said:
Example: you wish to multiply by 3.8

1. multiply by (3.8/4) = 0.95 < 1
2. Shift left 2x

In practice, you 'de-normalize' the data only when you leave the routine, so the shifting part does not really impact performance.

For instance, if you implement a biquad, you will have coefs that are greater than 1. So you divide all your coefs (by 2, 4, 8 - you're the one who decides). At the end of the biquad routine, you perform 1 single shifting operation on the data to undo the above coefficient scaling.

OK I see what you are doing. For a biquad you would just multiply the numerator and denominator by 0.5, 0.25 or 0.125 etc so that all coefficients are less than one and then shift the result left by the required amount at the end of it.

Thanks for clarifying that.

But then why the Q5.23 format used on the Sigma DSP if the standard Q0.31 DSP can achieve the same results ? Is it about saving this extra shift operation ? Surely there must be a catch here. ?

twest820 · 2013-10-03 12:40 pm

chaparK said:
you 'de-normalize' the data only when you leave the routine, so the shifting part does not really impact performance

Yes, that's what the code I linked on the last page does. Trevor, seems you haven't internalized it. But, Trevor, if you opt to it's straightforward to see how to use SMLAL on the M4 to perform single cycle MACs for the biquad kernel the same way multifunction operations are used on the SHARC. At least if one refers to the instruction set references in the processor manuals as suggested earlier---the SHARC programming manual I just glanced at provides the assembly for a biquad, for example. And if you spend a little time with the CMSIS DSP links provided in post 439 you'll see q31 is just a typedef for int32.

Trevor White said:
But then why the Q5.23 format used on the Sigma DSP if the standard Q0.31 DSP can achieve the same results?

I assume you meant Q0.27---the SigmaDSPs are 28 bit, after all. While the point floats you do have to keep track of it is position to correctly shift and truncate the accumulator the end of a biquad. Last I checked Analog didn't publish the SigmaDSP instruction set as SigmaStudio is pretty hands off. So it's difficult to speculate meaningfully about how the platform implements biquads. The Sigmas are low end enough I would guess it's probably just the usual shift at the end of the routine; certainly it's my impression that if one wants the scaling control to move the numeric noise floor down a couple bits the story is to change to a more open platform. Like the M4.

Silently ignoring input is just about always a bug. There are numerous ways to generate correct code for 0.05r * 10.21, such as casting to float and back to Q31 or shifting from Q4.59 to truncate to Q0.31. If VisualDSP++ doesn't support that the correct behaviour would be a compile error.

Trevor White · 2013-10-03 1:54 pm

twest820 said:
Yes, that's what the code I linked on the last page does. Trevor, seems you haven't internalized it. But, Trevor, if you opt to it's straightforward to see how to use SMLAL on the M4 to perform single cycle MACs for the biquad kernel the same way multifunction operations are used on the SHARC. At least if one refers to the instruction set references in the processor manuals as suggested earlier---the SHARC programming manual I just glanced at provides the assembly for a biquad, for example. And if you spend a little time with the CMSIS DSP links provided in post 439 you'll see q31 is just a typedef for int32.

I assume you meant Q0.27---the SigmaDSPs are 28 bit, after all. While the point floats you do have to keep track of it is position to correctly shift and truncate the accumulator the end of a biquad. Last I checked Analog didn't publish the SigmaDSP instruction set as SigmaStudio is pretty hands off. So it's difficult to speculate meaningfully about how the platform implements biquads. The Sigmas are low end enough I would guess it's probably just the usual shift at the end of the routine; certainly it's my impression that if one wants the scaling control to move the numeric noise floor down a couple bits the story is to change to a more open platform. Like the M4.

Silently ignoring input is just about always a bug. There are numerous ways to generate correct code for 0.05r * 10.21, such as casting to float and back to Q31 or shifting from Q4.59 to truncate to Q0.31. If VisualDSP++ doesn't support that the correct behaviour would be a compile error.

My understanding of the hardware difference between integer multiplication and fractional multiplication is that integer multiplication involves shift left and adds whereas fractional multiplication involves shift right and adds.

From the ADAU1701 Sigma DSP data sheet.

NUMERIC FORMATS
DSP systems commonly use a standard numeric format. Fractional number systems are specified by an A.B format, where A is the number of bits to the left of the decimal point and B is the number of bits to the right of the decimal point.
The ADAU1701 uses the same numeric format for both the parameter and data values. The format is as follows.
Numerical Format: 5.23
Linear range: −16.0 to (+16.0 − 1 LSB)

One of the advantages of the 5.23 format is that you can specify a fixed pointer number or filter coefficients up to +/-16 without worrying about any prescaling or bit fiddling.

twest820 · 2013-10-03 3:12 pm

Time to brush up the twos complement maths, eh?

Open Source DSP XOs

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member