Cheap ARM MCUs for RBCD audio

I've been spending time writing assembly code for the HC32F460 Cortex M4 (at last!) these past couple of weeks and have been finding this book extremely useful : https://www.amazon.com/Definitive-Guide-Cortex®-M3-Cortex®-M4-Processors/dp/0124080820. Given that ARM's documentation of the M4 instruction set is spread over at least two documents this tome brings useful info together nicely. I did find though it has significant errors when it comes to describing one of the FPU instructions - all the VCVT instructions that I tried are described back to front. That had me scratching my head for a few days until I dove into ARM's FPU instruction encoding details. The M4's FPU turns out to be useful even when not doing any floating point arithmetic as it contains a large register bank (twice as many as in the M4's 'core') which can be valuable when all the internal registers are maxxed out (which is often when I'm coding assembler). At first I was worried that trying to store integers in FP regs would give an error when the stored value wasn't a valid FP number but this turns out not to matter.
 
  • Like
Reactions: 1 user
I'm guessing there are a few isolated pockets of the art being perpetuated, perhaps mainly amongst those who cut their teeth at a level below assembler (toggle switches). Certainly from my perspective it makes the most sense when writing DSP code in the way I do it - which involves paying a lot of attention to execution time so I have no need to use interrupts. I'd not recommend assembler for most programming though, particularly not user interfaces.
 
  • Like
Reactions: 1 user
I use assembler simply because I enjoy the process much more - C coding is too abstract and then even after coding there's the inevitable process of finding out how the compiler's too dumb to minimize the execution time. I have a useful benchmark in C - Paul Beckmann's AES paper on DSP is an excellent resource (I'll link to it at the end) and he's got a figure of 2.65 CPU cycles per FIR tap for Cortex M4 - I assume he's got that with 16bit coefficients and 16bit data. I am reasonably confident a well-honed assembly routine will be able to beat that but I'm not sure by how much.

Having looked again at Mr Beckmann's comparisons I see I had missed important context from the slide previous to the one where he cites the 2.65 cycles figure. He's running floating point to get that figure. The FPU's MAC instructions all take 3 cycles so he can't be using them, he must split apart the multiply from the accumulate as both of those instructions only take a single cycle each. Running fixed point with only 16bit data and coefficients gives the possibility to avail oneself of the SIMD dual-MAC instruction 'SMLALD' which surprisingly still only takes 1 cycle. Getting the data and coefficients in to this instruction then becomes the bottleneck but 1.65 CPU cycles per tap should be achievable based on PB's number meaning that the SAA7220's 120 tap OS filter at 44k1 would need about 17M CPU ticks for the filter itself. Adding on overheads seems to suggest a clock rate around 20MHz might be able to do the trick. Such a modest (early-1990s) CPU clock frequency would have the F460 sipping ~4mA which is a huge power saving over the 180mA of the SAA7220.
 
I've just ordered one of these STM32H750 boards as the price is a tad over $7.

???STM32H750VBT6??? ??QSPI?? CubeMX STM32H7???-???

This CPU is something of a beast in the world of M-series. It runs at 400 (or maybe even 480) MHz and is dual-issue, meaning some instructions can run in parallel. It even has a double-precision FPU. What's particularly unusual for an ARM SoC is it has an on-chip S/PDIF-AES/EBU receiver and also 4 channels of S/PDIF output. The reference manual is equally a monster, running to over 3300 pages.

I've been looking around for some time for more in-depth information on the Cortex M7, in particular how to optimize assembly code for it to utilize the dual-issue capability. Today I found this, which acknowledges that that info on the M7 is something which ARM hasn't (yet) made public : https://www.quinapalus.com/cm7cycles.html. It also reveals that the double precision FP operations take 2 cycles in their simplest form, vs 1 cycle for single precision.
 
Probably I could translate the assembler into C but it would require a substantial re-write because in C its not permitted to use all the CPU registers. For example, the stack pointer and link register aren't available as they're dedicated to use by C, but I make good use of both of those.

Here's the frequency response of Pacific Microsonics' PMD100 digital filter chip, its generally regarded as the best sounding of its genre. I figured I'd have a go at working out whether a cheap Cortex M4 could implement it.

image_2023-12-07_201219730.png


Looks as though the stopband starts just below 23kHz from eyeballing the plot. When I put the parameters into http://t-filter.engineerjs.com I found it needed in the region of 250 taps to implement. Yet the DS for the chip says the latency is 83 input samples (independent of OS ratio selected). That says their filter - which I'm assuming is linear phase - is 166 taps long. So something doesn't stack up, either T-filter's not giving me an optimized set of coefficients or PMI have some secret sauce?
 
Last edited:
I spent quite a bit of time with it and found its quite easy to send the Javascript into weird spasms where it ignores one or more of the parameters and/or gives a non-flat stopband. I wanted a second opinion so I downloaded the 'Anaconda' package which has a Remez exchange function. Its even more pessimistic about what can be achieved with 166 taps than T-filter but its very early days for me in running Python, I'm a total noob.

1701957853975.png


My first stab using Spyder Python, 166 taps, transition band 3kHz.
 
Last edited:
  • Like
Reactions: 1 user
Yet the DS for the chip says the latency is 83 input samples (independent of OS ratio selected). That says their filter - which I'm assuming is linear phase - is 166 taps long. So something doesn't stack up, either T-filter's not giving me an optimized set of coefficients or PMI have some secret sauce?

Having looked in a little more depth into implementing a 2X OS filter, I'm now doubting that my interpretation of '83 input sample delay' into a filter length of twice that number is indeed correct. Here's my updated thinking - take the case of 2XOS, the output samples will be generated at 88.2kHz. If the filter's 166 taps long, that filter's total length represents 83 samples @ 44k1 but the group delay of a linear phase filter is half of its length (the impulse response is symmetrical about the centre point). So that group delay would correspond to only 41.5 input samples. It looks therefore that '83 input sample delay' actually corresponds to 4X that number of taps in the filter, i.e. 332.
 
  • Like
Reactions: 1 user
I have, at long last, a first approximation to a 2X OS digital filter with 256 taps running on my HC32F460 hardware now. There are a few hidden gotchas in getting this part going - like the protection registers - but I am able to confirm that running at 100MHz, the current draw is approximately 10mA. Which is pretty impressive and in line with what the datasheet says for 120MHz clock rate. From a debug output which I observe on my scope, I have about 4uS idle time per output sample which tends to indicate I could run at quite a bit slower clock, maybe more like 64MHz. Going to give that a try.

This is the evaluation module I'm using https://www.aliexpress.com/item/1005004405232164.html. I don't recommend buying from Aliexpress because that $17 price is unbelievably high. Over on Taobao, it goes for just under $3.
 
  • Like
Reactions: 1 user

mkc

Member
Joined 2002
Paid Member
Hi abraxalito,

Just lurking around here.

You might already know this, but I just want to draw your attention to the DWT module which most Cortex-M3 anf M4 implements. It has, among other things, a cycle-counter that can be used to profile part of your code. You will have to read the DWT_CYCCNT register before and after the function or piece of code you want to measure execution time of.

The result can be read depending on which debugger you use.

Cheers,
Mogens
 
  • Like
Reactions: 1 user
Thanks Mogens, I didn't know about that. Since you mentioned debuggers, I'll just dish a little bit of dirt here on Keil's simulator as regards execution times. Overall, I love the simulator (it is free after all, for <32kbytes of object code) and it has saved me a LOT of debug time in terms of getting my algorithm sorted out. But there is one place where it sucks badly, and that's on the execution time it attributes to Cortex M4. The 'native' M4 instructions are fine but those instructions also implemented by the M3 run at the timings of the M3, not the M4. One example is the SMLAL 32*32 with 64bit accumulate - it takes a variable number of cycles on M3 (from memory 4-7) but executes in just 1 cycle on M4. The simulator is told that the CPU's an M4 but it insists on the timings being those of the M3.

I opened a case with ARM about this and the response from them was, in short, 'That's a feature, not a bug'.
 

mkc

Member
Joined 2002
Paid Member
Hi abraxalito,

I haven't worked much with the Keil tools. They appear to be well regarded. But, not all simulators are created equal and some are not cycle-accurate. There is also Segger embedded studio, which is a renamed Rowley-Crossworks IDE. I have never used the simulator in that, but it's free for non-commercial work so you could give it a try.

Segger also offer a tool that allows the ST-Link on Nucleo board to be updated to act as an J-Link although not with same performance. They do also provide a reasonably cheap J-Link debugger for educational purpose.

Mogens
 
Taobao has the HC32F460JETA on a module for 14.9RMB, that's just a smidgen over $2 so getting close to 100MIPs/USD, not bad at all : https://item.taobao.com/item.htm?sp...1mx4ix&id=675002044133&ns=1&abbucket=8#detail

I've been trying to figure out why some of my estimated timings for parts of the filter code haven't matched up with reality and I'm seriously wondering if there's actually a significant error in the HC32F460 reference manual. It says that the FIFOs in the I2S peripherals are 2 words deep yet from my experiments poking bytes through my J-link, its looking more like they're 4 words deep. Which has confused my code no end as I was polling for the FIFO to be full assuming that meant 2 words present... Oh and I2S 'master' mode turns out to be not so useful when there's a non-bypassable divide-by-4 between the external clock pin and the peripheral. Slave mode it is then!
 
I have now gotten music playing sweetly through the 2XOS filter : https://www.diyaudio.com/community/...-rbcd-multibit-dac-design.324933/post-7593453

The biggest challenge - beyond the algorithm itself for convolution - has been with the clocking. The MCU has a couple of PLLs on-chip but I've been loath to use a PLL out of concern for jitter on the multiplied up BCK to run 2XOS. So given that initially I'm only really interested in 2XOS and nothing higher I have decided to repurpose the input 64fs BCK (2.82MHz) and run the 2X output with only a 16bit frame size. Meaning the input BCK can be used as-is. This does rather limit the DAC that can be driven to being only a 16bit one, to go beyond that limit I'll need I2S to be fed in with an MCLK on the 4th wire (11.28MHz typically).

The MCU's output port running in slave mode necessitates it being fed with both BCK and WS. BCK is fine as that's the input BCK but creating the 2XOS (88.2kHz) WS was a bit of a puzzle. I didn't want an external FF or counter chip if that could be avoided. Not just for the additional BOM but also because it would need a means to synchronize it which could be rather fiddly. In the end I settled on creating the WS synthetically from the spare output side of the input port. It means an increase in the housekeeping duties to keep it topped up but brings some nice flexibility, meaning that TDA1545s non-I2S input format can be handled with just a change to the data pattern sent to the output port. The Tx port to the DAC runs from the input WS because that gives most FIFO space so only the DAC sees the synthesized WS. With this arrangement I don't have to worry about getting the channels out of sync with each other on the output as both are written to the port in the same 32bit word.

Next up, I want to be able to bypass 2XOS and revert to NOS in real-time with a switch for A/B comparison. I figure the most straightforward way to do this would be to allocate some RAM space to an additional set of coefficients which don't do any filtering at all, just a single '1' in the middle and all others zero. That won't be 100% pure NOS as the DAC still runs at 88k2 but it'll get two adjacent samples the same which for a first pass is close enough I reckon.
 
  • Like
Reactions: 1 user