Arduino Spectrum Analyser 24bit 256KS/s

Looking at the touchscreen displays the zero has compatibility issues with some. The 4 seems best placed for compatibility - of those the QLED doesn’t need GPIO ports (7" QLED IPS Capacitive Touch Display for Raspberry Pi (1024x600)– The Pi Hut) which means you get a good display that is responsive although processing power is needed for the screen size.. so resolution vs speed.

Just need to work out the maximum GPIO 32bit read speed..
The Zero seems 20, ie SMI 4 byte GPIO transfers max out at 80MByte/sec. The 4 seems to be about the same but has more hardware conflict thus 45Mbytes/sec. So we obviously have to be concerned with bus saturation.

4MByte/sec is all we need (1Msps 24bit+8byte pad). However processing DFT on that will increase the saturation and rendering on top Is likely to max out the bus.

An alternative approach (and used by CERN originally) is to use a FPGA to DFT the signal stream in real-time. The difficulty is there’s more data in the rolling DFT per sample! So the option here is the buffer the HDMI and mix in the DFT into the image. However it is not particularly flexible and way to complex.

The last resort for the pi is to use one of the newer developement that allows a Graphics card to be added thus providing a GPU and GDDR ram that doesn’t share the PI’s memory so all it needs todo is up load the data and then orchestrate the GapU kernels (programs) then without going out of the graphics card - render it to the screen.

Or USB the data and let a larger PC do the work.
 
Last edited:
NickKUK: 1Msps 24 bits can easily be handled by the Linux USB UAC2 gadget running on RPi4 (tested) or very likely the new RPi Zero 2 (not tested). Linux and Win10 stock UAC2 drivers handle any reasonable samplerate/sample size/channel count (linux tested up to 4MHz, win10 wasapi exclusive tested up to 1.5MHz samplerate). Once the stream is in the PC, many options are available. Stock REW runs at 1.5MHz samplerate on linux/768kHz on Win. Linux has other analyzers available which run at any samplerate (e.g. jaaa tested to run OK at 19MHz samplerate). You may convince REW author to add support for non-standard samplerates read from some ENV variable.

IMO the only issue is getting the stream from your board into RPi reliably. Once the data are there, it's (relatively) simple.

I would be cautious about sampling + processing + displaying in RPi(4). The VideoCore4 GPU is nothing special, A72 Neon accelleration is quite weak for f32 (and none for f64) https://www.diyaudio.com/forums/pc-...rossovers-correction-etc-236.html#post6798125
 
Just looking at a real nasty option:

ADC -> CPLD -> FPGA -- CAMERA PORT --> PI

The concept here is that the CPLD managers the serial to parallel conversion reducing the load on the FPGA. The FPGA then acts like a store and forward - building a frame of 24bit image with successive outputs of the samples. You then build up a large number of samples (1M at 30 or 60fps so for 30fps that's 33,333 pixels per frame) which are then sent to the camera port of the PI.

The raspberry pi has the camera port connected to the GPU directly using it's own port.

Details here: 6. Camera Hardware — Picamera 1.13 Documentation

This means it's a DMA into the GPU RAM section and held as GPU buffers (ie OpenGL textures) that can be processed using open cl computation including FFT.

This bypasses all of the peripherals but still has the shared memory contention to cope with.
 
NickKUK: 1Msps 24 bits can easily be handled by the Linux USB UAC2 gadget running on RPi4 (tested) or very likely the new RPi Zero 2 (not tested). Linux and Win10 stock UAC2 drivers handle any reasonable samplerate/sample size/channel count (linux tested up to 4MHz, win10 wasapi exclusive tested up to 1.5MHz samplerate). Once the stream is in the PC, many options are available. Stock REW runs at 1.5MHz samplerate on linux/768kHz on Win. Linux has other analyzers available which run at any samplerate (e.g. jaaa tested to run OK at 19MHz samplerate). You may convince REW author to add support for non-standard samplerates read from some ENV variable.

IMO the only issue is getting the stream from your board into RPi reliably. Once the data are there, it's (relatively) simple.

I would be cautious about sampling + processing + displaying in RPi(4). The VideoCore4 GPU is nothing special, A72 Neon accelleration is quite weak for f32 (and none for f64) https://www.diyaudio.com/forums/pc-...rossovers-correction-etc-236.html#post6798125

I've focused on minimising the transfer cost - although 4MB/sec is not much, I want to keep as much as possible for the actual processing with as efficient transfers as possible.

Agreed the GPUs in embedded aren't great but they do offer the forward strategy - such as using the PCI-E 1x based graphics card with little change. A small discrete GPU would wipe the floor in terms of GPU memory and parallel processing bandwidth.

In the case of an external GPU, I'd expect the GPU to map the camera frames into shared memory space (in GPU terms "shadow ram") then currently it requires a shift into the external GPU shadow ram before being uploaded into the external GPU over PCI-E.

GPUs historically are excruciatingly slow for data upload into the GPU memory in favour of operation and download speed. Although compute systems like Nvidia Tesla etc have rebalanced the bus memory controller to allow better two way speed.

If you can use a discrete GPU then the RPi is simply coordinating which is very low load - the display can then be connected to the external GPU.
 
Also the use of buffers in the GPU means you can build overlays fast with the final rendered image. Considering the two approaches, and using common components, (c) indicates coding development:

ADC->CPLD(c)->FX3(c)-USB->PC/Mac/Linux user mode USB driver -> software (c)
ADC->CPLD(c)->RPi Linux (c) -> software (c)
ADC->CPLD(c)->RPi Linux (c) ->PCI-E GPU, software (c)
ADC->CPLD(c)->FPGA(c)->RPi Linux->....

So there's plenty of options to make things faster/complicate things! So I think the the first step is keep it simple initially - the less (c) the better.

The main purpose is for audio - amplifier design and build testing. Although two channels would be useful with maths between in the longer term. I prefer realtime. More of a preference. This would be a step up from the sound card ADC or 10bit ADC.

The reason for suggesting USB is simply to offer a fast development environment and then processing power in a VM/native. In the past I've tried using VNC over wifi but it's too slow for developing really.

However I also like the standalone with a touchscreen - once the thing is build and runs then you don't have to worry about remembering how to connect it up.. just use it.
 
Just checking the camera port option.
1. the RPI camera port is a MIPI CSI-2 with two data lanes (2GB/sec), such as an omnivision OV5647 is plugged in with a ZIF15 cable. A FPGA could spoof being a simple camera to provide data streamed 'pixels' as normal.
2. The GPU, or VideoCore, handles that connection and runs with a proprietary pipeline that has been reverse engineered. Including understanding the supported assembler operations.
The issue here is that the pipe line does some optical processing including chromatic processing - all defined by what the GPU reads from the camera IC registers. Hence the I2C implementation would need to return data to essentially nullify any post processing done by the GPU.
Then the kernel driver does the work in terms of making it available to the operating system applications. The transfers of completed frames from the GPU shared memory segment to the CPU shared memory segment are done by DMA.

So in theory it would be possible to essentially spoof a CSI-2 camera and feed data as a 24bit pixel into GPU. The spoofed I2C registers are configured to make the GPU not apply any post processing and pass through via DMA into main user space unaltered.

That is a shedload of FPGA work. Plus as the information is guess work, it's not guaranteed not to fail the next time an update comes out without warning.

Although it represents the fastest option, it's also tied together with bailing twine.

So it looks like a USB/parallel connection to the RPI is the next possible step for consideration.

The GPIO pins and the documentation is relatively straightforward, mapping the pins and then using an interrupt on a parallel clock signal would be the obvious way to progress.
It would need some driver work, and CPLD work. Just as the FX3 USB would need work.

The main concern is the number of GPIO pins. A CPLD could be made to pond shift with a '1' in the top 25th or 31st bit causing the clocking to enable and an interrupt in the RPI for reading (in theory it should be possible to make a DMA trigger with a control block to read the data from the memory mapped pin location and transfer it to the stream buffer.
There may be a more efficient way todo that using an intelligent DMA but I need to find that.
 
Just editing the previous post but ... got locked out so more:

GPIO DMA interrupt example: Raspberry-Pi-DMA-Example/dma-gpio.c at master * Wallacoloo/Raspberry-Pi-DMA-Example * GitHub

Also if we don't use a FX3 we don't need the 32bit transfer width so we can make use of the upper 8 signal lines.

[p1:24] => GPIO data pin (used only for the continuous high speed 24bit sample reads)
[p25] => DATA READY -> Interrupt -> DMA [p0:24] => memory buffer
[p26] => SPI <-> RPi port I
[p27] <= SPI <-> RPi port O
[p28] <= SPI <-> RPi port MLCK
[p29] <= SPI <-> RPi port CS (received but ignored)
[p30] <= ENABLE CONTINUOUS DATA drive pins (p1-p25) This would be clocked at a higher speed.
[p31] reseved


p30 = Low the SPI pins p26-29 are passed through to the SPI port on the ADC and clocked by the RPI.
p30 = High the SPI pins are not connected - the serial data reading from the ADC is driven by the data ready pin signal on the ADC, the data shift through p1:p24 is clocked then completed with a pin25 = HIGH to signal data available. The clocking is done entirely by the ADC data ready pin until termination is sent (p30 goes low). The p25 L to H transition causes an interrupt to read the data. The next clock signal it's set to low.

p30 transition L to H causes a a start streaming operation (hard coded in CPLD logic) to be sent to the ADC - it then streams continuously based on the set configuration. the CPLD-RPI SPI pins will be held low.
p30 transition H to L causes a 'termination' command to sent (hard coded in CPLD) to the ADC cancelling the continuous streaming. The first operations for the RPI SPI are probably to clear any errors before making it available to the application.
p30 L-L and H-H transitions are ignored.

Then it's up to the RPI to configure the ADC using native SPI (the ADC registers are all 8 bytes so this is straightforward, only the ADC DATA register is 24bit). Once ready for data the RPI sets pin30 high and hangs on.. when the system no longer needs data or requires SPI communications, pin30 is dropped to LOW (bearing in mind there may be some operation/DMA in progress). Given a sample period of time SPI operations can resume (I don't like timing waits in concurrency but here it's fair).

Given the SPI clocking on the ADC is done at the SCLK rate - the RPI SPI port and normal libraries can be used. It's just when you want to blast data - we leave that to the ADC to execute.

This means we have simple SPI integration, minimal CPLD work and minimal linux GPIO/DMA work. I just need to think if I make it a callback with a file based stream for the data output or a callback with frames like the sound interface. Preference is the latter.
 
Last edited:
Woke up at 4am and sat in the dark thinking. The main thee points of thought which are strongly related

1. Realtime - more accurately what is realtime to a user.
We have 1M samples/sec. For a small Arduino that's quite a bit of data. Why? for a number of reasons - some obvious and some not.
a) The sample rate max is 1M/sec. The conditional trigger rate would be 1MHz. The screen update rate is 50/60/100Hz. The maximum rate someone absorbs data is far lower - maybe 1-4 pieces of information per second.
b) FFTW requires static buffers (same for most fast FFT algorithms) as the CPU instructions are compiled with respect to memory word boundaries. Therefore you would loose time simply DMAing into an memory area then copying into the input buffer of the FFT, FFTing and then copying out the output to make room for the next. It would be a large waste in bus bandwidth, additionally so to would simply DMAing in and shifting the data in the FFT input buffer. So what is needed is an FFT mechanism that can take a sample at a time, adding it to the current history and FFTing to give the frequency domain output. As a first pass that would simply remove the input memory moves as the DMA can simply add data into the FFTW input buffer in a circular fashion. The processing takes the correct circular input and outputs the result correctly.
An extended form could be that the FFT code never 'completes' but simply maintains the current state and adds each sample (or block of samples)
This got me thinking about the speed - given we can't process 1M samples per second with an FFT on each sample, then we can't support immediate realtime triggering - but we can support triggering if it's seen in the data. If our screen refresh rate is say 16,666th of the sample rate then we can batch the samples, allow the DMA to input data into circular buffer and trigger run an FFT every N samples, that could be 1000 samples or more. It also means we don't miss a trigger condition (I'm thinking that a trigger condition would be something like - apply a spectrum wide correlation and search for a threshold). As long as the data is processed before the circular buffer overwrites.
However we don't need to worry about losing the data - the GPIO can broadcast the data between multiple RPI - one may simply be FFTing for a trigger condition, the other may be doing another task. The 'sample time' remains the same between the RPIs.
Multiple RPIs can then reduce the data and provide it to the main screen rendering RPI if that complexity is needed. This could allow detecting of aliases or harmonics etc.

2. The processing can be parallelised
The CPLD output p1:24 are broadcast to across multiple GPIO connectors, with the 'data ready' pin clocking all the RPIs.
For example - one process could filter the FFT to track a given signal and harmonics, providing the data (or even a compressed image - such as a bit field) to the main RPI for rendering with text numbers etc.
If we develop any functionality that is broken into:
* processControl = userinputprocesscontrol() - ie twiddling of widgets or parameters
* currentProcessState = process(currentProcessState, newSample, processControl)
* renderProcessState(currentWindow, currentProcessSate )

Breaking the above means we can have one RPI on the control and rendering to the screen and one that simply processes - an async common layer of data transport between them. The input control changes should effect immediate change, the processing will update in close to immediate as possible and the rendering of the output state can be done with each update/trigger or current state for the screen redraw. for example.

Data logging could be achieved using a separate RPI that simply listens and records each sample - the fun could be that a separate class could simply make a replay of a selected history into the processing in place of the realtime GPIO input. Same with store and retrieved data.

The loose coupling of tasks components means we can be flexible without loosing too much processing time.

3. Hellbent on reducing data transfers - ring buffers and delta processing.
My work with FFTW and GPU clFFT in the past means I'm aware of some of the complexities within them. These compile execution plans (these are basically functions dynamically created that are memory alignment aware to make the most out of the CPU parallel instruction set or the GPU architecture. The down side is that you can't simply give them any area of memory that contains the buffer - they're essentially hardcoded to a specific memory buffer. We don't want 1 million FFTs prepared for execution for each sample in a circular buffer but we can simplify in a couple of ways:
a) Break the buffer into 1000 and have 1000 execution plans, this means we batch feed the data into FFT and the output jumps every 1000 samples. We don't miss processing the data - it's just the data is nolonger immediate in realtime terms. It would be sufficient for our needs but is a brute force approach -costly in terms of memory and resulting 'glitchyness'.

b) The FFTW doesn't have the concept of ring/circular data, images in 2D or 3D are handled not by processing in a single 1D wrapped image but through the decomposition - 2D is essentially an set of X 1D and a set of Y 1D multiplied, same for 3D. That means the FFT is 1D applied across each axis then multiplied.
This gets complex very quickly - the memory alignment compilation would need need to be altered to cope with fetching data from different wrap around boundary alignments for the ring buffer. The result would be a larger compilation output and that static hard coded addressing. Not something I really want to get distracted with.

c) You can 'slide' the DFT as shown here by adjustment and then simply updating with the FFT: c - Doing FFT in realtime - Stack Overflow
This is the method I'd chose for now - the same mechanism can also be performed in applying the processing of the FFTs.

This makes the entire system expandable - the output from multiple channels for example could be supported, although hard wired between ADC and processing RPIs without a more complex CPLD as a switching fabric.

I want something simple. This can start simple, with the architecture it could be expanded.
As long as I can (as a slow user) setup:
* FFT with the normal power based calculations (SNR etc)
* Auto-track tones input and identify harmonics (THD etc)
* trigger on threshold events that appear anywhere on the spectrum (even if I'm not zoomed into it).

It's open ended but those three basics are enough!
 
IMO 1Msps at 1ch/24bits is not any major data rate to receive, if IRQs are kept at bay. I have not studied the DMA example in detail - how are the IRQs thrown for the transfer from paralleled GPIOs to RAM? I assume the DMA controller increments the destination address and throws an IRQ after a batch of transfers. Otherwise linux serving 1M IRQs per second reliably is unrealistic.

Have you considered converting SPI to I2S in your FPGA/CPLD? IME RPi manages 768kHz/24bits/2ch I2S output/input (I tested an I2S loopback to work fine). Also there are other ARM boards with multichannel I2S, allowing to receive much higher bitrates reliably. That would spare you the driver work as alsa I2S drivers are available and well tested for all major platforms (RPi, Rock64, etc.).

Personally I would try hard to avoid parallel processing by multiple RPis. Serial processing could be quite easy, even if simply sending the samples with netcat over the gigabit interface. One RPi for receiving, another for processing and visualizing.

Still for that I would use a regular PC running some existing analyzer, e.g. the aforementioned REW. On the other hand REW runs on RPi4 fine. I have not tested arm64 java yet, but if it works, the 8GB RPi4 should handle it OK. Adding 10"+ touch screen + onscreen keyboard could yield a standalone device. Again the easiest transfer would be the USB audio gadget in this case, or maybe networked pulseaudio + alsa pulse plugin to feed java.
 
IMO 1Msps at 1ch/24bits is not any major data rate to receive, if IRQs are kept at bay. I have not studied the DMA example in detail - how are the IRQs thrown for the transfer from paralleled GPIOs to RAM? I assume the DMA controller increments the destination address and throws an IRQ after a batch of transfers. Otherwise linux serving 1M IRQs per second reliably is unrealistic.

The GPIOs would be volatile physical memory mapped. An interrupt is set to trigger on pin transition L->H, and the DMA is triggered having been setup previously for a 24byte move from the GPIO address to the buffer, the system then automatically increments and moves on. The fun is how do you end - that's provided by a length, so I would suspect a new DMA is used. I will need to investigate further.
The pin toggling will then drive the data transfer.

Have you considered converting SPI to I2S in your FPGA/CPLD? IME RPi manages 768kHz/24bits/2ch I2S output/input (I tested an I2S loopback to work fine). Also there are other ARM boards with multichannel I2S, allowing to receive much higher bitrates reliably. That would spare you the driver work as alsa I2S drivers are available and well tested for all major platforms (RPi, Rock64, etc.).

I hadn't but looking through the information I have noted that it's available, however being available and what processing/delays are done as part of the pipeline is important. The python bode plot for the 1102X-E acting as a software AWG signal generator works through the Linux sounds subsystem and experience tells me that the streaming isn't trouble free (you get some noises etc) and also subject to volume scaling etc.

Personally I would try hard to avoid parallel processing by multiple RPis. Serial processing could be quite easy, even if simply sending the samples with netcat over the gigabit interface. One RPi for receiving, another for processing and visualizing.

The ethernet port did occur to me too :) however if it's coming in through the GPIO, the work could be sliced up in a multitude of ways - I did think about splitting frequency ranges too across the GPIO paralleled PIs, also it's possible to build a hard core option with a bespoke RTOS that performs FFTs and throws it out broadcast on the local network. Plenty of ways to break a very scalable operation such as FFT (and maths functions) down.

Still for that I would use a regular PC running some existing analyzer, e.g. the aforementioned REW. On the other hand REW runs on RPi4 fine. I have not tested arm64 java yet, but if it works, the 8GB RPi4 should handle it OK. Adding 10"+ touch screen + onscreen keyboard could yield a standalone device. Again the easiest transfer would be the USB audio gadget in this case, or maybe networked pulseaudio + alsa pulse plugin to feed java.

I have a 'little' Mac mini that is basically heat limited but I got sick of apple when they dumped OpenCL for their mess of a compute mechanism. Overnight my code stopped working. That's not the only time when it's caused a problem.

What is interesting is that the c++ code speed of a 4core VM with linux and the shifting FFT is slow as it doesn't optimise for parallel computation. Perhaps boost or another fast vector library would help. That's not surprising given for each sample, there's a million multiply-add instructions.

There's some examples of pipeline FPGA FFT generator here: An Open Source Pipelined FFT Generator
And an FPGA OpenCL FFT note here: Designing a 2 million-point frequency domain filter using OpenCL for FPGA - Military Embedded Systems


I'll see how REW is on linux in a VM, and see how it goes.
 
Last edited:
I've been doing some coding.

Even with a 32 bit fixed point arithmetic FFT the problem requires too much CPU power between each N perhaps for a single embedded chip. It's possible to use vector SIMD operations and threading to process blocks of the 1M processing per sample - I thought about splitting the FFT across multiple CPUs for this reason - each processing a chunk.

A couple of option stand in my mind - it would be possible to use a window to limit the spread of influence for each sample added (ie a signal does not influence all N/2 samples either side of it).

I'll look at the 'adding a block' option combined with the shifting FFT - that way you can precalaculate the multiplier for each additional sample in a block and then apply that as you navigate.

The concept of a shifting DFT works well with graphics too - given each gather is deterministic based on the thread position (ie sample position). Actually the GPU would probably go faster calculating the sin() and cos() as part of the processing rather than fetching from a table. Sometimes there's too simple.
 
I'll code up the following versions on the shifting FFT - an GNU SIMD instruction set version and an OpenCL version.

Two annoying points - GNU SIMD performance sucks compared to the Intel compiler (for intel CPUs) and that means the SIMD performance on the RPi4 will suck as it requires a code rewrite between the two plus it doesn't support gather. The OpenCL option I'm just exploring - there's a difference between the RPi1-3 and the RPi4 GPU which makes it incompatible at the moment - there's an experimental compiler (basically it has to use Vulkan to compile across - sort of going around the houses). That environment is currently downloading 1.4GB so far.. it's then got to compile.. If it does work it's a bit of a wobbly starting point for OpenCL..

I suspect a hand coded ARM routine may be in order - I used to code ARM2/3/610/710/StrongARM, so the basics are similar for the RPi4. So I'll see what can be done.
 
One thing to point out here is that is relatively simple to downsample and average into smaller bins.
I could do 2^20 samples (ie 1M and be radix-2 compatible) but reduce to a 1024 bin FFT - that's relatively simple and I get a max rate spread of 470,000 to 700,000 samples per second with a single thread. The distribution of performance is due to the scheduling of ubinutu - this is running on a 4cpu VM inside the mini.

SIMD is a pain because the code is expecting same shape blocs without the option to stride for example.
OpenCL is progressing - the compile ran out of disk space at 96% so I need more than 30GB to compile it.

In terms of fixed point - I wanted to see if a 32bit fixed point worked well enough to be able to compute a FFT without loss of SNR for the quantisation. I suspect the minimum I will need is a 48bit number at which point you may as well go to 64bit fixed point. All good on the Intel i7 I suspect but on RPi that may cause an issue.

Lastly - going 1Hz to DC means we'll need probably at least 2-5x the buffer size to identify going to DC.

So if 1M is too much - the option is to log all 1M for 10 seconds but down sample and bin to 1024 then allow the user to zoom into any area in realtime to process offline data. A trigger could also be set specifically in an area with a sub-FFT.

However I'm not done yet with the 1M/sec FFT. I also have some ideas around the SFFT that should greatly improve performance and make it scalable.
 
Some results on the Mac mini ubuntu 4cpu VM using ubuntu.

A 256k bin and samples when compiling c++ -O3 -fopenmp -msse - load at 400% and the pragma paralllel for simd for the SFFT loop the result is 721.901 samples/sec.
A 25.6k bin shows at about 8Ks/sec.
A 1024 bin shows as 85.3831K/sec.

However switching off OpenMP and running single threaded shows 1024bin as 462.449K samples/sec. I've tried this with 1/4 blocks running on 4threads and even sin/cos rather than the lookup to reduce bus use - that came out worse so the trig implies is slower than the bus. There's not much to simplify with a A[] = A[] + B[] * C per operation.

So I was under the impression that an i7 had a bit of poke.. well it doesn't seem to have much as I thought it might. So I'm feeling a little deflated of being beaten.

Given research papers on 1M+sps FFT FPGAs have mentioned running on Virtex4 or 7s.. that's a £4-7K minimum for a development board = not happening. The parallel RPi approach seems the best option - it scales and 4 RPis running with the same data input allows it to make a full 1MSps 1024 bin analyser in realtime - using floating point and not relying on fixed point maths.

There's a number of backplanes you can get for RPis that support multiple compute and zeros. Literally stack and rack. The obvious piece here is if we tie the GPIO ports to the CPLD - we solve our issue with 4 compute modules to support 1MS/sec with a 1024 bin window that we could scroll around using inbuilt averaging or other sampling functions. The output of 1024 could then be simply taken by a RPI4 and away you go.

So we're looking at
* £140 for the ADC and frontend
* £40 for the CPLD
* £40 for each compute module (£120 for 4) or £13 for zeros (£52 - assuming same performance)
* £30 for a RPI multi-compute Hat
* £50-70 for a RPi 4 8-16GB
* £60 for a large touchscreen (HDMI rather than GPIO based)
====
£ 460 (4 compute modules) Shedload of processing in a box.

Or without the parallel computation £310 without the compute + backplane.

So it's got competition from the likes of Digilent Analogue Discovery etc. The difference is that it would be realtime rather than record and process - the latter is easy and the user simply waits..