Arduino Spectrum Analyser 24bit 256KS/s

I’ve been considering ideas to speed the performance by reducing the calculations required.

Marks point of a flattop window is easily implementable using a window lookuptable. The cost is an extra fetch and multiplication. However given the width of the function being applied is a small number of samples, this substantially reduces the required processing per sample.
In the Astro processing I used the same mechanism to limit the computation required to prevent image boarder issues causing correlation errors.

I’ll code up a simple implementation on the CPU real-time. It should have a marked impact on performance.

Additonally I want to see the performance impact of splitting FFT via frequency range ‘slices’ so each process has a smaller lookup table thus see how large four slices can be to benefit from the CPU’s 2MB L2 cache.

I also have a look at what is available for PC components and cost. A 12 physical core i7 (ie 20+ with HT) with 25MB cache ~£500. You can get GPUs with 24GB onboard ~£2500 :eek:. With that kind of horse power it I suspect GPU audio full spectrum processing to be easily available in the 2-5K bracket in volume.
 
So with a flat top window of 72 samples wide loaded from a table and multiplied in, from my understanding there's no reason to multiply beyond the transition to zero in the window. Therefore you end up with a 72 point FFT impact across a 1Mpt fft. The i7 CPU is operating at 2.6M/second realtime per process or 2.8M/second for four parallel processes. The RP4 CPU 283,399/sec and 577K/sec for four parallel processes.

I'll need to setup so that the results are verified against FFTW in single float mode.
 
I've been looking at this: http://www.ee.ic.ac.uk/pcheung/teaching/de2_ee/Lecture 5 - DFT & Windowing (x2).pdf

Normally I've scaled the window with respect to the value - thus you get a varying width of application.

The link above seems to suggest that it would be possible to make a 0dB window width at a particular frequency range, then apply that to reduce the processing.

If 1 Hz is 1M samples and therefore the width of the ring buffer - ie the width of the window then we don't get any saving but instead we need process more (ie the main lobe on a flattop being 4*PI )or more depending on the window) or two cycles thus two seconds of buffer are required per sample with a window.
 
I think the real-time processing option may need to limit to 10Hz at the lowest point to keep within (a) a small 1MB ring buffer and (b) keep processing time to a realist if value. The lower the frequency the more samples (given we have a constant 1M per sec) we need to process which equals more processing time. Thus the FFT slicing frequency range for each process isn’t linear but depends on the processing time for that those frequencies.

Offline processing the same applies although given we have processing time we can 10x upscale the buffer to 10MB and process down the 1Hz with the penalty that given 18 FFT/sec we’re waiting 1sec to display.

Busy day today but I’ll try to read more.
 
So a bit of pen an paper this morning over coffee whilst the bread is cooking. What I'm attempting todo is identify if there is a way to reduce (a) the number of samples processed, (b) the processing per sample. This reduces bus utilisation and thus improves performance.

time period = 1 second / Frequency
sample rate = number of samples / 1 second
We need 2x samples for a given frequency due to nyquist.

The upper frequency is given by the window that gives the upper and lower boundaries depending on the window chosen. There's a few differences between online and offline processing. For both options - there is the averaging (or highest) for the sampling if we don't need the resolution in the span chosen.

Offline, using vkFFT we're constrained to use the buffers configured when the DFT is setup, we have to use the processing - that includes the radix 2 that means some zero padding and processing we don't strictly need in favour of processing the total required DFT faster. It's possible to build a new FFT for the current span and vision - however that would likely need memory alignment for the input buffer (and for discrete GPUs the data needs to be copied into the 'shadow' buffer in the GPU system memory location which the GPU then uploads into graphics ram). This gives the opportunity to 'gather' or reformat the resolution data by averaging.

Online, as our sample count is 1 as we process each sample but then apply to N frequency bins based on a coefficient. I can tailor the number of bin calculations and therefore the number samples required to be processed back within the ring buffer. A further optimisation that can be done per-change to the view Is gather the calculation coefficients into a smaller table once (assuming a complete cycle) this then means we don't have to stride within the larger array. This makes the cache more effective for the calculation. Additionally averaging means we and also create smaller tables too based on the point of measurement.

For averaging - the ADC samples and it's natural point of measurement is the end of the sampling period at it's native speed. Thus the period is equidistant. With averaging - it's assumed that the average is at the centre point of the averaged period of time (aligned with a natural measurement point) thus changes in resolution remain equidistant.
 
I had chance to sit down last night and go through both the maths for the frequency resolution and windowing.

It looks like a few windows will be of use:

  • ADC characterisation - Blackman Harris
  • Sinewave (amplitude important) - Flat top
  • Accurate single tone sine wave measurements - Flat top
  • Sine wave/multiple waves - Hann
  • Closely spaced sine waves - Uniform/Hamming

So it probably makes sense to implement flat top and Blackman Harris initially. A texas instruments ADC presentation had an easy to understand overview in characterising ADCs where they use multiple terms ie a 7-term BH window for a -163dB largest lobe so later I'll look at multiple-terms.

In terms of samples and frequency resolution, I'm in the wobbly point of having my brain span both time and frequency domains. Watching one you tube video it made no sense. Watching something that simply put the maths to the wall helped alot. I'm aware that the more samples/sec offers more resolution in the wave forms present, and that samples/sec is not the same as frequency (cycles/sec) but I'd hit a mental block. So it was back to the basics:

X[k] = SUM n=0 to N-1 of x[n].exp( -i2PIkn / N )

So X being frequency sample with frequency index k. x being the time domain sample at time sample index n within the total number of time samples N. The complex exp() or e^-I... is the interesting point as my brain shifts to circles and so it's easy to visualise in circular form:
1. more frequency resolution = more steps in the circle - so instead of 0 45 90 degrees it could be 0.1 degree steps. This is given by the ratio of n / N above. Thus kn/N dictates the resolution by the number of steps at the frequency.
2. higher frequency = faster rate of movement per step around the circle. Just as sin(wt) gives you your sine wave at the frequency 'w' over time the same is present in the maths as 'k' provides.

In the realtime processing example, the code finds the difference of x[n] and the incoming sample (that will overwrite x[n]) then apply that to the bins. The exp part of the function is recalculated in the table but allows frequency of k to step around the circle at the rate required, so coefficient[k] and coefficient[2k] are the same table but the code strides through the table adding two to the index each time and wrapping around - this works because exp() term is the same for a function at point k in two calls or in one stride to that k value. The code wraps as it's all in a normalised circle 2PI between 0 to 1 from the n/N term. Note the code has some additional wrapping code but I'm showing the general approach:

pre-calculate coefficient[ ] for all values of the circle at the resolution required. So 1MHz sampling frequency would be 1M entries, one for each of the steps.
pre-calculate window[ ] for for the sample range required (in this case we're using N, but we can reduce N due to the zero point of the window - this means we can reduce N to Nwindow

n=0

For each sample value 's' received
delta = s - x[n]
x[n] = s

coefficientStride = 0
for each bin 'k' between 0 and Nwindow-1
bin[ k ] = bin[ k ] + delta * coefficient[ coefficientStride ] * window[ n + k ]
coefficientStride = ( coefficientStride + n ) mod N;
next k

Ready for the next sample
n = (n + 1) mod N
next sample


Averaging is simply taking the number of averaging samples in 2^ so that we can do a binary shift to divide by 2, 4 etc which makes it faster than a divide. Also if the window and everything are sized as in 2^ then this stops any remainders etc. Changing the averaging means we would clear the x[n] rather than attempt to reprocess.

Frequency resolution change, such as zero padding means more samples, this has a few effects (assuming we're no longer averaging samples):
  • our sample buffer now has to be larger and zero padded, OR, we keep the same sample buffer over the end of the data we simply use zero within the calculation. For our realtime we're receiving a sample at a point in time rather than have all our samples in a block we can add zero padding to the end of, thus more samples per second needs to be accounted for.
  • our pre-calculated coefficient now needs to be larger as we have more steps in the rotation of the circle, plus our absolute stride is now increased due to more steps. It still remains that 1 revolution of the coefficient circle per second is 1Hz, it's just more steps to move over.
  • the window needs to be larger (more samples) as the function is applied to time domain samples.

It's possible to create the largest sizes for the maximum sample rate available then sample the coefficients and window etc to match the required resolution. Modern caching can cope with sparse data retrieval, however typically caches are paged in blocks. So to span 10MB sparsely with 4K pages in a 2MB cache results in cache thrash where reads are not hitting the caches. For this reason it's normally better to recreate the blocks of data so that the strides are not required.
In the code above we could create a data block that has all the data structured so that the code only needs to fetch the next location rather than stride but the size of the data would be large.
 
I’ve spend the holidays studying Vulcan - horrendously over complicated with little documentation. It’s almost a rite to get something done. To make matters more fun - Vulcan texture ids don’t work with ImGui without some Hackett-pockery.
I’m currently writing the pipelines for rendering. Once this is done the initial capability of FFT bins can be rendered to the screen.
 
An interesting unofficial driver for the Broadcom video core 4 could make Vulkan on the zero 2 possible: https://github.com/Yours3lf/rpi-vk-driver

If this works (and a big if) then it’s likely to be slow as the zero2 figures showed above but given the form factor etc it could be possible to pack a large number of zeros in with the bonus that each is running a seperate bus and memory - slower individually but on a task that has high parallelism that may be really useful.
 
Reading the FAQ (https://github.com/Yours3lf/rpi-vk-driver/wiki/FAQ) the current VC4 RPi3 driver doesn’t support compute pipelines which vkFFT uses. There may be a flag setting for non-compute pipeline use but for now it looks incompatible.
That leaves the RT option but that’s CPU and the results we’ve seen - those results mean we’re back to better performance for the physical size using RP4s.
 
I don't know if this would help but I found this android app: Spectroid, https://download.cnet.com/Spectroid/3000-20432_4-77833231.html which seems to do a lot of what would be useful. It uses some open source components. it uses several band limited FFT to assemble the response more quickly.
An idea similar did make me think.
The FFTs only need to be 1024 wide (screen resolution) and it’s possible to target centre and span recursively like a tree. However switching the configuration is not guaranteed to not to cause discontinuities or worse.
I have an texture image now displaying but next up is the FFT to bin piece.
I have the details worked out to parallelise an FFT both in frequency bins but also within each bin. I will have a Lego set of classes that make up steps in FFT pipeline - some do simple things like transfer from CPU to GPU memory, the others can then be replaced by parallel processing versions. For now the task is to get a basic FFT pipeline manually configured to display data.
I can hack up a sound card “sample source” using the same mechanism as the 1104x-e bode plot function generator software. Just the clocking is at the mercy of the system clock and the operating system scheduling.
 
Last edited:
Progressing :)

I was looking how to optimise FFTs for high sample rates, both DFT delta techniques but also FFT with larger radix. You can decompose the FFT in both the time and frequency domain into slices for processing which means it's very adaptable in terms of processing.

In the end the FFT is more a bandwidth limited operation (the maths functions are relatively simple). The RPI and RPI compute is low memory bandwidth - 0.6GB/sec - whereas the Intel i7 is in the order 30GB/sec. However that got me looking into what options of compute modules you can now get. There's i7 embedded and Intel make a "compute module" which is really a computer in a metal block with pin outs from the south bridge to USB/Network/HD etc. It's not really a compute module..
Now interestingly nVidia also make a compute module. It's a hard core compute module (yay) and as such has essentially a GPU with 512 parallel processors (ie fragment or "compute" shaders) but this is the good bit - it's baby model "Jenson Nano" has a respectable 25.6GB/sec memory band with and 472GFlops processing performance.. but ... and this is the interesting bit.. the AGX model has 136.5GB/sec memory performance (227x the RPi) and a heafty 32TFlops (ie 32,000GFlops). That is one mean device and something that could/should make a massive leap in the FFT performance.. The AGX starts at 7GB at $675. Which compared to graphics cards is quite low..

It wouldn't be realtime as such given the OS and the drivers wouldn't be realtime. However it would absolutely fly creating FFTs - with a performance 4.5x the i7 and more cores (GPU style) it's performance is likely to be 4x an i7.