Linux USB-Audio Gadget (RPi4 OTG)

Continuing from https://www.diyaudio.com/forums/pc-based/341590-using-raspberry-pi-4-usb-dsp-dac-7.html#post5898526

Last posts:
phofman:
Keeping the async mode for the audio gadget would be technically best, but quite difficult to implement [alsa-devel] Options for ASYNC feedback source in USB-audio gadget (USB OTG)?




Tfive:
I read your posts on the alsa-devel mailing list. IMO all the stuff you describe is pretty complicated. My opinion:
a) adaptive in this case is perfectly fine if you have proper adaptive resampling, i.e. a pll implemented digitally with smooth, long term averaged adjustement of the resampling factor.
b) if you want to go async and at the same time avoid all the userspace feedback hassle and the daemon keeping an eye on this, why not use an existing alsa device as clocking master and slave the gadget to it. And then base the clock/timing feedback channel off of the master alsa device's clock/timing? The "master" device could be specified as module parameter.
 
I very much appreciate you are thinking about the issues.

a) adaptive in this case is perfectly fine if you have proper adaptive resampling, i.e. a pll implemented digitally with smooth, long term averaged adjustement of the resampling factor.

Certainly true for audio processing/listening purposes. I am aiming at a "smart" USB soundcard for measurements with internal harmonic distortion compensation Digital Distortion Compensation for Measurement Setup . When compensating harmonics down to -150dB, any resampling is undesirable, the chain must be bit-perfect. Also any data discontinuity (buffer under/overflow) will ruin the measurement.


b) if you want to go async and at the same time avoid all the userspace feedback hassle and the daemon keeping an eye on this, why not use an existing alsa device as clocking master and slave the gadget to it. And then base the clock/timing feedback channel off of the master alsa device's clock/timing? The "master" device could be specified as module parameter.

Do you have a specific implementation in mind? I have not come up with anything less complicated than the user-space solution.

Naive implementation of async on the USB device: every USB frame new data is received. The data is fetched into a short buffer. Level of the buffer fill is converted to feedback format and sent back to the host. Rather straightforward to implement.

But in software the situation is different. The master clock of the physical device projects into the kernel as a sequence of more or less frequent IRQs, depending on the configured period time. Plus the more or less reliable query about the current location of the DMA reading/writing pointer (with some processing delay).

We cannot monitor level of the DMA buffer fill like the USB device does (e.g. halfway-full => no change in feedback), the buffer fill changes all the time. A single look at the buffer does not tell anything about whether we should speed up or slow down. We can only monitor the average data flow (based on timestamps and pointer positions passed to/from alsa-lib to the kernel, the actual data are already copied to the shared DMA buffer by the playback app/alsa-lib), compare the inflow and outflow figure (related to common time base - system time), and generate the corrective action to align the two flows in the long run.


As of kernel vs. userspace. Alsa-lib operates with the same data and methods kernel does, the buffer is shared, the key methods are propagated to user space. IMO the principle would be basically identical. Coding in userspace is MUCH easier, floating point available, proper malloc available, easy deployment, debugging, no hidden bugs in IRQ handlers with half of resources unavailable, a major fail does not lock the OS. Also it would be quite difficult to get such corner-use change into kernel, while a change to user-space alsa-lib could be accepted.

As of the daemon: it would be a very simple code, even in python. Again, user space means easy testing, troubleshooting, easy changes. Some simple stepped implementation, PID regulator, anything possible.

This would definitely be a long-term project but a useful one, IMO. There is no proper async USB-audio device with linux onboard, all firmware is currently coded in FPGA/CPLD.

Thanks for any comments/ideas.
 
The first step accomplished. Hacked alsa-lib generates 10s averages of current throughput measured against system clock (nanosecs) in both directions, both for RW and MMAP access modes. In the end it was quite simple, hooking just in corresponding methods of the HW plugin alsa-lib/pcm_hw.c at master * alsa-project/alsa-lib * GitHub .

Values clearly hold for both directions (single clock), each soundcard is slightly different:

PCI Infrasonic Quartet - playback
Code:
MMAP: time: 7464.31, card = 3, device = 0, averaged samplerate: 48000.425255
MMAP: time: 7474.36, card = 3, device = 0, averaged samplerate: 48000.426115
MMAP: time: 7484.41, card = 3, device = 0, averaged samplerate: 48000.431125
MMAP: time: 7494.46, card = 3, device = 0, averaged samplerate: 48000.426516
MMAP: time: 7504.51, card = 3, device = 0, averaged samplerate: 48000.454509
MMAP: time: 7514.56, card = 3, device = 0, averaged samplerate: 48000.129543
MMAP: time: 7524.61, card = 3, device = 0, averaged samplerate: 48000.719691
MMAP: time: 7534.66, card = 3, device = 0, averaged samplerate: 48000.434230
MMAP: time: 7544.71, card = 3, device = 0, averaged samplerate: 48000.179946
MMAP: time: 7554.76, card = 3, device = 0, averaged samplerate: 48000.507986
MMAP: time: 7564.81, card = 3, device = 0, averaged samplerate: 48000.601536
MMAP: time: 7574.86, card = 3, device = 0, averaged samplerate: 48000.445306
MMAP: time: 7584.91, card = 3, device = 0, averaged samplerate: 48000.141746
MMAP: time: 7594.96, card = 3, device = 0, averaged samplerate: 48000.697811
MMAP: time: 7605.01, card = 3, device = 0, averaged samplerate: 48000.445344
MMAP: time: 7615.06, card = 3, device = 0, averaged samplerate: 48000.312700
MMAP: time: 7625.11, card = 3, device = 0, averaged samplerate: 48000.353650
MMAP: time: 7635.16, card = 3, device = 0, averaged samplerate: 48000.619305
MMAP: time: 7645.21, card = 3, device = 0, averaged samplerate: 48000.183977
MMAP: time: 7655.26, card = 3, device = 0, averaged samplerate: 48000.319778
MMAP: time: 7665.31, card = 3, device = 0, averaged samplerate: 48000.786663
MMAP: time: 7675.36, card = 3, device = 0, averaged samplerate: 48000.329274

PCI Infrasonic Quartet - capture
Code:
MMAP: time: 7473,41, card = 3, device = 0, averaged samplerate: 48000,139744
MMAP: time: 7483,46, card = 3, device = 0, averaged samplerate: 48000,545291
MMAP: time: 7493,46, card = 3, device = 0, averaged samplerate: 48000,771819
MMAP: time: 7503,56, card = 3, device = 0, averaged samplerate: 48000,243504
MMAP: time: 7513,66, card = 3, device = 0, averaged samplerate: 48000,205902
MMAP: time: 7523,71, card = 3, device = 0, averaged samplerate: 48000,723531
MMAP: time: 7533,81, card = 3, device = 0, averaged samplerate: 48000,110984
MMAP: time: 7543,91, card = 3, device = 0, averaged samplerate: 48000,678882
MMAP: time: 7553,96, card = 3, device = 0, averaged samplerate: 48000,506135
MMAP: time: 7564,06, card = 3, device = 0, averaged samplerate: 48000,360787
MMAP: time: 7574,06, card = 3, device = 0, averaged samplerate: 48000,376825
MMAP: time: 7584,16, card = 3, device = 0, averaged samplerate: 48000,501185
MMAP: time: 7594,21, card = 3, device = 0, averaged samplerate: 48000,480220
MMAP: time: 7604,31, card = 3, device = 0, averaged samplerate: 48000,406392
MMAP: time: 7614,41, card = 3, device = 0, averaged samplerate: 48000,387192
MMAP: time: 7624,51, card = 3, device = 0, averaged samplerate: 48000,356472
MMAP: time: 7634,61, card = 3, device = 0, averaged samplerate: 48000,573921
MMAP: time: 7644,71, card = 3, device = 0, averaged samplerate: 48000,094037
MMAP: time: 7654,76, card = 3, device = 0, averaged samplerate: 48000,695609
MMAP: time: 7664,86, card = 3, device = 0, averaged samplerate: 48000,371363
MMAP: time: 7674,91, card = 3, device = 0, averaged samplerate: 48000,411589
MMAP: time: 7684,96, card = 3, device = 0, averaged samplerate: 48000,573967
MMAP: time: 7695,06, card = 3, device = 0, averaged samplerate: 48000,411573

Intel HDA - playback
Code:
MMAP: time: 7505.41, card = 1, device = 0, averaged samplerate: 47998.770252
MMAP: time: 7515.41, card = 1, device = 0, averaged samplerate: 47999.158003
MMAP: time: 7525.41, card = 1, device = 0, averaged samplerate: 47998.736536
MMAP: time: 7535.41, card = 1, device = 0, averaged samplerate: 47999.004453
MMAP: time: 7545.41, card = 1, device = 0, averaged samplerate: 47998.869876
MMAP: time: 7555.41, card = 1, device = 0, averaged samplerate: 47999.087398
MMAP: time: 7565.41, card = 1, device = 0, averaged samplerate: 47998.786917
MMAP: time: 7575.41, card = 1, device = 0, averaged samplerate: 47998.958394
MMAP: time: 7585.41, card = 1, device = 0, averaged samplerate: 47999.057015
MMAP: time: 7595.41, card = 1, device = 0, averaged samplerate: 47998.778014
MMAP: time: 7605.41, card = 1, device = 0, averaged samplerate: 47999.262400
MMAP: time: 7615.41, card = 1, device = 0, averaged samplerate: 47998.634468
MMAP: time: 7625.41, card = 1, device = 0, averaged samplerate: 47998.957631
MMAP: time: 7635.41, card = 1, device = 0, averaged samplerate: 47998.897422
MMAP: time: 7645.46, card = 1, device = 0, averaged samplerate: 47999.043195
MMAP: time: 7655.46, card = 1, device = 0, averaged samplerate: 47999.066140
MMAP: time: 7665.46, card = 1, device = 0, averaged samplerate: 47998.680223
MMAP: time: 7675.46, card = 1, device = 0, averaged samplerate: 47999.094982
MMAP: time: 7685.46, card = 1, device = 0, averaged samplerate: 47998.785280
MMAP: time: 7695.46, card = 1, device = 0, averaged samplerate: 47999.110548
MMAP: time: 7705.46, card = 1, device = 0, averaged samplerate: 47998.818020

Intel HDA - capture
Code:
MMAP: time: 7528.49, card = 1, device = 0, averaged samplerate: 47998.937688
MMAP: time: 7538.49, card = 1, device = 0, averaged samplerate: 47998.814948
MMAP: time: 7548.49, card = 1, device = 0, averaged samplerate: 47999.176253
MMAP: time: 7558.49, card = 1, device = 0, averaged samplerate: 47998.751358
MMAP: time: 7568.49, card = 1, device = 0, averaged samplerate: 47998.915809
MMAP: time: 7578.49, card = 1, device = 0, averaged samplerate: 47998.907137
MMAP: time: 7588.49, card = 1, device = 0, averaged samplerate: 47999.214837
MMAP: time: 7598.49, card = 1, device = 0, averaged samplerate: 47998.606434
MMAP: time: 7608.49, card = 1, device = 0, averaged samplerate: 47998.986621
MMAP: time: 7618.49, card = 1, device = 0, averaged samplerate: 47998.933411
MMAP: time: 7628.49, card = 1, device = 0, averaged samplerate: 47999.018881
MMAP: time: 7638.49, card = 1, device = 0, averaged samplerate: 47998.778986
MMAP: time: 7648.49, card = 1, device = 0, averaged samplerate: 47999.132910
MMAP: time: 7658.49, card = 1, device = 0, averaged samplerate: 47998.818048
MMAP: time: 7668.49, card = 1, device = 0, averaged samplerate: 47998.899001
MMAP: time: 7678.49, card = 1, device = 0, averaged samplerate: 47998.964163
MMAP: time: 7688.49, card = 1, device = 0, averaged samplerate: 47999.087897
MMAP: time: 7698.49, card = 1, device = 0, averaged samplerate: 47998.785300
MMAP: time: 7708.49, card = 1, device = 0, averaged samplerate: 47998.909897
MMAP: time: 7718.49, card = 1, device = 0, averaged samplerate: 47999.067181

Clock difference is about 1.5 sample every second. This shows how reclocking using plain buffer is a technically incorrect solution - e.g. a 500ms buffer filled to half at the beginning would over/underflow in 4.5 hours.

Charlie experienced glitches after 2 hours of running the adaptive gadget to soundcard without any adaptive resampling. The assumption that available buffers over/underflowed due to the clock difference is valid, IMO.

Now time to look at the async feedback options.
 
It seems to me strange that the rate is changing up and down every 10 seconds. This does not correspond to what I would expect for a clock! The rate should be very steady, but each clock will not be exactly the same as the next one. As an example I have plotted the first set of data as the deviation from 48kHz vs time (see attached figure). You can see that the long-time average is very close to 0.425, however, the value reported each 10 seconds is quite different.

Only at "startup" when the clock info must be built from scratch would info on the timescale of a few seconds be useful. You might consider starting there and then building up better estimates of the the clock rate difference that is averaged over a period of several minutes, and then not doing 100% correction on the error. This will prevent "whiplash" corrections to the rate.
 

Attachments

  • sample_rate_deviation.PNG
    sample_rate_deviation.PNG
    17.3 KB · Views: 1,131
Last edited:
There is no clock available to the OS. The precise soundcard clock is timing the DMA transfer. The soundcard/USB controller/anything doing DMA is instructed to throw an interrupt request when the reading/writing pointer reaches certain address/threshold so that the kernel -> alsa-lib -> application knows it should supply one or more periods of samples to the DMA buffer. The exact time at which this supply occurs is unimportant as long as enough data are always prepared in memory for the DMA reading/writing pointer to avoid xruns.

And this time of writing/reading samples to the buffer (or changing the buffer boundaries in case of mmap access) is all timing I have - my code is added to the respective methods. At these times I read the cached hw pointer, store the previous time and every 10 seconds (when current_time >last_time + 10 seconds) I calculate the difference in the hw pointer divided by the time difference. As time delay deviates, the up-to-dateness of the hw pointer information varies too. The variations of the calculated average rate are tiny - easily total 10 seconds vs. a few milliseconds time imprecision.

I am discussing with specialists on the alsa-devel list, may be re-reading the current hw pointer at that time will be recommended. But I doubt it will be actually necessary. We do not need a very precise figure, the feedback loop will run continuously, trying to drive the rate difference between USB and output soundcard to zero every control cycle. I think that cycle will take more than 10 seconds, maybe a minute, maybe more. The buffers throughout the chain are large enough (it took 2 hours in your case to exhaust the buffers), we do not need ultra-low latency of milliseconds.
 
We do not need a very precise figure, the feedback loop will run continuously, trying to drive the rate difference between USB and output soundcard to zero every control cycle.

OK. I would not try to make it zero every cycle. Whether or not you have access to the clock(s) the data you presented is very noisy and you risk constantly overcorrecting. That was what I was trying to point out with my plot. Like you said, it would take a long time to create over/under runs, and that is really all you are trying to prevent. A subtle change is all that is (likely) needed.
 
Exactly, no intention of using these 10sec averages for exact calculation of the feedback action. They are for checking if the rate measurement actually works. Which it seems it does.

The control change will have further propagation delays. Various control algorithms (regulators) can be used - that is why I do all of this in userspace, experimenting in kernel is clumsy.

I will have to dust off the faint recollections of control systems, my masters major a few decades ago...
 
The alsa subsystem admin recommended asking the driver directly for current soundcard pointer and corresponding timestamp (method snd_pcm_hw_status alsa-lib/pcm_hw.c at master * alsa-project/alsa-lib * GitHub ). This method yields much lower variation in 10 sec averages:

Intel HDA
Code:
time: 7110.17, card = 1, device = 0, averaged samplerate: 	191995.605796
time: 7120.19, card = 1, device = 0, averaged samplerate: 	191995.630191
time: 7130.22, card = 1, device = 0, averaged samplerate: 	191995.736022
time: 7140.25, card = 1, device = 0, averaged samplerate: 	191995.521832
time: 7150.27, card = 1, device = 0, averaged samplerate: 	191995.666591
time: 7160.30, card = 1, device = 0, averaged samplerate: 	191995.704428
time: 7170.33, card = 1, device = 0, averaged samplerate: 	191995.512622
time: 7180.35, card = 1, device = 0, averaged samplerate: 	191995.711321
time: 7190.38, card = 1, device = 0, averaged samplerate: 	191995.697803
time: 7200.41, card = 1, device = 0, averaged samplerate: 	191995.658741
time: 7210.44, card = 1, device = 0, averaged samplerate: 	191995.403172
time: 7220.46, card = 1, device = 0, averaged samplerate: 	191995.795381
time: 7230.49, card = 1, device = 0, averaged samplerate: 	191995.691082
time: 7240.52, card = 1, device = 0, averaged samplerate: 	191995.590803
time: 7250.54, card = 1, device = 0, averaged samplerate: 	191995.688861

Infrasonic Quartet:
Code:
time: 7077.50, card = 3, device = 0, averaged samplerate: 	192001.549402
time: 7087.50, card = 3, device = 0, averaged samplerate: 	192001.413235
time: 7097.60, card = 3, device = 0, averaged samplerate: 	192001.553629
time: 7107.60, card = 3, device = 0, averaged samplerate: 	192001.383925
time: 7117.70, card = 3, device = 0, averaged samplerate: 	192001.476731
time: 7127.70, card = 3, device = 0, averaged samplerate: 	192001.603120
time: 7137.80, card = 3, device = 0, averaged samplerate: 	192001.598430
time: 7147.79, card = 3, device = 0, averaged samplerate: 	192001.716339
time: 7157.90, card = 3, device = 0, averaged samplerate: 	192001.446606
time: 7167.90, card = 3, device = 0, averaged samplerate: 	192001.477986
time: 7177.99, card = 3, device = 0, averaged samplerate: 	192001.509265
time: 7187.99, card = 3, device = 0, averaged samplerate: 	192001.524726
time: 7198.09, card = 3, device = 0, averaged samplerate: 	192001.621344
time: 7208.09, card = 3, device = 0, averaged samplerate: 	192001.662404
time: 7218.09, card = 3, device = 0, averaged samplerate: 	192001.721233
time: 7228.19, card = 3, device = 0, averaged samplerate: 	192001.310009
time: 7238.19, card = 3, device = 0, averaged samplerate: 	192001.436610
time: 7248.29, card = 3, device = 0, averaged samplerate: 	192001.639014
time: 7258.29, card = 3, device = 0, averaged samplerate: 	192001.644186
These values are OK for further processing.
 
USB 3.1 includes dual-role data (DRD) functionality, replacement of OTG. Probably not all chips support this, but Intel Apollo/Gemini Lake series have one USB DRD port usb: xhci: pci: Enable Intel USB role mux on GLK platforms - Patchwork . I could not find any motherboard supporting this mode, but it is likely we can expect more and more standard hardware supporting the DRD (formerly OTG) mode.

MS Windows as usual do not care for regular hardware USB Dual Role Driver Stack Architecture - Windows drivers | Microsoft Docs "Function drivers are not available on Windows 10 for desktop editions", but linux does USB OTG/dual-role framework [LWN.net] . That development will allow all sorts of exciting projects.

The more reason to implement the async mode into the audio v2 composite function.
 
Last edited:
Since PA can resample for you, the gadget at its current status switched to adaptive mode should be OK for your project. Lucky you :)

I don't quite understand how PA is going to improve the problems I encountered with a slightly different clock and sample rate on the Pi... After a couple of hours time, will there not be the same under/overrun problems?
 
Thanks a lot, very interesting reading. I do admire the PLL work in USB-adaptive chips. IMO the engineering and know-how behind top quality USB adaptive PLL are much more impressive than async setups which just report fill level of some short buffer back to the sender.

Fortunately in our case we do not have to worry about jitter since all is in the PC. The output clock is independent of the processing (be it soundcard clock in async or USB clock in adaptive). All what counts is the number of samples buffered within the chain must be stable, so that the buffers never hit their limits.
 
With kind help of Minas Harutyunyan from Synopsys (authors of the USB host-gadget IP dwc2 in the Broadcom SoC) I am now running both directions/duplex 64kHz/32bit/32channels bitperfect and no xruns between RPi4 and my linux workstation. The gadget sends/receives 1024 bytes of data every USB highspeed frame 125us (8MB/s in each direction) which is USB audio v.2 maximum achievable rate (1024bytes max packet size for one isochronous endpoint).

RPi overall (4cores) load - idle 98%.

Minor changes in code, suboptimal device-tree config of dwc2.

First step - tick. Now avoiding the gadget alsa devices stall when the USB side is disconnected/idle.
 
With kind help of Minas Harutyunyan from Synopsys (authors of the USB host-gadget IP dwc2 in the Broadcom SoC) I am now running both directions/duplex 64kHz/32bit/32channels bitperfect and no xruns between RPi4 and my linux workstation. The gadget sends/receives 1024 bytes of data every USB highspeed frame 125us (8MB/s in each direction) which is USB audio v.2 maximum achievable rate (1024bytes max packet size for one isochronous endpoint).

RPi overall (4cores) load - idle 98%.

Minor changes in code, suboptimal device-tree config of dwc2.

First step - tick. Now avoiding the gadget alsa devices stall when the USB side is disconnected/idle.

Nice work by everyone involved! This is really great news. I think the USB audio gadget has some real promise, once all of these kinks are ironed out of the code. This is on the Pi 4, correct? The Pi zero also has OTG but I do not think that is the "Pi" you are testing.

I see you mentioned 64k sample rate. [...I believe that is the default rate of the gadget. I never knew why that was... any ideas? Maybe Minas can provide an answer to that question...] Have you been able to try higher rates over fewer channels?
 
I used that samplerate/samplesize/channel count because that combination maxes out the USB2 limit on one isochronous endpoint (in each direction). The gadget does not care about the params, as long as the bitrate is within this limit. It could be e.g. 8.192MHz/8bits/1ch (if that samplerate number fits the usb config descriptor slot).

It's on Pi4. But all the Pis have the same OTG controller, IIUC.
 
Ah, I understand now.

But what about the 64kHz rate? That seems to be the default rate for the OTG audio gadget, however, that rate is not supported by any other audio device that I am aware of... it's usually a multiple of 44.1 or 48kHz. Do you think this rate came about because it is the rate that, when used at 32bits and with 32 channels will max out the throughput capability?
 
Last edited:
I did not care about the default rate value in the driver, I always specified the params I needed. No idea why the authors used this rate as default. But certainly not because of that maximalization - the driver needs a few minor changes to max it out, and you can reach it with many other combinations, e.g. 512kHz/16bits/8channels.

The reason I needed to test the maximum bitrate was that there was a device tree misconfiguration which prevents larger packet sizes than approx. 900 bytes (out of the 1024 limit). I will push all the patches upstream, when they are ready.