synchonized streaming to multiple clients/loudspeakers

CharlieLaub · 2020-09-20 4:26 pm

Here is some data showing how important Qos and WMM traffic prioritization can be on your network. Below, I am pinging one of the Intel boxes that is set up to deliver audio to a loudspeaker.

First, I pinged the machine when no audio data was flowing to the client:

Code:

ping 192.168.1.231
PING 192.168.1.231 (192.168.1.231) 56(84) bytes of data.
64 bytes from 192.168.1.231: icmp_seq=1 ttl=64 time=48.1 ms
64 bytes from 192.168.1.231: icmp_seq=2 ttl=64 time=74.4 ms
64 bytes from 192.168.1.231: icmp_seq=3 ttl=64 time=90.9 ms
64 bytes from 192.168.1.231: icmp_seq=4 ttl=64 time=114 ms
64 bytes from 192.168.1.231: icmp_seq=5 ttl=64 time=136 ms
64 bytes from 192.168.1.231: icmp_seq=6 ttl=64 time=160 ms
64 bytes from 192.168.1.231: icmp_seq=7 ttl=64 time=182 ms
64 bytes from 192.168.1.231: icmp_seq=8 ttl=64 time=204 ms
^C
--- 192.168.1.231 ping statistics ---
8 packets transmitted, 8 received, 0% packet loss, time 7010ms
rtt min/avg/max/mdev = 48.084/126.117/203.893/50.656 ms

I immediately began audio streaming, and pinged the client again:

Code:

ping 192.168.1.231
PING 192.168.1.231 (192.168.1.231) 56(84) bytes of data.
64 bytes from 192.168.1.231: icmp_seq=1 ttl=64 time=2.35 ms
64 bytes from 192.168.1.231: icmp_seq=2 ttl=64 time=3.17 ms
64 bytes from 192.168.1.231: icmp_seq=3 ttl=64 time=4.00 ms
64 bytes from 192.168.1.231: icmp_seq=4 ttl=64 time=1.49 ms
64 bytes from 192.168.1.231: icmp_seq=5 ttl=64 time=2.10 ms
64 bytes from 192.168.1.231: icmp_seq=6 ttl=64 time=1.58 ms
64 bytes from 192.168.1.231: icmp_seq=7 ttl=64 time=2.99 ms
64 bytes from 192.168.1.231: icmp_seq=8 ttl=64 time=2.03 ms
64 bytes from 192.168.1.231: icmp_seq=9 ttl=64 time=1.75 ms
64 bytes from 192.168.1.231: icmp_seq=10 ttl=64 time=1.80 ms
64 bytes from 192.168.1.231: icmp_seq=11 ttl=64 time=2.58 ms
64 bytes from 192.168.1.231: icmp_seq=12 ttl=64 time=1.58 ms
^C
--- 192.168.1.231 ping statistics ---
12 packets transmitted, 12 received, 0% packet loss, time 11017ms
rtt min/avg/max/mdev = 1.494/2.286/4.004/0.740 ms

The difference, both in terms of the average latency and jitter, is huge. The new router is prioritizing the traffic to the client because it can recognize that it is audio/voice data and knows that requires low latency.

My previous WiFi setup lacked the Qos feature, and the ping results were always similar to the first set, above.

phofman · 2020-09-20 4:29 pm

Well, that is very interesting, thanks for sharing.

soundcheck · 2020-09-21 4:44 pm

Interesting project.

I am wondering how you properly sync the left and right side properly.

I'd expect quite some channel variations without having a master clock in place.

Done any measurements?

wealas · 2020-09-22 9:18 pm

CharlieLaub said:
You mention the WiSA products like they are available to the public. I thought this was more or less a proprietary commercial product that required licensing. Is that not correct?

I talked to these guys since I live in Europe: Wireless - WiFi, WiSA and DLNA (Buy Online) | Profusion the rx/tx boards were about 40 euro a piece with no minimum quantity and you can see the price of the USB dongle. Then there is the more expensive option for a transmitter Amazon.com: Klipsch Axiim Link Wireless Transmitter, Black: Home Audio & Theater but since the technology is gaining speed I suspect there will be more coming. There might be a licensing fee if you produce the modules but I don't think you need a license to use them.
It's not open source but it is designed for purpose and doesn't require any additional infrastructure. It is also range limited to one room and single transmitter per room.

CharlieLaub · 2020-09-28 3:57 pm

soundcheck said:
Interesting project.

I am wondering how you properly sync the left and right side properly.

I'd expect quite some channel variations without having a master clock in place.

Done any measurements?

I'd like to do some measurements but have not had the time yet. Let me give an overview of my current understanding of how the playback rate is updated.

In my system, audio flows from a single "sender" (computer) to multiple clients. On the sender, the audio stream is packaged into RTP packets, and these contain packet-by-packet timestamps. The RTP is sent to the client and received in a buffer called the jitter buffer that attempts to stitch the audio data from the incoming RTP packets into a continuous stream. RTCP "control" packets are also sent, both to the client from the sender, and back to the sender from the client. Therefore, there is an exchange of playback timing data.

NTP (the Network Time Protocol) is a complicated algorithm for syncing clocks over communication lines, and is the basis for obtaining accurate time info e.g. to set your computer clock to a remote "time server". When NTP is running it "polls" the time server for the current time every 2^M seconds, where M is the poll exponent. M is typically on the order of 6 to 12, depending on how established and stable the synchronicity happens to be between the time server and the local clock. This translates to anywhere from 1 minute to hours between updates.

Under Gstreamer, an NTP-like algorithm is implemented on the client side. There is one major difference, and that is that the update interval is much more frequent, e.g. on the order of a few seconds between updates. This helps to bring the two end point into synchrony much faster at startup, and keep a tighter synchrony.

Of course there is always jitter to contend with. Here, jitter is the variation in the time for the timing info to travel from sender to receiver, and it clouds the receivers ability to know the exact time. But with many, many updates the jittery data can be averaged and the receiver can obtain a good estimate of the correct playback rate, and can adjust around the "nominal" rate that is sent at the beginning of the streaming session as part of the SDP info on the stream.

Exactly HOW the client adjusts its playback rate under Gstreamer is not clear to me, but I believe this is done within the jitter buffer. Samples are spooled out from the jitter buffer to the rest of the playback pipeline that leads to the audio renderer (e.g. DAC) using a rate based on the internal audio pipeline clock that has been constructed, and which is completely independent of the local clock on the machine. In the past, playback timing adjustment mechanisms were more exposed to the user, and included skewing (abruptly moving) the playback pointer within the buffer, and an asynchronous sample rate conversion. In the current form I am using, the jitter buffer is buried within a wrapper and I do not set these parameters directly nor know what they are exactly.

The principle and behavior of this type of clock mechanism is very different from a crystal clock running a hardware DSP. There is always a pool of audio data to work with (lying in the jitter buffer) of some tens of milliseconds. With a sufficiently low jitter internet link, the timing/clock rate info seems to be sufficient accurate such that multiple clients do not drift apart very much at all. The playback clock is constantly being updated.

Is it "perfect"? No, certainly not, it can't be. But do we need some kind of "perfect" reference? No, it just has to be good enough. Because it is constantly being updated for drift and error it's very unlike a crystal reference clock. The stereo image may sometimes move slightly to the left or right but eventually moves back again. I really should do some measurements to confirm the magnitude of the timing differences, but it seems to work pretty well on an empirical level based on extended listening tests on a couple of different systems in my home.

I should stress that this mechanism of RTP+RTCP was not developed for audio playback. It's a general scheme for synchronizing data flow across the internet and can be applied to the case of a sender and multiple receivers, multiple senders and one receiver, or a mix of the two. It happens to work well enough for audio use, and so I am happy to be using a platform that supports it.

phofman · 2020-09-28 6:51 pm

This seems to be a VERY relevant paper https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=8052964&fileOId=8052965

Page 30/PDF44 describes the algorithm used by gstreamer, page 46/PDF60 describes an improved method the authors go on to measure results of. Unfortunately still using the dummy linear interpolation algorithm of gstreamer which the authors concede is suboptimal.

Gstreamer offers an API to implement a custom resampling method (GST_AUDIO_BASE_SINK_SLAVE_CUSTOM) but I could not find any implementation as a plugin on github.

CharlieLaub · 2020-09-28 7:35 pm

phofman said:
This seems to be a VERY relevant paper https://lup.lub.lu.se/luur/download?func=downloadFile&recordOId=8052964&fileOId=8052965

Page 30/PDF44 describes the algorithm used by gstreamer, page 46/PDF60 describes an improved method the authors go on to measure results of. Unfortunately still using the dummy linear interpolation algorithm of gstreamer which the authors concede is suboptimal.

Gstreamer offers an API to implement a custom resampling method (GST_AUDIO_BASE_SINK_SLAVE_CUSTOM) but I could not find any implementation as a plugin on github.

Wow, great find! There is a very good overview of things in the first part of the text.

I noticed that the researcher used Gstreamer version 1.4.x, which is a bit antiquated now. The version that can be obtained via the OS is, I believe, 1.12 or 1.14, or possibly an even more recent version in some cases. For example on a machine that I recently converted to Ubuntu 20.04, Gstreamer is version 1.16. On a Raspberry Pi using a relatively recent version of the Raspberry Pi OS, Gstreamer is version 1.14.

I believe that there have been some improvements made, perhaps the very ones suggested in the 2015 Thesis, since version 1.4. Specifically in the use of the RTPbin agglomerated element.

I have been experimenting with audio streaming on an isolated network, which makes it impossible to implement NTP with an external time server, so I don't use any system-wide clock synchronization on that network. Instead I let Gstreamer do all of that for me as part of the streaming audio connection of sender and receiver over WiFi, and that has been working well for me.

mksung · 2020-10-02 3:51 pm

Synchronize? May be refer to AVB and TSN.

CharlieLaub · 2020-10-03 2:14 am

mksung said:
Synchronize? May be refer to AVB and TSN.

I refer you to this post:
synchonized streaming to multiple clients/loudspeakers, post #30

phofman · 2020-10-03 8:03 am

Well, AVB/TSN does not seem to be so proprietary, IIUC Getting Started with AVB on Linux* — TSN Documentation Project for Linux* 0.1 documentation . I wonder what timing the combo can reach, but other pages (not linux-related) talk about synchronizing speakers in multichannel setups and single usec timing.

wealas · 2020-10-03 8:40 am

And XMOS has an AVB stack for their chips so a very low and clean power receiver can be made.

googlyone · 2020-10-03 10:07 am

I find this an interesting concept, but feel that the fundamental challenge is not really being addressed. That being a system wide "master clock" or equivalent.

Timing precision in the region of 1ms is "pretty good" but in terms of achieving a stable and accurate stereo image a fair way short by hifi standards. This is 180 degrees phase error at 500hz.

I expect that if this error was stable, and this was only a L to R error, it would sound fine. But the "sweet spot" would move around.

If this error was between speakers in one channel it would lead to the crossover misbehaving in a serious way! I will add one qualifier though, a 1ms error on a Sub would be in the noise so there is a good application of this right there.

I have been engineering far too long to trust network latency except where I control all traffic with an iron fist. In the past I have used "ethernet" but at hardware level. This worked a treat. I suspect that WiFi will always throw in delays and latencies you don't plan for. So the good work you have done might suffer from occasional glitches ss a result of external interference or co channel traffic.

Have you tried the following experiment...
1. Put a mono 1khz sinewavr into the system.
2. Hook a CRO to L and R channels, trigger off one channel
3. Measure inter channel delay.
4. Power the system up and down and repeat regularly.

This will give you a feel for how well the system is achieving that "system clock".

Either way I think you have created s fine wsy if remoting a sub, I think L znd R precision will remain a challenge. But I hope to be surprised!

aurel32 · 2020-10-03 11:00 am

wealas said:
And XMOS has an AVB stack for their chips so a very low and clean power receiver can be made.

For what I understand AVB/TNS is linked to Ethernet, and doesn't work with wireless.

That said it seems there is work being done for synchronizing wireless devices in audio application. See for example this proprietary solution that is accessible to DIY and still based on Linux : https://www.ti.com/lit/an/swaa162a/swaa162a.pdf

CharlieLaub · 2020-10-03 1:50 pm

aurel32 said:
For what I understand AVB/TNS is linked to Ethernet, and doesn't work with wireless.

That said it seems there is work being done for synchronizing wireless devices in audio application. See for example this proprietary solution that is accessible to DIY and still based on Linux : https://www.ti.com/lit/an/swaa162a/swaa162a.pdf

Thanks for posting here about that technology. To me it seems similar to GPS PPS, which I know works very well but you need a good and relatively unobstructed sky view for the GPS receiver. In the past and in another home, I set up a stratum1 NTP server in my home using GPS and it was excellent for synchronizing machines on my LAN. But NTP takes a long time to stabilize and a GPS based system is just not practical everywhere.

So if this is like a local PPS beacon for your home, I think it will be very successful.

CharlieLaub · 2020-10-03 2:07 pm

googlyone said:
I find this an interesting concept, but feel that the fundamental challenge is not really being addressed. That being a system wide "master clock" or equivalent.

Timing precision in the region of 1ms is "pretty good" but in terms of achieving a stable and accurate stereo image a fair way short by hifi standards. This is 180 degrees phase error at 500hz.

I expect that if this error was stable, and this was only a L to R error, it would sound fine. But the "sweet spot" would move around.

Correct. I am currently using only two separate broadcasts: the left speaker and the right speaker (but see below about subwoofers). DSP is done on the machine receiving the left or right audio stream to create the bands for the drivers (e.g., the crossover) . So the only effect you can perceive is a wandering of the stereo image if/when synchonicity grows large enough for you to detect it.

googlyone said:
If this error was between speakers in one channel it would lead to the crossover misbehaving in a serious way! I will add one qualifier though, a 1ms error on a Sub would be in the noise so there is a good application of this right there.

You are correct. There is no way that you could broadcast the individual bands, because the timing error will definitely be too large at 1kHz. The exception is for subwoofers. At e.g. 100Hz, 1msec is not enough phase rotation to worry about, and this will be even more true as the crossover frequency to the sub is lowered. It should be possible to implement wireless, distributed subwoofer systems with this approach.

googlyone said:
I have been engineering far too long to trust network latency except where I control all traffic with an iron fist. In the past I have used "ethernet" but at hardware level. This worked a treat.

googlyone said:
I suspect that WiFi will always throw in delays and latencies you don't plan for. So the good work you have done might suffer from occasional glitches ss a result of external interference or co channel traffic.

This is why I am using a dedicated, isolated WiFi channel for my three-Pi test system. There is no other traffic. Latency is low and repeatably so. But, sure, on a general network this might not be the case. I chose to use a channel in the 5.8GHz band that is not typically used because it requires DFS/TPC. This is in the range of channels 100-144. See:
List of WLAN channels - Wikipedia

googlyone said:
Have you tried the following experiment...
1. Put a mono 1khz sinewavr into the system.
2. Hook a CRO to L and R channels, trigger off one channel
3. Measure inter channel delay.
4. Power the system up and down and repeat regularly.

This will give you a feel for how well the system is achieving that "system clock".

Either way I think you have created s fine wsy if remoting a sub, I think L znd R precision will remain a challenge. But I hope to be surprised!

I have a two channel audio measurement setup that should be fine for this type of measurement. Because either channel can lead or lag, I will have to force one channel to lag by adding a large fixed delay, e.g. 20msec (I can do this easily in the DSP part of the signal chain). Then I can use the other channel as the reference and take a measurement now and then of an impulse that is sent through the system. This should give me the time offset. I will have to manually record and tabulate the offset data after each measurement.

Like I said before, this system is not and cannot be "perfect". It has probably barely enough synchonicity to keep the stereo image mostly centered. I am using only freely available software, and just seeing what it can do.

phofman · 2020-10-03 2:36 pm

Charlie, have you tried playing with PTP? It seems all "official" implementations explore it to some extent.

CharlieLaub · 2020-10-03 2:41 pm

phofman said:
Charlie, have you tried playing with PTP? It seems all "official" implementations explore it to some extent.

No, because PTP requires hardware (e.g routers, etc.) that support PTP timestamping. I am not even certain that PTP can be implemented over a wireless network.

Remember, I am trying to keep this as simple and low cost as possible. If someone wants to get cutting edge performance there are hardware based solutions for that (e.g. WiSA, etc.).

phofman · 2020-10-03 3:18 pm

OK, thanks, I have not learnt PTP details, my bad.

Have you considered a GPS clock on each of your RPi clients, instead of the network NTP? This tutorial looks quite simple NTP Server via GPS on a Raspberry Pi | Weberblog.net and the GPS module costs about 6USD Flight Control GPS Module with APM2.5 Flight Control Accessories GY-GPS6MV2 | eBay .

CharlieLaub · 2020-10-03 4:02 pm

phofman said:
OK, thanks, I have not learnt PTP details, my bad.

Have you considered a GPS clock on each of your RPi clients, instead of the network NTP? This tutorial looks quite simple NTP Server via GPS on a Raspberry Pi | Weberblog.net and the GPS module costs about 6USD Flight Control GPS Module with APM2.5 Flight Control Accessories GY-GPS6MV2 | eBay .

Like I mentioned in an earlier post, I have some experience with GPS based NTP and had a stratum 1 GPS based NTP time server serving the computers in my home for several years. The problem is that you need to receive a strong signal from several GPS satellites in order to get a reliable signal. This means you need to position your GPS receiver outside or very near a window so that it can see a large portion of the sky. Even a large tree above will reduce the signals so that they come and go, and that does not lead to good timing information. So IMHO GPS based NTP is not really practical in many situations. Certainly not for EVERY computer. Even getting one set up to serve your entire home may not be possible, and even when you do you will still need to distribute its signal over your LAN, which has jitter. What happens when you want to take the system elsewhere, e.g. to demo the system? Even if you can set up the GPS receiver and restart NTP at the new location, it takes hours to stabilize.

phofman · 2020-10-03 4:23 pm

Comments by the tutorial discuss the stabilization delay - IIUC caused by mixing other network NTP servers and GPS source in the NTP server config.

I have no experience with the technology, just found it worth exploring. Maybe the technology has advanced a bit since then. Maybe it could help solving the moving stereo image in combination with the existing network NTP if it were made to work to adjust the time after several minutes of starting the chain. Also the module could be kept powered to avoid cold starts. Many maybes

synchonized streaming to multiple clients/loudspeakers

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member