idea for homebrew recording studio over ethernet

This old topic is closed. If you want to reopen this topic, contact a moderator using the "Report Post" button.
Hey guys,

I have been pondering an idea recently, and thought I would bounce it off some of you.

My idea is to develop a cheap solution for a multitrack digital home recording studio.

It is broken down into these peices:

- Analog to Digital Converters
- digital audio to ethernet converters
- PC with an ethernet card

Basically, you have a bunch of mics set up in a recording studio. Each mic is connected to an analog to digital converter. this ADC outputs a stream of digital samples, in SPDIF or similar.

Each ADC connects to a "black box" which takes in digital audio samples, strips off the SPDIF headers and such, wraps them up in a IP packet, and sends them onto the ethernet.

The PC is receiving all these packets from all the black boxes, and is writing the audio tracks to disk.

This idea is cheap, because you can use commodity PC hardware and networkign hardware.

This idea allows you to have conplete control over the audio quality of your recordings, as you decide what ADC to purchase (this is the only peice of the equation which affects audio quality).

This idea is scalable, in that you can set up a 4 track recoding studio, only to later decide you need 16 tracks. You simply add more equipment, and dont lose any of you original investment.

Also, there is no reason you can't have more than one PC receiving packets, as long as you have some method of syncronizing the blackboxes. In fact, there should be no theoretical limit to the number of concurrent tracks you can record.

This idea has been sitting in the back of my mind for probably a year now, but only recently have I decided it might be feasible to implement. Take a look at the picoweb server prototype:

There, they demonstrate the feasability of interfacing a microcontroller to a standard ISA ethernet card, including a full TCP/IP stack.

anyone's thoughts on this are welcome.



assuming 44k/16bit audio tracks,

44100 x 16 = 705600 bits/sec, just for the raw samples.


8 tracks would require more than 5.6 Mbit of bandwidth.

this assumes zero collisions, and does not take into account the overhead of ethernet packet headers.

so, with several NICs in one machine, you could provide more tracks than you will probably need.

however, 100Mbit might be a better way, to account for header overhead and network collisions, etc.
ethernet headers

Standard networking follows the OSI 7 layer model.

each layer is an abstraction, so that each layer doesnt have to know anything about the layers below it.

each layer wraps the data in more headers.

for an entertaining tutorial, see:

7 ...
6 ...
5 ...
4 transport (TCP)
3 network (IP)
2 data link (802.3 ethernet frames)
1 physical

for efficiency, lets say for now we only use up to the IP layer.

from this:

we see that an ethernet header is 26 bytes:

7 byte preamble
1 byte start of frame
6 byte MAC address (destination)
6 byte MAC address (source)
2 byte frame type
up to 1500 bytes (data)
4 byte CRC footer.

however, apparently the first 8 bytes is considered to be at the physical layer, and is not considered in the typical 1518 byte maximum ethernet packet?

then we have the IP header,

which is at least 20 bytes.

thus, even with no samples, our packet is 46 bytes if I am correct.

if we were going for a little better than 50% efficiency, we would need 24 samples at 16 bits each, thus 48 bytes of data.

for 75% efficiency, we need 72 16 bit samples in each packet (190 byte packets)

100Mbit is looking like the way to go.
uh, ok sorry........had to interject here.

First, you should ***REALLY*** try to aim for a bit rate higher than 16 bit /44.1kHz for all seriousness.

Secondly, debating about whether or not to use 100MBit or not is pointless as you cant buy anything slower anymore (or if you can its obsolete and the 100MBit hardware is no more expensive, if not actually cheaper).

So do the calculations again, assuming a 100MBit network and say 24 bit/ 192kHz optimally but at least 24/96
calm down there, my plans are to make it scalable to 96/24, but I was just doing feasability calculations.

the 10mb or 100mb thing is a debate over "can I get away with using ISA NICs" or "do I have to use PCI NICs".

I think I effectively demonstrated that ISA NICs are not the way to go, as they would strain to do 8 tracks of 44/16, making the whole solution not worthwhile.

The implication here is that with PCI, not only could you use more than 8 tracks of 44/16, but the possibility of higher sample rates and resolutions is opened up.

but besides this, debating over 100 vs 10 is not "pointless", as I am sure it woudl be much easier to learn how to interface with an ISA card vs a PCI card, not to mention ISA NICs are cheaper.

hopefully future posts in this thread will be a little more constructive and polite.
interface to S/PDIF or AES/EBU

to interface with the ADC, the blackbox will need a receiver chip capable of understanding S/PDIF, AES/EBU, or both.

some notes on these standards:

this document mentions some receiver chips:

searching on google for each of those reveals the cs8412 seems to be the more popular of those.

however, the cs8414 and cs8413 are capapble of taking 96k/24bit input.

the cs8414 will operate with being controlled by a microcontroller. This may make for a simpler design.
cellular, ok im sorry i really didnt mean to come across as rude.

I'll go back and clarify some things as well as respond.

check these two links:
ISA Price check

PCI Price Check

As you can see, PCI nics start @ $3, ISA @ $7. Big deal, either way they are both insanely cheap.

Also keep in mind, that more and more newer motherboards dont actually have ISA slots anymore so if you want this to work on all new computers effectively thats out. Interfacing the card is fine, you can make your thing run over tcp/ip, its really not that difficult to do.

The cards drivers add an abstraction removing any difference between pci and isa.

ISA is an old, outdated standard that should not be used for any new products, in a way similar to the thread about people using 2N3904's etc in new designs, its really not something you should do.

I'm sorry I offended you with that, it wasn't my intention, I'm just trying to set you straight.
sorry, i just noticed that i think you were referring to the network on the adc end. the interfacing here is an issue, i'll give you that and i apologize for some of what i said before. however, if you're going to use multiple 10 MBit cards, that's pretty ridiculous.

additionally, now that we're talking about the adc end, use an all-in-one ethernet ASIC.

its much cooler.
Sounds like a nice idea. I just have one question:

How do you syncronize the sources? I mean when you have say a drum machine a couple of mikes on vocals and guitars and these send stuf to the network and collisions occur they will retransmit at various points in time and the arrival of individual packets from inidividual sources are not syncronized. This even without thinking about the start times for initial transmitting. Are you going to encapsulate an absolute time reference in your packet and have all sources set to reference time? Am I missing something here?

(SoundWeb seems to be for "broadcast" type transmissions which means one source goes to one or many receivers which needs no sync so they would not have any problems.)

I would use a dedicated ethernet for this sytem, to ensure that there is no other traffic to congest the network and cause unnecessary collisions. The second thing i would do is a few calculations to ensure that the required bandwidth will fit nicely within the capabilities of your 100Mbps limit, along with a simple statistical analysis of collision behaviour. This will ensure that all packets are guaranteed (to a very high percentage) delivery without errors. The analysis would also help determine the size of buffers you'll need for worst-case, out-of-order packet reception. Since i'm feeling lazy right now, i'm not going to do a sample calculation. I'm sure it's not that hard to do.

Synchronization over an asynchronous link can be a problem, though not insurmountable. One method is to use an absolute time reference system, similar to what is employed in isochronous firewire links. The master controller (eg your primary ADC) will have a high-quality, stable crystal oscillator and a set of counters keeping a large absolute time count (say 48 bit or so). Periodically, this time count is captured and sent out on the network to the other nodes, which compare it against their internal counts, and use the difference to make small adjustments to their own clocks. Unfortunately, this is not easy to implement over ethernet, since the precise timing of the timebase packet transmission is a factor in accurately correcting the other clocks.

Perhaps the best method would be to use a single remote box for all of your audio I/O. That way, a single master clock can reside in the box, and data returning from the computer just has to be sent back at the same rate it is coming in from the remote box. You could equip the remote box with ASRCs to handle data format and clock rate mismatches on data coming in from other sources.

I recall discussing this very idea once before on this forum, while also talking about building a standalone DSP... perhaps there's some info in an old post that you may find useful.
I would take a look at the Klotz Vladis system. While its not really expensive (thats a relative thing), it is a pretty damn good system. It is used all over the world and seems to have a good technical base. They interface to all sorts of automation, production and storage systems. That being said, start cloning.
indeed, synchronization is the main problem i have been thinking about.

The way I was originally going to do it was all the ADC's would be streaming samples to my black boxes, and the black boxes would be waiting for the "go" singal to start transmitting those samples to the central computer. Since they would all be on the same physical link, the time difference would be so small between when each of them received the go signal that it would be largely undetectable. Even if it were, the user would be given the option to pad some of the tracks with empty samples at the beginning (after everything has been recorded).

However, this is sort of a mute point, because even if they all start at exactly the same instant, there is no guarantee that each ADC has a perfect 44.1k oscillator in it. Being even just slightly off would manifest into big synchronization problems if you were recording for an hour or so.

I have decided to do things another way, which should simultaneously solve the synchro problem, and at the same time make more efficient use of the bandwidth.

I thought about what the main advantage of TCP was. That is, it guarantees reliable in-order delivery over a lossy, reordering channel.

However, the re-ordering only occurs when you go over more than one physical connection (one physical ethernet link is all the computers connected on a hub or switch, or a series of hubs or switches. If you don't get routed through something with an IP address, you are on the same physical link. yes, i know, some switches have ip addresses, buts that's different (they are for maintenance - you dont' route normal traffic to that IP address)).

I originally intended this project to only work over one physical ethernet link, so I don't need to worry about reordering. All I need is raw IP.

Using the "everyone talks at once, and if there is a collision, retransmit" idea from TCP really makes no sense in this application, because this is a very specific circumstance. Each host has an identical load, so it makes most sense for them to simply take turns. So you simply do a round robin type communication. Box 1 sends a group of samples, then box 2, then 3, etc.

We fix the number of samples they can send at a time. We also set it up such that if it is there turn and they don't have that many samples yet, they wait until they do.

This forces everyone to transmit at the rate of the slowest link in the chain. Thus, if one black box is getting 45k samples every second, and another is getting 44.1k samples per second, the faster box will simply fill up its buffers and drop enough samples such that it delivers exactly 44.1k samples per second.

One side of me thinks I should worry about making an intelligent algorithm to drop samples, such that if you have 100 in the buffer and another 10 come in, you drop every tenth sample instead of the 10 consecutive ones at the end. However, this is ridiculous, as the difference between the fastest and the slowest clocks in the ADC's will probably be very small, such that you drop 1 in thousands of samples at most.

Now, since we are all taking turns, there are zero collisions. The fact that the packets are a well known size causes another interesting phenomenon.

Normally, you have to listen to the network to see when it is available before you start transmitting. However, there is a small propogation delay involved in the signals traveling over the copper wire. Box 1 transmits his last bit at time unit 10, but I don't receive his last bit until time unit 13 because of the propogation dely. I then see that in time unit 14 there was no data being transmittied, and so I start transmitting at time unit 15. As the size of messages gets smaller, the time wasted waiting on the propogation delay grows larger.

Now, if you are transmitting to each other, you can both talk at the same time, as there are sepparate send and receive wires. But what about when you both want to talk to the same third party?

If you knew the size of everyone's packets in advance, you can start transmitting as you are receiveing the last few bits of the other boxes packet.

Lets say we have box A, box B, and the central computer, C. The delay between A and B is 10. the delay between the boxes and the computer is 5. This means that if B is transmitting, A can start transmitting when he knows that B's tranmission is 5 time units away from completion, and there will be no collisions at the central computer. Thus, we could tweak even more bandwidth out of the network than most other situations.

by the way, I am in a networks class this semester, so if anyone has any questions about networking stuff discussed in this thread, fire away. If I don't know, I can look it up or ask my professor.
more thoguhts on dropping samples

simply dropping a sample could cause audible distortion.

Perhaps a bettern way to do it would be to go ahead and transmit the extra sample in the packet, but set a flag which tells the central computer "shorten this packet by one sample".

Then, the computer could do it with a littel more finesse, like averaging the distance between the samples on either side of the sample it decides to drop, etc.

in order to keep our synchronization, we would have to build in room for this extra sample, such that everyone either uses it to transmit an extra sample, or transmits all zeros in its place, etc, as long as the packets all stay the same size.
sync, sampling rate, delay, etc.

TCP/IP, basically tcp at layer 4 is not completely sufficient to keep the packets in order, account for packet delays, variable delays etc. for voice applications, let alone music sampling rates, such as hi fidelity audio. Gigabit ethernet, or at least the 802.1ad extensions would be needed at layer 2 to get better control of the packet flow, support multiple packet streams on the "ethernet backbone" to avoid the lowest probable bit rate problems one previous poster mentioned.

Therefore, additional protocol extensions to IP, such as RTP and RTCP have been specified, "real time protocol", "real time control protocol", as well as "real time streaming protocol". This is just the start of the necessary protocol extensions,. The unbrella term is VoIP, voice over IP, which involves a number of competing standards that are fighting now for market dominance, e.g.; SIP, MEGACO, MGCP, BICC, H.323, etc. have been some of the competing standards in the past several years.

All of this work is evolving now in the telecom - networking field where there is a need and interest to replace digital circuit based call control and switching with (digital) packet based streams and routing. Telecom systems also have the generic problem of needing to efficiently carry multiple streams from any point to any point, and to also collect accounting and billing information, and to make use of network services such as call forwarding, local number portability, IP Mobility, etc., etc.

Billions of dollars are being spent now to work all this out and to make the "next generation" of telecom and neworking equipment. Needless to say any one of these topics will bring a half dozen new engineering text books up in a search of general terms of the subject, like VoIP, voice/video/data convergence.

I'd recommend books by Uyless Black, an older one called "Voice over IP", and "IP Telephony". He is a prolific author of these and many other titles.

I think your idea is an interesting one overall, and when the a/d-to-networking hardware becomes cheap enough and embeddable enough we will definitely see these sorts of systems, with perhaps additionally wireless networking interfaces as well.

One idea along these lines that I had was that MIDI data as well should be able to run over ethernet, instead of that low performance serial cable. I believe Yamaha has a non-standards based or proprietary system available for high end studios, but its not an industry standard in my understanding. With midi, you don't have the basic probelm of the end to end analog to digital conversion, but you just have the problem of encoding midi data into a single standardized (digital code word) format, which based on my research a few years ago was resisted by the supporting members of the MIDI organization. I think this control over the midi standard should be taken out of the control of music equipment makers and turned over to an IETF committie, if it has not been already. Also, such a networked midi implimentation would have to impliment the higher layer IP support protocols I mentioned above like RTCP, RTSP, etc. to support the real time aspect of musical performance.
This old topic is closed. If you want to reopen this topic, contact a moderator using the "Report Post" button.