Impulse response, FFTs, deconvolution

CopperTop · 2011-11-12 2:14 pm

Hi

I'm trying to understand the theory and practice of room correction, starting with the measurement of impulse response, then hopefully moving onto deriving the inverse filter to perform a 'perfect' correction of some music recorded at the same position as the impulse response measurement (listening in headphones). At this point, I may feel that I have some grasp of how this stuff works, and can then experiment with 'tempering' the correction to make it work in practice.

I was wondering if anyone could help with some of the real 'nuts and bolts' of how this works:

As I understand it, as a first approximation, we can regard the room as 'smearing' the sound from each speaker by adding spurious echoes, reflections and reverberation that all reach our ears at different times after the direct sound; the audio from the speakers is 'convolved' with the room characteristic. This characteristic can be captured in its entirety (at one listening position) as the impulse response, and inverted to 'correct' the source audio for the room. (Various caveats, such as the system being assumed to be linear etc. which, in practice we may not achieve, but it's near enough for us to do something with it). This goes beyond normal frequency-selective equalisation, and allows us to effectively suppress echoes and reverberation - which seems rather miraculous and I can't wait to try it for myself, even if, in practice, it wouldn't be a good idea to take it as far as theoretically possible.

Capturing the impulse response can be done as easily as making a 'click' and recording the sound in the room for a couple of seconds, but a better way is to use a longer duration 'probe' signal that helps to reduce the effects of ambient noise and is more predictable, and easier for speakers to reproduce. A swept sine wave is good and especially, for various reasons, one that increases in frequency exponentially. However, any signal could be used, as long as it contains the full range of audio frequencies, such as white noise.

The impulse response can supposedly be derived by deconvolving the recorded signal with the 'dry' test signal. In the frequency domain, convolution is the equivalent of multiplying the spectra of two signals together, and deconvolution is simply obtained by dividing one by the other.

I have made some recordings of the room, with a loopback of the swept sine wave test signal recorded simultaneously, so in theory all I have to do is take the forward FFT of both signals (with suitable quantities of zero padding at the end of each signal, no windowing necessary(?)) and, bin-by-bin, divide the spectrum of the recorded signal by that of the test signal (using complex arithmetic), take the inverse FFT and hey presto I should have the impulse response. I think. It seems that the usual technique is to load up the FFT's real input with an audio signal and set the imaginary values to zero. Then after the frequency domain processing, the audio output is again taken from the real values, only. I might have expected the imaginary values to be almost zero but in my experiments I am finding that the the impulse response output is complex i.e. both real and imaginary values are present at equal levels on average. Does this mean that I definitely have a problem somewhere in my calculations, or is there a way of extracting the 'real' impulse response from this?

The usual way of correcting a room seems to be by using a convolution processor that convolves a FIR correction filter with the audio (again by FFT). For experimental purposes, the filter could be the perfect inverse of the impulse response, so convolving it with the signal would be the same as deconvolving the signal with the impulse response. Again, I find that this filter (obtained by inverting the division in the impulse response extraction stage above) is complex. How do I derive from this a simple kernel that can be convolved with the signal to effectively deconvolve the impulse response from the audio?

Many thanks to anyone who can help with this.

CopperTop · 2011-11-19 2:11 am

...in my experiments I am finding that the the impulse response output is complex i.e. both real and imaginary values are present at equal levels on average.

In case anyone's interested (probably not!), the answer to this is that a forward DFT of real data exhibits conjugate symmetry i.e. the output is mirrored around the half sample frequency, with every complex value a+jb in the left half with a corresponding point in the right hand half of a-jb. Multiplying or dividing with another DFT derived from real data, the result should remain symmetric in this form, so the backwards transform after convolution or deconvolution should, again, produce real data only. I couldn't understand where I was going wrong.

My mistake was to use a very poor FFT that, for large FFT sizes, produced only very approximate conjugate symmetry (up to 1% error or so) and also just inaccurate values, leading to complex values at the final output. I changed to an open source library from FFTW Home Page, and suddenly my impulse response outputs are almost pure real.

saigenku · 2011-11-24 10:10 pm

Hi,
I'm interested in room correction and just stated to investigate. I found that it is a very interesting topic.

To my limited understanding not sure if FIR is able cancel out the echo/reverberation I think it is crossing frequency domain and the time domain. I'm seeing that as something like subtracting a faint and beyond recognition image of earlier sequence of sound and more than once.

I think it is mostly just correcting the frequency/phase response of the system.

CopperTop · 2011-11-25 10:35 am

Hi saigenku

My experiments so far, have been useless. I got nowhere with this software DRC: Digital Room Correction, experiencing a strongly unpleasant effect even when using the 'minimal' configuration, hence my attempts to write my own software just to get a handle on what I am doing wrong.

The way I try to imagine what the FIR filter should be is doing is this: suppose you were standing in a field with a speaker in front of you, and a concrete wall behind you. An impulse from the speaker would reach your ears after a certain delay, carry on to the wall, and then be reflected back to your ears as a single echo, after another delay. In theory, by sending a negative impulse from the speaker at just the right moment after the initial impulse, the reflection from the wall could be cancelled out at your ears. If, instead of impulses, we were listening to steady sine waves, then we would get partial reinforcement and cancellation at different frequencies, appearing to give us a wobbly frequency response. We might be tempted to think that a graphic equaliser would fix the problem, but it would not work with transient signals. Suppose the echo came 10 seconds after the original transient. Clearly no 'tone control' could fix it.

I'm not so naive that I believe that we can, or should, achieve perfect 'correction', but I would love to think that I had improved the general coherency of my system e.g. aligning the 'group delays' of my speakers as a by-product of the processing. Denis Sbragion who designed the DRC system above seems to have done a great job of ensuring that the processing can be subtle, and he has included ways to progressively reduce the correction at high frequencies with time, for example. If only I could make it work!

I have, in fact, managed to correct some music recorded at the microphone using my own deconvolution software. The straight recording sounds hollow and 'boxy' the way it always does when the microphone is some distance from the source (the problem we are trying to correct for), but when deconvolved with the measured impulse response (or almost, using the 'Inverse Kirkeby' method, microphone in the same location), the recording sounds much closer to the original - in headphones. But I have not got it to work 'for real'. When your head is in the same position as the microphone, you just don't perceive the same hollow 'boxy' sound without the correction, anyway. And if I pre-deconvolve the source audio with the measured IR, then it doesn't suddenly appear to my ears as though I'm listening with headphones or in an anechoic chamber, as I had expected! Instead, I hear strong 'pre-echoes' and colouration. Maybe I'm not understanding what I'm supposed to be doing with convolution and deconvolution. I have yet to 'close the loop' and record the sound of the corrected audio using the microphone and listen to it with headphones. If it sounds OK, but doesn't when 'live' with my head in the same position as the microphone, I'm not sure how to proceed from there.

saigenku · 2011-11-25 4:49 pm

Hi,

May be you already done that. Without moving the mic do the swept sin wave and calculate the FIR then do the FIR corrected swept sin wave and check to see if the frequency and phase response had improved.

What mic and what digitizer are you using?

saigenku · 2011-11-25 4:55 pm

The idea is that at least you see real improvement from the measurement.

Headphone may not be "flat", the only thing certain is that it do not have any audible reverberation/echo.

simon7000 · 2011-11-25 5:01 pm

The first issue that gives you trouble is that you are assuming a linear system. Even a simple loudspeaker system is not linear. The voice coil heats in use and so the coil resistance goes up and the efficiency down. Air also has limits to linearity.

The second issue is that you can do this for one location. As you have two ears that places a second limit. A localization error of 10 uS is significant.

Third is that recordings being binaural representations of three dimensional pressure modulations are often improved by adding additional early reflections.

There are several commercial products that try this and although they sound different they don't particularly sound better.

But it is fun to play.

saigenku · 2011-11-25 9:16 pm

I think it is possible to correct (to some extend) the non-linearity for example by combining multiple correction FIR created a different sound level and do some kind of interpolation.

Localization is going to be a big problem specially if there are more then one person listening.

satoru · 2011-11-26 9:23 pm

How are you handling the noise in your room measurements? As you pointed out, deconvolution is a division in Fourier space, so any noise in the transfer function will have a devastating result in the outcome (sometime, blowing up the whole calculation to infinity), especially in the region where the amplitude is low. There are so many scientific papers on handling the noise effectively, but in essence, you have to fight it with the number of data you take.

The "pre-echoes" you are hearing may well be derived from "ringing" as a result of deconvolution.

"Pulse" measurement to get a system's transfer function, in theory, is at best when the width of the "pulse" can be infinitely narrow and the threw rate is infinitely high. If you cannot achieve these close enough to practicality, then measuring steady state condition is better way to go. Given that a lot of us can hear a difference in -60dB region, you have to suppress noise even lower. As the noise goes down by the root of the number of measurements, you want to measure at least 1 million sine waves. Even at 20 kHz, it will be 50 seconds, while you need more than an hour for middle C (261 Hz). At least, you can try to increase the number of measurements in a practical manner and see whether it improve you deconvolution.

By the way, are you doing your deconvolution in double precision floating point??

Good luck!

Regards,
Satoru

CopperTop · 2011-11-28 10:17 pm

Thanks for the comments.

The microphone I'm using is a simple Panasonic WM61 capsule with a 10k bias resistor and a 9V battery. I did build my own Veroboard pre-amp, but I'm now using a Tascam mixer's microphone input. I have yet to build the capsule into a housing, so it currently exists at the end of a stalk of screened cable, reasonably far from any nearby surfaces.

Yes, it has occurred to me that the deconvolution process can involve division of something by nothing, or of nothing by nothing. I wasn't sure whether the best thing was to catch potential divide by (almost) zeroes and discard them, or also to pre-empt the problem to some degree by only dividing the bins within my sweep's frequency range. In theory the Kirkeby inverse filter is supposed to suppress huge boosts, as well. (I am carrying out all the calculations in double precision.)

I realise that the system isn't strictly linear, so a single filter can only be an approximation, but I was hoping for something reasonably close which I could then refine. I have a feeling I'm making an error somewhere, though, because although I can 'correct' a recording of uncorrected audio made with the mic at the listening position by multiplying its FFT by the FFT of the Kirkeby filter and taking the inverse FFT, I can't seem to achieve the same result by saving the Kirkeby's real part as a wav file, then later convolving it with music using one of the VST-type convolvers and recording the result with the mic. I'm doing something stupid somewhere.

I have just discovered this nice looking GUI front end for the Sbragion DRC system (don't know how I missed it before) which looks to simplify the process quite a lot, so I will be trying this ASAP.

Digital Room Correction Designer Help

theSuede · 2011-11-29 5:01 am

One of the very basic points of 'room correction' is that it's really a three part problem. Both from a measurement and a "music listening" PoV.

You seem to be at home with the math involved, so that base is covered.

I'll just tell you about some of my personal experiences, since starting out with digital direct/reverbant sound correction some ten years ago (my - how time flies....). Split into the three separate points, exkluding the gory innards of the works:

1) Loudspeaker direct sound
This has to be considered the defining base point for everything from here on. Get this wrong, and it doesn't matter what you do in the room-correction - it's still a crappy speaker! I allways do this in a separated step (since the convolutions later on is easily overlayed on to the base point correction) - My reasoning, and my own understanding of this point is that:
Most of the sound from the lower midrange and up is psycho-acoustically dominated by the non-delayed direct radiation sound, the sound arriving at your ears without bouncing off anything - just a straight line through the air between the driver and the ear/mesuring point.

It's very important to "voice" the speaker in this stage, do the intial eq and corrections. I go very easy on the phase corrections, since this has already been partly taken care of in the digital crossover (6x 8192 step FIR convolutions). I seem to prefer a correction with very little pre-ringing present, some small time/amount is ok - but not like in a totally phase linear FFT filter. I have no scientific or other reason for this, it's just something I've gradually have grown to know.

I do this with the impulse average of three 2m distance measurements at (approximately) 0-15-30 degrees. As reflection free as possible - I gate the measurement, and limit the correction amount applied to the region below 300Hz to a very small amount. The correction for the region below 300Hz I get by close-mic measurements, that I multiply in on top of the farfield correction (that should tend towards zero correction below 300Hz). A nice, even frequency response here is just as important as a good room correction. Already after this point, you should notice a very marked improvement of the soundstage, otherwise something is very wrong. The only tricky point here is to do the switch from near-field to far-field correction correctly, and that's more often than not a manual labor (automating say a 12dB/oct correction switchover often leaves a level-difference between the two spliced parts, akin to a "shelf" EQ step - you have to get the curves to mix nicely, with constant sound pressure as a result).

(I'll split this to make it easier to read...) >>>

theSuede · 2011-11-29 5:18 am

Part 2:

The actual physical placement of the speakers. You just can't get away with doing something stupid here... It's better to get as little NEED for correction as possible. The ways to acheive this are very personal and practical, the do of course vary with application - but please refrain from pointing your speakers straight into a concrete wall or something like that. Avoid early "hard" reflections, try to get a subwoofer region / room interaction that works as well as possible. There's a limit to what levels of correction you can ask from the next stage:

3) The actual room correction!
Since you allready have a well corrected SPEAKER, and a reasonably set up ROOM, the need for correction should be as low as possible.
You should by now allready be playing your measured test signals through the loudspeaker correction convolution (I export the test signals and play them via Foobar+convolution...).

The best measurement signal that I've found is the one you're allready using, the swept sine + deconvolution.

Now, mr Sbagrion's software gets to work again. I usually use a very soft correction setting, very close to the standard "soft" setup with lin.phase response. Since the actual speaker correction is allready made there (should) be very little, if any, pre-ringing in the correction signal. I've never found a "perfect for everything" setting here, even though I've helped quite a few friends to get started by now.

Allways save your base point loudspeaker correction! Don't overwrite & forget - or you'll have to do the near-field measurements all over again...

Convolve the base point impulse with the room correction impulse, and save the result as your intended correction signal...

Then listen to the setup...!

......................

I have some small technical knowledge about the programming and the math works behind it, but I am in no way any kind of expert in the area. But if you have any specific questions, feel free to ask. I just can't promise to be able to help you...

CopperTop · 2011-12-04 9:34 pm

@theSuede

Many thanks for that invaluable information. Are you using off the shelf tools entirely, or is some of it your own software?

theSuede · 2011-12-05 11:40 am

Almost only OTS software, the only part that I've not assembled from existing parts is a slight rewrite of the sliding window filter in the room-correction (I use less of the direct signal, and concentrate on the room reverbation).

When I started out with DRC, I got the same first impression as you. It did sound like crap, most of the time. I guess the "big thing" is to find a solution, a general recipy, that improves everything slightly - without being to intrusive on the original signal. AND to have reasonably good speakers and a good setup as a starting point.

CopperTop · 2011-12-05 6:30 pm

@theSuede

So how do you blend the near and far field speaker correction? Is this a case of mixing (you mention multiplying - is this in the time domain, or convolution?) two wav files, maybe having aligned the phase in the overlapping frequency regions 'by eye'?

I maybe did get a half reasonable correction last week when I measured the IR with a 2 minute sweep on each speaker and used the DRC 'normal' setup. Although I wouldn't leave it on all the time, it didn't offend me, and seemed to improve the 'separation' both between left and right channels, and also 'within the mix'. And I seemed to get deeper, clearer bass notes that really stood out on a couple of tracks. The problems, if there were any, were a cut in the top end brightness (maybe it's now flat!), and a hint of 'smearing' (pre-echo) at the start of certain sounds.

And there's another factor that I'd very much like your opinion on: my understanding of the way room correction works, is that when performing time domain correction, in essence the FIR filter is effectively following the 'dry' signal with a delayed version which cancels out echoes and reverberation. I'm wondering if there is a subtle problem-ette from this in that while the audio that reaches your ears is corrected, there is also a path through the floor which you feel in your feet, or posterior (if sitting down!). Is it possible that the correction may actually increase the vibration level you feel while reducing the audio level you hear, giving rise to a strange sensation of loud music being made quieter?

theSuede · 2011-12-06 10:23 am

I blend by convolution. That's the rewrite, I make the initial pulse of the measured room response converge into a perfect 1-frame pulse to make the room-correction "let the speaker do its' thing".

The main problem with a "one-step" room correction is that you don't separate the integrated frequency response from the time-response - and they are not entirely dependent on each other. The "room response" sum with a long integration window does not have to be "flat" in any way to make the sound of the speaker balanced. The main coherency and "integrity" of the music/signal comes from direct radiation.

So the ideal is to have a good, flat frequency response within a ( 2/f + 10 ) ms time window - this is where you have the "speaker sound". The "reverbation" or "room response" after this initial pulse should not count very heavily towards frequency response correction, it just has to be dampened.

Basically - the frequency response of the system, in-room, does not have to be flat. It doesn't matter.
If you DO push a correction system into making the (long t) gated room response "flat", then you have probably made the speaker direct sound misbehave. You take a (maybe) perfectly ok speaker, and make it pre-compensate for errors that will happen LATER in the reverbation - to make the sum flat. Psycho-acoustically this is not a very good behaviour, and can make the sound very "wrong" even though general impression is that the sound is "balanced".

The FIR/convolution approach to room correction is just as you say depending on a cancellation effect, and this is also why stronger correction means more "hot-spot" sensitivity. Also because of this I prefer to adjust the ROOM for a good response in the 400-500Hz+ region, and limit the correction to the lower frequecies (that have less directivity and phase sensitivity as compared to real distance relations between origin/receiver).

Impulse response, FFTs, deconvolution

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member

Member