Blind Listening Tests & Amplifiers

Status
Not open for further replies.
frugal-phile™
Joined 2001
Paid Member
nw_avphile said:
Hmmm... do you have references on that? I have a feeling we're talking about a different definition of "the same"? I agree a blind test cannot prove two things are the same IN ALL WAYS. It can only prove they're the same in the ways the test is measuring (or judging) them.

If i could find the references buried in my emails i'd have made a web page. As Steve has said, as designed in an ABX test, if you can not distinguish the difference, you have a null result from which nothing can be concluded.

You (planet10) said earlier:

Not me

dave
 
diyAudio Senior Member
Joined 2002
OUCH...

Hi,

For example, in double blind drug studies they compare the drug to a placebo. If the results of the study don't show a statistically significant effectiveness for the drug, it flunks the study.

In French: Oufti.

Please, talk about whatever you're comfortable with and, I hope, know something about.

Pancakes perhaps?

Or will that be a placebo?:rolleyes:

P.S. The quote I used isn't necessarily wrong, the idea to relate it to audio is rather far fetched.

Hold on, let me get my aspirine (the most underestimated drug, BTW)

Mama mia,;)
 
planet10 said:
If i could find the references buried in my emails i'd have made a web page. As Steve has said, as designed in an ABX test, if you can not distinguish the difference, you have a null result from which nothing can be concluded.

Not me
Well please let us know if you find them. I've looked at a bunch of blind test documentation and have not seen anything that supports that (but I can't say I have anything to disprove it either).

As for the quote, sorry, you're right it was Rob M in post #58.
 
frugal-phile™
Joined 2001
Paid Member
nw_avphile said:
Well please let us know if you find them. I've looked at a bunch of blind test documentation and have not seen anything that supports that (but I can't say I have anything to disprove it either).

Don't hold your breath -- i've been looking on & off for a year or so... it is in the archives of the hi-fidelity mail list on Yahoo... the posters were David Klein & Timothy bailey.

dave
 
nw_avphile said:

Why do really large numbers make a difference in this case besides the study gradually becomes more reliable the more samples you have? Do we have any statistical/logic experts here to help out with this part of the discussion?

There's nothing statistical about it. It's just that there are lots of ways to screw up a listening test. The more different tests get done, run by different people at different sites, the greater the chance that you'll get some that weren't screwed up. And the same's true for positive results -- if one test shows an audible difference, then it might be there really is a difference, or it might mean that the levels weren't really matched, or that the listeners figured out the pattern, or any of the other ways that they could guess right without actually hearing anything. More successful tests means less likelihood that the "success" was just a fluke.
 
Re: $$$$

fdegrove said:
In recording just as in hi-fi reproduction, I have one word of advise: keep it simple, keep the signal path as short as possible, don't use feedback to impress others or the datasheets, try to be faithful to the music and yourself.
At least there's something we agree 100% on! I very much agree with the above, but sadly, your advice isn't widely followed in the recording industry these days. As I was saying earlier, the audiophile recordings are often nicely done but the performance itself usually leaves a lot to be desired. There are also just not that many of them even if you did like the obscure performers.

I also agree there are some stunning classical recordings from the late 50's and early 60's. They have lots of tape hiss, but they're otherwise impressive. They're all the more impressive when you think about what the "state of the art" was back then.

As for the rest of your comments, I agree good doesn't have to mean expensive, but I still know there's a big double standard. There's all sorts of stuff in the studio signal path that should make audiophiles make a funny face and cover their ears--especially when you consider it greatly outnumbers the stuff in the playback chain. You yourself have said that just one part--Wondercaps--can ruin the sound of a system. So something doesn't add up here because audiophiles seem perfectly happy with the sound from their systems playing recordings that should be offensive to their ears.
 
My blind procedures (long!)

Rob M said:
There's nothing statistical about it. It's just that there are lots of ways to screw up a listening test. The more different tests get done, run by different people at different sites, the greater the chance that you'll get some that weren't screwed up.
Well I was expecting something like that. I'd guess I'd like to be able to put some numbers to it if that's possible? In other words, if you run one test, it's 60% accurate, if two tests agree it's 65% accurate, if ten different tests agree, it's 90% accurate, etc. Is that possible? I realize the quality of the test is an issue, but keep reading...

I still have to come back to the common sense side of this. If, as you suggest, the levels weren't matched, that would actually bias the test towards HEARING A DIFFERENCE. If someone guessed the pattern that would also bias toward hearing a difference. If someone could see the leads or had other clues, that would also bias the test towards hearing a difference. In other words, all the likely errors I can think of FAVOR HEARING A DIFFERENCE! It's hard for me to come up with things that would actually obscure differences between the amps and cause a "false null"?

I'll outline my procedures, and if someone can tell me how this test would mask the audible differences between the amps, I'd love to hear them. The only credible ones I'm aware of are the time to do the switch if you're cable swapping, and the switching hardware if you're using a switcher.

One can argue the time shouldn't be a big deal as most golden ears claim to have a great "sonic memory". Reviewers frequently compare products they're reviewing to others they haven't heard for several months. We're only talking a few minutes here.

The Onkyo/Bryston test was run as follows (loosely patterned after the Sunshine Challenge):

The test takes place in the listener's own home using their own system and music. We'll call their amp (the Bryston combo) "Amp A".

The listener (or listeners) pick a playback level they're most comfortable with on their system while playing a familiar recording. Tell them you're not going to speak to them once the test starts, and it's their responsibility to ask you when they want a "swap" and that's also when they should tell you what amp they think has been playing. I also ask them, in advance, what percentage of correct votes they should get to indicate they can really tell the difference. Some are so confident they say 100%. The lowest I've heard is 2 out of 3 or 66%. Then ask them to leave the room for a moment.

Leave the gain set where they like it and play a -20db test sine wave on a diagnostic CD (you can buy one for $6 if you don't have one). The frequency isn't terribly critical as we're already assuming the amps have flat response in the audible range. I would suggest 400hz or 1khz.

Use a DMM (you can buy one of those for $25) to read the level at each of the speaker terminals while the sine wave is playing. Write down the voltages.

Connect the new amp or receiver (Amp B--the Onkyo) to the CD player. The most pure way to do this is simply unplug the CD interconnects from the listener's amp/preamp and plug them into the evaluation amp. It's a good idea to temporarily ground the CD player to both amps so that you don't get any nasty hum or other weirdness when you swap cables. You can disconnect the ground once the connections are swapped so there's no chance of it altering the sound. You can also use the preamp outputs if you want to remove the listener's preamp from the evaluation.

Move the speaker leads from Amp A to Amp B. It's best to make the line level connections with the speakers disconnected from both amps. I've had to make adapters if Amp A has really huge speaker wire terminations that don't fit Amp B. The adapters, if anything, would make Amp B easier to pick out.

Play the same test track and use the DMM to set the voltage to be the same as what you wrote down previously for each speaker (adjusting balance if necessary). The two amps are now level matched to a very exact degree. The accuracy of the DMM isn't an issue (unless it has huge drift problems--very unlikely) as you're only concerned about the relative match. You should be able to get the two amps within a few millivolts of each other which is a very tiny fraction of a db at typical listening levels.

Flip a coin to pick Amp A or Amp B to be up first and invite your listener back in and play the music of their choice. Two things are important here. First, the listener should not have any visual clues as to which amplifier is connected either from the listening position or as they enter and leave the room. Most equipment racks allow for this but I've had to do things like stand the DUT up on end behind the rack, set up a bit of audibly transparent camouflage (I bring some pieces of speaker grill cloth), etc.

Second, because this isn't a double blind test, the person running the test needs to not give any clues to the listener. Avoid eye contact, don't talk, ask questions, etc. I usually sit off to the side, out of sight of the listener, when I'm not needed to let them focus on task at hand.

When the person is ready for a "switch", I turn around so they can't see my face and they tell me which amp they thought was playing and I record their answer. They leave the room briefly and I flip the coin again to decide if I swap the leads or only pretend to swap the leads and invite them back in when I'm done. Again, it's obviously important they can't hear or see what you're doing at swap time. If they like (and they usually do) we replay the same music.

When they request the next swap, they again "vote" for which amp was just playing and again I record their answer.

You repeat the above few steps using different music as desired, for as long as the listener desires. If they're scoring really well, I usually terminate the test early as it's obvious they can hear a difference--again, this is biasing the test TOWARDS hearing a difference. If they're scoring about random, in my experience, they usually know they're only guessing and they tend to terminate the test fairly early. It's when they think they hear a difference but are not sure that things drag on.

When the test is over, you tally up the right and wrong answers and compare them to the total number of votes. I then share the results with the person and we discuss them against what they expected to hear. The Onkyo/Bryston guy admitted he was only guessing. He did, however, want to verify that he was indeed listening to both amps--i.e. that I was really swapping sometimes (which is easy enough to verify afterwards).

All that said, for my own personal tests, I built a "toggle" (momentary contact) switcher that simply switches the inputs and speaker leads through very high quality relays and wiring. This allows instant A/B switching with the person necessarily knowing which is A and which is B. Because it's an instant swap, it's more sensitive to differences. Some, of course, argue the extra contacts and wiring degrades the sound and masks any differences, but I've not found that to be the case in doing blind tests comparing the sound with no switcher and with it being in place.

Anyway... there's a DIY procedure for a simple blind home test. Comments anyone?
 
Disabled Account
Joined 2002
Steve Eddy said:


That's all well and good from a purely objective, emotionless point of view.

However, the sole purpose of the amplifier is ultimately to serve the subjective, emotional human beings at the end of the chain. And their single most important criteria is their own subjective satisfaction. Which may be at odds with an amplifier designed by wholly objective criteria.

....must be transparent, neutral, ie not bend signal...if you want an emotional twist, distortion,..etc...added to your music, then introduce it at line-level, with an 'emotion-adder'


Steve Eddy said:

You seem to be saying that objective criteria is the only valid criteria to use when it comes to designing an amplifier.

if by objective you mean sound, proven, and well founded scientific engineering principals, guilty as charged....

Steve Eddy said:

If that is what you're saying, then I'm afraid I will have to disagree. The equipment serves us, not the other way around.

se


...indeed the equipment serves us, but we surely must have some unambiguous and transparent method of determining the quality of service?
 
Re: My blind procedures (long!)

nw_avphile said:

Well I was expecting something like that. I'd guess I'd like to be able to put some numbers to it if that's possible? In other words, if you run one test, it's 60% accurate, if two tests agree it's 65% accurate, if ten different tests agree, it's 90% accurate, etc. Is that possible?

Not really, no. You want to set the test up so that any factors you think might be important are incorporated into the design. That leaves all the stuff you didn't think of (solar flares? seismic activity?) which is going to be pretty hard to assign a number to. And then there's the human factor: actual people will be performing the tests, and they make mistakes. They also cheat, sometimes without even realizing it. So, it's up to you: how many passed listening tests would you need to see to be completely convinced that resistor end cap material audibly changes the sound? And how many would it take before you decided to spend a little more on resistors for your next project? (Personally, I'd say "two" and "one", if they were done by the right people.)

As for the test you outline, I'd say it's got a few problems, but I'll let someone who knows more about psychoacoustics than I do comment on it.
 
Statistics 101 ..... A REALLY BORING POST

I've skimmed the last few pages and there are a number of points needing clarification.

For a start, please see my earlier post:
http://www.diyaudio.com/forums/showthread.php?postid=149253#post149253

Statistics cannot prove anything

In case you are not listening, let me repeat that more clearly:

STATISTICS DO NOT PROVE ANYTHING!!!!!!!

And remember, I am one of the proponents of the DBRCT process.;)

What statistics tells us is the probability that we would observe the given set of results by chance alone. that is it .... finito!

The element of proof comes from agreed convention. For example, in most biostatistical studies we accept (defined before we start the study!!) that if we see a set of observations which have less than 5% probability of occuring, when no difference exists, we will accept the result as "significant".

By definition we are now going to have a "significant" result in 5% of cases when there is no real effect. This is called alpha error.

A significant result is interesting as it may indicate there is a real effect and we should look closer. This may mean repeating the study with some variation, in a different population, etc.

If the effect is explicable by process, the study(s) well designed and the effect reproducible we "accept" that the effect is real.

NB: Nothing has actually been proven, we simply accept that the observations are so unlikely by chance that there must be something real.

Now, if we are talking about a drug or an intervention which carries risks, as well as a benefit, we might set the alpha level to 1%, by convention.

OK ........ something a little closer to the thread (you guys still awake??).

Let's take "differentiation of audible difference". How unlikely, by chance, would we like the results to be for us to accept that the result is significant ?

That's right folks, we get to choose before we start :eek:

I would suggest that we would actually set an alpha level higher than 0.05 (5%). The "downside" is not going to hurt anyone and we are more interested in "finding" differences if they really are there (I hope).

SO, for our DBRCT, I would recommend alpha = 0.2 (just my opinion, nothin' factual here!)

BTW, if the stereopile reference from Fred was all true, the probability of guessing 5 heads is 3% and I would be kind of prepared to accept that this was a significant result.

Now, we have the ever present statement, "the null result says nothing". Let me be quite clear about this, WRONG!

A "non-significant" result tells us exactly the same thing, the probability of observing the given results by chance alone .... blah, blah, ... persistent, aren't I :devilr:

Now, for us to say that the "non-significant" result actually means "there is no difference", we have to design the trial to another agreed set of rules, so we have sufficient confidence to accept that this is true.

Here we are talking about "power", which is basd on the reverse error to setting the alpha level. Alpha is accepting a difference when none exists, beta is accepting no difference when one really does exist (see, Fred, I'm thinking of you!).

Normal biostatistical power is accepted (again it is a convention thing) at 80%. That is, if we see "no difference", then we can be 80% confident that it wasn't by chance alone.

Back to our DBRCT (the one I prepared earlier). We might say that we are really in search of difference, therefore we want 90% confidence the no difference finding is real.

To calculate power is slightly more complicated and depends upon numbers and trial design, but this reduces to math in the end.

Then we come to what question are we actually asking (see post reference above), as this will alter the trial design.

The results of 1 trial are "interesting" and may indicate an effect, or lack thereof.

If several trials line-up along the same planetary axis (in respect for Galileo) then I am getting interested.

If these results are reproducible in different locations with variations of trial design, then I'm prepared to call effect/no-effect.

AGAIN, and I'm getting tired of saying this, properly designing a DBRCT is actually a very difficult thing. Done properly, however, it is the gold standard for assessing an effect which is under the influence of psychological factors.

The "double-edged" comments following my last post did very little to credit their authors.

cheers
mark
 
Re: Statistics 101 ..... A REALLY BORING POST

mefinnis said:
What statistics tells us is the probability that we would observe the given set of results by chance alone.

I'll buy that. "Prove" is a strong word I suppose. With respect to probabilities, I have the following information on audio blind testing from people that make a business out of it so hopefully the math/theory is accurate?

The chances of a coin coming up heads on two successive tosses is 1 in 4; on three successive tosses, 1 in 8; on four successive tosses, 1 in 16. Getting tosses of heads ten times in a row is likely only one time in 1024. In other words, large samples tend to smooth out the aberrations of randomness that occur.

Therefore, not only do you need to conduct a fair number of trials in a session, but the results have to be substantially better than 50% to mean that there is an audible difference between the units under test. The very minimum number of trials you should do in a session is ten. Out of ten trials, if the listener chose the correct amplifier seven times, he or she might have heard some difference between the two, but it’s really too inconclusive. Eight correct choices would indicate a probable audible difference between the amplifiers, with about a 95% level of confidence. Nine out of ten would be an even stronger indication. Ten out of ten would almost certainly indicate the listener heard some sonic differences between the amps, since there’s only a 1 in 1024 chance it could randomly.

What we’re really after in these statistics is a high degree of confidence that the results show real conditions and not random occurrences. For example, a series of 25 trials has 33,554,432 possible combinations of right/wrong answers, ranging from 0 correct/25 wrong, up to 25 correct/0 wrong. There are 5,200,300 possible combinations of 12 correct/13 wrong, and an identical number of possible 13 correct/12 wrong. There is only one combination of 25 correct/0 wrong, and while it is possible that a random sequence of responses could be right 25 out of 25 times, there is only a 1 in 33,554,432 chance of it happening. Therefore, we can say that there is a 33,554,431/33,554,432 (99.999997%) chance of it not happening; that would also be our level of confidence in the results: 99.999997%.

We won’t be quite as picky with the listening tests; a 95% minimum level of confidence will be good enough. That is, there should be less than a 5% chance that the results can be attributable to chance.

Trials/Minimum Correct
10/8
12/9
14/10
16/11
18/12
20/14
22/15
24/16

So, if your listener correctly “guesses” the identity at least the minimum number of times in the session of trials, you can confidently estimate that there is an audible difference between the amplifiers under test. On the other hand, if the listener gets fewer than the minimum number correct, it means that you can’t confidently say there’s an audible difference.


So, according to these guys, for a 95% probability you need anywhere from 66% correct to 80% correct depending on the number of trials. According to them a "null test" means "you can't confidently say there's an audible difference."

So, for example... If I go visit Fred and bring my trusty Onkyo with me to compare to his amp in a blind test. We run say 20 trials and he would have to get at least 14 right for there to be at least a 95% probability he can hear the difference between his amp and the Onkyo. If he gets fewer than 14 of the trials right, can we say there's at least a 95% probability he cannot hear an audible difference?

As I understand it, the test can prove to 95% that Fred cannot hear the differences he might claim to be able to hear, but a stastically insignificant result (i.e. less than 14 correct) doesn't prove the amplifiers are necessarily identical. Do I have this right?
 
mikek said:
....must be transparent, neutral, ie not bend signal...if you want an emotional twist, distortion,..etc...added to your music, then introduce it at line-level, with an 'emotion-adder'

Why? Why can't I simply build an amplifier that I'm happy with?

if by objective you mean sound, proven, and well founded scientific engineering principals, guilty as charged....

By objective I mean universally applicable. Ohm's Law for example. It applies to me, you and everyone else. It doesn't change depending on the individual.

...indeed the equipment serves us, but we surely must have some unambiguous and transparent method of determining the quality of service?

We already do. It's called... listening. :bigeyes:

se
 
Tom Nousaine results

Being as we're talking numbers... I thought I'd share the results of some of Tom Nousaine's blind testing of amps:

B&K ST140 vs Parasound HCA800II - 4 listeners picked correct 31 out of 66 total trials

Sumo Andromeda vs HCA800II - 10 listeners, 49 correct out of 102.

Andromeda vs HCA800II - Long term test over 5 weeks, 1 listener, 5 correct out of 10.

All 3 of the above are worse than or equal to 50% correct (random) which means no difference was found. I'm familiar with the Parasound and the B&K but have no personal experience with the Andromeda. Most audiophiles would consider the Parasound (which I believe is a John Curl design?) to be a much better sounding amp than the ST140 which is a very old design that dates back to the late 70's or early 80's. Not as dramatic as an Onkyo receiver against a Bryston, or a Yamaha integrated against big $15,000 monoblocks, but still interesting.

Mr Nousaine also tested WonderCaps against Radio Shack caps and the result was 38 correct out of 94 with 7 listeners. This is much worse than random so it was concluded the capactors sounded the same. The amplifier was a Bryston 2B.

He also tested speaker wire and interconnects from AudioQuest, MIT, MonsterCabale, H.E.A.R against Radio Shack blister pack $2.50 RCA cables and 16 guage zip cord. In all cases, the resulsts with as many as 7 listeners were 50% or worse so they were deemed to sound the same.

References for the above can be found in:

Nousaine, Thomas, "Wired Wisdom: The Great Chicago Cable Caper", Sound and Vision, Vol. 11 No. 3 (1995)

Nousaine, Thomas, "Flying Blind: The Case Against Long Term Testing", Audio, pp. 26-30, Vol. 81 No. 3 (March 1997)

Nousaine, Thomas, "Can You Trust Your Ears?", Stereo Review, pp. 53-55, Vol. 62 No. 8 (August 1997)

Some other interesting references:

Toole, Floyd E., "Listening Tests - Turning Opinion Into Fact", Journal of the Audio Engineering Society, Vol. 30, No. 6, June 1982, pp. 431-445.

Toole, Floyd E., and Olive, Sean E., "Hearing is Believing vs. Believing is Hearing: Blind vs. Sighted Tests, and Other Interesting Things", 97th AES Convention (San Francisco, Nov. 10-13, 1994), [3893 (H-5], 20 pages.

Burstein, Herman, "Approximation Formulas for Error Risk and Sample Size in ABX Testing", Journal of the Audio Engineering Society, Vol. 36, p. 879 (1988)
 
planet10 said:



* i did recall reading about an extensive double blind test done somewhere in Europe (The Netherlands?) where 3 systems were used. An analogue, tube system, a digital, tube system, and a digital, SS system. Each participant was measured before and after for the level of "relaxation/stress". System 1 left people more relaxed, Ssytem 3 left people less relaxed , and System 2 was in between.

This kind of test is probably MUCH more meaningful -- and requires no effort on the part of the subject -- than any blind AB test.

dave


here:

http://www.stereophile.com/fullarchives.cgi?203



I think double blind test isn't the best solution for evaluating sound of audio components.
The major drawback is that one cannot listen to two amps at the same time so everything is left to audio memory. Being exposed to relatively short music interval doesn't leave enough time for brain to adjust so ussualy only large differences in sound are easily percived.
Long time listening is my prefered audio evaluation method, and yes, it is subjective as music and human senses are.

Null test also doesn't tell much about sound of the DUT. Because not all is in quantity of distorsion - spectral content vs. freq, power etc also tells a lot.

I think Vladimir Lamm of Lamm Industries tells a lot about this subject - his interviews are fun to read:
http://www.lammindustries.com/interviews.html
 
Binomial probability distribution ....... Stats 102 ??

The binomial probability for obtaining r successes in N trials is:

P(r) = [N! / (r!(N-r)!)] . p^r . (1-p)^N-r

Where p is the probability for each event (0.5 in our case for Ho = no difference)

We run say 20 trials and he would have to get at least 14 right for there to be at least a 95% probability he can hear the difference between his amp and the Onkyo.

P = (20! / 14!6!) . 0.5^14 . 0.5^6
_ = 0.037

But we really want the probability of 14 or greater, which is 0.058 .... near enough!

I return however to the question of where we set the alpha level and if we chose 0.2 (actually let's use 0.15, the math is neater), Fred needs 12 or better.

As I understand it, the test can prove to 95% that Fred cannot hear the differences he might claim to be able to hear, but a stastically insignificant result (i.e. less than 14 correct) doesn't prove the amplifiers are necessarily identical.
Almost, we can be 95% sure they are DIFFERENT if he gets >=14. If he gets < 14 then our confidence they are the SAME is not the same 95%, sorry.

To work this out I need to know what the probability of guessing A/B would be if they were different .... OUCH :scratch:

I might stop at that point, as you see it becomes none too simple.

cheers
mark
 
Disabled Account
Joined 2002
Steve Eddy said:


Why? Why can't I simply build an amplifier that I'm happy with?



By objective I mean universally applicable. Ohm's Law for example. It applies to me, you and everyone else. It doesn't change depending on the individual.



We already do. It's called... listening. :bigeyes:

se



...someone is going round in circles and its not yours truly...
 
Status
Not open for further replies.