Blind Listening Tests & Amplifiers

Status
Not open for further replies.
Stats 103 ...... I promise I will stop after this!

In an attempt to give some perspective to numbers, let me try this:

Lets assume/agree that if there IS a difference, then we should be able to guess 70% (taken indirectly from our 14/20 example)

Case 1. We use "standard values" (alpha=0.05/power=0.8)

We need 37 tries to be 80% confident a "null result" really indicates no difference.

Case 2. We use my suggested values, being more interested in "finding a result" (alpha = 0.2/power=0.9)

We need 30 tries. Again this is to be confident of the negative result.

We can find a positive result (ie. the devices are different) with smaller numbers, as indicated.

This is for a single person/multiple "tosses of the coin".

NB: The commonest mistake in biostatistical literature is accepting a negative result, when the study is under-powered to make any such comment :nod:

cheers
mark
 
Re: Stats 103 ...... I promise I will stop after this!

mefinnis said:
In an attempt to give some perspective to numbers, let me try this:

We need 30 tries. Again this is to be confident of the negative result.

We can find a positive result (ie. the devices are different) with smaller numbers, as indicated.

Mark: Thanks for the above and for puzzling over this! Assuming what you've come with is accurate (I'm not in a position to evaluate it) this is very helpful! I can see how the results of these studies can be "abused" but it sounds like it's not that difficult to test for what we want. It would seem Nousaine did just that with at least some of his comparisons.

It's a bit odd the way the logic isn't symmetrical but I can see how that makes some sense. In one direction, as the listener gets greater than the required number of votes correct the probability approaches 100%.

In the other direction, as the listener does worse than the required number, I would expect the probability there is NO difference to eventually approach 100% with a sufficient number of trials. So in the Nousaine trials where they scored much worse than random, and had a larger number of votes, isn't that a fairly strong indicator there was no difference?

In summary: If Fred flunks the basic 95% positive trial numbers, we're left with Fred unable to demonstrate he can hear a difference, but we're unable to demonstrate that he can't either. If we want to conclude they very likely sound the same, we have to use more votes and the poorer he does, the higher the probability. Correct?

Because of the higher numbers of trials in (most of) the Nousaine comparisons, and the fact the scores were often below random (less than 50% correct), that should indicate a high probability of the devices being audibly the same? That's certainly the conclusions that were drawn in his published articles.
 
Re: Tom Nousaine results

nw_avphile said:
Being as we're talking numbers... I thought I'd share the results of some of Tom Nousaine's blind testing of amps:

B&K ST140 vs Parasound HCA800II - 4 listeners picked correct 31 out of 66 total trials

Sumo Andromeda vs HCA800II - 10 listeners, 49 correct out of 102.

Andromeda vs HCA800II - Long term test over 5 weeks, 1 listener, 5 correct out of 10.

All 3 of the above are worse than or equal to 50% correct (random) which means no difference was found.

At a minimum, I'd want to see these scores broken down by listener before I concluded anything.
 
vuki said:
I think double blind test isn't the best solution for evaluating sound of audio components.
The major drawback is that one cannot listen to two amps at the same time so everything is left to audio memory. Being exposed to relatively short music interval doesn't leave enough time for brain to adjust so ussualy only large differences in sound are easily percived.
Long time listening is my prefered audio evaluation method, and yes, it is subjective as music and human senses are.

This is a common misunderstanding about double blind listening tests.

There is absolutely no requirement whatsoever that one only listen to short intervals of music. This erroneous notion seems to stem from the desirability of fast switching times. Fast switching time does not mean short intervals of music. The intervals of music may be as long as the listener desires. You can listen to A for minute, an hour, a day, a week, whatever. Then switch to B and listen to it for a minute, an hour, a day, a week, whatever. Then go back to A, etc.

Fast switching times only means that when you DO make the switch from A to B, that the switching take place as fast as possible.

This is because the more subtle the difference, the worse our audio memory is. The longer it takes to swtich from A to B, the less reliably one can detect a difference.

So short switching times actually IMPROVES the sensitivity of the test.

Null test also doesn't tell much about sound of the DUT. Because not all is in quantity of distorsion - spectral content vs. freq, power etc also tells a lot.

But the argument isn't about how something sounds. It's about simply being able to discern a difference. Any difference.

se
 
vuki said:
Null test also doesn't tell much about sound of the DUT. Because not all is in quantity of distorsion - spectral content vs. freq, power etc also tells a lot.
But a null test shows any errors including those in frequency response. You can evaluate the spectral content of the null difference to get a very good idea of what sorts of distortions and errors the amplifiers has. ALL errors.

As SE says, if you're only looking to detect a difference, it can be a very sensitive test for that (using real music driving real speakers). As I said earlier, if you capture the peak and average null difference level and spectrum with one brand of capacitors, change the capactitors, repeat the test, and compare the results you can be fairly certain if they make ANY difference at all--and that includes transient performance, phase, harmonic distortions and all other errors or distortions.

If you are saying that you like distortion (as a single-ended tube amp might produce) then I agree you are better off listening for the kind of distortion you want and the null test is less useful.
 
Which way is up??

nw_avphile says Because of the higher numbers of trials in (most of) the Nousaine comparisons, and the fact the scores were often below random (less than 50% correct), that should indicate a high probability of the devices being audibly the same?
Hidden in here is a very important point. Scoring less than 50% is equally important if our question is based around a Ho = "there is no difference".

We are not assuming we know the direction of which is "better" ..... for all I know there may be some relatively inexpensive amplifiers which sound "better" than some more expensive ones.

I would also agree with the statements around wanting to see the individual scores from Tom Nousaine's work.

If we are looking for a small difference, then the measuring instrument has to be adequately sensitive. If we try to measure the difference with a device lacking suficient sensitivity, then the results will not mean anything.

I re-state from my earlier post, it depends upon the question asked:

1. Can 10 average people hear a difference?
2. Can 10 average "golden ears" hear a difference?

1 & 2 => "pooled results are just perfect"

3. Can any/some specific "golden ear" hear a difference?

This said, the sorts of numbers used start to make one wonder.

regards
mark
 
Steve Eddy said:
However, the sole purpose of the amplifier is ultimately to serve the subjective, emotional human beings at the end of the chain. And their single most important criteria is their own subjective satisfaction. Which may be at odds with an amplifier designed by wholly objective criteria.
I think this excellent point, which has come up a variety of times in various ways, is worth a bit more attention.

Perhaps euphonic distortion really is at the root of much of the differences between the objective/subjective schools of amplifier thought? There are many examples where us humans seem to prefer having certain kinds of distortion present in our audio.

For example, I agreed with fdegrove that the classical recordings from the late 50's/early 60's sound amazingly good. Is that perhaps because they're likely recorded with an all tube signal chain with some euphonic distortions that are not present in newer recordings?

Why are "tube simulators" of various kinds as popular as they are? There is all sorts of studio DSP gear now that uses advanced 32bit DSP processing with "tube algorithms" to simulate tubes and, from what I've heard, some recording engineers are using them. Of course there are real boxes with real tubes you can put into the studio signal chain as well. It's likely most of these devices genuinely create a unique sound that will easily survive a blind listening test. They're all ADDING distortion.

I used to sell Conrad Johnson and Audio Research tube gear. The Conrad Johnson gear stood out in a blind test we did. It didn't measure very well either. But I can't say it sounded bad--especially if you weren't asking it to deliver gut slamming deep bass. I still liked the way it sounded even after I knew it was "flawed".

Even in the digital world, the introduction of dither, and other signals, can be beneficial from a listening standpoint.

So perhaps more of the amplifier differences we're discussing come down to euphonic distortion that I first thought? In comparing the Bryston to the Onkyo both probably have such low levels of distortion that one can argue that's why they sound the same in a blind test. The same might be true of the Nousaine B&K/Parasound/Sumo trial (I don't know much about the Sumo).

Mind you this is only a partial explanation. If you think the Onkyo, with all its cheap parts, should sound worse than the premium Bryston, then we still have a problem. Likewise, if you think that two brands of similar capacitors sound different, we also still have a problem (unless we can measure some difference in a null test).

But if you simply believe that many amplifiers--especially somewhat esoteric ones--sound different, you might be right. This doesn't change my advocacy of blind listening or null testing, but it might be meeting the subjectivists half way on some of the arguments?
 
nw_avphile said:
1 - The Cheever document is essentially a way to quantify "euphonic distortion"--distortion we find more pleasing (or less objectionable). There's nothing wrong with that, and I never said that low NFB amps don't have higher levels of distortion (which some seem to enjoy). We're back to amplifiers as art. If you WANT more distortion, and less accuracy, then by all means use less NFB as that's one way to get it.

The psychoacoustic research shows that we cannot distinguish quite high levels of low order harmonic distortion. It would be interesting to hear a DBT with you to add a small amount (below the perception threshold say <3%) to your beloved Onkyo, to see whether you can actually hear it. I bet you couldn't, because if you could it would fly in the face of 70 years of research by such 'tweakers' as Shorter.

2 - The main premise of the Cheever document was to attempt to resolve the apparent conflict with what the high-end tweak press reports (in this case Stereophile) and with what can be objectively measured. With respect to euphonic distortion, that's fine, but with respect to low distortion amps, the real issue seems to be the high-end press and their psychologically biased non-blind listening (the very subject of this thread).

The main premise of the document was to reconcile what is heard with what is measured. Because a meter reads a certain value of something does not mean it has any relevance at all to the purpose of the device.
And as it was passed by MIT, hardly, I would have thought a tweak school, I think it is fair to assume that the technical analysis is in fact correct. How about you approach it on point instead of taking the easy way out and simply bash Stereophile or a connection to them?
I have been reading Crowhurst's writings since I was at Uni two decades ago. The Engineer I trained under had a huge catalogue of articles in technical magazines from that time (40's - 80's). I have yet to see you or anyone else discredit successfully what he (and others such as Shorter) wrote then. Shorter, working for the BBC labs was part of the great tradition of researchers which developed low disortion, low colouration speakers that still stand up well today, so simply to dismiss them based upon your percetion of what you think is correct, only serves to illuminate your closed mind. In those days the BBC was one of the pre-eminent acoustic research facilities that has ever existed, and all their work was based upon rigorous investigation. Simply because it disagrees with your worldview it has no merit. Hmmm.
The reference to Stereophile at the head was merely a segue into the technical discussion. If the technicalities didn't hold up to scrutiny, do you think it would have passed?

3 - If you look at the credibility of Stereophile, as a whole, much of what they publish really doesn't stand up very well to even common sense. They claim the most outrageous differences from things that have been shown time and time again don't make any difference at all. Obviously not everyone will agree with me on this point, but if you look at the greater community at large, and not just within the esoteric crowd that has a big investment in high-end audio, my views are in the majority.

My reading of your response is that as soon as you saw reference to Stereophile, you turned off your thinking and went into Borg mode, and failed to read the later technical analysis.
I am no great fan of Stereophile, or any of the magazines. I subscribe to the UK hifi+ because it's entertainment (also partly a gift), and has great music reviews and articles.
As for majorities, most people drive cheap cars, watch too much trash TV, listen to boomboxes and eat too much junk food. So should we take a majority view on what a performance automobile is, a great movie, a realistic sounding hifi or fine cuisine? Most people aren't interested in what is primarily a hobby to a small section of the population. My brother is into surfing in a big way, and whilst I can (just) ride a board, I can tell little dfference between them. That does not mean that he can't, with about a million waves behind him, and numerous boards ridden.
If your point was to try to discredit me, you failed. I have nothing tied up in the hi end industry at all.

4 - So, IMHO, the Cheever document is built on a very shaky foundation to begin with--subjective biased comments that don't stand up well to blind listening results. In other words: Garbage In-Garbage Out.

In a word, bull. For ANY person to undertake an analysis of a subject, first they must make a premise. You notice something that doesn't tally, and then you look for reasons why. It's irrelevant if you hear a difference, or you measure it on a meter as the starting point for an investigation. In this case it was that the the measured results don't reconcile with a simple one bit THD reading. The analysis makes a good case as to why, and it ties in auditory research from other fields unrelated to 'hi fi', as well as a fair amount of previous research. It is primarily a technical document, and has a ton of references to previous academic research. Take it apart on point, or you'll simply show that you are merely full of hot air.
 
Re: Which way is up??

mefinnis said:

Hidden in here is a very important point. Scoring less than 50% is equally important if our question is based around a Ho = "there is no difference".

We are not assuming we know the direction of which is "better" ..... for all I know there may be some relatively inexpensive amplifiers which sound "better" than some more expensive ones.B]

Thanks for the comments. It shouldn't matter which is better, only that the person can hear a difference or they can't? Or am I missing something?

If the person gets 14 out of 20 correct, we agree there's a 95% chance they can hear a difference.

If the person gets 13 out of 20 correct, there's a somewhat lower probablity they can hear a difference.

If they get 10 out of 20 correct, that's a perfect random or "null" result. It shows they cannot reliably hear any difference.

If they get 5 out of 20 correct, are you saying that doesn't tell us anything different than a 10/20 result? It would seem to me to we're on the other side of the fence now and headed in the direction of the probability showing they do indeed sound the same?
 
Re: Re: Re: Re: Re: Re: Harold S. Black.

mikek said:
The simplest analogy i can use is hose filled with water, ( to represent electrons in our linear negative feedback audio amp.), and attached to a tap at one end, (which then reperesents our signal source).

As soon as the tap is opened, water must flow out the open end of the hose instantaneously...there can be no delay in the initial response....

No, it must not flow out the open end of the hose instantaneously. And indeed it does not flow out the open end of the hose instantaneously. That's because instantaneous is impossible over any distance with a finite propagation velocity.

This does not happen with water and it does not happen to electrons. Both acoustic and electromagnetic waves propagate at finite velocities. And because of this, there will ALWAYS be a delay.

se
 
Re: Re: Which way is up??

nw_avphile said:

Thanks for the comments. It shouldn't matter which is better, only that the person can hear a difference or they can't? Or am I missing something?

If the person gets 14 out of 20 correct, we agree there's a 95% chance they can hear a difference.

If the person gets 13 out of 20 correct, there's a somewhat lower probablity they can hear a difference.

If they get 10 out of 20 correct, that's a perfect random or "null" result. It shows they cannot reliably hear any difference.

If they get 5 out of 20 correct, are you saying that doesn't tell us anything different than a 10/20 result? It would seem to me to we're on the other side of the fence now and headed in the direction of the probability showing they do indeed sound the same?


Its been awhile since I took statistics, but for a random event of two possible outcomes and a large number of trials, how can you
get anything but %50 if there is no difference. Sounds like too few samples or an indication that it is not a random outcome.
 
Directions ......

If they get 5 out of 20 correct, are you saying that doesn't tell us anything different than a 10/20 result? It would seem to me to we're on the other side of the fence now and headed in the direction of the probability showing they do indeed sound the same?
Remember our binomial probability thingo. If we take N = 20 tries, then the probability of r successes was given by our formula, with the results:

r probability
1 0.000019
2 0.000181
3 0.001087
4 0.004621
5 0.014786
6 0.036964
7 0.073929
8 0.120134
9 0.160179
10 0.176197
11 0.160179
12 0.120134
13 0.073929
14 0.036964
15 0.014786
16 0.004621
17 0.001087
18 0.000181
19 0.000019
20 0.000001

So while is it very unlikely by chance we would get 15 successes, it is equally unlikely we would see 5 successes. If we do see this then we need look further into the trial design, because it is a very unlikely thing to happen soley by chance. (It can ..... just very odd/worrying)

Mark
 
Re: Re: Which way is up??

nw_avphile said:

If they get 5 out of 20 correct, are you saying that doesn't tell us anything different than a 10/20 result? It would seem to me to we're on the other side of the fence now and headed in the direction of the probability showing they do indeed sound the same?

No, because getting only 5 right out of 20 would be surprising if they were only guessing. You'd expect just by chance they'd be right more often than that. And, if they're not guessing, and you're sure nothing fishy is going on, then they must be able to hear a difference.
 
Christer, Sy et al:
originally posted by nw_avphile: Hey! We're getting closer... now what we need are two members here who actually live halfway close to each other to make this sort of challenge a reality.
The test that was planned by dorkus never materialized asa far as I know. For those of you who want to see its origin and the discussions in that thread, go here
 
nw_avphile said:

For example, I agreed with fdegrove that the classical recordings from the late 50's/early 60's sound amazingly good. Is that perhaps because they're likely recorded with an all tube signal chain with some euphonic distortions that are not present in newer recordings?

I also think that this time period was a golden era for classical
recordings regarding sound quality. Obviously the equipment
was quite good already then. However, I think the main reason
for the good sound quality was rather to do with recording
procedures. One, two or, at most, three microphones were
used. Multitrack tape recorders were not used. I guess all
mixing usually took place at recording time and there wasn't
much to mix anyway. It was also unusual to cut and paste
the tape to fix musicians mistakes. Of course, simpler and
more straightforward recording procedures were naturally
accompanied with simpler and more straightforward electronics.
 
Its that intantanous feedback thing everyone keep tallking about.

As long as everyone keeps talking about keeping names out of this, can we pick one other than Fred for the hypothetical listening test examples. I feel like a character in one of those books that teach kids to read in school.

See Fred listen. Listen, Fred, Listen!

Maybe we could use Dick and Jane as our hypotethical listeners......
 
Re: Directions ......

mefinnis said:

Remember our binomial probability thingo. If we take N = 20 tries, then the probability of r successes was given by our formula, with the results:

r probability
1 0.000019
2 0.000181
3 0.001087
4 0.004621
5 0.014786
6 0.036964
7 0.073929
8 0.120134
9 0.160179
10 0.176197
11 0.160179
12 0.120134
13 0.073929
14 0.036964
15 0.014786
16 0.004621
17 0.001087
18 0.000181
19 0.000019
20 0.000001

So while is it very unlikely by chance we would get 15 successes, it is equally unlikely we would see 5 successes. If we do see this then we need look further into the trial design, because it is a very unlikely thing to happen soley by chance. (It can ..... just very odd/worrying)
Good point! Like I said, statistics aren't a strength of mine :)

Sorry for asking, what in hindsight, was a dumb question. A really low result WOULD be worrying. Most of the Nousaine results are within the 40 - 60% range so they're not worrying in that sense. I was just extrapolating in the wrong way. It's a sort of bell curve which makes perfect sense now.
 
Status
Not open for further replies.