The double blind auditions thread

Status
This old topic is closed. If you want to reopen this topic, contact a moderator using the "Report Post" button.
Sorry, don't have Toole's book. Not sure which Lipshitz paper you're referring to but I assume it's not the one linked previously in this thread as that makes only passing mention of line equalization.

By "DBT amp tests" are you referring to the 1982 tests mentioned in post 2 of this thread? If so, I don't have that reference handy either. I also checked the ABX Bibliography and the only link there is broken. I'm not adverse to pulling paper references if they're worthwhile, but that's not something easily done after hours on a Friday. :p
 
No, they level-matched in that experiment, but the paper I was referring to was the landmark JAES paper. Worth reading. VERY worth reading. It's also discussed in the Clark JAES paper.

From the BAS paper:
"The gains of the "A" and "B" paths were matched in both left and right channels to within 0.05 dB at 1 kHz using the PCM-F1's gain controls."

Really, what else did they need to say?
 
Hi,

Double blind testing is a standard scientific method for evaluating the impact on humans while avoiding subjective opinions corrupt the data. It is used for evaluating the efficacy of new medicines for example.

Correct. Double Blind testing is an important tool. In many cases it is not possible to have any reasonable certainty about possible preferences while testing sighted.

However, like any form of test blind testing is subject to a number of issues that have the potential to produce errors, errors that can be so large that they completely invalidate the actual test and make it useless.

These range from Experimenters Bias over the Nocebo Effect to simple Statistical Errors and a lack of "testing the test" against known simulae.

As we are in blind testing employing statistics, that is, we compare the results of the test to the likelihood of them happening "by accident", the best way of guarding against statistical errors is simply to use a large dataset from a large number of participants, which is one of the reasons why it is extremely rare to see for example clinical trials of new drugs with only a handful of participants.

Other arguments against the validity of a given test may refuted by demonstrating the ability of the test (that is the whole test set-up, the chosen test subjects, test program etc.) with phenomena that have been confirmed as giving positives in similar tests.

Doing so and showing that the test is capable confirming the presence of phenomenae known to cause differences in testing in essence addresses ANY objection to the test and would therefore the preferred method of confirming the validity of the test, to any reasonable person.

Unfortunately double blind auditioning of audio equipment is a controversial topic with some audiophiles.

This thread is for those who believe in its usefulness.

Its purpose is to put in one place for easy access the available info on such tests.

Controversy usually arises when tests that are fatally flawed (in other words have to be rejected as biased/inconclusive) are being used to promote publicly certain agendas. The mere existence of such a test rarely even draws notice.

I doubt for example that anyone would dispute a given double blind test that was carried with care to minimise bias's and other sources of errors.

To give an example of such a test I would recommend the test carried by Juergen Ackerman in Germany (where incidentally I observe a lot more serious research on issues relating to audibility of controversial issues than in the English Speaking world)> It is described in some detail in Markus Sauer's Article "God is in the Nuances" in Stereophile(start from "Expert Testimony") and which may be considered as an exercise how to do it right and which I recommend to be considered by any performing blind tests and against which other published tests should be evaluated.

On the other hand, there are tests that attempt to clothes themselves in "Scientific Vestments" by proclaiming themselves to be "Double Blind" and so try to enhance their credibility. Yet pay no attention to minimising bias (on the contrary, they often seek to maximise them) and appropriate statistics and worse, they do not disclose these facts but instead are conducted and published by people with known agendas.

For example in debates between myself and well known audio researcher James D. (JJ) Johnston of AT&T-Bell Labs on RAHE (if memory serves), JJ was under the simple impression that because some such tests had been published by "reputable" publication and they where claimed to be "Double Blind Tests" they where meaningful and scientific.

So he came out in support these tests (when I criticised them), without actually having looked in detail at the way the tests had beenset up and how the results had been statistically evaluated. When I spelled out the actual tests set-up etc. he was considerably taken aback how anyone would attempt to pass off such utter tosh as serious test.

To not criticise and depreciate such tests and to not lay open the deliberate or self deception as is practised in these tests and by those that conduct them would be to acquiesce to the abuse of science and good will towards it.

Sadly those with a public agenda and sufficiently lacking in conscience tend to be by far more vocal than those who have even a minimal interest in the truth of the matter, so the majority of the DB Tests that have received public interest are precisely the ones that have such fatal flaws, not the least because they are often publicised in a quite spectacular manner, strongly reminiscent of carnie folk barkers and other such moutebanks.

There have been a number of publications addressing some of the issues around Audio DB Tests, especially the preponderance of Null Results in many of the published tests, among others:

How Conventional Statistical Analyses Can Prevent Finding Audible Differences in Listening Tests
Author: Leventhal, Les
Affiliation: University of Manitoba, Winnipeg, Manitoba, Canada
AES Convention:79 (October 1985) Paper Number:2275

Type 1 and Type 2 Errors in the Statistical Analysis of Listening Tests
Author: Leventhal, Les
Affiliation: Department of Psychology, University of Manitoba, Winnipeg, Man. R3T 2N2, Canada
JAES Volume 34 Issue 6 pp. 437-453; June 1986

Statistically Significant Poor Performance in Listening Tests
Author: Leventhal, Les
Affiliation: Department of Psychology, University of Manitoba, Winnipeg, Manitoba, Canada
JAES Volume 42 Issue 7/8 pp. 585-587; July 1994

Analyzing Listening Tests with the Directional Two-Tailed Test
Authors: Leventhal, Les; Huynh, Cam-Loi
Affiliation: Department of Psychology, University of Manitoba, Winnipeg, Man., Canada
JAES Volume 44 Issue 10 pp. 850-863; October 1996

Also worth reading for historical interrest is posted on Stereophile's website and introduces many of the Dramatis Personae in the several dacade long Greek Tragedy:

The Highs & Lows of Double-Blind Testing
Author: Larry Archibald, J. Gordon Holt, C.J. Huss
Stereophile May 5, 1985

Another useful item from Stereophile's site is the account of a large scale blind listening test Stereophile organised in 1989, issues surrounding the statistics and many other issues.

Blind Listening
Authors: John Atkinson Will Hammond
Stereophile Jul 9, 1989

So, I am greatly in support of blind testing as a tool, however I am equally greatly in opposition to it's misuse as propaganda tool in the likenss of what old Bill once described as:

"And thus I clothe my naked villainy
With old odd ends, stol'n forth of holy writ;
And seem a saint, when most I play the devil."

Ciao T
 
Thorsten, thanks for the info you posted.

One important difference with the medicines testing, is the fact that you can repeat the audition tests any number of times using the same listener (subject) which obviously you cannot do with patients since you cannot restore the "sickness state" before the medicine was administered. There are also other important differences we can talk about.

With double blind auditions, you can repeat a simple test a number of times (10-20 or more) with the same subject. The test can be the one mentioned before: you have a switch with 3 positions: A, B and X. The subject controls the switch, and can switch between the positions any number of times, then he must say if X is the same as A or as B.

Therefore I believe one subject is enough for a DBT listening test. Of course the test will only show if that particular subject can distinguish between A and B. Other subjects might score better or worse.
 
fw, it depends on what you're testing for. If the question is, "Can listener X hear variable Y?" the experimental design to answer that question is different from, ""Can an average listener hear variable Y?" or "What is the range of thresholds for detecting variable Y?" And these are again different than tests designed to tease out hedonic issues ("Of the choices for variable Y, which are most preferred?").

A Procrustean approach to experimental design, complete with inapt analogies, is not likely to clarify. So, one subject is enough for some questions, not enough for others.
 
Hi,

With double blind auditions, you can repeat a simple test a number of times (10-20 or more) with the same subject.

Indeed. However, the more trials are added up in one sequence the lower the actual attention, at least for many people. I personally found that even with a lot of effort my rate of correct identification is slightly worse with 5 trials in a row (without break) than with three but with something like ten trials in a row my later trials in the sequence are no better than random.

The test can be the one mentioned before: you have a switch with 3 positions: A, B and X. The subject controls the switch, and can switch between the positions any number of times, then he must say if X is the same as A or as B.

The ABX test protocol such as it it normally promoted has been severely criticised for the last two and a halve decades. This alone should suffice to require at the very least modifications. I would suggest in fact that unless your aim is to produce deliberately false negatives as a propaganda tool, other test protocols (See Ackerman, Leventhal and others) may be better able to not only produce results that allow high degrees of statistical confidence but also produce additional data that allows the tests to be much more meaningful.

Therefore I believe one subject is enough for a DBT listening test. Of course the test will only show if that particular subject can distinguish between A and B. Other subjects might score better or worse.

Let me be clear. You can do any test you like, any which way you like.

However, if wish for the results of the test to be considered as offering any information about circumstances outside the specific test I suggest you would be required to demonstrate that:

A) Your specific test can confirm the presence of differences previously demonstrated to be audible.

B) Your specific test has a balanced risk of Type I and Type II Statistical errors.

C) Your specific test is repeatable with identical outcome.

Ciao T
 
@ Thorsten_L,

while the ABX-protocol was criticized a lot, it is not impossible to get useful results by using it.
But participants in a controlled test using the ABX-protocol might need a longer training time to get used to it.

Training of listeners and the usage of proper established positive and negative controls are the main factors (assumed that the experimenters are able to control the other variables) to do a test that is objective, valid and reliable.

In this regard JJ was quite ahead because he was always emphasizing the need of positive and negative controls and was nearly shouting that ´training of listeners is mandatory´ .

BTW the now freely accessible ITU Recommendations provide a lot of useful informations on controlled tests:

ITU-R BS.1116-1 (10/97) "Methods for the subjective assessment of small impairments in audio systems including multichannel sound systems "
http://www.itu.int/dms_pubrec/itu-r/rec/bs/R-REC-BS.1116-1-199710-I!!PDF-E.pdf

All recommendations for broadcast (BS) (i guess this will be censored by the system...) are listed on this page:

Broadcasting service (sound)

Just a quote from the 1116 recommendation that deserves much more attention:

It should be understood that the topics of
experimental design, experimental execution, and statistical analysis are complex, and that only the most general
guidelines can be given in a Recommendation such as this. It is recommended that professionals with expertise in
experimental design and statistics should be consulted or brought in at the beginning of the planning for the listening
test.

A good reference for subjective evalution tests is:

Soren Bech, Nick Zacharov, Perceptual Audio Evaluation - Theory, Method and Application

BTW, i am still wondering which way the Meyer/Moran-Paper slipped through the peer review .

A couple of interesting papers:

Dominik Blech, Min-Chi Yang, DVD-Audio versus SACD
Perceptual Discrimination of Digital Audio Coding Formats

http://old.hfm-detmold.de/eti/projekte/diplomarbeiten/dsdvspcm/aes_paper_6086.pdf

The complete thesis (unfortunately written in german):

http://old.hfm-detmold.de/eti/projekte/diplomarbeiten/2004/dsdpcm/pdf/Gesamtarbeit neu.pdf

A. Pras and C. Guastavino, “Sampling rate discrimination: 44.1 kHz vs. 88.2 kHz,”:
http://mil.mcgill.ca/wp-content/papercite-data/pdf/pras_sampling_2010.pdf

Oohashi et al., Inaudible High-Frequency Sounds Affect Brain Activity:
Hypersonic Effect:
Inaudible High-Frequency Sounds Affect Brain Activity: Hypersonic Effect
 
Hi,

while the ABX-protocol was criticized a lot, it is not impossible to get useful results by using it.

While this is correct, my personal main criticism is that unlike other forms of blind testing it provides no additional information, such as a preference etc., which with all due respect are at least as important to advancing the state of the art in Audio if not more so, than merely establishing that difference exists (at which ABX is also not that good).

Maybe what puts me off also is that it was explicitly designed by a certain group as instrument to produce proof of their hypothesis that the sonic differences in HiFi Systems reported by many where in fact illusionary... As a result ABX as it is published and often practiced has a strong statistical bias against positive results.

But participants in a controlled test using the ABX-protocol might need a longer training time to get used to it.

Training of listeners and the usage of proper established positive and negative controls are the main factors (assumed that the experimenters are able to control the other variables) to do a test that is objective, valid and reliable.

In this regard JJ was quite ahead because he was always emphasizing the need of positive and negative controls and was nearly shouting that ´training of listeners is mandatory´ .

I agree on this, however most tests and all those published by the ABX Mafia lack training, controls and many other things that one may seriously question their motives.

BTW the now freely accessible ITU Recommendations provide a lot of useful informations on controlled tests

<heavy snipping for brevity>

Oohashi et al., Inaudible High-Frequency Sounds Affect Brain Activity:
Hypersonic Effect:

These are all very good and useful references, thank you for adding them.

Ciao T
 
Oohashi et al., Inaudible High-Frequency Sounds Affect Brain Activity:
Hypersonic Effect:
Inaudible High-Frequency Sounds Affect Brain Activity: Hypersonic Effect

That one has had a lot of critisism:
1 no one could notice any difference between normal and hypersonic sound.
2 it takes 20 seconds for an effect is measured.
3 It has not been repeated by anyone but Oohashi
4 quote from page 3550:
Two recording sessions were repeated for each condition in the following order: baseline–FRS–HCS–FRS–HCS–
baseline.
If you know in what order what is being tested, bias is inevitable.
More info here: Hypersonic effect - Wikipedia, the free encyclopedia

@SY: You beat me to it.
 
Hi,

That one has had a lot of critisism:

First, I believe it was listed to illustrate the kind of test set-up / procedure one should use.

1 no one could notice any difference between normal and hypersonic sound.

Yet brainwave analysis showed high correlation...

I believe Mr. Oohashi did several follow-ups addressing the criticsms, not that it matters in the context.

Ciao T
 
I had the privilege to participate in some of the first tests of the ABX system, put on by those in the "ABX mafia". When I ran the ABX box, i.e. participated in the test, between two amplifiers I knew beforehand to have huge differences, I was shocked that I could not tell the difference between the two amps. My initial reaction was to move firmly into the no-differences camp. After more years than I care to admit, I do think there are differences between amplifiers. These differences, FOR ME, tend to be listening fatigue issues, which I have largely eliminated by moving to class A and class D devices.
I still don't think I could tell a difference using an ABX box, if levels were matched, and similar or identical frequency responses were obtained.
 
Hi,

I had the privilege to participate in some of the first tests of the ABX system, put on by those in the "ABX mafia". When I ran the ABX box, i.e. participated in the test, between two amplifiers I knew beforehand to have huge differences, I was shocked that I could not tell the difference between the two amps.

To put this into different words, you fell for the con-job.

Would you have found equally "no difference" had you been:

1) Ignorant that you where comparing two items that you had previously percieved as "very different"?

2) Had been simply asked to rank item A and Item B by preference, while not only being ignorant of the identity of Item A and Item B, but also to their very nature?

My initial reaction was to move firmly into the no-differences camp. After more years than I care to admit, I do think there are differences between amplifiers. These differences, FOR ME, tend to be listening fatigue issues, which I have largely eliminated by moving to class A and class D devices.
I still don't think I could tell a difference using an ABX box, if levels were matched, and similar or identical frequency responses were obtained.

So, in other words the ABX Test itself (I will grant that the ABX device itself was largely transparent) introduced an additional variable, which resulted in the obscuring of sonic differences, which you claim long term sighted testing reveals.

BTW, I once demonstrated that with the right kind of "challenge" to the right kind of "subject" it is possible to cause apparent inaudibility of gross sonic differences (polarity reversal of one channel).

In the end, the ABX protocol alone as normally applied already contains a strong bias against non-null results, this can be made far greater by exposing the subject to the tests without training and by making sure the subject has significant emotional involvement (that is, the subject is convinced of the presence of a difference and is especially eager to show that such a difference exists).

By handling things right even the grossest sonic differences may be made inaudible, never speak of any one that should be considered subtle.

Now I am usure if the Gentlemen behind this where just displaying a mixture of ignorance coupled with a desire to make by far more sure they where not falsely accepting posited differences, or if they had intentions to deceive from day one.

However their obstinate persistence in promoting their methodology, commercial devices etc., disregarding any criticism and failing to correct the methodology in the light of reasonable criticism, combined with their clear agenda clearly indicts them.

To quote SY's signature...

"In science, contrary evidence causes one to question a theory. In religion, contrary evidence causes one to question the evidence."

Ciao T
 
Status
This old topic is closed. If you want to reopen this topic, contact a moderator using the "Report Post" button.