The double blind auditions thread

cbdb · 2011-08-26 3:40 am

How did this thread dedicated to double blind test results get highjacked. If you dont believe in DBTs (because they prove your ignorance) go read another thread. We all know DBTs arent perfect, but they sure beat the ego driven spouting off that seems to be the alternative.

ThorstenL · 2011-08-26 5:33 am

Hi,

VictoriaGuy said:
As I read it, you have mis-stated Thorsten_L's position.

Not quite. The statement was incomplete.

To put it simple, if a person has a strong opinion on the experiments real situation (be it that no difference exists or that a difference exists) they will tend to have their prejudice override what they hear. The result is a strong randomising factor.

This can be overcome somewhat by using a methodology that allows both more extensive analysis than ABX and in itself is less obscuring.

This incidentally has nought to do with test situation stress or attention span issues when many presentations (e.g. 16) are made in a row, which are separate.

My position is that if we consider all the issues that can affect the sensitivity of the ABX Test, that is the usually very high risk of Type II statistical errors due to small sample size, the issues of attention span and test situation stres AND the impact of prejudices held regarding by individuals, it is surprising that the kind of ABX tests promoted by the ABX Mafia ever return ANY positive result.

As the ABX Mafia has routinely refused to apply controls in their experiments as well, we simply no indication if their tests can detect anything more more subtle than turning the system on/off or gross distortion (like clipping or large level differences).

So the result is that these tests are not useful in order to actually understand anything about the audibility or not, of relative small changes, including for example uncompressed Audio vs. high bitrate MP3, where we find more sophisticated double blind tests do show material differences.

Hence I personally draw some simple conclusions from the fact that a given individual supports, cites and promotes the specific style of ABX Testing as promoted by the ABX Mafia.

They are either incredibly naive and gullible and taken in by false mantle of being scientific that this method is wrapped into, these usually demonstrate little actual knowledge of statistics, general methods of sensory testing etc. and normally respond to criticism of the method with bafflement, having been taken in by the claims that this method (and not for example the ITU recommendations) are the gold standard in testing.

The other group one cannot fail to consider as not being interested in facts and truth, only in results that support their position, which is best summed up as "everything in hi-end is 'snakeoil' and must be demonstrated as such at all cost".

I cannot see how anyone who understands sensory testing and statistics and has any interest in the truth can support the kind of test that is promoted by the ABX Mafia or consider it to be "good science".

Ciao T

ThorstenL · 2011-08-26 5:46 am

Hi,

VictoriaGuy said:
Training='power of suggestion' ?

If the following test was not blind, then there would some merit to your consideration.

However, if the test is Double Blind then no power of suggestion can induce people to generate positive results.

In virtually any test (even non-blind) in any other field it is considered best practice to familiarise the volunteers participating with the test, allow them a few "dry runs" to make sure the test circumstances themselves are not contributing to poor result accuracy.

Equally, the use of positive and negative controls in such tests is extremely well established, it is actually extremely baffling to see any test being presented as "scientific" without such controls.

So I find it extremely dubious for anyone to claim that the training of tests subjects was actually "a bad thing".

What would be your stance on using controls in the tests then?

And would it better for the relevance (and statistical power) of the test to use a number of subjects that have tested well in the positive and negative control tests in the actual tests and NOT inform of the difference being evaluated, or would it better for the relevance (and statistical power) to issue a challenge regarding the Difference (say Cables) and tus explicitly select people that have a great deal of bias in the matter?

Ciao T

VictoriaGuy · 2011-08-26 8:16 am

I think I understand the 'training' issue.
Thanks.

However, I am baffled by the 'biased subject' issue. Now that you have re-stated your position, it seems that you are saying that:

Whether a subject believes that he will hear a difference or not hear a difference between the test samples makes no difference. The mere fact of having an opinion on the topic will somehow ensure that the listener's auditory/neurological apparatus will be rendered incapable of hearing any differences.

But, a listener who has no opinion on the topic will have a superior-functioning listening apparatus and will be able to discriminate between the test samples.
First: Is this what you are saying?
If it is, then can you explain the mechanism for this effect?

It sounds preposterous to me, but I'm quite ignorant in these matters, apparently.

Select two groups of subjects- one group is sure that they can see the difference between shades of green, the other group is sure that they cannot. Both groups will be unable to see any differences in the test.
BUT, people who (somehow) have no opinions about their ability to see colours, will do much better on the test. Their eyes and brains work better because they are not opinionated.

Is this is what you are saying?

ThorstenL · 2011-08-26 8:34 am

Hi,

VictoriaGuy said:
However, I am baffled by the 'biased subject' issue.

See Placebo/Nocebo effects.

They work just as well in auditioning audio as in Medical trials.

If you believe there is difference you are very likely to hear one, even though it there was no difference. Equally, if you believe there is no difference you will not hear one, though there is one. In either around halve your test will be "random" and with the stereotypical ABX Test it will impossible to attain a statistically significant result, if the difference is reasonably small, in some cases even if it is quite large.

I once told a "cable [can general high end] skeptic" that I was testing mains cables, under the standard ABX Protocol. Overall four people where in the test, me excluded. In reality I did not change cables, I simply wired one channel in opposite polarity. The three test subjects who had no definite opinon (e.g. they agreed it may or may not a difference and this test was a good idea to find out) had a "perfect" score. The (dis)believer had a score card that can only be considered completely random.

So, for this individual, in the context of my experiment the various conditions of the experiment made a very gross distortion of the audio system actually inaudible. We cannot as such generalise from my one single experiment, which was less of an experiment and more of a very graphic demonstration (the "skeptic" is still not on speaking terms with me over a decade later), however the underlying causes of the effects are well known and studied and do not, in themselves, need more empirical validation that is in existence.

VictoriaGuy said:
Whether a subject believes that he will hear a difference or not hear a difference between the test samples makes no difference. The mere fact of having an opinion on the topic will somehow ensure that the listener's auditory/neurological apparatus will be rendered incapable of hearing any differences.

But, a listener who has no opinion on the topic will have a superior-functioning listening apparatus and will be able to discriminate between the test samples.

Yes, insofar that the point is that if an audible difference exists or does not exist, the unbiased will have much greater likelyhood of correct identification.

VictoriaGuy said:
If it is, then can you explain the mechanism for this effect?

It sounds preposterous to me, but I'm quite ignorant in these matters, apparently.

Placebo/Nocebo.

It may also sound preposterous that taking sugar pills that one believes to be medicine can produce the improvements in the patients condition the real medicine would and that having been wrongly diagnosed with a terminal illness would cause the death of the patient, yet both of these effects are quite real and well documented.

Ciao T

sofaspud · 2011-08-26 9:03 am

It begs the question... why include biased participants as anything other than a type of control subject?
I'm also not fully sure about medicine being a fit analogy. IE "wrongly diagnosed with a terminal illness would cause the death of the patient," is this what the death certificate states? Yes, there's likely some flippant sarcasm there, but it's just to decorate the seriousness beneath that thin exterior. This to me is similar to the "died of a broken heart" type events. Sure, it happens. How it is applicable to DBT I don't know.

SY · 2011-08-26 10:23 am

Using biased participants, in the sense of biased toward believing that the variable under test is audible, is absolutely valid. Preferable in some cases. If the procedure is double blind, their bias is irrelevant.

ThorstenL · 2011-08-26 10:31 am

Hi,

sofaspud said:
It begs the question... why include biased participants as anything other than a type of control subject?

You need to ask that of the experimenters, more specifically those who I call ABX Mafia.

I could offer some reasonable suggestions...

They carry out experiments for which they are looking for unpaid volunteers.

The likeliest applicants in such a case are precisely the people that also have the greatest bias, as people who have no real interest, nothing to prove and no emotional stake in this are least likely to waste their time, unpaid on such a tests.

In fact, I would go further, it seems that at least this group of experiments actively seeks out to entice the most biased individuals they can find, buy issuing challenges that will be likely answered by those who are holding a defined view on the topic.

OF COURSE, if my purpose was to produce "null" results, this would be an excellent approach and should against all odds someone do show a statistically significant result (which is rather unlikely in the case to start with) we just use the statistical device and declare such a case a "lucky coin" and exclude it from the dataset.

When you are asking the experimenters for their justification for relying primarily on people with high likelihood of personal bias in the issue, you may also enquire about patent lack of showing control experiments.

sofaspud said:
I'm also not fully sure about medicine being a fit analogy. IE "wrongly diagnosed with a terminal illness would cause the death of the patient," is this what the death certificate states? Yes, there's likely some flippant sarcasm there, but it's just to decorate the seriousness beneath that thin exterior. This to me is similar to the "died of a broken heart" type events. Sure, it happens. How it is applicable to DBT I don't know.

Nocebo is covered here:

Nocebo - Wikipedia, the free encyclopedia

If you do some searching there are precisely such cases (not many, thankfully) documented.

The underlying principle of placebo and nocebo are applicable to blind testing, especially as the classic "nocebo" response is the opposite of placebo, that is the placebo effect happens when the patient receives the sugar pill and gets better as he "believes" in the medicine, while the nocebo effect occurs when the patient receives the real medicine but fails to improves as he does not believes that the medicine will help.

The analogies in blind testing of audio should be patently obvious.

Ciao T

Jakob2 · 2011-08-26 12:05 pm

VictoriaGuy said:
'widely accepted'..?
'we'...?

This just sounds like a re-statement of "Golden ears better than instruments".....

Quite the contrary as it is stated for example in the ITU-Recommendations. Although it was first related to compression methods, the idea is now expanded to the case of "analog or digital processing" . If you think along that line it seems to be a safe conclusion that every recording is sort of heavy "analog processing" of the real sound field.

Another example would be the Geddes/Lee approach, who researched the relation between conventional distorsion measurement and perceived sound quality and came to the conclusion that correlation is low and therefore new metrics should be used.
If of interested i´ll give cite the papers.

BTW, i think it would be better to refrain from thinking in categories like "golden ears" "subjectivists" or "objectivists" . The scientific requirements are universal and it is better to establish a good understanding of the quite complex topic of sensory testing first.

<snip>
But with hearing- where the biology predicts that our unaided senses would be even worse, the opposite argument is routinely made.

Comparision of vision and hearing sense are quite complex too and normally nobody would consider hearing as "being worse" overall. But the comparison does not really help in our discussions here.

How many tests of hifi topics require an audiologist report on the subject's hearing before the tests begin?

It totally depends on the questions/hypothesis you want to research with a specific test.
If a test is done running statistical tests against the results shows if the nullhypothesis (the nullhypothesis means a test result could have been reached due to random guessing) will be rejected or not. Any further conclusion is only possible if only valid if the specific variables were controlled. So, an audiologist report might sometimes be needed as a basic pretest screening.
During the test protocol the positive control(s) should ensure that internal validity is given.

How many silver-maned audio salesmen post their latest hearing test on the wall of the showroom?

Evaluation of test routines is totally independent of any salesmen action and the topics should not be mixed.

sofaspud · 2011-08-26 12:19 pm

@SY
Yes, I understand that perspective, but if nothing else than to quash this aspect of the debate. If that's possible in any case. The inaudible bias doesn't really work the same way so do away with it altogether, was the heart of my proposal.
And to be honest, I don't see how it can be irrelevant and preferable both.
@ThorstenL
OK, placebo and nocebo are opposite sides of the same coin. And in a sense it is bias as the underlying factor. I can connect that with the listening tests. If bias can be accounted for and analyzed statistically than it shouldn't be a problem. I know not the ABX Mafia, but they must be a burr in your saddle.
I have to say, though, that when I read this:
"They carry out experiments for which they are looking for unpaid volunteers.
The likeliest applicants in such a case are precisely the people that also have the greatest bias, as people who have no real interest, nothing to prove and no emotional stake in this are least likely to waste their time, unpaid on such a tests."
I have to think... this is the land of 4-figure cables and 6-figure speakers. Now suddenly the money has dried up? It makes me highly suspicious.
DBT itself doesn't seem to have reasonable detractors, but it seems there will always be someone claiming "it wasn't good enough." If the environment (I don't want to use the phrase SOTA) is the same after the test as it was before, that'd be an incentive to not bother with DBTs.
I find statistics boring as all getout. A big reason is what you mentioned ("...just use the statistical device...") - they can be tweaked and twisted with bias.
But here's is the issue as I see it. The most biased individuals aren't suitable for DBT due to bias effects, but that leaves them free to declare superiority with golden ear listening skills. They become untouchable. That can't be what you're proposing. Or is it?

SY · 2011-08-26 12:37 pm

sofaspud said:
And to be honest, I don't see how it can be irrelevant and preferable both.

Thus my qualifier "in some cases."

Interestingly, in the Lipshitz BAS article, the person who proclaimed himself biased toward hearing the differences under discussion couldn't, and the one who is accused of being biased against hearing the differences under discussion could.

Jakob2 · 2011-08-26 12:44 pm

VictoriaGuy said:
You mean by 'can't normally repeat it' that your results cannot be replicated by other researchers, or even by yourself?

In that specific case, as we run a paired comparison as a discrimination test, it would be difficult to repeat it. You need a group with the same preference and as we are dealing with smart people it will not take that long until they notice that something special is going on, and then the "no test illusion" will be no nonexistent anymore.

That would neatly side-step one of the essential steps in validating results, no? Assuming this is still in the world of science...

Perhaps you mean something different by 'repeat' ?

If we are talking about science than i don´t understand your post on "why all this talk about training etc" , because the scientific requirements are that a test has to be objective, reliable and valid.
Without using positive controls it is impossible to show internal validity.

If an experimenter uses positive controls he normally realises quite fast that it is better to train the participants, but of course it sometimes depends on the objections of a test.

Jakob2 · 2011-08-26 1:01 pm

SY said:
As you well know, Bennett et al went after those papers as a class, not one at a time.

As you know (at least by now) Oohashi et al. did not use a method of their own, but instead an established method of correction for the multiple comparison problem, so that fits the "class idea" quite well.
In what paper were Bennett et al dealing with the Oohashi "class" (i.e. gaussian filter approach)?

You would not only rely on the dead salmon case (which is not not related to Oohashi´s approach), would you?

I eagerly await Oohashi or one of his fans demonstrating that their data meets modern standards.

Let´s see first your answers to the questions above and the argument in which way Oohashi et al. handled the data of their subjective evaluation wrongly.

ThorstenL · 2011-08-26 1:06 pm

Sy,

SY said:
Interestingly, in the Lipshitz BAS article, the person who proclaimed himself biased toward hearing the differences under discussion couldn't, and the one who is accused of being biased against hearing the differences under discussion could.

Actually, to be precise, the person who was put "on the spot" failed to produce an result that allowed the rejection of the "null hyphotesis". We can work out how likely this would for a modest or small sonic difference existing and not being detected (in other words, the risk of Type II statistical errors) and would find that the likelihood of "missing" small differences would approach a much greater value the likelihood of accepting that a difference was hear and when in fact the result was down to chance.

So Bias was not really needed, but as the test subject was clearly a great believer in the difference (I may have many issues with Ivor Tiefenbrunn and a very low opinion of him as a person, but that does not come into this), so this would have been another factor to make a null result more certain. We may speculate if the experimenters appreciated this fact and used it deliberately or not.

Secondly, you are clearly misrepresenting what Mr Lipshitz could hear and what he could not hear.

Mr. Lipshitz was able to identify the insertion of the Sony PCM F-1 processor by the increase in noisefloor (which was not notable when music from LP was playing), but made no claims he could identify the PCM F-1 with music.

Another person clearly very familiar with ABX Comperator suggested to use the sound of the relays to aid identification.

Looking at the "Challenge" from a scientific viewpoint and considering the requirements of scientific tests, I do not think the whole episode paints Mr Lipshitz and the others in a good light, instead placing them with a large number of other confidence tricksters that use the gullibility of their "Mark" to their own ends.

Ciao T

SY · 2011-08-26 1:23 pm

IT heard a difference until the selection was blinded. Then he couldn't. It's really not that complicated.

ThorstenL · 2011-08-26 2:52 pm

Hi,

sofaspud said:
And to be honest, I don't see how it can be irrelevant and preferable both.

Oh, that is very is easy.

It is preferable when selecting subjects for a blind test that we desire to show a null results, disregardless of the actual facts.

It is irrelevant (or at least can be claimed to be) when the subsequent test results are criticised for having bias.

I would have that much was obvious.

sofaspud said:
OK, placebo and nocebo are opposite sides of the same coin. And in a sense it is bias as the underlying factor. I can connect that with the listening tests. If bias can be accounted for and analyzed statistically than it shouldn't be a problem.

Yes. Or if the presence of bias can be minimised.

In the blind tests I am running quite regularly I generally try to keep the participants ignorant of what they are hearing and by using a protocol that does not rely on ABX but is a double blind preference test with multiple answers I minimise the impact of the Bias.

Of course, I am interested in finding an answer to my question, which is "between multiple items, circuit configurations, PCB Layouts, passive parts etc. presented, is there are a persistent preference for some over others that cannot be explained by chance?" I do not directly seek to force an AB choice.

However, when multiple independent listeners all show a marked preference for one of several items under blind conditions I do feel it is also reasonable to conclude an difference was heard.

sofaspud said:
I know not the ABX Mafia, but they must be a burr in your saddle.

Well, I ride bareback and the ABX Mafia is easy to spot. They like to issue challenges of the "you must prove to me that you hear what claim you" insisting on using the ABX protocol in the tests, carrying their tests out mostly and preferably with biased subjects, reject the needs for controls in their experiments and just generally behave in a way that is highly unscientific and ethically questionable.

sofaspud said:
I have to say, though, that when I read this:
"They carry out experiments for which they are looking for unpaid volunteers.
The likeliest applicants in such a case are precisely the people that also have the greatest bias, as people who have no real interest, nothing to prove and no emotional stake in this are least likely to waste their time, unpaid on such a tests."
I have to think... this is the land of 4-figure cables and 6-figure speakers. Now suddenly the money has dried up? It makes me highly suspicious.

The "ABX Mafia" actually are "debunkers". They operate from the a-priory position that the differences are imaginary and they concocted a test that makes pretty sure they get null results.

When an initial version resulted in too many positives for their liking, they adjusted the statistics and number of trials (claiming of course good reasons) until the results where as desired. I am willing to grant that this was not an intentional, concious and actively directed process, however they where so mortally afraid of false positives (as their agenda was not confirm thatwhat they as "unreason" in High End actually made a difference) that they maximised the chance for their tests to produce null results in the face of small and even moderate REAL AUDIBLE differences.

They have since then steadfastly promoted their tests as the Gold Standard, refused to take on board valid criticisms and have been for the last 30 Years berating anyone who does not apply their methods or who criticises them as "in league with the high end and only out to make big profit from un-weary consumers" and the like.

In doing so they have effectively closed the door for any sensible blind testing and have largely contributed for the last 3 decades to the groth of "unreason" in high end audio.

In the circumstances manufacturers are obviously unwilling to either fund independent experiments that they may (inaccurately or not) perceive as biased and (given for example the "digital challenge" affair) deliberately intended as entrapment.

The ABX Mafia, having failed to see their "ABX comparator" sell widely to all magazines and audiophiles (as they no doubt hoped) had close their business and as a result are broke and bitter (okay, I'm being cynical here) and have absolutely no incentive to do their ongoing tests with due care (the last thing they want to know is that they where actually wrong).

So, no money to hire paid listeners and do real tests.

sofaspud said:
DBT itself doesn't seem to have reasonable detractors, but it seems there will always be someone claiming "it wasn't good enough." If the environment (I don't want to use the phrase SOTA) is the same after the test as it was before, that'd be an incentive to not bother with DBTs.

Equally, if the test heavily pushed as the "way" to test audibility (depite being criticised widely) is fundamentally flawed, why should anyone bother.

sofaspud said:
I find statistics boring as all getout. A big reason is what you mentioned ("...just use the statistical device...") - they can be tweaked and twisted with bias.

As it stands, DB Testing is inextricably linked with statistics.

I generally agree, I personally do not trust others peoples statistics. Hence serious experimenters would include the actual data with their papers.

sofaspud said:
But here's is the issue as I see it. The most biased individuals aren't suitable for DBT due to bias effects, but that leaves them free to declare superiority with golden ear listening skills. They become untouchable. That can't be what you're proposing. Or is it?

First, I have not created this situation. I merely have to live with it. Of course, in an ideal world all information and all beer would be free and we could just choose what pleases us and could do serious and rigerous tests without worring about economics, "trade secrets" etc... But the world ain't like that.

Second, just because someone says "I find that X sounds better Y" without presenting an ABX test that shows this person does indeed hear it I do not get hissy or apoleptics fits, nor blind stammers etc. and then demand that they provide proof or forever shut etc.. I take their comments for what they are worth and conduct my own experiments if I feel I am interested.

I feel to see the point of all this obsession that people "should not declare their superiority with golden ear listening skills". We are all adults here, it is a hobby. When someone makes an observation that they personally find item A preferable to item B we all understand that these are opinions, not rigorous science.

If you want to practice rigorous science, don't hang out with people who mess with audio for a hobby, or at least don't expect them to.

Ciao T

ThorstenL · 2011-08-26 3:13 pm

Sy,

SY said:
IT heard a difference until the selection was blinded. Then he couldn't.

Actually, you simply do not know that.

The article does not include any statement that is positive that I.T. actually heard any differences in the non-blind conditions.

The BAS reported experiment was so flawed in so many areas, with cues from excessive his, relay noises etc and in one round of tests the Experimenters had even "forgotten" to enable the actual acoustic output from the test, any conclusion drawn from it is simply unsafe.

The best we conclude from it is that within the rules set for the experiment there was no proof of any audibility nor of any absence thereoff.

SY said:
It's really not that complicated.

You are right, it is not complicated.

I recommend to everyone to read the account of the experiment Sy champions here as the Paragon and Paramount scientific listening tests and conclude for themselves of the test is more akin to what one expects to be treated to in a Circus sideshow, or what what one expect in serious science...

Boston Audio Society - ABX Testing article

It would be especially illuminating to compare the experiments setup against the ITU recommendations, as the ITU is precisely ointerrested in this kind of test, where the insertion of the DUT (AD/DA converter, audio codec etc. et al) is compared to the unprocessed (unimpaired) signal.

Ciao T

SY · 2011-08-26 3:34 pm

ThorstenL said:
I recommend to everyone to read the account of the experiment Sy champions here as the Paragon and Paramount scientific listening tests

Really? Where did I say that? You do have a consistent manner of argumentation, which involves attributing things to people that they never said.

It was a well-designed test to answer the stated hypothesis (IT can easily hear the insertion of a Sony PCM F1 into an analog chain of components of his choosing); please read the exchange of letters in HFNRR which led to this visit from IT. IT has not subsequently demonstrated the audibility under any other conditions, and shortly thereafter, started selling digital components. And unlike anything you've ever posted or published, the Tiefenbrun experiment was reported in full, warts and all.

ThorstenL · 2011-08-26 3:55 pm

Hi Sy,

SY said:
Really? Where did I say that? You do have a consistent manner of argumentation, which involves attributing things to people that they never said.

Forgive me from drawing inferences from your persistent and consistent advocation of this particular test in this thread in a very positive way, while actually attacking by far more well implemented tests as fundamentally flawed, so, no you did not say that, however your behaviour in this thread gave me this impression.

SY said:
It was a well-designed test to answer the stated hypothesis (IT can easily hear the insertion of a Sony PCM F1 into an analog chain of components of his choosing); please read the exchange of letters in HFNRR which led to this visit from IT. And unlike anything you've ever posted or published, the experiment was reported in full, warts and all.

The raw data for the test was not presented. What presented was "X out of Y = no better than chance".

The actual experiment as implemented was fundamentally flawed in several areas, that should have (and could have) produced a "false positive". The subject was clearly severely biased and more bent on "beting the con" than on actually listening.

In fact, I would suggest that this way of conducting a test is a textbook piece on how not to do it. Moreover, I an concerned even with the framing of the "Hypothesis".

However, as expected you continue to champion the test as "a well-designed test" (with a stated qualification regarding the Hypothesis). It is clear where you stand on the subject of scientific testing...

Ciao T

SY · 2011-08-26 3:57 pm

ThorstenL said:
The raw data for the test was not presented. What presented was "X out of Y = no better than chance".

That IS raw data since X and Y were given for each round.

The double blind auditions thread

Member

Previously: Kuei Yang Wang

Previously: Kuei Yang Wang

Member

Previously: Kuei Yang Wang

Member

Ex-Moderator

Previously: Kuei Yang Wang

Member

Member

Ex-Moderator

Member

Member

Previously: Kuei Yang Wang

Ex-Moderator

Previously: Kuei Yang Wang

Previously: Kuei Yang Wang

Ex-Moderator

Previously: Kuei Yang Wang

Ex-Moderator