What kind of evidence do you consider as sufficient?

Status
Not open for further replies.
YOU don't constitute as a statistic, so what you can or can't do is irrelevant. Of course, one can willingly ignore anything he's hearing, even 20dB level differences, and just pick randomly. This would guarantee a null result. For such a subject, a "positive control" won't help at all. Or maybe some subjects are drunk. Or maybe some subjects are tired, Or maybe some subjects don't care. This is where statistics is coming into play, and the audience sample size determination is a critical step.
Of course a positive control will help - it is intended to reveal such situations as above but more importantly a subject may be tired, his attention may wander, the equipment & setup may not be conducive to revealing small differences.

It has nothing to do with statistics - that's a red herring. In fact lumping all the invalid results (drunk, tired, don't care, etc) into the mix will statistically trend the overall statistical result towards a null

But at least you are trying to address points instead of trying to score points ;)

Once again, you seem to test the audience sensitivity and not DUTs audible differences.
The sensitivity of the test itself & participants has to be determined before any results are investigated. Why do you think the ITU recommendations state that certain individual's full sets of results may be discarded?

Eliminating extreme results, as indicated by ITU, is a common practice in any statistical analysis. There are specialized statistical/mathematical methods for that, chi-squared test comes to mind.
It also states that incorrectly evaluating the hidden controls & anchors leads to possible elimination

As I said, you are swinging in the wind if you don't have some evaluation of the sensitivity of the test itself - trying to measure centimeters with meter rulers - you simply don;t know how sensitive your measurement setup is
 
Last edited:
I'm also mainly referring to Foobar ABX tests which is what is mostly used on forums as 'proof'

These home run tests are usually individual & the conditions are highly variable, different equipment, listening environment, listeners, etc

We simply don't know the validity of the results

I'm saying that I would require, at the very least, some hidden controls in the results before I would even begin to treat these as other than anecdotal listening impressions
 
No, I'm not suggesting that.
You seem unable to get your head around the fact that the test itself (including its participants) needs to be verified as capable of differentiating small differences - just like I'm sure you have different measurement equipment whose capabilities differ in this regard.

Do you blindly accept that this equipment is sensitive enough to measure at the sensitivity written on the spec? No, you trust that this has been calibrated correctly & you run routine re-calibrations on scopes, etc.

Here we are using a test with unknown sensitivity & assuming its results have validity - are we trying to measure centimeters with a ruler which only shows meters?

Maybe it's your assumption that auditory perception is like a measuring device - always delivers the same output with the same input? And you believe that the only change being made in a blind test to auditory perception is the removal of knowledge? I'm not sure where your problem in comprehension is?

I'm trying to explain it as best I can but if you prefer Jakob's explanations that's fine.
Here are what you keep omitting (either intentionally or you aren't aware of) when it comes to reported audible difference among audio electronic components.
1. Common cause of the difference people hear when comparing DACs and amps is the voltage level difference at the output terminal which causes volume level difference.
2. Listening position change. You can move your head couple inches and you will hear difference when no component is changed especially more so in a room without good acoustics (common among consumers). So, a listener listens to a DAC for a while, then goes to the equipment rack and changes to another DAC, comes back to his seat and listens again, he is subjected to not double whammy (level difference & listening position change) but triple whammy (level difference, listening position change & aural memory fade) which is the list number 3.
3. Aural memory fade from lack of quick switch between components.

Contemporary DACs and amps are hi-fi enough to pass our ears. Other than malfunctioning unit, the audible difference people report isn't due to some magical property of audio replaying electronic components that is yet to be measured or discovered. It's much more straight forward than you are trying to make it to be.
 
I'm also mainly referring to Foobar ABX tests which is what is mostly used on forums as 'proof'


Those using individiual Foobar ABX test results as "proof" are as much ignorants as those claiming they hear the difference between two mains cables in their home setup.


1000 (picking a random number here) Foobar ABX individuals testing, out of which 990 (making up a number here) ended up with a null result is called a statistic.


If you want to understand why those 10 did hear a difference, you could repeat the test on them, and see if there's indeed something they hear, or their results were simply statistical variances (read: lucky guesses).
 
Are we using a test which can't differentiate known audible differences to evaluate audible differences in the DUT?

Again, what are the known audible differences in "good" DAC's you will use to establish controls? You quote those ITU specs that use terms like "impairment" or "artifact" without specifying or quantifying them. I suspect the ITU being engineers would measure distortion, etc. and quote some numbers that you might not like. Like no one is expected to hear <1% low order 2nds and 3rds.

There is an element of this thread that seems to aim at dismissing DBT for audio altogether (Mark's fear factor). The DAC thing is especially low on my list, for real enjoyment I listen to LP's exclusively. If you have some spare time you should read Romy GoodSoundClub - Romy the Cat's Audio Site and Arthur HIGH-END AUDIO and Arthur Salvatore to get the extremes of listening pleasure.
 
Last edited:
Wally, for clarification and from my understanding:

Foobar ABX does not provide a proof.

It does give a probability that the person undergoing the test was able to hear a difference in the two files and identify which of the two (A or B) was presented as X.

This may lead to a consensus that people can hear a difference and identify the files.

However, this is not a formal proof.
 
On another point I wish folks would stop acting like Foobar ABX is something special. You could write a script in any number of languages that plays the .wavs at random and queries the user. I'm certain Foobar uses the same IEEE library to randomize as just about any program out there and the statistical computations are documented in numerous places.
 
Pretty much what I would expect from psychology students.

Provided were results from students, scientists in the field and scientists/teacher lecturing statistics. Your comment only mentioned the students; does that mean you do not wonder about the misconceptions apparently part of "common knowledge" of the other scientists and lecturers?

Of course it would explain why you aren´t surprised by the students errors. ;)

You can not use that to infer anything whatsoever with regards to engineering students.

I used my own experience (although admittingly a long time ago : ) ) to infer something about the state of real understanding among engineering students. Further i use my experience from the ongoing discussions about "statistical things" in audio forums where engineers were participating.

And of course i´m using these results while and the others coming from other fields; whenever someone studies the quality of medical research papers wrt statistics the verdict is mainly not very positive either. Not related to student work but to people doing post thesis work. Biology? Not so different.

So of course it might be that in general a lot of other field suffers from these effects but not the engineering department, beside those participating in audio forums, but that hypothesis i´d consider as not so likely.... :)

But anyway, as you might reread my original post, my comment was directed to those engineers participating in audio forums and the "dbt discussions" .
 
Yeah, I've never understood why Foobar's ABX is some boogeyman. It's a tool to conduct computer-randomized ABX trials. How the samples are prepared, the end goals/etc are on the investigator.

We have addressed this too in the past; the main reason is that in audio forums the ABX tests were more often mentioned, i suspect mainly due to the reason, that free software available did favour the ABX protocol, starting with Arny Krueger´s ABX tool, followed by another free tool that even provided the ABC/HR protocol as an alternative (although as far as i have seen rarely used) and ending in todays quite popular Foobar ABX tool. People are demanding "blind tests" from other members, throwing around "ABX" and "Foobar" but never mentioning the pitfalls of using such a tool. Over at hydrogenaudio for example you´ll find it be mentioned in every thread (more precisely in every thread i´ve read) that is related to controlled listening tests.

I hope we have offered already sufficient evidence for the fact that the proportion of correct answers is significantly lower in ABX trials compared to other protocols when testing the same sensory difference. The internal "work load" for listeners seems to be higher than in other tests.
 
Last edited:
Statistics as a field of mathematics was developed sometime after Calculus. It took about 20 years to get statistics pretty well figured out. Some observers have suggested that statistics is somehow more foreign to the way human minds work than Calculus, making it perhaps easier to understand why it took so long to get it figured out. By the way, I don't make this stuff up. Just reporting some things I have read about it.
 
Foobar ABX was once very popular around here, that is, until I described how it's verification system could be beaten. But, since many here have had an opportunity to try it, they may know what it is when it is referred to. Nothing more to it than that.

Also, I noticed it has some features that most similar programs don't have. Just not enough features to finish the job of making the way it needs to be.
 
Wally, for clarification and from my understanding:

Foobar ABX does not provide a proof.


Loose example, to clarify:


X claims he can hear a difference between mains cable A and mains cable B in a sighted test. He agrees to be subject of a Foobar ABX test. Two outcomes:


- With a generally accepted probability of error, he scores positive. This does NOT mean that everybody will hear the difference between A and B, only that either X has an extremely good hearing, or that differences between cables A and B are less than trivial. More (statistical) testing is required to figure out which, but engineers/scientists have now good grounds to step in and find the root cause of this Foobar ABX outcome.


- The result is null. This doesn't mean that nobody can hear a difference between A and B, but only that there are no grounds to take X's sighted test result as a fact. The burden of proof is on X to further prove his findings in sighted tests. For example, it is X that should eventually finance a large scale statistical Foobar Test (like, 1000 participants) to prove that there is a level of confidence that differences between A and B can be heard. Until such proof is provided, engineers/scientists can rest.
 
<snip>
Jakob does this fit into use of controls in your experience?

The JND for level differences is the result from tests where participants had to reply to the question "which stimulus is louder" .

If you use the same level difference but ask your participants "which stimulus sounds better" or "do the stimuli sound different" it depends on the experience and abilities of the listeners if they will be able to identify a level difference as reason for the difference.
Unexperienced listeners will often have problems to identify even larger level differences.

But even experienced listeners will just report much smaller level differences (means below 0.3 dB) as sound differences but will most likely not report one stimulus as louder.

If the experimenter use this kind of control on different niveaus he will get addtional informations about possible learning effects.


EDIT - On further thought this just open up a cheat like the relay click. Identifying the +-1dB trials 100% would enable manipulation of the statistics if one can not distinguish the other trials.

Still amazing how strong bias works :)

No, it would not, as the control results are not part of the analyis wrt EUTs but exclusively for analysis if the test works as intended.
 
Unfortunately we can´t force others to actually read references or recommendations (just kidding, forcing others will rarely result in positive progress) but i encourage everybody interested in controlled listening tests to read the ITU-R BS.1116-3 as it mentiones a lot of important points. And it does not fail to mention that the complex matter of sensory testing can´t be thoroughly covered withing the 26 pages of this publication. (Inclusion of external special expertise is recommended)

It might be suprising for some, but you´ll not find the assertion that nobody should start testing with participant numbers of 1000 and below. ;)
Instead you´ll find that the minimum number would be 20 listeners.

Some excerpts related to controls (denoted as "anchors" in this document) and about what to do if too many transparent systems are found (which translates to can´t reject the null hypothesis in simple ABX and other forced choice hypothesis tests):

It must be empirically and statistically shown that any failure to find differences among systems is not due to experimental insensitivity because of poor choices of audio material, or any other weak aspects of the experiment, before a “null” finding can be accepted as valid. In the extreme case where several or all systems are found to be fully transparent, then it may be necessary to program special trials with low or medium anchors for the explicit purpose of examining subject expertise (see Attachment 1).
These anchors must be known, (e.g. from previous research), to be detectable to expert listeners but not to inexpert listeners. These anchors are introduced as test items to check not only for listener expertise but also for the sensitivity of all other aspects of the experimental situation.

and

If these anchors, either embedded unpredictably within the context of apparently transparent items or else in a separate test, are correctly identified by all listeners in a standard test method (see § 3 of this Annex) by applying the statistical considerations outlined in Attachment 1, this may be used as evidence that the listener’s expertise was acceptable and that there were no sensitivity problems in other aspects of the experimental situation.

and

On the other hand, if these anchors fail such correct identification by any listeners, then this suggests that either these listeners lacked sufficient expertise, or else that there were sensitivity flaws in the situation, or both.

(Bold feature activated by me)
(Source: ITU-R BS1116-3. Methods for the subjective assessment of small impairments in audio systems, pages 8-9)
 
It really is amazing how much denial & reluctance there is to having the test itself evaluated for how fit for purpose it is. It's especially amazing given that there are such documents detailing the correct approach to blind testing & those arguing against it have patently obviously never read the documents or read them but failed to understand them

Those that argue against this so vehemently are the self-same ones who claim that modern audio electronics are "hi-fi enough to pass our ears"

It can't be helped but notice how the predominant null results of such 'tests' fits this agenda & therefore their reluctance is understandable but not excusable if they claim to be interested in scientific rigor & truth
 
Last edited:
Still amazing how strong bias works :)

No, it would not, as the control results are not part of the analyis wrt EUTs but exclusively for analysis if the test works as intended.

That was not as stated these hidden controls had a purpose as I read it they were proposed to show that the blind ABX somehow removed the listeners ability to discriminate small differences i.e. blind ABX has some hidden flaw.

So I ask with respect to the two DAC's set to level match to .01dB what do you propose as controls and how are they used?
 
Status
Not open for further replies.