Hires 96/24 listening test of opamps

Which of the files do you prefer by listening?

  • rr = LM4562

    Votes: 1 4.5%
  • ss= OPA2134

    Votes: 2 9.1%
  • tt = MA1458

    Votes: 2 9.1%
  • uu = TL072

    Votes: 9 40.9%
  • vv = OPA2134

    Votes: 1 4.5%
  • I can not hear a difference

    Votes: 7 31.8%

  • Total voters
    22
  • Poll closed .
Status
This old topic is closed. If you want to reopen this topic, contact a moderator using the "Report Post" button.
Nonsense! The only other explanation is that the person is deliberately trying to undermine the test by choosing the wrong result.


Oh, DF, I don't know what to say. You are of course quite wrong about that. Probably better for you and everyone else if you would read Thinking Fast and Slow, including the extensive footnotes. It would likely turn out to be very useful for you, and hopefully you would be happy you did.
 
AX tech editor
Joined 2002
Paid Member
It takes a minimum of 30 trials per sample to approximate a normal distribution and to say anything signifant statistically about this test. I didn't see instruction addressing that point in the test. Each person should take the test a minimum of thirt times and then find the sample mean ... unless you want argue how this test follows a normal distribution.

It is an interesting test but no statistically significant information can be taken from it from what I can tell.

The way I understand this, is that it is a three-way interplay: how many trials, the result, and the confidence that the result is valid.

The more trials, the more confidence you can have in the result. But even with less trials, if the result is very strongly to one side, you can still have good confidence.

Like 100 trails, and good guesses are 65% confidence is X. But even with 20 trials, if good guesses are 85%, you also can have confidence X.

Is that correct reasoning?

Jan
 
The way I understand this, is that it is a three-way interplay: how many trials, the result, and the confidence that the result is valid.

The more trials, the more confidence you can have in the result. But even with less trials, if the result is very strongly to one side, you can still have good confidence.

Like 100 trails, and good guesses are 65% confidence is X. But even with 20 trials, if good guesses are 85%, you also can have confidence X.

Is that correct reasoning?

Jan

To get the confidence interval you need to know the distribution. If we know this gathered data follows a specific distribution then we can determine the confidence interval, from that distribution.

But if you don't know the distribution you do know that with enough trials you can approximate a normal distribution. I learned that the minimum of trials needed to approximate a normal dist is 30 trials per sample. But more will give you a higher level of confidence.

I can't say anymore without dragging out my statistics text book :)

I don't remember learning what you mention, can't say your wrong though.
 
If we are just talking right now about the probability that one person is guessing in one ABX test involving some number of trials (and neglecting possible situations where someone hears correctly some of the time, but not all of the time, and sometimes might even choose backwards, and what that might mean), then you might want to play around with a binomial calculator which intuition might suggest is a good model: Binomial Calculator However, the problem we have isn't exactly according to this model, but it's the kind of thing you may be thinking of.

EDIT: It's probably also the model that Foobar ABX test uses to calculate probabilities, which we know is wrong for that test.

If on the other hand, one wants to know something like if we tested so many people in some way or other and we wanted to know from that what we could infer about the general population, and if we further wanted to allow for the possibility of hearing sometimes, but not always, and maybe choosing backwards sometimes, etc., it can get much, much more complicated, and probably require more than one kind of test.
 
Last edited:
If we want to talk about how to set up a proper trial, then we're going to need said stats books. And do a fair bit of literature review about experimental design.

Better, and FAR more important thing, is to first determine the question at hand. This supersedes everything else.
 
Markw4 said:
neglecting possible situations where someone hears correctly some of the time, but not all of the time
Why should we neglect what is almost certainly the most common situation for small differences? If a difference is blindingly obvious then ABX will show this with a high score. A smaller difference will result in someone hearing correctly some of the time. As ABX tends only to be used for small differences (or no difference) then hearing correctly some of the time may be what happens nearly all the time. Why neglect it? Have I misunderstood you?
 
DF, I do not suggest neglecting anything at this point for actual analysis of listening testing.

In the quoted text I was speaking in response to some basic questions about statistical analysis to some people who seemed relatively new to it. For developing some initial understanding, it might make sense to start by looking at a simplified model, which is what I was suggesting.
 
OK, but simplified models of statistics can be deeply misleading. Most people have a very poor intuitive idea of what 'random' means so they think that if they toss a coin 10 times they will get 4-6 heads (and not all sequential). Many people think that if they have got 8 heads from 9 throws then it is almost certain that the 10th one will be a tail. These false intuitions can be deeply ingrained so even people who can do the maths correctly can suffer from them.
 
Yes, So I also stated that the simple model is wrong for our purposes, but that it might be close to what they were talking about in the discussion so far. I was expecting there might be subsequent questions about a more accurate model for our purposes, but still waiting. One question might have to do with trying to understand the assumptions underlying Foobar ABX testing displayed statistics and what might or might not be more proper.

If there continues to be interest in and discussion of the subject, then I think people will start to see how complicated it can be, and understand the need to carefully formulate a statement of what they want to find out about listening testing.

Then we could see if there is still interest in undertaking a project. My guess would be that there might not be a lot of interest, but maybe there will be.
 
Last edited:
<snip>
Regarding "categorical messages", for the reason that no one, not a single participant of the test was able to post a valid ABX result that would show and confirm he was able to hear the difference between the opamps, more than this, not even between the D/A - opamp - A/D chain and the original sound file, I dared to make my conclusions.

Have a nice day :).

I understand, but i think we´d agree that the two "categorical messages" were just valid hypothesis, because we don´t know if people couldn´t or actually haven´t even tried.
 
I am going to hazard a guess that the people who developed whatever statistics are used by Foobar know more about statistics than almost all the ABX critics. People knock it not because they have deeply analysed their analysis and found errors in it, but because it gives answers they find unexpected and/or unwelcome. I may be wrong; statistics was always my weakest end in maths.
 
We are talking about an add-on for Foobar that may have been developed by one person.

If one guesses with Foobar ABX, such as by selecting the exact same choice each time without listening, obviously on average about half the answers should be right, and half wrong.

Therefore half right, half wrong should be calculated as the maximum likelihood of guessing.

However, Foobar ABX statistics show a maximum likelihood of guessing if all answers are wrong.

Sure, just try getting all answers wrong by guessing. One would have to be very lucky to get such a result. It would be just as unlikely as getting all answers right by guessing.
 
Last edited:
@jcx, Jakob2 has stated there is some research regarding ABX listening tests that he thinks applies to this situation. He said that ABX can work well only if there is significant training and practice with it.

Sort of; it´s all about probabilities and it seems that overall - due to the different mental processes - there occure differences in the results when using different test protocols, which is a bit alarming.

The original ABX - test was introduced ~1950 and there were some comparisons to another protocol done:
"Although the instructions to the subject in ABX can be very
simple, it is clear that a really very involved judgmental process is
called for. "

and
"In this laboratory we have made some comparisons among DLs
for pitch as measured by the ABX technique and by a twocategory
forced-choice judgment variation of the constants method
(tones AB, subject forced to guess B "higher" or "lower"). Judgments
were subjectively somewhat easier to make with the AB
than with the ABX method, but a greater difference appeared in
that the DLs were uniformly smaller by AB than by ABX. On a
recent visit to this laboratory, Professor W. A. Rosenblith and Dr.
Stevens collected some DLs by the AB method with similar results.
The case seems to be that ABX is too complicated to yield the
finest measures of sensitivity. "

(J.Donald Harris; Remarks on the Determination of a Differential Threshold by the So-Called ABX Technique, Journal of the Acoustical Society of America, Vol 24, 1952, 417.)

That is of course related to the ABX - protocol then used, which didn´t allow for user control over switching and replay of the alternatives. But, if the listener follows the original instruction, the different internal process still exists, so the argument is imo still valid if compared to an A/B forced choice where the listener has the control too.

As some people have reported quite impressive results using the ABX - see for example the list that Paul Frindle provided in his AES Convention paper) i concluded that at least some listeners are able to perform under ABX conditions if getting used to it. (Maybe some even when used it for the first time, humans are different beasts)

I tried the ABX myself and didn´t like it (but did not invest any additional time to train), used it with two different listeners and noticed for both a not so well performance confirmed by statements of both that they felt uncomfortable, so dropped it.
 
So what determines the distribution? Can you influence that with test design or is it something that 'just happens' by the results?

Jan

It is just something that happens that an expert statistician found out long time ago. I don't think design of the test has much to do with it. But you just want to gather data with an idea what kind of information you want to get from it, and then do the statistical analysis.
 
Why should we neglect what is almost certainly the most common situation for small differences? If a difference is blindingly obvious then ABX will show this with a high score.

First of all we need to know what constitutes a small difference (or a big difference) and after that we have to be sure that a listener reaches a sufficient sensitivity niveau under the specific test conditions.

And we should be clear about the research hypothesis (or simply the question to be examined); it makes a difference if one wants to infer from test results to the abilities of the underlying population or if one wants to use just one listener (or a small group of listeners) as detectors for the perceptibility of a (maybe existing) difference.

As said already if an experimenter uses a stimuli with unknown differences in a test with a detector (listener) of unknown capabilities in a test protocol with unknown impact we shouldn´t be surprised if the results are mainly of uncertain relevance/importance/correctness.

A smaller difference will result in someone hearing correctly some of the time. As ABX tends only to be used for small differences (or no difference) then hearing correctly some of the time may be what happens nearly all the time. Why neglect it? Have I misunderstood you?

Leventhal wrote an JAES article in 1990s? about the significance of statistically poor performance (afair coined the acronym SSPP :) ) because as jcx already mentioned it might be that a listener truly hears something but gets the attribution wrong.
As Jon Boley pointed out in another AES convention paper from 2009 that additional analysis of ABX results according to the SDT could gain further insight even in the case of inclusive results.

We should remember that the typical listening test is a small sample experiment mostly neglecting any power calculation.
 
Status
This old topic is closed. If you want to reopen this topic, contact a moderator using the "Report Post" button.