John Curl's Blowtorch preamplifier part III

Status
Not open for further replies.
Actually, there is still a problem with Foobar ABX and how it calculates probability of guessing. Guessing should give a score of 4 out of 8 (on average). The probability of guessing in that case should be at a maximum. However, Foobar ABX says there is a 50% chance of guessing when the score is 4 out of 8.

Foobar ABX calculates the probability exactly according to a mathematical definition of probability. Probability of 4 successes of 8 coin flips is NOT 50% !!!!!!! For 10 attempts (n) with (k) successes, the probability you were guessing is

k- probability you were guessing
1- 99%
2- 99%
3- 95%
4- 83%
5- 62%
6- 38%
7- 17%
8- 5%
9- 1%

Probability to have success k-times from n attempts in a vote from 2 possibilities (so called Bernoulli scheme)

p(x = k) = (n over k)*p^k*(1 - p)^(n - k)

p .... probability of success
n .... number of attempts
k .... number of successes
 
Last edited:
Actually, there is still a problem with Foobar ABX and how it calculates probability of guessing. Guessing should give a score of 4 out of 8 (on average). The probability of guessing in that case should be at a maximum. However, Foobar ABX says there is a 50% chance of guessing when the score is 4 out of 8.

No, it does not. Foobar ABX says this in the 4/8 case:

Code:
Output:
WASAPI (event) : OUT (DUO-CAPTURE EX), 24-bit
Crossfading: NO

08:38:52 : Test started.
08:39:07 : 01/01
08:39:10 : 02/02
08:39:14 : 03/03
08:39:20 : 04/04
08:39:25 : 04/05
08:39:29 : 04/06
08:39:34 : 04/07
08:39:38 : 04/08
08:39:38 : Test finished.

 ---------- 
Total: 4/8
Probability that you were guessing: 63.7%
 
The probability of a coin flips and the probability of guessing are two different things. On average X=A should happen in ABX as often as X=B. If one doesn't know the correct answer and therefore always chooses X=A then answers should be right half the time. One should score 50% correct.

This is the same type of problem that comes up in trying to penalize guessing in multiple choice exams. The usual formula is that if there are 4 choices on each question, then exam taker will be penalized 1/4 point for each incorrect answer. The purpose of applying a penalty is to discourage guessing.

However, the approach has been criticized for various reasons, including, IIRC, whether or not the formula is the most correct one to use.

In addition, with ABX testing it has not been researched whether there is in fact a reverse correlation effect due to System 1 operations in instances of weak or unfamiliar signal detection. I think there might be.

Further, with coin flipping it is assumed that each trial is a random event, uncorrelated with other trials. For a human either guessing systematically or with correlated errors associated with memory, fatigue, and other factors when trying to answer the same question over and over but not quite getting it, it might be something more like having a skilled coin fipper that can affect the outcome of trials to make them partially deterministic rather than purely random and independent.

For the above reasons it might be better to skip trying to assign guessing probabilities for ABX testing of small differences. Equating it to a process like coin flipping may be too much of an oversimplification.
 
Last edited:
The probability of a coin flips and the probability of guessing are two different things.

Of course. Probability you will succeed just 5 times from 10 attempts is not the same as probability that you were guessing if you had 5 successes from 10 attempts.

Probability that you were guessing if you had k successes from n attempts is what is called "at least k successes".
For our example of 5/10, it is calculated as probability of just 5 successes of 10 attempts + probability of just 6 successes of 10 + ...... etc. It is a sum of probabilities of just 5/10 + 6/10 + 7/10 + 8/10 + 9/10 + 10/10. And that's exactly how Foobar calculates the result, there is no problem, no mistake.
 
That illustrates one of the problems with null hypothesis testing like it is usually done.
The data observed is the number of correct answers summed up and used as test statistic.
The analysis is done under the assumption that the null hypothesis is true (the null hypothesis is stated as random guessing) and uses the exact binomial test.


A significance level is set before doing the experiment (often at SL=0.05) and the accumulated probailities for each possible result is compared to this significance criterion.
Foobar shows these accumulated probabilites for the result chain rounded to one digit.

The chance to get 10 correct answers in 10 trials is P(10 l 10) = 0.00098 = A (rounded to the 6th digit.
The chance to get 9 correct answers in 9 trials is P(9 l 10) = 0.0098 = B

Given the often used SL = 0.05 we would accept each of these results rejecting the null hypothese, but that means (according to the Kolmogorov axioms) that we have to add the probabilites, as the probability for getting the result P(A or B) = P(A) and P(B) .
In this example it gives P(A or B) = 0.01074 so it is still below our criterion.

The next would be P(8 l 10) = 0.04395 = C but now the cumulated probability for
P(A or B or C) = 0.05469
The probability for that result would be higher than our critierion of SL = 0.05, so in a formal decision process we would no longer reject the null hypothesis.

Actually we are evaluating how compatible the observed data is with our null hypothesis, but as we are not really examing if the null hypothesis is true, we can´t conclude that a negative test result establishes the null hypothesis or corrobates the null hyptothesis, because the real reson might be different and other hypothesises might be even more compatible with the observed data than the "random guessing assumption" .
 
Actually we are evaluating how compatible the observed data is with our null hypothesis, but as we are not really examing if the null hypothesis is true, we can´t conclude that a negative test result establishes the null hypothesis or corrobates the null hyptothesis, because the real reson might be different and other hypothesises might be even more compatible with the observed data than the "random guessing assumption" .

Exactly! Thank you. A calculation is being made on the probability of a particular score occurring based on the assumption that all answers are random guesses, but in fact answers probably are not all random guesses, so the probability of a particular score occurring is likely not being correctly calculated.

That's not all, either. Probably no point in going on, though.
 
The next would be P(8 l 10) = 0.04395 = C but now the cumulated probability for
P(A or B or C) = 0.05469
The probability for that result would be higher than our critierion of SL = 0.05, so in a formal decision process we would no longer reject the null hypothesis.

.

The same results I spoke about, the only difference is "just 8 of 10" and "at least 8 of 10". We are not interested in "just 8 of 10" solely.
 
Member
Joined 2016
Paid Member
Hope that there are enough people who are not bipartisan in the discussion that we might actually make some progress rather than just keeping pushing the boulder up the hill each day.

Sadly there are those that think there is a problem, but are not prepared to produce a test process that can be used by all, but just keep avoiding doing so.

.. you must look to your own designs as possibly lacking something, IF people do not prefer them over others, not that everyone is fooling themselves over what something looks like or some sales pitch from another.

While some people still think that if it doesn't sound different it's bad, that rock is going to keep getting pushed...
 
The same results I spoke about, the only difference is "just 8 of 10" and "at least 8 of 10". We are not interested in "just 8 of 10" solely.

Actually no difference as you cited it already:
" The next would be P(8 l 10) = 0.04395 = C but now the cumulated probability for
P(A or B or C) = 0.05469

The probability for that result would be higher than our critierion of SL = 0.05, so in a formal decision process we would no longer reject the null hypothesis.
"
(i´ve activated the bold now)

P(A or B or C) = P(A) + P(B) + P(C) which is the probability for at least 8 out of 10 .

Better it is to use the expression that it is the probability for P(X >= 8 l 10) with X representing the number of correct trials.
I remember that for example excel uses the expression "at least" for explanation of their probability calculation but is actually calculating P(X > 8 l 10) which is a problem in small trial number experiments.

Edit: in my post with the explanation i used the line "P(A or B) = P(A) and P(B)" but meant P(A or B) = P(A) plus P(B) so it is the mathematical operator "+" not the logical "and" .
 
Last edited:
Sadly there are those that think there is a problem, but are not prepared to produce a test process that can be used by all, but just keep avoiding doing so.

While some people still think that if it doesn't sound different it's bad, that rock is going to keep getting pushed...

I have proposed a change to ABX as it currently exists in Foobar ABX, that I think would move it a into the direction of being more useful. Maybe even to whole way to definitely fair and useful. It would involve adding a loop checkbox, and single button switching between samples with eyes closed such as a hotkey on the keyboard. Those two things should help a lot. At that point I would want to test again. Not sure about at the end a question how the answer choices are presented. Don't remember if its okay or not since I haven't tried it for quite awhile. A programmer here in the forum did contact me by PM at one time and offer to do it so we would have a something a little better than Foobar ABX, but he eventually decided it was more than he could take on.

With regard to 'if it doesn't sound different its bad,' don't know what that is supposed to mean. Some things are indistinguishable from one another. Just not usually the things in PMA's listening tests, although they can be quite hard to use Foobar ABX on. My only complaint about it is that if I can reliably hear a difference blind using a different protocol myself, I would like to see us find a protocol as good that we can agree on and that everybody here can use. The problem is that Foobar ABX can't be changed and it is the only program with a validation test system (although it can be cheated). Like many things, it would appear to take funding to fix it, and nobody wants to pay. They only want to argue.

Once again, I would like to step back as this is taking up too much time. Bye.
 
Member
Joined 2014
Paid Member
See, I had hoped that an interesting discussion could come out of the fact that we have an interesting result to discuss. Not the score in isolation but the way he trained himself to hear the difference and the fact that he was honest enough to admit he didn't have a preference and wouldn't guarantee that he could tell the difference if just presented with one (so effectively transparent for normal use).



The method was a variant of a fast switching on a single event. This may increase sensitivity to the test, but is at odds with the hardline subjectivists and their view that you need hours or days to listed to one before switching to hear the differences. This is fascinating, whilst not unexpected.



My take away is that an open and honest mind can get a lot from these sort of comparisons. Personal learning rather than data to solve an unsolvable argument, but still good :)


Back to pushing your boulders. The vulture will be along to eat your liver in a while!
 
Status
Not open for further replies.