But this is an ABX test, not AB.
dave
Its still a binomial distribution. ABX is a methodology, it does not change the distribution. Getting to hear A & B aprior before the test sample X, does not change the fact that you must pick X = A or B.
If anyone is interested Statistical Analysis of ABX Results Using Signal Detection Theory | Jon Boley - Academia.edu
Last edited:
More than three? Do i work while i'm sleeping? I didn't get the memo.
Didn't get the memo, you wrote the memo.
I may have missed some, but I count (in no particular order):
1. “DAC blind test: NO audible difference whatsoever.”
2. “Hi-rez: you can't hear the difference...”
3. “World's best midrange Blind Testing...” - World' Best Midranges - SHOCKING Results & Conclusions.
4. “World's best Tweeters Blind Testing”
Unclear if this one actually happened…
5. “The Ultimate Blind Test - The whole speakers/system.”
Additionally, you once tried to make a bet on the outcome of a blind test you proposed to carry out re: passive versus active crossovers. You structured the outcome of the bet, unsurprisingly, that if no difference was heard, you won! (This would have been the first test with an outcome that didn’t SHOCK you).
So, either you had done the test before and knew the outcome, or, you were willing to bet on the null outcome without having pre-tested.
>>>>>
I wonder who would pay real money for the results of what now appears to be a single-blind, ABX “test” run by a person who doesn’t believe in audible differences in high end audio?
The pro guys need bulletproof gear, so that would leave out a lot of what you tested. In electronics, the costs to meet the build quality and noise requirements would dwarf the potential savings on a DAC chipset, so they wouldn't risk sales or reputation on a small cost saving, in case the test results proved bogus...
That seems to lead back to consumer audio?
Single-blind tests can leak and potentially give false positive results, although I can't think of any way they could give false negatives.
A long time ago, I took part in a single-blind loudspeaker cable hearing test with a couple of (then) colleagues and a friend of a colleague. One colleague flipped a coin in another room, came into the listening room, disconnected the loudspeaker cables, moved both pairs of cables around a bit and then connected one pair depending on the flipped coin. Two colleagues of mine and a friend of one of them listened to the music and tried to decide which cable they had listened to. I was the sceptic: I filled in what loudspeaker cable I thought it was before the music even started. We had decided in advance to do 10 trials and to calculate the single-ended probability of excess; anything below 5 % single-sided would be considered a significant outcome.
Results:
One of the listeners had 3 out of 10 correct.
Two others had 4 out of 10 correct, but had given the exact same answers.
I had 0 out of 10 correct.
Had we decided to do a double-sided test, then my result would have been significant, while I only listened to the sound of cables being swapped.
A long time ago, I took part in a single-blind loudspeaker cable hearing test with a couple of (then) colleagues and a friend of a colleague. One colleague flipped a coin in another room, came into the listening room, disconnected the loudspeaker cables, moved both pairs of cables around a bit and then connected one pair depending on the flipped coin. Two colleagues of mine and a friend of one of them listened to the music and tried to decide which cable they had listened to. I was the sceptic: I filled in what loudspeaker cable I thought it was before the music even started. We had decided in advance to do 10 trials and to calculate the single-ended probability of excess; anything below 5 % single-sided would be considered a significant outcome.
Results:
One of the listeners had 3 out of 10 correct.
Two others had 4 out of 10 correct, but had given the exact same answers.
I had 0 out of 10 correct.
Had we decided to do a double-sided test, then my result would have been significant, while I only listened to the sound of cables being swapped.
As an audio hobbyist there is no obligation to optimize the price-performance ratio. I spent 3.5 years and over 1000 euros on the DAC I built, but it was fun, I learned a lot and I have a good sounding DAC now, so why should I care whether or not it is audibly different from other DACs?
? I think you have the two terms confused - false positives are when something is perceived as audible which isn't - false negative are when something is perceived as inaudible when there are actually audible differences.Single-blind tests can leak and potentially give false positive results, although I can't think of any way they could give false negatives.
An ABX test needs a minimum of 16 trials for statistical significance - it's easy to miss audible differences (false negatives) for many reasons - lack of suitable training, tiredness, unsuitable equipment, unsuitable test environment, bias that all sound the same & the test is a waste of time, etc.
That's the issue with statistical analysis - there's the lucky coin results (or in your case unlucky coin) issue to be dealt with by repeating enough tests to swamp any such lucky coin resultsA long time ago, I took part in a single-blind loudspeaker cable hearing test with a couple of (then) colleagues and a friend of a colleague. One colleague flipped a coin in another room, came into the listening room, disconnected the loudspeaker cables, moved both pairs of cables around a bit and then connected one pair depending on the flipped coin. Two colleagues of mine and a friend of one of them listened to the music and tried to decide which cable they had listened to. I was the sceptic: I filled in what loudspeaker cable I thought it was before the music even started. We had decided in advance to do 10 trials and to calculate the single-ended probability of excess; anything below 5 % single-sided would be considered a significant outcome.
Results:
One of the listeners had 3 out of 10 correct.
Two others had 4 out of 10 correct, but had given the exact same answers.
I had 0 out of 10 correct.
Had we decided to do a double-sided test, then my result would have been significant, while I only listened to the sound of cables being swapped.
? I think you have the two terms confused - false positives are when something is perceived as audible which isn't - false negative are when something is perceived as inaudible when there are actually audible differences.
That's exactly what I mean. The example I gave suggests that there was a difference between the sounds of the cables that got connected, or between the expression on my colleague's face when he connected one cable or the other. Such a difference could be incorrectly attributed to the cable's sound quality.
See (again)they'll produce statistical evidence
that there is no way to prove the accuracy of the statistical evidence that is produced by such a test
You mean you used a "tell" (your friend's face when connecting cable).That's exactly what I mean. The example I gave suggests that there was a difference between the sounds of the cables that got connected, or between the expression on my colleague's face when he connected one cable or the other. Such a difference could be incorrectly attributed to the cable's sound quality.
So you effectively "cheated" the blind tests as it wasn't blind - your friend was telling you which cable he was connecting from his facial expression.
What's your point - that it's too easy to cheat ABX tests?
Last edited:
My colleague was doing his very best not to give any clues at all as to what cable he connected, but still, apparently I could somehow hear or see a difference without listening to the music. That shows that it's difficult to get anything sensible out of a single-blind test, or at least that a positive result won't mean much.
By the way, the two guys who gave exactly the same answers were sitting right next to each other during the test, but they did not deliberately cheat or anything. They were simply trying to listen to the music, but apparently still influenced each other.
By the way, the two guys who gave exactly the same answers were sitting right next to each other during the test, but they did not deliberately cheat or anything. They were simply trying to listen to the music, but apparently still influenced each other.
No, Doppler. Only 3 so far were made.
Made = Organized + Done + Results.
It's very (VERY) time-consuming to make such test. At least in a correct, serious way.
Made = Organized + Done + Results.
It's very (VERY) time-consuming to make such test. At least in a correct, serious way.
As an audio hobbyist there is no obligation to optimize the price-performance ratio. I spent 3.5 years and over 1000 euros on the DAC I built, but it was fun, I learned a lot and I have a good sounding DAC now, so why should I care whether or not it is audibly different from other DACs?
That's an entire question completely.
Of course this thread is not debating about the DIY hobby per se, the FUN of making yourself a project. It's only about the sonic result.
There is, indeed, plenty of other factors to consider: by example the durability. Is the FiiO could last as long as the Forssell? Probably not. Maybe. But that's not the point here.
There is another member here who helps me organize those tests and he really wants to test other components: Tweeters and Subwoofers.
A while back i was interested, but now i'm not sure at all. I personally don't think subwoofers (even vastly different on paper) would leave us any real chance of positive identification and Tweeters might be better but it's more difficult to organize.
No, i think the 4th i'm planning is the most interesting because it's a massive shortcut and will, finally, give us chances to find an identification threshold.
I think it's time to seek for a test that will land on a full positive identification or a threshold, from the start.
A while back i was interested, but now i'm not sure at all. I personally don't think subwoofers (even vastly different on paper) would leave us any real chance of positive identification and Tweeters might be better but it's more difficult to organize.
No, i think the 4th i'm planning is the most interesting because it's a massive shortcut and will, finally, give us chances to find an identification threshold.
I think it's time to seek for a test that will land on a full positive identification or a threshold, from the start.
<snip>
An ABX test needs a minimum of 16 trials for statistical significance .....
Not necessarilly, that depends on the criterion for the "guessing probability" the experimenter is willing to accept when rejecting the null hypothesis.
That means, if the experimenter chooses the historically 0.05 criterion (SL = 0.05) only 5 trials are needed to reach significance.
Chance to get 5 correct answers in a 5 trial experiment is p = 0.032 which is below the 0.05 critierion threshold.
But there is another problem to consider and that is the beta-error (not rejecting the null hypothesis although it is false)
- it's easy to miss audible differences (false negatives) for many reasons - lack of suitable training, tiredness, unsuitable equipment, unsuitable test environment, bias that all sound the same & the test is a waste of time, etc.
The degree of difference is important and - as we have to rely on human listeners - the detection ability of these listeners is quite important too.
As you said above there are numerous confounders still at work and the experimenter only knows upfront that the measured differences are quite small.
He had to calculate the statistical power for a test - assuming something for the combined difference/detection ability - to get a useful estimation for the number of trials/samples needed.
Statistical power is (1-Beta error) and todays common scientific goal is to get at least statistical power of 0.8 (which btw means that the experimenter is willing to accept a beta-error four times as much as the alpha-error).
So if the experimenter assumes a small to medium effect and wants to reach statistical power of 0.8 then he needs an experiment with 37 trials.
If the experimenter wants that risks for beta and alpha error should be the same then he needs (for the same assumed small to medium effect) an experiment with 67 trials.
If the experimenter thinks the difference will be detected with a probability of 0.6 only and wants to get the same balance of error risks he even needs an experiment of 268 trials (158 trials for statistical power of 0.8 instead of 0.95).
That was the reason for Les Leventhal´s article in the JAES pointing out that the usual 16 trial ABX isn´t a good choice and mentioning that proper training of listeners would help to lower the number of trials needed.
Les Leventhal, Type 1 and Type 2 Errors in the Statistical Analysis of Listening Tests, JAES Volume 34 Issue 6 pp. 437-453; June 1986
Last edited:
<snip>
I feel it is unfair to ask so much scientific rigor for tests of this kind done by individuals.
On the other hand I think Dave is insisting on statistical discipline because such tests shouldn’t be aiming for proving something.
These tests can´t _prove_ anything all what is delivered are probabilities.
To ask for complete scientific rigor from an individual is of course a bit unfair, but that depends on the conclusions drawn from tests and in this case the "touted" conclusions are quite categorical, therefore it is important to point out where the tests are a bit weak.
Furthermore it is imo quite difficult to get a comprehensive description about what was really done (and in which way) in these tests.
I consider such tests as very good for the participating individuals as they provide to them adequate empirical evidence about their discriminating capabilities.
I´ve always promoted the idea that members should do their own controlled listening tests, but combined with the warning that it needs some time/efforts to get good/correct results from these tests and that nobody should routinely expect to find the truth at the first attempt.
I have quite a bit of experience with testing and comparing audio devices, I run a modest studio and share rooms with a seasoned mastering engineer. The mastering guy is on the never-ending search for better sounding components and we pretty often critically compare audio components.
With this background it is really baffling me, that NONE of the participants was able to detect any difference, because most probably there are some.
In my experience there are most often detectable differences between audio devices, however small and it might be hard to say which device is better, or closer to the truth, but we where able to detect them in a blind test, with enough hits to be confident it is not coincidence. (Randomly played files from a daw)
Well, I do not know the fiio (But I own one of their headphone amps and can attest very good quality), but it is obviously not a bad converter. The forsell is famous in music studios, but I don't know it either. There is a chance both converters really sound the same, why not? Would make the fiio the bargain of the century.
But then, all converters we compared sounded different (antelope audio pure 2 vs metric halo vs dam1021 🙂 vs benchmark vs mytek vs RME …
To detect converter differences you need a good room and good speakers obviously, we listen on geithain 904 active speakers and the room is treated very well.
The biggest problem of the test in this thread might be the room, the photos show a room with little real basstrapping (Some couches might help 🙂) and a bit of absorption on the walls, but that kind of treatment stops working below something like 1Khz and you have all the critical lower midrange bouncing around the room uncontrolled, you just hear the room and certainly not the fine details of converter imaging. Most living rooms have better treatment with books and records on the walls and a couch or two in the corners.
The speakers though might be fine, I don't know.
And than you really have to train your listening skills. I guess most of the difference in converters is due to jitter, filter quality and analogue backend. None of those differences jump in your face, but are very subtle. You have to concentrate on imaging, listing fatigue, emotional impact and midrange coherency, things like that, not bass and treble or linear distortion.
You can learn to listen to that stuff, and than you detect the differences, I am pretty confident.
The argument that for the casual listener those differences are meaningless, because he cannot detect them is imho flawed, because music consumption is all about emotion and nothing about analytical thinking. A lesser converter simply is less fun to listen to, like compressed music, it is not like " oh, that sounds broken", more like "ok, turn the music down, it bothers me".
My hypothesis is: The brain simply does not like e.g. the effects of jitter on the music, even though those effect are very small (barely or even not detectable) in an analytical listening session. I even go on further in saying, that those digital artifacts are even worse than a flawed frequency response or high linear distortion, because those things are easily analyzed and compensated by the brain.
I love to listen to classical music on my 1960s tube mono kitchen radio, I love the sound of its wideband speaker.
But as soon as e.g. the crossover of a modern speaker is not totally perfect and there is just a little bit of phase bull.. (barely detectable in an anytical way) in it, listening is no fun anymore, no matter how great the rest of the components are.
With this background it is really baffling me, that NONE of the participants was able to detect any difference, because most probably there are some.
In my experience there are most often detectable differences between audio devices, however small and it might be hard to say which device is better, or closer to the truth, but we where able to detect them in a blind test, with enough hits to be confident it is not coincidence. (Randomly played files from a daw)
Well, I do not know the fiio (But I own one of their headphone amps and can attest very good quality), but it is obviously not a bad converter. The forsell is famous in music studios, but I don't know it either. There is a chance both converters really sound the same, why not? Would make the fiio the bargain of the century.
But then, all converters we compared sounded different (antelope audio pure 2 vs metric halo vs dam1021 🙂 vs benchmark vs mytek vs RME …
To detect converter differences you need a good room and good speakers obviously, we listen on geithain 904 active speakers and the room is treated very well.
The biggest problem of the test in this thread might be the room, the photos show a room with little real basstrapping (Some couches might help 🙂) and a bit of absorption on the walls, but that kind of treatment stops working below something like 1Khz and you have all the critical lower midrange bouncing around the room uncontrolled, you just hear the room and certainly not the fine details of converter imaging. Most living rooms have better treatment with books and records on the walls and a couch or two in the corners.
The speakers though might be fine, I don't know.
And than you really have to train your listening skills. I guess most of the difference in converters is due to jitter, filter quality and analogue backend. None of those differences jump in your face, but are very subtle. You have to concentrate on imaging, listing fatigue, emotional impact and midrange coherency, things like that, not bass and treble or linear distortion.
You can learn to listen to that stuff, and than you detect the differences, I am pretty confident.
The argument that for the casual listener those differences are meaningless, because he cannot detect them is imho flawed, because music consumption is all about emotion and nothing about analytical thinking. A lesser converter simply is less fun to listen to, like compressed music, it is not like " oh, that sounds broken", more like "ok, turn the music down, it bothers me".
My hypothesis is: The brain simply does not like e.g. the effects of jitter on the music, even though those effect are very small (barely or even not detectable) in an analytical listening session. I even go on further in saying, that those digital artifacts are even worse than a flawed frequency response or high linear distortion, because those things are easily analyzed and compensated by the brain.
I love to listen to classical music on my 1960s tube mono kitchen radio, I love the sound of its wideband speaker.
But as soon as e.g. the crossover of a modern speaker is not totally perfect and there is just a little bit of phase bull.. (barely detectable in an anytical way) in it, listening is no fun anymore, no matter how great the rest of the components are.
Sample size determination - Wikipedia
Statistical power - Wikipedia
The most critical variable in our case here is the number of participants. Then, the type of participant. Then the number of trials.
If the equipment/system set-up is questionnable, then it could become a critical variable.
So far, the bottleneck, the weakness, is the number of participants.
Statistical power - Wikipedia
The most critical variable in our case here is the number of participants. Then, the type of participant. Then the number of trials.
If the equipment/system set-up is questionnable, then it could become a critical variable.
So far, the bottleneck, the weakness, is the number of participants.
The FiiO has been running in my system for 3 days now. WOW!
Thank You, JonBocani.
Umm..... doesn't that miss the point? Isn't the thread about DACs not being different enough from each other to matter?
It's not saying that the FiiO DAC is a unique outlier and is as good as all $3000 DACS.
That someone would then go and buy the specific DAC mentioned in the test defeats the whole point, no? 😕 Use any good DAC that is convenient and fits well with your system , including aesthetics..
Look at training participants but that is difficult if you don't know what differences to focus on. Did you say that there were a significant number of tests where 0.2dB was differentiated?Sample size determination - Wikipedia
Statistical power - Wikipedia
The most critical variable in our case here is the number of participants. Then, the type of participant. Then the number of trials.
I would look at the Toslink connection as a possible weakness - quality of optical cable can be significant - plastic/glass?. Can you try another digital connection from the Fiio?If the equipment/system set-up is questionnable, then it could become a critical variable.
- Home
- Source & Line
- Digital Line Level
- DAC blind test: NO audible difference whatsoever