What kind of evidence do you consider as sufficient?

Status
Not open for further replies.

TNT

Member
Joined 2003
Paid Member
....snip... The last paragraph: "Thus, at that cocktail party that seems to go on all night, the barman has to drown out your colleagues’ endless, superimposed banter to hear your request for something stronger. In doing so, he must capture the statistics of the complex background signal, collapse it into a texture, and subtract the resultant model’s predictions from the ongoing babble. Barmen are pretty good at that"

... and lip reading ;)

//
 
Statistics as a field of mathematics was developed sometime after Calculus. It took about 20 years to get statistics pretty well figured out. Some observers have suggested that statistics is somehow more foreign to the way human minds work than Calculus, making it perhaps easier to understand why it took so long to get it figured out. By the way, I don't make this stuff up. Just reporting some things I have read about it.

There was a course at the Met Museum given by a gal who had a PhD in math, had been a mathematician for IBM when she decided to chuck it and get a PhD in Art History. "Art in Context".

So one Saturday morning she lectured on Gothic architecture starting with St. Denis and moving onto Amiens. Amiens was interesting because Columbia University's Architecture School had just put on the web "The Amiens Project" and it took forever to just download one drawing! I recall that she said "The architects of Amiens understood the calculus, they just couldn't codify it." Meaning I guess that the language of mathematics hadn't progressed much beyond Euclid.

Amiens was built, what, 500 years before Newton, 200 years before Galileo?
 
So I ask with respect to the two DAC's set to level match to .01dB what do you propose as controls and how are they used?

Good luck, I'm trying to get the same information for about two years now.

And yes, quoting out of context is a method of choice. There's nowhere in the existing standards that require a "positive control" for completing a credible ABX test.
 
Last edited:
It really is amazing how much denial & reluctance there is to having the test itself evaluated for how fit for purpose it is.

Really I just keep asking outline the process and propose these "anchors" for this specific DAC test. I mean a detailed description of what is involved and what is supposed to happen. Why do I need to read the ITU spec for the purpose of this test I accept it based on its provenance (I know members of various standards committees)? Where did I ever disagree with it? I'm sure it covers no peeking or cheats and I'm happy with that.

These thing are work and they are not fun.
 
That was not as stated these hidden controls had a purpose as I read it they were proposed to show that the blind ABX somehow removed the listeners ability to discriminate small differences i.e. blind ABX has some hidden flaw.

Let me repeat it , it is amazing how strong bias works. :) (SCR)
I´m not sure mmerill99 stated it the way that you claim - strong bias leads imo to less positive interpretation (or stronger to more negative interpretation) of written statements of a member seen as opponent - but the goal of using a positive control is to show that a test works in the way it should work - and i´m pretty sure that mmerill99 did already write that.

And please think about it and tell me your thoughts; if an ABX test is used to test a given fixed sensory difference but the proportion of correct answers in the ABX is significantly lower than in an A/B test used to test the same given fixed sensory difference, could it mean (everything else equal, especially the number of trials) that the ABX has a hidden flaw?

So I ask with respect to the two DAC's set to level match to .01dB what do you propose as controls and how are they used?

I guess you don´t like to read the ITU-R BS.1116-3?

I assume you want to test the two DACs with music samples; so for the normal trials you use the same music sample as stimulus for both DACs, but for the control trials you prepare music samples with a slightly higher level (around 0.2 dB and lower) which you are using in the control trials. One DAC gets fed with the music sample of the normal level while the other DAC gets fed with the music sample of slightly higher level. But your listeners will still answer the same question as before.

There are two possibilities to choose from; as usual both have their advantages and disadvantages but principially you can use control trials within your normal test routine (requires to use more trials overall) or you can run a completely seperated test (with the same number of trials as you normally would use) with the control stimulus/stimuli against the normal/unaltered stimulus.
 
Last edited:
Foobar ABX was once very popular around here, that is, until I described how it's verification system could be beaten. But, since many here have had an opportunity to try it, they may know what it is when it is referred to. Nothing more to it than that.

Also, I noticed it has some features that most similar programs don't have. Just not enough features to finish the job of making the way it needs to be.

Huh?

ABX Binomial Probability Table

The stats reported from Foobar fall in line with these.
 

TNT

Member
Joined 2003
Paid Member
No, I'm not suggesting that.
You seem unable to get your head around the fact that the test itself (including its participants) needs to be verified as capable of differentiating small differences - just like I'm sure you have different measurement equipment whose capabilities differ in this regard.

Do you blindly accept that this equipment is sensitive enough to measure at the sensitivity written on the spec? No, you trust that this has been calibrated correctly & you run routine re-calibrations on scopes, etc.

Here we are using a test with unknown sensitivity & assuming its results have validity - are we trying to measure centimeters with a ruler which only shows meters?

Maybe it's your assumption that auditory perception is like a measuring device - always delivers the same output with the same input? And you believe that the only change being made in a blind test to auditory perception is the removal of knowledge? I'm not sure where your problem in comprehension is?

I'm trying to explain it as best I can but if you prefer Jakob's explanations that's fine.

And I thought that it was the human that was the DUT ;-D

//
 

Not the stats, which would be a separate issue if one wanted to get into that. They did and presumably still do describe what the displayed stats mean about the chances of someone guessing. The description is incorrect, but it is not big deal.

The story I was referring to is Foobar could print out your score with a checksum included. If you feed the whole printout into a website they put up, it could verify that the score numbers were not altered by recalculating and verifying the checksum. This was believed to be of value to prevent cheating when reporting scores. However, cheating was still quite possible which I generally alluded to but didn't describe in any detail. A diyaudio moderator then said I would be doing a service to the community if I would disclose details of the potential cheat, because he said, they suspected there had been some cheating going on and they didn't know how it was possible. So, I complied with the request.
 
Last edited:
There are two possibilities to choose from; as usual both have their advantages and disadvantages but principially you can use control trials within your normal test routine (requires to use more trials overall) or you can run a completely seperated test (with the same number of trials as you normally would use) with the control stimulus/stimuli against the normal/unaltered stimulus.

Ok, I’ll bite.

Assuming a separate control test, the combination of the two tests have four possible outcomes. Please state the interpretation and the actions to follow for each of the four outcomes.
 
I guess you don´t like to read the ITU-R BS.1116-3?

I assume you want to test the two DACs with music samples; so for the normal trials you use the same music sample as stimulus for both DACs, but for the control trials you prepare music samples with a slightly higher level (around 0.2 dB and lower) which you are using in the control trials.

3.1 Expert listeners
It is important that data from listening tests assessing small impairments in audio systems should
come exclusively from subjects who have expertise in detecting these small impairments. The
higher the quality reached by the systems to be tested, the more important it is to have expert
listeners.

That leaves me out from the start. BTW I don't see a 0.2dB level change as fitting on their 5 point annoyance scale. In fact the whole document seems centered around CODEC's and how much information can you remove before annoyance sets it. As I said their reproduction system has very modest requirements and no qualification other than THD is applied to the distortion spectra.
 
That leaves me out from the start. BTW I don't see a 0.2dB level change as fitting on their 5 point annoyance scale. In fact the whole document seems centered around CODEC's and how much information can you remove before annoyance sets it. As I said their reproduction system has very modest requirements and no qualification other than THD is applied to the distortion spectra.

Is the concept of abstraction really totally outdated? :confused:
We are/were talking about sensory tests in general and about certain test protocols in detail.
The ITU-Recommendations do explain a lot of thing that are in general important for "double blind listening tests" that should be able to detect not so obvious or big differences.
That includes to emphasize the usage of controls (their anchors) and the importance to check for possible factors that might lead to lower sensitivity.

And some of the points are direct related to their specific test protocol that was especially designed for their needs in testing codecs.
Could we use the ABC/HR protocol for other EUTs (than codecs)? Of course we can, but might do some adaptions, it is not mandatory to use a grading scale that includes terms like "annoying" we could use another attribute that fits better to our demands. But the principial requirements for controls (their anchors) and checks for sufficient sensitivity are still the same. The same holds true for the training and accommodation time.

Do we need to use the ABC/HR method? No of course not, there are other protocols that we could choose if we need no grading. Use for example the A/B paired comparison method.

Do we still have to use controls (their anchors)? Of course as the underlying principles are still the same. Do we still have to check for sufficient sensitivity? Of coures we have to.

It´s probably the 1000th repetition, but any test has to be objective, valid and reliable.
Validity means that the test really tests what you (the experimenter) pretends it does. If it should test if a difference between two EUTs exists it should _really_ test that, it shall not test (hidden because the experimenter is ignorant of the fact) instead if your detectors were desensibilised by the specific test conditions. T ofind out if this desensibilisation happens, a smart and honest experimenter (who wants to know if his test is a good one) use positive controls (and for completeness negative controls as well).

Just as a reminder, if the same test person gives a lower percentage of correct answers in an ABX than in an A/B test when testing the exact same sensory difference, then the validity of your test is in question when you are using an ABX without guards against the drawbacks of this test protocol.
 
Last edited:
.....
There are two possibilities to choose from; as usual both have their advantages and disadvantages but principially you can use control trials within your normal test routine (requires to use more trials overall) or you can run a completely seperated test (with the same number of trials as you normally would use) with the control stimulus/stimuli against the normal/unaltered stimulus.

I wouldn't agree - the highlighted test is pre-training & not a hidden anchor or control

I maintain that the hidden anchor should be within the normal test in a random number of trials.

If it is included as a separate test then the participant is aware & this awareness may effect the attention, awakeness, etc.

When included as random trials within the actual test the results can be analyzed & if the participant failed to score correctly these control trials (some agreed level maybe 4 or 5 correct out f 6 control trials) then the that participant's complete results are discarded. If this is happening with many participants then the test itself has to be examined as regards fit for purpose
 
Last edited:
I wouldn't agree - the highlighted test is pre-training & not a hidden anchor or control

I maintain that the hidden anchor should be within the normal test in a random number of trials.

If it is included as a separate test then the participant is aware & this awareness may effect the attention, awakeness, etc.

As usual it depends, if the listeners don´t know about the EUT they only know that doing two runs of trials are requested.

A random number of control trials is questionable as a minimum set is required due to statistical reasons. Further you have to carefully check if the control trials itself lead to distraction as in these trials the differences might be suddenly ... different (assuming that a real audible difference between the DUTs exists).

Edit: So i would choose to include controls when sort of (hedonic) rating is used and several stimuli will be rated, but use seperated tests when only two items are under test and the same stimulus is used in every trial.
If different stimuli are used, the controls can be used again within the "normal" trials.
 
Last edited:
As usual it depends, if the listeners don´t know about the EUT they only know that doing two runs of trials are requested.

A random number of control trials is questionable as a minimum set is required due to statistical reasons. Further you have to carefully check if the control trials itself lead to distraction as in these trials the differences might be suddenly ... different (assuming that a real audible difference between the DUTs exists).

Yes, for sure, it gets complicated for those that are interested in conducting valid tests but for others, who claim it's just a simple test, close your eyes, no peeking, blah, blah. As long as they are getting null results sure what could possibly be the problem ;)
 
No, it would not, as the control results are not part of the analyis wrt EUTs but exclusively for analysis if the test works as intended.

When included as random trials within the actual test the results can be analyzed & if the participant failed to score correctly these control trials (some agreed level maybe 4 or 5 correct out f 6 control trials) then the that participant's complete results are discarded. If this is happening with many participants then the test itself has to be examined as regards fit for purpose


Will you guys please stop the word salad and propose something tangible and in detail. In case you have not figured it out no one is arguing anymore. You can discard my results ahead of time I'm only interested in seeing others claim's substantiated, I'll even let you set the rules.
 
Last edited:
Will you guys please stop the word salad and propose something tangible and in detail. In case you have not figured it out no one is arguing anymore. You can discard my results ahead of time I'm only interested in seeing others claims substantiated, I'll even let you set the rules.

Yes you are still arguing & after interminable posts misinterpreting what I clearly said & arguing as a result, you are now moving into stage 2 argumentum

If you're not interested (& this is plainly obvious) why not leave the thread?
 
Status
Not open for further replies.