Hi,
2 posts so far and no details.
It is really simple, to be honest.
Before attempting any given test on an unknown quantity one tests known ones.
So if for example I wish to examine the effect of a given digital domain modification on the analogue output, I would carry out the test using either calibrated and well maintained gear with a known confidence limit (e.g. A Miller Analyser or an AP2). Failing that I would use a loop-back test or use a known good comparison item (e.g. a low jittere CD Player etc.) to establish the confidence levels.
And I would select a test regime that is appropriate to the area effect I am expecting.
In the case of the SPDIF attenuators etc. here we can be clear that there is no data-manipulation (oversampling, ASRC etc.) in the device itself, however we do have to make sure that the Computer the device is attached to sends bitperfect data, which is surprisingly hard to achieve unless using ASIO or WASAPI exclusive mode.
Based that we know that there should be no data-manipulation we need to test for jitter. Of course, we could compare THD, but even a few nanoseconds jitter seem to leave the calculated THD & Noise in both RMAA and on an AP2 essentially unaffected.
So it follows that we need to use a test that highlights jitter, or as it commonly called J-Test. In order to get measurements that give confidence I need to get the AP2 at the Lab to run large numbers of FFT's and average them to reduce the AP2's own noisefloor, so a single jitter measurement that has a sufficiently low noisefloor and hence a high enough confidence level may require 1024 averaged FFT's that take over 30 seconds each to run.
If we instead wanted to conduct a listening test, not a casual one for personal use, but one to publish and to attach any significance to, I would feel incumbent on the experimenter to first conduct some tests on items that have a known audible difference, to ascertain the lowest level of sensitivity.
Stimuli that may be employed in the "calibration tests" include reversing the polarity of one stereo channel (you may be surprised how many "flunk out" at that test), a channel level difference of 1dB and of 0.3dB etc. That way we can have at least confidence that our test resolves known audible differences (again, you would be surprised if you knew how often such is not the case).
Finally, while designing my test it would be good to employ assumptions in our statistics that ensure the risk of type A and type statistical errors is equalised, which is a great problem with small scale listening tests.
Based on my own experiences in such tests I would also suggest that the numbers of trials in a row is limited to no more than five, as with many individuals attention and acuity falls off with much greater numbers, something that can also be tested during the "calibration" phase (e.g. make the same test with 15 trials in one run or three runs of five trials).
I actually found my work on establishing the sensitivity of my tests by far more interesting than the tests and results themselves (which where needed for commercial reasons), especially the way in which they illustrated exactly WHY, as one proponent of ABX/DBT testing remarked "perfectly obvious differences when testing sighted evaporated when the test was done blind".
Actually the differences did not evaporate, they where still there, but the blind tests and attached statistics simply pushed the "noisefloor" up so much that previously clear differences where obscured, at least to the eye of the statistician and the ear/brain system of the participants.
So, my point in all this is that there are ways to test seriously and then there is simple vanking around, which can be fun but is ultimate futile.
A serious test demonstrates first what it can and what it cannot resolve, that it has sufficient resolution to show the debated phenomena and the result of testing a "known good". All this serves to illustrate what the test can resolve and what not and sets the limits and thus provides the context within which the results may be interpreted.
This last item of testing a "known good" item is included ideally even when the test gear itself is "known good", as it will highlight any issues in the test that may have escaped the attention of the person doing the tests. I had several instances of magazines testing my gear using also an AP2 (plus where the Testers had used the AP for many years on a very regular basis) and coming initially to results that where dramatically divergent from those that I got. In all cases we were able to resolve as being due small fault in the test or interpretation, be it faulty calculations outside the AP2 or simply issues in the test set-up.
Ciao T