Double Blind Testing

Status
This old topic is closed. If you want to reopen this topic, contact a moderator using the "Report Post" button.
Someone inspired me to write a post about Double Blind testing, and its implications in Audio Review, or, IMHO, lack of implication. I think there is a great deal of misunderstanding with this term, and when its supposed to be used in an experiment. It really would not make the most sense in audio reviewing. As for my qualifications to make this statement, I am a recent graduate of Clinical Psychology program, where I spent most of my research and time working in things pertaining to research design. I now work for an agency as a Research Associate under Cornell University, and most of my work is in conjunction with other big Universities, some of my work is in some pretty high profile cases. I wont tell you that I am an expert in Experimental design, but it is what I went to 8 years of college for, and I probably know more about it than most. I feel that Experimental Psychology is the most relevant field to the subjective nature of Audio equipment review. I do not think that Engineering experimental designs would be appropriate, I do not think that even medical experimental designs would be the most appropriate. When someone wants to test the effects that something has on human senses, and human perception, which is what I believe listening to music would fall under, they would turn to someone like me, not a Medical Dr. or Engineer.

I pulled out some text books from my undergraduate years to get some of the most basic explanations I could, as some I thought would just not be fair to write up here. Single Blind testing is a an experiment inwhich the treatment group are not informed as to the nature of the experiment. Double Blind is when both the treatment group and experimenter are not informed of the experimental condition. This is used to remove experimenter Bias, when it's believed that the experimenter could introduce an extraneous variable by his own bias. So lets look at an example. Say we are testing the difference between Amp "A" and amp "B". The experimenter has say an "A/B" switch hooked from a preamp to the amp, and the treatment group, a listener or multiple listeners sits and listens. He then, through some measure, say a check list, or possibly just his educated opinion, tries to identify the differences he hears. With Double Blind, the experimenter would not know which amp he is switching between, and the idea would be, he could not influence the listener to think he is hearing differences that he is not really hearing. Here I will agree that a Double Blind test looks like a good idea.

However, we missed a whole bunch of extraneous variables we just introduced, along with taking certain things for granted, including somehow turning the treatment subject into an objective measure, which he/she is not. The A/B box is thought to be transparent in said experiment, in most of science, it would be, but in Audio, it probably is not. Second, we assume that the listener has the ability to hear all the difference's in these amps within a matter of seconds or minutes. We are assuming that the subject has a strong sense memory, which research in this area suggests that our short term sense memory is quite poor, so that would also not be true. We assume that the subject has the ability to objectify, to the best of his ability, the subjective differences he is experiencing. We are not comparing to a control group for the subject, so we have a subject bias, but we are controlling within subjects through the control of one amplifier, hopefully.

How does that compare with the experiments that reviewers do in real life, not at all. We get on their cases and gripe over the need for Double Blind testing. However, that is not enough, a simple double blind test does not take into account all the variables needed, and the naturalistic experimental design that reviewers use, I think, actually does a better job. It takes into account enviroment, sense memory, minimizes issues like the transparency of the switch, and takes advantage of subject bias rather than is hurt by it. In my opinion, a true experiment in whether amps sound different or not would actually use a highly controlled design that eliminated as many variables as possible, was counterbalanced in nature, and would not involve one person, or one set of actual experiments, but rather many. It would take into account all the variables of the subject as well, such as sense memory, quality of the senses, etc. Another major issue is that our reviewers do not go through a training to standerdize how they objectify what they hear, so what is bright to one, may be detailed to another. They shoot from the hip, they need actual measures capable of turning subjective experience into objective data. Such devices exist, they usually use 5 point Likert Scale questions, and would require training to ensure consistency. Given that this will probably never materialize, and is probably overkill, I really do believe that audio review is not something that lends itself well to the experimental paradigm, but can accuratly be explained through the simple introduction of reliable measures instead.
 
AX tech editor
Joined 2002
Paid Member
pjpoes,

Nice write-up of a subject that is close to my heart. But I have a few comments;) . I feel you are mixing up a few things.

Firstly, we are NOT trying to assess the impact of music on the subjects. We are trying to find out whether two pieces of equipment reproduce the same signal differently.

Now, your case that DBT's are flawed because the A/B switch can influence that, and that subjects lack sound memory etc is certainly true but I can immediately use that to kill the value of ANY *subjective* judgement, because by the same reasoning a subjective judgement is ONLY valid for the person making the judgement, and not for me or you.So, the validity of subjective judgements for, say, purchase advice, is non-existing.

Toward the end you say (if I understand your drift): why not accept objective judgements as a valid judgement? Well, aside from the above objective, there is another one. You agree that extraneous factors determine the result, things that are documented elsewhere also, like the colour, shape, design, brand name, reputation etc. The problem is, we don't want to assess the impact of reputation on component sound perception (unless we are doing a market survey), we are trying as I said before, to assess whether a component reproduces the signal differently.

So, to keep this discussion clean, lets first try to make clear what we are trying to say:

Are we indeed trying to assess the difference in sound reproduction? Why then should we accept that, say, physical design has as least as big or even bigger influence on that perceiced (or not) difference? Would you accept the outcome of a clinical experiment trying to find out if a certain new medical drug works if you KNOW that the pills' shape and color has as AT LEAST the same or even bigger influence on the patient's report on its effectiveness???

Jan Didden
 
To be honost, I think you confused a great deal of what I was trying to say, but I may not have been clear. I had not thought this out like I might for work, so I see that what I wrote is a bit mixed.

First, you may want to ***** if two amps reproduce sound differently, but it is my contention that we can't actually measure that. If you dont't believe me, then try and think through what it is you are looking at in the subjective assesment of a product. If we wanted to only know what difference a sound product has, and not the interpretation of the listener, then we have to eliminate a listener. Measure it, as we do, but, we have realized that we can not measure all that we can hear. Then you have the issue of how sound works on a human. We take sound in through our ears, and then process it in our brains, with a great many factors influencing how we percieve that sound. It is not a very straightforward process, making assesment very difficult. So Instead, what I was suggesting was that if a truely scientific experiment was to be done, you would have to exaustivly test every aspect of the process, to eliminate extraneous variables, including using many many many different listeners. However, I dont believe that to be practical, and I feel that the naturalistic observation method used really does a perfectly adequate job. We can, ourselves, adjust for listener bias and other variables just by being familiar with the listener.

As for the Objective over subjective matter, its important to understand those terms. Basicly, Subjective means the interpretation of a stimuli response, and Objective means the actual response. If we can't measure the actual response, and in Audio, we can not, then we have to objectify a subjective interpretation. We can do that to a point, though not completely. First, you use a special question scale known as a Likert scale, those are the rating scales. (1) poor, (2) fair (3) Good, etc, and you usually use odd numbers of choices like 5 or 7. Then, in order to avoid differences between reviewers or observers, and to avoid ceiling effect or central tendancies, you train the observer to match some standard you setup, and would do this with every observer until they rated to within 1% of that standard, if possible. The closer the better. This keeps everything as consistent as possible, and allows you to compare the numbers to each other. This is basicly the method to objectify a subjective matter. Its not 100% accurate, but it does a very good job none the less. Its really all we can do.

See, what I think people forget is that, we really can't measure how something actually sounds to people, because we have the unfortunate problem of having people. People don't hear things all the same, and simple measurements don't tell us what the human brain will do to our sound interpretation.

One example of how powerful the human psyche can be on what we hear, and may help you understand the problem here is this effect that happens with two notes very close together. When you have two notes played that are very close to each other, say one is slightly flat to the other, they tend to warble or modulate. The closer to each other they get, the faster that modulation is. Once you have them at the same frequency, it stops, and you simply hear the two tones layer each other. Now, and I might have this detail wrong, as its something I was told in a recording class (Free Elective:D ), that is not actually happening. You could not measure the warble, as nothing is warbling. However, our brains can not process the two discimeral but close notes, and so it creates the warble, its a psycho-acoustic effect. Musicians, my self included, use it to tune instruments as its very accurate. What this tells us is that our brain, universally so, has an effect on the sound, in many cases an effect that we all share collectivly, but can not measure, we can only express it. So the key is to find a way to objectify our subjective experience as much as possible, and that starts with consistency. The other way, and its what reviewers do now, is to simply get to know the reviewer and all of his or her quirks and styles, terms, tastes, biases, etc. Then when they review something, we can remove all of those from the review equation, and what is left is what actually exists for everyone who listens, or as close as you are going to get.
 
No, the warble is actually there. When you linearly mix two frequencies very close, the amplitude appears to modulate. In fact, AM itself is a carrier with upper and lower sidebands, no suprise the waveform is very similar. The difference is easiest seen at maximum modulation, where the envelope defined by the peak voltage appears as rectified halves of a sinewave.

The following graph was produced with the equation:
f(x) = A * sin(Theta * C) + B * sin(Theta * D)
Where Theta = Pi * (1 + x/2).

Tim
 

Attachments

  • interference.gif
    interference.gif
    6.2 KB · Views: 252
pjpoes said:
First, you may want to ***** if two amps reproduce sound differently, but it is my contention that we can't actually measure that. If you dont't believe me, then try and think through what it is you are looking at in the subjective assesment of a product. If we wanted to only know what difference a sound product has, and not the interpretation of the listener, then we have to eliminate a listener. Measure it, as we do, but, we have realized that we can not measure all that we can hear. Then you have the issue of how sound works on a human.

Sounds to me like you're almost leading up to Occam's Razor: the simplest solution probably is. In this case, psychoacoustics. Except you didn't, you went off on a far less probable tangent.

I mean, give me a break here... if you can't measure a correlation using today's sophisticated equipment, what the heck makes you think the human subjects are going on anything percieved audially?

The ONLY barrier is that people REFUSE to believe that they are listening to their own PREJUDICES, not their cables or markers-on-CD-edges hacks (especially the $500/ft+ cables!).

Tim
 
AX tech editor
Joined 2002
Paid Member
pjpoes said:
[snip]First, you may want to ***** if two amps reproduce sound differently, but it is my contention that we can't actually measure that. If you dont't believe me, then try and think through what it is you are looking at in the subjective assesment of a product. If we wanted to only know what difference a sound product has, and not the interpretation of the listener, then we have to eliminate a listener. Measure it, as we do, but, we have realized that we can not measure all that we can hear. Then you have the issue of how sound works on a human. We take sound in through our ears, and then process it in our brains, with a great many factors influencing how we percieve that sound. It is not a very straightforward process, making assesment very difficult. So Instead, what I was suggesting was that if a truely scientific experiment was to be done, you would have to exaustivly test every aspect of the process, to eliminate extraneous variables, including using many many many different listeners. However, I dont believe that to be practical, and I feel that the naturalistic observation method used really does a perfectly adequate job. We can, ourselves, adjust for listener bias and other variables just by being familiar with the listener. [snip]


Sorry, but I don't get any of this. First you seem to imply that because humans are subjective, it is very difficult to do an assessment of sound reproduction. You say that it is almost impossible to do a good scientific assessment. So, you say, lets stay with the current naturalistic (whatever that is; I take it it is subjective) method. But by your own words that subjective assessment is then worthless as an assessment on sound reproduction!! You find it next to impossible to isolate all external factors that lead to subjectivity, yet you think we can easily "adjust for listener bias and other variables just by being familiar with the listener". Surely you are joking??

Why are humans subjective? Well, I postulate that they are NOT really subjective. But, being bombarded with a plethora of impressions and opinions, sound, color, appearance, reputation, body language of peers etc, they make a weighted judgement and that judgement therefore may differ from occasion to occasion, from person to person etc even with the SAME test or equipment. The way to solve this would be to try to switch off all those external impressions and signals and try to limit the variables to what you want to assess: the sound reproduction. That is what DBT are for, and so far I am not aware of any method that does better. Your proposal, that, in essence says, lets just listen, have fun and state whatever we feel, is a giant step backwards. Unless you just want to listen, have fun and state whatever you feel of course. But I thought the object here was trying to find out whether people can hear differences between equipment and if so, what they are and/or which is preferred.

Jan Didden
 
Many of you have missed my point big time. I think a lot of that is a misunderstanding of the scientific method, which is exactly why I stated that I may have a bit of knowledge over many of you, given my background.

As I said on the warble effect, it was something I had simply been told by an instructor in a studio production class, and I took it with a grain of salt, as I had never heard it before.

As for the issues of subjective vs Objective, its clear to me that you simply dont understand atleast my definition, as would be used in the kind of scientific testing I do. Humans are subjective, as far as measuring our senses are conserned. That is a fact, it is not argueable, at the moment. Subjective means our experience of something, objective means the actual "something." Again, that is the operational definition used in psychological studies of subjectivity vs objectivity with human subjects. That means that if we are trying to measure the experience that a listener gets from an amp is subjective. The goal of any good scientific study is to attempt to code the subjective experience into numbers that can be treated like consistent objective data, to objectify it as best as possible. That is all I proposed. However, I feel that fully doing that would be unreasonable.

I never said anything about just listen and whatever, nor do I think that cable prices or anything of that sort have anything to do with this discussion, so for the sake of a propper arguement, lets leave all of that aside for the moment.

Occam's razor is a common montra in psychology, and one I should learn better, as I am great at designing complex all encompassing studies that no one could ever get funded, and often have to streamline my studies big time in the end. So here, I suppose what I am suggesting is that, avoid my overly complex method of measure for studying the sound of an amp, and instead go with a simpler, less effective, but acceptable method, which would not be far off from what we have now. I would just call for more standardization, maybe some training in how to review a product, so everyone in a group of reviewers comes to the same sorts of conclusions for a given product, thus eliminating listener bias. Of course, that isn't so simple either, as sometimes what sounds accurate doesn't sound as pleasing, and then opinions and biases too take on a part, etc etc.

Again, I will restate that I believe we can not measure the sound of a piece of equipment, say an amp, directly. We can get some of the measurements, and get a rough idea from those measurements. However, I believe it gives us an incomplete picture, and that listening experience must all be measured, which is exactly what a product review is. A product review consists of the listening experience, or the listeners impression of the sound quality, and then product overview for features and setup. If you believe that we can measure everything we can hear, that is fine, thats a difference of opinion, but that would null my whole theory, and DBT as well, as we wouldn't need it, we would simply need to measure our equipment on a computer and be done with it. However, if you buy the premise that some aspects of sound can not be measured and instead must be experienced, then we must measure that human experience to get a full picture of the sq. Again, keep in mind that we can not measure the Sound Quality of an amp directly, we can not measure what our ears are detecting, we can not look at the brain process and see the full picture, so we must rely on subject reflection of the experience. That, as far as I can see, is a fact. DBT can be used in that case, but, as I wrote before, that still doesn't give a full picture of the SQ differences between products, because our senses simply dont work that well.

I have one more analogy that may help you understand why DBT can not work to detect these differences. If I setup a study to show you different color cards, and have you recall if two are the same or different, and I placed them side by side, you probably could detect very slight shade differences. However, If I show you one, then take it away, and show you the other, you probably could not detect the difference. This is because our sense memory is not very strong, and so we can not hold the one color in our brains long enough to compare it with the other. The analogue to sound would be playing two different components for you, and asking you to tell the slight differences between the two. If you had both available at the exact same time, you probably could, accept that our ears can't deal with that well, and its very difficult to do. So instead, I have to show you only one sound, take it away, and show you the next, and ask you to detect the difference. The problem here again is that our memory is simply not strong enough to extrapolate all the sudtle differences that exist. Instead, I would have to let you live with the two colors, or sounds for a long period of time, and then you would begin to notice all the differences. Again, please understand that is something I can defend with a great deal of studies, that is how our senses work, and that is always taken into account when attempting to measure the response from our senses.
 
AX tech editor
Joined 2002
Paid Member
pjpoes said:
[snip]As for the issues of subjective vs Objective, its clear to me that you simply dont understand atleast my definition, as would be used in the kind of scientific testing I do. Humans are subjective, as far as measuring our senses are conserned. That is a fact, it is not argueable, at the moment. Subjective means our experience of something, objective means the actual "something." Again, that is the operational definition used in psychological studies of subjectivity vs objectivity with human subjects. That means that if we are trying to measure the experience that a listener gets from an amp is subjective. The goal of any good scientific study is to attempt to code the subjective experience into numbers that can be treated like consistent objective data, to objectify it as best as possible. That is all I proposed. However, I feel that fully doing that would be unreasonable.[snip]


We understand all that, done it, been there. You skipped the basics. With DBT we don't want to objectivise human subjectivity. We JUST want to find out if there is an audible difference. To do that, we try to set up the test such that the only variable is the sound. We try to keep the listener from knowing which component in the comparison is playing. We try to delete all clues that are not strictly the sound. I don't see how you can get more scientific and objective than that.

The listener hears a difference or not, and scores accordingly. Now it is entirely possible that he hears it sometimes, or just on some types of music, or only after a beer or two. That then would be important indicators that the differences are relatively small. That then would also mean that in normal listening where we try to enjoy the music instead of listening to the equipment, the equipment is pretty equal in performance. On the other hand, if many different listeners in any circumstance consistently hear a difference, that would be an indication that one equipment would be superior to the other.

The problem is not the test. The problem is the unwillingness to accept results that contradict peoples prejudices. Some people go so far as to suggest, without any back-up, that the stress of an organised test keeps people from hearing differences that they would be able to readily hear in their own home comparing two pieces of equipment. And that at home, they would be perfectly able to disregard equipment price, design, reputation etc from their judgement, although it is documented that even experienced listeners cannot do that. So, I feel you are barking up the wrong tree.

Now, I agree, if we would do a test asking listeners: "Please tell me which component gives the best rendition of the emotional content of this musical excerpt" then we run in all sorts of trouble as discussed by you. In that case we would indeed, as you say, "trying to measure the experience that a listener gets from an amp". But that's not what we do.

Jan Didden
 
AX tech editor
Joined 2002
Paid Member
[snip] avoid my overly complex method of measure for studying the sound of an amp, and instead go with a simpler, less effective, but acceptable method, which would not be far off from what we have now. I would just call for more standardization, maybe some training in how to review a product, so everyone in a group of reviewers comes to the same sorts of conclusions for a given product, thus eliminating listener bias. Of course, that isn't so simple either, as sometimes what sounds accurate doesn't sound as pleasing, and then opinions and biases too take on a part, etc etc. [snip][/B]



I hear what you are saying here, and I agree that it appears a sound objective or scientific way. But the problem I see is that what you call a standard is simply YOUR perception, which you want to impose on others as the Way It Should Be. This would be a great test to find out how the various pieces of equipment do according to YOUR standard, but what's that to me?

Jan Didden

PS Please don't take all this personally; I really try to address your posts' contents.
 
Emotional content is the responsibility of the musicians, not the amplifier. I hate emotional amplifiers; they're tempermental and tend to cry a lot.

BTW, we just did a double blind test of Mosel Rieslings. Despite test pressure and all those other horrible things, the panel had no problem distinguishing Wehlener Sonnenuhr from Uertziger Wuertzgarten.
 
AX tech editor
Joined 2002
Paid Member
pjpoes said:
[snip]I have one more analogy that may help you understand why DBT can not work to detect these differences. If I setup a study to show you different color cards, and have you recall if two are the same or different, and I placed them side by side, you probably could detect very slight shade differences. However, If I show you one, then take it away, and show you the other, you probably could not detect the difference. This is because our sense memory is not very strong, and so we can not hold the one color in our brains long enough to compare it with the other. The analogue to sound would be playing two different components for you, and asking you to tell the slight differences between the two. If you had both available at the exact same time, you probably could, accept that our ears can't deal with that well, and its very difficult to do. So instead, I have to show you only one sound, take it away, and show you the next, and ask you to detect the difference. The problem here again is that our memory is simply not strong enough to extrapolate all the sudtle differences that exist. Instead, I would have to let you live with the two colors, or sounds for a long period of time, and then you would begin to notice all the differences. Again, please understand that is something I can defend with a great deal of studies, that is how our senses work, and that is always taken into account when attempting to measure the response from our senses.

But I fully agree to this. Most DBT's are set up such that the listener can switch and listen as long or short or repeated as he likes. Within practical limits of course, but nobody ever said DBTs are perfect; but they beat anything else that has been proposed.

But wouldn't you agree that if the differences in color shades are so small and subtle, you could select ANY as your curtains and be happy with them? Same with sound equipment; if the differences are so small that they cannot reliably be identified, then any of them will do. Again, the problem is that whoever bought those 500$/ft cables CANNOT ACCEPT that there is no audible difference.

Jan Didden
 
Status
This old topic is closed. If you want to reopen this topic, contact a moderator using the "Report Post" button.