Questions about signal detection theory

MikeLandau · October 18, 2017, 10:18pm

I just wanted to ask three questions about the stall catchers game, and signal detection theory. I have read that one of the variables in signal detection theory is which strategy a player is using in order to make a decision about whether or not the signal is present.

There are four possible outcomes.

A stall is present, and the player says it is there (hit).

A stall is absent, and the player says it is there (false alarm).

A stall is present, and the player says it’s not there (miss).

A stall is not present, and the player says that it’s not there (a correct rejection).

Is there a strategy that a player could use that would maximize hits and correct rejections, and minimize false alarms, and misses? I mean I guess such a strategy is part of what would be outlined in any tips contained in a tutorial, but I was just wondering whether or not it would make any sense to present the options this way in a tutorial?

Is sensitivity somehow affected by how intense the stimulus is? If so, are the point values somehow adjusted when the movies are extremely blurry? Or, are more sensitive people always able to detect the stalls better than less sensitive people no matter how blurry the movies are?

Finally, I read that according to signal detection theory, when a person is quite uncertain as to whether the stall was present, the individual will decide based upon what kind of mistake in judgment is worse: to say that no stall was present when there actually was one or to say there was a stall, when in fact, no stall was present. Sorry, if I’ve gotten the condition switched around, but I think I’ve got them in the correct way. It occurred to me that the player does not know which mistake is worse in the context of the stall catchers game. Would it be a good idea to state in a tutorial which kind of mistakes will result in the loss of the most points in the game? I know that technically points cannot be lost, but the different types of mistakes do affect how much the blue bar goes down, so it would be good to know which type of mistake makes the blue bar go down the most.

pietro · November 30, 2017, 5:49pm

Hi Mike,

Thanks for delving into this - this touches on the crowdsourcing science part of the project (one aspect I particularly enjoy).

Please see inline responses…

The really cool thing about signal detection theory (SDT) is that if some basic assumptions are met, it can provide independent measurements of discrimination sensitivity and bias. The response strategy often boils down to bias. In other words, when I have uncertainty about what I saw, am I more inclined to answer “stalled” or “flowing”?

Yes! These four values are often collectively referred to as a confusion matrix (for 2-alternative forced choice tasks). These values are also, respectively, referred to as true positives, false positives, false negatives, and true negatives.

SDT does not offer any specific value in terms of helping participants to choose a response strategy. SDT’s measure of sensitivity (d-prime) is based on the idea that regardless of whether you err on the side of misses or false alarms (regardless of your bias), your sensitivity will remain constant, which is what is reflected by the “blue tube”. Your best strategy in all circumstances is to simply give your best answer based on what you see and what you think about what you see.

Really apt question. People would be expected, on average, to be more sensitive to a very clear stimulus than to a noisy one. And to the extent to which our perceptual filters work similarly for all movies and to the extent to which our expertise about movies is unidimensional (if it tends to be more or less, rather than having different kinds of expertise), then more sensitive people would tend to detect stalls better than less sensitive people. And I think that is generally true, though we haven’t investigated this question. So for example, it is conceivable that you and Guy both have the same sensitivity, but that Guy is better at detecting stalls when vessels are straight, and you are better at detecting stalls when vessels are curved. We have not explored individual differences like that.

Exactly! In Stall Catchers, we have set up the scoring rubric to try to neutralize bias. In other words, we have tried to make both kinds of mistake equally costly, or both kinds of successes (hits and correct rejections) equally rewarding, taking into account the approximate frequency of stalled and flowing vessels that are shown. Despite this, each catcher has her/his own individual bias toward responding one way or the other, and there appears to be a (normal) distribution of such biases that is roughly centered around neutral responding (which is consistent with a neutral reward system).

There is really no strategy to optimize the blue tube behavior other than simply responding as best you can. Even though the cost of False Negative might be higher than that of a False Positive, if you tried to adjust your bias accordingly to respond, on average, with more positives (Stalls), the False Positive rate would increase much faster because of the prevalence of flowing vessel movies and the blue tube would drop just as quickly as if you biased yourself in the other direction. In other words, neutral responding is optimal and there is really no way to game the system (which, of course, is by design).

That said, for research purposes, we’d rather see a false positive than a false negative (we don’t want to miss any stalls if we can help it).

Thanks again for the probing questions - I hope this helps clarify!

All best,
Pietro

(p.s. Sorry for my late reply - I actually replied to most of this about a week ago, but somehow left it unfinished until now. Guy kindly brought this to my attention!)

pietro · November 30, 2017, 8:36pm

Your question, @MikeLandau, prompted me to peek at the data!

Here is a distribution of response bias from catchers who have been annotating the High Fat Diet dataset:

The distribution is weighted slightly to the right, but most people respond neutrally.

Best,
Pietro

MikeLandau · December 2, 2017, 9:50pm

This whole discussion of different types of errors makes me think about type I error, and type II error in hypothesis testing. I believe they are analogous to what we are discussing, but I’m not exactly sure. I always get the two confused, but I will try to keep them straight. Type I error is when we incorrectly reject the null hypothesis. In other words, we say that there is a difference between the control group, and the treatment group when in fact there is none (false positive). A type II error is when we fail to reject the null hypothesis when in fact the null hypothesis is false, so we say there is no difference between the treatment group, and the control group when in fact there is one (false negative). Please correct me if I’m wrong about this. Whenever I think about hypothesis testing I get hopelessly confused about which one is which.

Would it be correct to say that saying that there is a stall in the blood vessel when in fact there is none is like a type I error, and saying that the blood vessel is flowing when in fact there is a stall is like making a type II error? If so, I can understand why you would not want to miss a stall if there was one, but I’m not sure why that would be worse than saying that there is a stall in the blood vessel when in fact there is none. I guess my question is in this case which is worse, making the type I error, or making the type to error?

I guess which one is worse depends upon what the research question happens to be. What is the current research question as far as the high-fat diet is concerned? Is it that you are looking to see whether or not mice who eat the high-fat diet have more stalls compared to a control group that eats a normal diet? If that is the research question, which one do you think would be worse, a type I error, or a type II error? It’s hard to decide which would be worse, because both types of errors would be bad.

Is the ultimate goal of this research to find a drug that will eliminate, or reduce stalls in blood vessels? If so, I imagine that you will have a control group, and a treatment group. The treatment group will receive the drug, and the control group will receive no drug, or a placebo. In such a case, I would imagine that it a type I error would be bad because that would lead to the possibility that you might put on the market a drug that does not work. On the other hand, a type II error might be even worse because the drug might actually work, but you might miss the effect, so you would say that the drug does not work when in fact the drug does work. I realize that I am kind of going off-topic here, but it is related to signal detection theory in a way in that it always looked to me like signal detection theory looks like a type of hypothesis testing in that we have two curves, and we are trying to distinguish between the two curves. However, the fact that signal detection theory, and hypothesis testing look similar may just be a coincidence, I’m not sure, I’ve just always noticed the similarity, and wondered about it.

pietro · December 6, 2017, 7:42pm

Hi @MikeLandau!

MikeLandau:

This whole discussion of different types of errors makes me think about type I error, and type II error in hypothesis testing. I believe they are analogous to what we are discussing, but I’m not exactly sure. I always get the two confused, but I will try to keep them straight. Type I error is when we incorrectly reject the null hypothesis. In other words, we say that there is a difference between the control group, and the treatment group when in fact there is none (false positive). A type II error is when we fail to reject the null hypothesis when in fact the null hypothesis is false, so we say there is no difference between the treatment group, and the control group when in fact there is one (false negative). Please correct me if I’m wrong about this. Whenever I think about hypothesis testing I get hopelessly confused about which one is which.

I also have trouble keeping them straight, but that sounds right to me!

Though you have correctly aligned Type I & II errors with false positives and false negatives (respectively), I think using that alignment to refer to finding stalls might conflate error types with hypothesis testing. Each time a catcher annotates a vessel, s/he is generating hiser own mini hypothesis, which is either that the vessel is flowing or that it is stalled. If the hypothesis is that it is flowing, then the null is that it is stalled, and vice-versa. So whether the error ends up being Type I or Type II depends on the user’s answer. In this context of an individual classifying a vessel, I’m not sure how much additional utility we get be thinking of classification in terms of hypothesis testing and type I & II errors, and indeed, it may further confuse our analysis because what is type I or type II changes depending on the selection.

However, in the context of the study from which the datasets are derived, in which there are treatment and control groups and we hypothesize a treatment effect, then it does make sense to distinguish between no treatment effect (null) and some treatment effect (alternate) because there is a built-in asymmetry between no effect and some effect. For example, the current dataset examines the potential effect of a high fat diet (a cardiovascular risk factor) on stall rates. So the null hypothesis is that a high fat diet has no effect, and the alternate is that it does. And then a type I error would be concluding that it has an effect when it doesn’t.

But I think part of your question (please correct me if I’m wrong) pertains to our decision to think of stalled vessels as positives and flowing vessels as negatives. This decision is not so much about hypothesis testing (as explained above) and related more to the relatively low incidence of stalls. On average in healthy mice, stalled brain capillaries tend to have an incidence of about 0.5% and about 2% in mice with Alzheimer’s disease. So imagine if only 1 in 100 vessels is stalled and we somehow miss that stall - yikes! If there are only 1000 vessels in a dataset, missing 1 out of 10 stalls would be a 10% error rate. On the other hand if we miss a flowing vessel, which would be 1 out of 990 vessels, that would result in an error rate of about 0.1%.

At this point, you might observe that in Stall Catchers you see stalls much more often than 1% of the time, and that is because we intentionally insert calibration vessels to maintain an in-game stall incidence of about 20%.

Exactly. So in this case it is not so much about which is worse, but which would be considered the treatment effect with respect to a baseline. Since we are hypothesizing that a high fat diet increases stall rates, we make that our alternate hypothesis (The choice of a treatment effect as the alternate hypothesis is basically a scientific convention that helps us keep things straight when we do our statistical testing.)

Is the ultimate goal of this research to find a drug that will eliminate, or reduce stalls in blood vessels? If so, I imagine that you will have a control group, and a treatment group. The treatment group will receive the drug, and the control group will receive no drug, or a placebo. In such a case, I would imagine that it a type I error would be bad because that would lead to the possibility that you might put on the market a drug that does not work. On the other hand, a type II error might be even worse because the drug might actually work, but you might miss the effect, so you would say that the drug does not work when in fact the drug does work. I realize that I am kind of going off-topic here, but it is related to signal detection theory in a way in that it always looked to me like signal detection theory looks like a type of hypothesis testing in that we have two curves, and we are trying to distinguish between the two curves. However, the fact that signal detection theory, and hypothesis testing look similar may just be a coincidence, I’m not sure, I’ve just always noticed the similarity, and wondered about it.

I like the way you think, Michael, and relate ideas to each other. I do wonder if there could be utility in considering the application of SDT to results from multiple studies about the same treatment, as a way to combine those studies, treating each as a unique classification result.

In the context of comparing the relative costs associated with Type I and Type II errors in treatment studies, I think it all depends on the subjective utility function - in other words, what’s important to each individual and, in particular, how they assign value to various outcomes. If you are a pharmaceutical company, there might be one set of costs associated with type I and type II errors (the cost of mass producing an ineffective drug and dealing with any related lawsuits vs the cost of missing an opportunity to develop a lucrative drug). On the other hand, if you are a patient with Alzheimer’s disease, you might decide it’s a much greater risk to miss the possibility of an effective treatment than to try a drug that doesn’t work (unless there are very bad side effects, of course).

Thanks again for your ongoing, and ever-stimulating ideas and inquiries!

Best wishes,
Pietro