Rules for statistical thinking
I know we have been talking about statistics a lot in my last few articles, but recent reader comments have prompted me to think more about why doing statistics properly matters. Come with me, dear reader, on a journey to find out why you should embrace, and not run screaming from, your inner statistical geek. But to get there, we have some interesting terrain to cover… starting with the human brain.
All right, I want to build a convincing case, step by step, so let’s get some basics out of the way.
Reality: it’s… complicated
Reality is messy, especially that part of reality we call business. We often have to make very important decisions when we have inadequate information and in the presence of noise that further confounds our ability to perceive the truth (or its reasonable approximation).
If that is what we are stuck with, we have one of two choices: abandon data and go on “gut feel,” or try to extract information from that data.
Unaided brain as decision engine, or gut feel as heuristic
Traditionally, of course, people have chosen the first option, and I won’t deny that some people have had some success with it at times. But the vast majority of people are really terrible at making the correct decision consistently. Recent research on how the human brain works has really made me wonder how we managed to find our way out of the caves and into the urban jungle. Here are just a few freaky examples:
Example 1. The human brain is pretty bad at modeling a complex system. Heck, it is pretty bad at multiplying two three-digit numbers let alone encompassing interactive effects. So with business processes, we generate “industrial mythology” about how processes run—and get emotionally attached to them.
Example 2. We are subject to confirmation bias—once we become convinced of something, we tend to notice evidence that supports our conviction, and disregard evidence that contradicts it.
Example 3. Every time we remember something, that memory is subject to change, which makes it easy to change what we think we remember. Researcher Elizabeth Loftus was easily able to implant totally fictitious memories in people who then ended up confabulating very detailed memories that they absolutely believed were true, even though they could not have been.
Example 4. In what is being called the “backfire effect” the more we are convinced that something we think is true, the more that contradictory evidence convinces us it is true. No, that is not a typo—reread it and despair for humanity. I myself have seen it time and time again in business.
Beyond the implications to epistemology and civilization, what are the consequences of this on decision making in business? The human brain cannot correctly determine what happened, remember it, or trust that convincing data will change our minds.
Here is my conclusion: Without a discipline to follow, we really can’t trust ourselves to make decisions that consistently have any relationship with reality.
Now that is actually not quite as hopeless as it first appears. I don’t totally give up on humanity as irreconcilably irrational, I gave us an out: “Without a discipline….” With a discipline, I think we can make good, or at least the best possible, decisions. In fact, I think almost all of the progress we have seen throughout the past 300 years or so is due in large part to using disciplines to keep us on topic and thinking rationally. What is disturbing to me is that rationality does not seem to be particularly natural to us.
Obi-Wan Kenobi would make a terrible business consultant, since he advises Luke to, “Let go! Trust your feelings!” This is about the worst advice one can give in business. Maybe elsewhere, too.
The discipline to which I am referring is, of course, the scientific method, and as I wrote about in “Will Google Earn a Black Belt?,” define, measure, analyze, improve, control (DMAIC), and more generally, plan-do-check-act (PDCA) are nothing more than different formalizations of the scientific method. So we do have these evidence-based decision-making disciplines present in many businesses.
Had you considered that your Six Sigma work is a way to foster rationality in the workplace? That is pretty cool if you think about it. And heck, businesses sure need it.
There are clearly a lot of places in DMAIC or PDCA where you can be mistaken, for whatever reason, and end up thinking that the wrong factors are important, but that is where the analyze or check step comes in. These disciplines (like the scientific method) are self-correcting because you don’t just come up with factors you think are important, figure out countermeasures, implement them, and go your merry way—you use data to validate that they are important and that your solutions are effective in solving the problem.
In a very real way, the efficacy of the whole process depends on this one act—you can go wrong at any step prior to this and still (eventually) recover from it—as long as you perform this verification step correctly.
What need does the science of statistics fill in business?
Reality is messy and subject to noisy variability. This means that to determine if factors are significant in the real world we have to contend with additional, omnipresent random variability other than the issue we are trying to understand. If my problem is “simply” to reduce scrap due to weak welds, I am going to have to make decisions in the face of variability from the filler-rod vendor, rod diameter, rod composition, rod batch, weld angle, current and voltage fluctuations, travel speed, electrode composition, electrode geometry, electrode length, atmospheric pressure, temperature, humidity, operator, phase of the moon, sunspot activity, and probably a bunch of other stuff. Some of these will have more or less (or no) effect on the process output “weld strength,” which itself is subject to measurement error variability.
Phrased that way, it is kind of remarkable that we can make any decision, isn’t it?
Anyway, the point is that even a “simple” process is pretty complex. We know that humans are terrible at making correct and consistent decisions even in simple situations. How can we handle one with so much variability from so many sources without falling prey to emotional or cognitive errors?
That is where applied statistics comes in. It is the tool for making decisions in the presence of many sources of variability. It is a way of accounting for what we don’t yet understand (unexplained variability) in order to see effects that stand out above the noise (significant factors). All those statistical tests are based on the idea that we can quantify or estimate the amount of “background” variability and look for things that rise above that background, and call them significant.
Rules for statistical thinking
Now let’s focus on the process of statistical decision making, which is really what I wanted to talk about.
Keeping in mind that we humans have a terrible record of letting our subjective reality affect our decisions about objective reality, we need to come up with some rules to follow to keep ourselves honest when doing our statistical analysis. Business decisions are binary—we either do or do not make a change, so the rules have to recognize this.
Rule one—respect the status quo
The first rule is that, to be conservative, we will stick with the status quo until we prove to ourselves that change is necessary. That seems reasonable as a “devil we know” kind of argument. Once we accumulate sufficient evidence to convince ourselves that there is a real difference, then we will change. (In hypothesis testing, the status quo is the null hypothesis, which we will stick with until we reject it based on our analysis.) Otherwise, we would change the process whenever anyone had a neat idea, and never test to see if it works. No one would do that, would they? (Add one point for your inner stats geek—the true stats geek loves to make decisions with data.)
Rule two—respect statistical error
Because we are using statistics to make decisions in the presence of variability, we also know that there is a probability of making the wrong decision even though we did our statistics perfectly and correctly. It had nothing to do with us. It is just the price we pay for living in an imperfect world and having to make a “go” or “no go” decision. This type of error, due to chance and chance alone, is called “statistical error” and comes in two mutually exclusive flavors. Our second rule is to understand these errors and their effect on the business.(Add another point for your inner stats geek—understanding these errors allows you to design an efficient and economical experiment, and to properly analyze it.)
If you have done everything right and you decide that something had an effect, you are either correct (yay!), or you are incorrect and have made a Type I error (a.k.a. α). What is neat is that you get to (well, you “must,” actually) choose the probability of making this error before you even gather the data. If the probability goes below what you choose, you are going to conclude that there was a significant change.
If you have done everything right and you decide that something did not have an effect of a big enough size, you are either correct (yay!), or you are incorrect and have made a Type II error (a.k.a. β). The probability of making a Type II error relates to how hard it is to see the effect over the noise, so it is controlled by the Type I error you chose, sample size, process variability, and the size of the effect you want to see. The experimental design that you choose also affects a Type II error—some designs are more sensitive to differences than others, or you might have unknowingly confounded significant factors into the background noise making it harder to hear what is there. (Add one point for your inner stats geek—knowing how to really design an experiment, rather than just plugging and chugging, can earn you an advantage in cost and/or sample size.)
So, as I described in “(Sample) Size Matters,” you and management decide on what levels of Type I and Type II error that you can tolerate based on a number of factors. Let’s say that you would like to see what would happen if you switch to a cheaper fill rod vendor who promises to save you $450,000 a year, but are concerned that doing so might affect your weld strength. If your weld strength is affected by more than a minimal amount, you will incur losses of $800,000 a year. Based on the risks and benefits, you decide that you can tolerate a 10-percent chance of saying the new vendor decreases strength, when they really don’t (Type I error), and you select a sample size sufficient to give you only a 5-percent chance of missing that the vendor change actually did reduce the strength (Type II error). (Note that this doesn’t even consider threats to external validity where your findings don’t apply to the real process. Maybe it’s a topic for another article sometime.)
Consider the alternative to making a decision based on Type I error: You do the experiment to determine if the weld strength changes and get a p-value of 4.8 percent. If we follow the rules, we would reject the hypothesis of no change and conclude that the new vendor did change the strength. But if you said, “Yeah, but 4.8 percent is really close to 5 percent, and I want to make the change to save money, so I am going to advise that we buy from the new vendor anyway,” you might as well have not done the experiment, because you are now bringing in the highly subjective reasoning of the human brain. You might or might not be right this time (“Do you feel lucky, punk?”) but you sure are not using the data to make that decision, and it will catch up with you. And if you are wrong, you’ve got nothing but evidence that you screwed up.
Rule three—respect the statistical rules
What’s worse is that those probabilities of making a Type I or Type II error are correct only if you do everything right and your assumptions are sound. Once anything happens that is outside of your assumptions, then we have no idea what those probabilities are—we only know that they get worse by some amount. What you think is a 5-percent probability of making a Type I error might actually be 7 percent, or 50 percent, or even more. We probably have no way of knowing. And the same thing can happen on the Type II error side. This additional, often unknown, amount of error is part of systematic error. No matter how many times you repeat the same study, that systematic error is still there and still giving you a higher probability than you think of making the wrong call—either changing vendors when they actually increase your costs, or not changing vendors even though that would be all right.
So, rule three is this, to keep ourselves honest, we must understand and meet the assumptions and limitations of whatever statistical heuristic we have selected in order to stay as close to the stated levels of Type I and Type II error. What do I mean? Every applied statistician worth their salt knows that there is no such thing in the real world as a normal distribution, t-distribution, F-distribution, etc. But we also know that these are useful models to help us make decisions that are less influenced by our human weaknesses. The reasoning goes something like this:
I know that no real distribution is truly normally distributed, but sometimes it is a reasonable approximation of what we see. For random samples from two truly normal distributions, the ratio of the variances will… (many boring calculations redacted)… and so the ratio of the distribution of sample variances seen over many samples should follow the F-distribution, which I can therefore use to decide if two samples have equal variance. (Cool! But wait!) If those two distributions are not normal, using the F-statistic as an approximation will start giving us the wrong answer, adding an unknown amount of systematic error. And it doesn’t take much of a difference from perfect normality before I start running a pretty good chance of making the wrong decision. Therefore, I will test my real data for normality to keep my probability of error as close to the stated levels as possible. If it passes that test, by definition any real non-normality will be small enough so that any additional systematic error in using the F-distribution will be small. If it fails the normality test, by definition that added systematic error starts becoming too big to dismiss, and I will have to find a different way of testing for equality of variances.1
In the latter case, we are not rejecting the use of the F-test for equality of variances because the population is not normal, nothing probably is, but because it is not close enough to being normal to avoid significantly increasing our actual chance of making a bad decision well above that which we decided earlier and everyone bought into.
The alternative? An unsophisticated Black Belt might say, “Hey it looks normal to me, and since nothing is really normal, testing for normality is not important, so I’ll go ahead and use the F-test.” (By the way, I challenge you to find even one applied statistician that will say this—the F-test for equality of variance is so sensitive to departures from normality that I predict even the most liberal statistician will think it is cuckoo.) What they don’t understand is that the risk of making a bad decision has now gone from known and agreed-upon, to unknown. Your response should be, “Hey, my job is on the line here—I want to know the risk I am taking and not have my career depend on drawing to an inside straight!”
And this is not limited to testing normality only! That was just an easy example. Every statistical test has at least some assumptions, and for every one of those assumptions you violate, your systematic error increases by some amount, which is to say, your chance of making the wrong decision increases, and the reason for using statistics to help you make decisions is diluted.
And that is why I am so adamant about knowing and testing each assumption for every statistical test I (and my students) do. When making important business decisions I am by nature, and you should be too, quite conservative in demanding that I know the real level of risk we are taking. (Now give your inner stats geek a million points and go indulge it by learning more statistics.)
Conclusion—dem’s de rulz
When making decisions using statistics, as much as possible we must have a realistic set of rules that are strictly enforced. To do otherwise is to invite disaster, since as soon as we bring our judgment into the analysis, our own brain is going to try to trick us, and each rule we break increases our chance of making the wrong decision.
Also, in doing this, we provide a route for others to repeat the same analysis on the same data and come to the same conclusion, one of the core tenets of the scientific method. It also provides some objectivity that might be helpful if you are in a litigation-prone industry.
Finally, use this thinking to frame arguments among statisticians—it almost always boils down to a disagreement over the additional amount of systematic error a procedure or broken assumption adds to your decision. Some tests are robust to certain assumptions, others are really, really not. You need to rely on your inner stats geek to have the enthusiasm to learn these limits in order to make the right decisions.
For it is written, “The geeks shall inherit the Earth.”
1. Actually, the F-test is so sensitive to this departure from normality that even if a distribution passes the normality test, we don’t recommend using it anymore. Keep in mind that there is a test for every situation that is testable, so do not fear. Oh, and +1 inner geek point for reading a footnote.