Statistics from end to beginning
A colleague of mine made an interesting point about how we teach and learn experimental design techniques, and I thought I'd explore the subject further. He observed that the order that we teach statistics is almost exactly opposite of how one would actually use them. So this month I will describe the different tools and how they would be used in the actual order you might use them.
Let’s say that we have defined the opportunity, figured out or validated a way to measure the current state, and have come up with a list of potential sources in the analyze the causes step. At this point, you are potentially faced with a large number of variables that might influence the process and cause the problem.
Now, let’s say that the process is fairly mature, and the problem is a tough one and has been viewed as unsolvable. In this case, all the process experts usually know the answer, but none of them agree with each other. In my experience, this means that you are dealing with one or more interactions between the process variables. These are almost impossible for someone just running the process to catch; to them it appears that for no apparent reason things go OK for a while and then go bad. But the real question is, “Which of the gazillion variables are interacting to cause the problem?”
Figure 1 - Data from a modeled process affected by multiple interactions
We need some way to filter out those variables that have no, or relatively little, effect. The tool for this is a “screening experiment,” which is evaluated with an analysis of variance (ANOVA). Because there ain’t no such thing as a free lunch, if we want to test a lot of factors and interactions, we can either set up a gigantic experiment that would take the age of the universe to complete, or we can design an experiment with a small fraction of all possible combinations of the factors and those interactions that we think are possible. Management is generally not willing to wait 10100 years for protons to decay and the heat death of the universe before getting their results (they only wait that long before giving a pay raise), so we usually choose a “fractional” experiment. But there’s a price to pay—we are taking the risk of “confounding” real effects that we thought weren’t significant into the factors and interactions we decided to study. This can cause factors to appear significant when they really aren’t, to be more significant than they really are, or even to appear insignificant when they do have an effect.
I teach and advocate using a customized fractional design based on an orthogonal array, because it is rare indeed that one of the classic fractional factorials actually answers the research questions you want to ask. I have little use for response surface methodology—it requires that all factors be continuous, which doesn’t happen often in industry, and is subject to variation leading to a misleading model. The customized fractional factorial is a lot more useful, and most people find it easier to learn and use. Our stated objective at this point is only to reduce the number of factors that likely influence the process, and the fractional factorial is a very economical way to do that, therefore it’s the preferred method. Unlike some practitioners, I highly advocate including possible interactions in the experiment, as one or more of them are probably the culprits you are looking for.
The output of the screening experiment is a reduced list of factors and their interactions. However, we know that we have only run a few settings out of the total possible combinations, and we would never consider using the results of a fractional experiment, with all its confounding, without performing a confirmation experiment to test to see if our predictions hold true. A confirmation experiment validates that the effects that we think we observed can be reproduced. To do so, we might run a full-factorial (all combinations of the factors we found significant) on the reduced list. We might even run a one-way ANOVA at multiple levels for a factor if we thought that the response was nonlinear.
As we run full-factorial or one-way ANOVAs, we bump up against the inherent limitation of the test. As I described in my Heretic article “Homoscedasticity,” ANOVA only detects if there is a difference—it doesn’t allow you to specifically say that one setting is different from the other (except for the extremes). When an ANOVA detects a difference, we need to follow it up with post-hoc tests to determine just what those differences are.
And guess what, those post-hoc tests are just the good ol’ t-test, just using some fancy ways to control for alpha error inflation due to making multiple tests. The basic standby is the Bonferroni-Dunn approach, where you divide alpha by the number of t-tests you need to do. This gives you a new alpha for each test, and if your t-test goes below that new number, it’s a significant difference. (There are other techniques that are more useful when you have many more comparisons that you want to do.)
This brings us full-circle back to where we start to teach inferential statistics—the basic hypothesis test.
Of course, we couldn’t just start off teaching fractional factorials if you had never seen a hypothesis test before, but I thought it was an interesting observation that we actually use the tools in the opposite order in which we teach them.
So next time someone tells you something that seems totally “bass-ackwards,” consider that they might just be teaching you something you will want to know down the road a ways.
Maybe even 10100 years down the road.