How ANOVA helps ensure process data lead to accurate, defensible decisions.
The two-sample t-test determines if two population means are equal. Typical applications involve testing whether a new process or treatment outperforms a current one. But what if we have three or more means we want to test? The t-test is inappropriate for this analysis.
For example, a young engineer tests the mean brightener concentration in their four acid copper pulse plating tanks (A, B, C, D). There are six pairwise comparisons: AB, AC, AD, BC, BD, CD. Using the t-test, if the probability of correctly accepting the null hypothesis for each test is 1 – α = 0.95, then the probability of correctly accepting the null hypothesis for all six tests is (0.95)6 = 0.74, or 74%. In other words, 1 – 0.74 = 26% chance of committing a Type I error. Recall that a Type I error occurs when we reject a true null hypothesis (no statistical difference) and claim that there is a statistical difference. The multiple comparisons cause a significant increase in Type I errors. The appropriate procedure for testing the equality of several means is the analysis of variance.1
R.A. Fisher, a British statistician, invented the analysis of variance (ANOVA) in 1918. While the t-test compares only two groups, Fisher designed the ANOVA to permit comparisons among multiple groups using a single test. The ANOVA gained popularity after being included in Fisher’s text Statistical Methods for Research Workers in 1925. Today, the ANOVA is the most useful technique in the field of statistical inference. The ANOVA is a general linear statistical model technique used to test the hypothesis that the means of two or more groups are equal. The linear function refers to the mathematical relationship between the model parameters and the dependent variable (y). Specifically, the response variable (y) is a linear function of the model, meaning the average outcome relates linearly to each term in the model.2
There are two types of assumptions with the ANOVA model. The first assumption is about the form of the model. These initial assumptions pertain to choosing the correct predictors (they are related to the response variable), and the average outcome is linearly related to each term in the model.2,3
The second assumption concerns the distribution of the errors (residuals). It is generally assumed that the sampled populations are approximately distributed, that the observations are independent, that the variances remain equal across groups (homogeneity) and that the observations come from random sampling. The ANOVA technique proves robust to minor deviations from normality, independence and homogeneity. You can gather clues about whether most of these assumptions will be met before building the model. But we typically build the model first and then verify the assumptions. Suppose you’ve done the foundational work in the early steps. In that case, testing assumptions is about looking for minor deviations, not major transgressions.2,3
The ANOVA tests the null hypothesis (H0) that two or more population means are equal versus the alternative hypothesis (H1) that at least one mean is different. Using the formal notation of statistical hypotheses, for k means we write:
H0: μ1 = μ2 =…= μk
H1: At least one mean is not equal to the others
In statistics, the alternative hypothesis can be either one-tailed or two-tailed. The one-tailed tests are for either inferiority or superiority, while the two-tailed tests are for parity (not equal). The ANOVA is a bit more complex. With ANOVA, we test “not all means are equal.” Suppose we are comparing three groups; the alternative hypothesis says that at least one of the following is true:
Mean 1 is not equal to mean 2.
Mean 1 is not equal to mean 3.
Mean 2 is not equal to mean 3.
As implied, the ANOVA analyzes variances to test means. But why analyze variances to derive conclusions about the means? Remember that “means are different.” The larger the differences between the means, the greater the variation present. The ANOVA assesses the amount of variability between the group means in the context of the variation within groups to determine whether the mean differences are statistically significant. When the ANOVA signals statistically significant results (p-value < 0.05), indicating that not all means are equal, post hoc tests are needed to complete pairwise comparisons.
Let’s look at how the ANOVA works by using an example. Table 1 shows three factors (A, B, C), each with three measured responses, along with descriptive statistics. The data are fictitious and presented for explanatory purposes only.
Table 1. Three-Factor Data Set
Table 2 shows a raw ANOVA table, followed by the detailed ANOVA calculations. Finally, Table 3 shows the completed ANOVA.
Table 2. Raw ANOVA Table
Descriptive statistics. Descriptive statistics, such as the mean and standard deviation, summarize a set of data.4,5
Mean of A: 1 + 2 + 3 / 3 = 2
Mean of B: 4 + 5 + 6 / 3 = 5
Mean of C: 7 + 8 + 9 / 3 = 8
Grand Mean: 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 / 9 = 5
Degrees of freedom. Degrees of freedom (n – 1) represent the number of independent values that a statistical analysis can estimate. More specifically, they define how many units within a set you can select without constraints. Let’s say we have three numbers that add up to 12. There are two degrees of freedom (3 – 1 = 2). After picking the first two numbers, there is no freedom to choose the last number; it is “determined” by the other two numbers. The first and second numbers can be any positive or negative numbers. For example, if the first number is 3, the second number is 7, the third number must be 2.4,5
Factor: 3 – 1 = 2
Error: 8 – 2 = 6
Total: 9 – 1 = 8
Sum of squares. The sum of the squared deviations of scores from their mean. The total sum of squares helps express the total variation that can be attributed to various factors. The adjusted sum of squares represents the unique portion of the sum of squares explained by a factor, given all other factors in the model, regardless of the order in which they were entered.4,5
Factor (between the factors): 3 * [(2 – 5)2 + (5 – 5)2 + (8 – 5)2] = 54. (Note: “3” is the number of levels within the factors, not the number of factors, and “5” is the grand mean.)
Error (within the factors):
SS of A: (1 – 2)2 + (2 – 2)2 + (3 - 2)2 = 2
SS of B: (4 – 5)2 + (5 – 5)2 + (6 – 5)2 = 2
SS of C: (7 – 8)2 + (8 – 8)2 + (9 – 8)2 = 2
Error: 2 + 2 + 2 = 6
Total: 54 + 6 = 60
Mean squares. A term used in the analysis of variance to refer to the variance in the data due to a particular source of variation. Converting the sum of squares into mean squares by dividing by the degrees of freedom lets you compare these ratios and determine whether there is a significant difference. The larger this ratio is, the greater the factor’s impact on the outcome.4,5
Factor: 54 / 2 = 27
Error: 6 / 6 = 1
F-value. Calculated by dividing the factor mean square by the error mean square. F-critical is an alternative to calculating the p-value. The F-critical is found in the F-table, using the degrees of freedom for the factor and error, F(2, 6). An F-value greater than F-critical indicates statistical significance.4,5
F-value: 27 / 1 = 27
F-critical: 5.14
P-value. The p-value indicates the probability of observing the given F-value (or a more extreme value) under the assumption that the null hypothesis is true. It is calculated from the F-distribution, F(2, 6), using the F-value.4,5
F-value: 27
Probability (p-value) of X ≥ 27, F(2, 6) = 0.001
Table 3. Completed ANOVA Table
The ANOVA signals statistically significant results (p-value < 0.05), indicating that not all means are equal. Before taking action, however, the model needs to be validated by examining the residuals. If all looks good, conduct a post hoc test for all pairwise comparisons. Finally, review the five requirements for data acceptance.
The ANOVA work does not stop when the model is fit. As previously discussed, the second assumption pertains to the distribution of the residuals. If your model is not adequate, it will incorrectly represent your data. For example, incorrect F- and P-values. Models can be adversely affected by as few as one or two points.4
To validate the model, the assumptions about the distribution of the residuals must be met. These assumptions include that the residuals are normally distributed, independent of observations (no autocorrelation) and homogeneous of variances (equal variances across groups). Residuals are elements of variation that are not explained by the model. Since they are a form of error, the same general principles apply to the group of residuals as would apply to errors in general: one expects them to be normal and independently distributed (NID) with a mean of zero and constant variance NID(0, σ2). Departures from these assumptions usually mean that the residuals contain unaccounted-for information. Validating the model helps ensure the conclusions drawn are correct, unambiguous and defensible.1,3
Normality. Virtually any graph suitable for displaying the distribution of a set of data is ideal for judging the normality of the distribution of a group of residuals. The two most common plots and graphs are the normal probability plot and the histogram.3,4
Interpretation: The normal probability plot of the residuals should approximately follow a straight line (see Figure 1). The histogram helps identify whether the data are skewed or contain outliers, as shown in Figure 2. For histograms, it’s best to have at least 50 data points (n ≥ 50) to ensure robust interpretation.4

Figure 1. A reasonable probability plot.

Figure 2. A reasonable histogram.
Independence. Suppose the order of the observations in a data table represents the order of execution of each test. In that case, a plot of the residuals of those observations versus the time order of the observations will test for lack of independence. For example, drift in equipment will produce models with autocorrelation.3,4
Interpretation: The independent residuals exhibit no trends or patterns when displayed in time order. Patterns in the data points indicate that residuals near each other may be correlated and thus not independent. The residuals on the plot should fall randomly around the center line with a mean of zero and constant variance NID(0, σ2) with no recognizable patterns or trends in the points (Figure 3).4

Figure 3. A reasonable residual versus time order plot.
Homogeneity. Plotting residuals versus the value of a fitted response should produce a distribution of points scattered randomly about zero, NID(0, σ2), regardless of the size of the fitted value. Quite commonly, however, residual values may increase as the size of the fitted value increases. When this happens, the residual cloud becomes “funnel-shaped” with the larger end facing larger fitted values; that is, the residuals exhibit increasing scatter as the value of the response increases.3,4
Interpretation: Ideally, the points should fall randomly around the center line with a mean of zero and constant variance NID(0, σ2) with no recognizable patterns, trends or outliers in the points (Figure 4).4

Figure 4. A reasonable residual versus fits plot.
Post-hoc testing. Suppose the ANOVA indicates a statistical difference (p-value < 0.05), and the model assumptions are validated. In that case, post-hoc tests are used to identify which specific groups differ from each other. Standard post-hoc tests include Tukey, Fisher, Dunnett, and Hsu MCB. The Tukey and Fisher tests compare all pairs of groups. The Dunnett test compares the treatment groups to a control group. In contrast, the Hsu MCB test compares each group to the group with either the largest or the smallest mean (chosen by the process engineer). The process engineer must consider individual and family error rates in conjunction with post-hoc testing.4
The individual error rate is the maximum probability that one or more comparisons will incorrectly conclude that the observed difference is statistically significant, given the null hypothesis. It is equivalent to the alpha level selected (typically 0.05) for the hypothesis test. The family error rate is the maximum probability that a procedure consisting of more than one comparison will incorrectly conclude that at least one of the observed differences is significantly different from the null hypothesis. The family error rate is based on both the individual error rate and the number of comparisons. It is essential to consider the family error rate when making multiple comparisons, as the chances of committing a Type I error for a series of comparisons are greater than the error rate for any one comparison alone.4
The Tukey test is a robust, widely used, and popular post-hoc test. It compares all pairs of groups while controlling the simultaneous confidence level (SCL). The SCL is the percentage of times that a group of confidence intervals will all include the true population parameters or true differences between factor levels if the study were repeated multiple times. The SCL level is based on both the individual confidence level and the number of confidence intervals. The Tukey family error rate is typically controlled at 0.05 (5%). The trade-off with Tukey’s is the less precise confidence intervals and hypothesis tests, which are less powerful than those of either Dunnett’s or Hsu’s MCB.4,6
There are five requirements if conclusions drawn from data analysis are to be correct, unambiguous and defensible. These five requirements are an equitable sample, stability, statistical significance, practical significance and truth. Each of these is discussed below.
Process characterization is an integral part of any continuous improvement program. There are many steps in that program for which process characterization is required. These include instances when we introduce a new process or tool for use, as well as when we bring a tool or process back online after scheduled/unscheduled maintenance, when we want to compare tools or processes, when we want to check the health of our process during the monitoring phase, when we are troubleshooting a bad process, or when we need to improve a process.3
A young process engineer is working on a process improvement project for their acid copper pulse plating tanks, aiming to improve throwing power. They conduct an experiment looking at three different pulse recipes. The first pulse recipe (P1) is the control (current wave), while recipes P2 and P3 are experimental. The test vehicle is an 18” x 24” panel with 20:1 aspect ratio holes. The engineer plates four panels with each of the three pulse recipes and measures the throwing power (Note: The runs are randomized to protect against noise variables.) The throwing power percentages, along with descriptive statistics, are shown in Table 4.
Table 4. Throwing Power Percentages and Descriptive Statistics
The process engineer analyzes the throwing power data using an ANOVA. The recipe p-value is less than 0.05, indicating that not all means are equal (Table 5).
Table 5. Pulse Recipe ANOVA
Next, the engineer validates the model by examining the residuals. The probability plot of the residuals approximately follows a straight line. The histogram is ignored due to the presence of fewer than 50 data points, making interpretation difficult. The residuals versus order points fall randomly around the center line with a mean of zero and constant variance NID(0, σ2) with no recognizable patterns or trends in the points. The residuals versus fit points fall randomly around the center line, with a mean of zero and constant variance NID(0, σ2), exhibiting no recognizable patterns, trends, or outliers in the points. All four plots can be seen in Figure 5. The model has been validated. The process engineer now needs to use a post hoc test to complete pairwise comparisons.

Figure 5. Four-in-one residual plot.
The process engineer decides to use the Tukey post hoc test. They use the grouping information table to quickly determine whether the mean difference between any pair of groups is statistically significant. Groups that do not share a letter are significantly different. According to the results in Table 6, group A contains recipe P3, group B contains recipe P2, and group C contains recipe P1.
Table 6. Tukey Post-Hoc Test
Discussion. The ANOVA model has been built, validated and a post-hoc test has been completed. The process engineer concludes that all three recipe means are statistically different; the results in the data are unlikely to be explained by chance alone. The data acceptance criteria has been met: equitable sample (18" x 24" panel, 20:1 aspect ratio, four test panels), stability (all parameters were in range during the testing), statistical significance (p-value < 0.05, and residuals are normal and independently distributed with a mean of zero and constant variance NID(0, σ2)), practical significance (36% improvement in throwing power) and truth (significant modifications to the pulse waves improve throwing power). Recipe P3 has been statistically proven to improve throwing power over recipe P1 by an average of 36% (86% – 50%). The process engineer concludes their improvement project’s data are correct, unambiguous and defensible. They can confidently implement the process change.
The analysis of variance (ANOVA) is over 100 years old. Today, the ANOVA is the most useful technique in the field of statistical inference. The ANOVA permits comparisons between multiple groups using a single test. The ANOVA work does not stop when the model is fit; the model must be validated. Validation occurs through verifying that the residuals are distributed normally, have independence of observations and have homogeneity of variances. When the ANOVA indicates a statistical difference, and the model assumptions have been validated, a post-hoc test can identify which specific groups differ from each other. The Tukey test is a robust, widely used and popular post-hoc test. Finally, data acceptance is based on five requirements: equitable sample, stability, statistical significance, practical significance and truth. Drawing conclusions from an improvement project’s data that are correct, unambiguous and defensible is crucial for the process engineer.
3. NIST Engineering Statistics Handbook, 2012, www.itl.nist.gov/div898/handbook.
is the Lean Six Sigma Manager for Uyemura USA (uyemura.com); This email address is being protected from spambots. You need JavaScript enabled to view it.. He holds a doctorate degree in quality systems management from Cambridge College, a Six Sigma Master Black Belt certification from Arizona State University, and ASQ certifications as a Six Sigma Black Belt and Reliability Engineer.