Statistically Significant Doesn't Mean Meaningful

Mark Schneider, Director of IES | February 13, 2024

In the first section of this post, I share growing concerns about the potential for misinterpreting results when we focus solely on statistical significance. In the second section, Brian Gill (bgill@mathematica-mpr.com) joins me to discuss how Bayesian approaches are a promising solution to this challenge.

Starting with our very first statistics course, most of us were taught that random variation can lead us to misidentify a difference between groups or a change over time when there is no meaningful difference or change. All measurement includes some amount of random error, which means randomness can fool us into putting too much stock into apparent differences that do not reflect meaningful differences in true values. To minimize these mistakes, we're taught to calculate p-values to assess "statistical significance." Many of us were led to believe that a p-value < .05—which serves as the bright line for statistical significance in education and many other fields—indicates that there is under a 5 percent likelihood that the differences we see in our data are due to chance.

Unfortunately, that's not what p-values mean at all. As the American Statistical Association has been warning for years, a p-value doesn't directly translate into the probability that a finding is due to chance. (Readers should check out the ASA statement and its accompanying commentary.)

Moreover, p-values and tests of statistical significance say nothing about the size of an effect or whether a difference is educationally meaningful. In a large sample, a difference that is statistically significant might be trivial; in a small sample, substantively important differences might not reach statistical significance.

If there is any doubt as to the frequency with which p-values and significance tests are misinterpreted, look no further than media coverage of the 2022 NAEP results. When NCES reports NAEP results for individual states and cities, or for different demographic groups, it reports whether any changes since the prior administration were significantly positive, significantly negative, or non-significant. This is where the trouble starts.

In the wake of large pandemic-related learning losses registered in 2022 NAEP assessments—when not a single state registered a statistically significant increase in reading or math—many commentators, using language that NCES put in their own media materials, deemed places with non-significant declines to be "holding steady." Some commentators pronounced large, urban school districts to be "bright spots," because (for example) most of the participating urban districts did not have statistically significant declines in 4th-grade reading, while statistically significant declines were registered for the country, for all four large geographic regions, and for more than half the states.

But this is not how p-values and significance tests work. A failure to find a significant change doesn't mean there wasn't a change. NAEP tests fewer students in cities than in states, and statistical significance is sensitive to sample size—which means identical declines for a city and state might be statistically significant for the state but not for the city.

In fact, most urban districts were not holding steady in 4th-grade reading. More than half showed declines at least as large as the average decline nationwide (three NAEP points and statistically significant). The average decline for the 26 urban districts was the same as the average decline nationwide (and statistically significant).

How can we do better? Below, Brian Gill and I discuss how Bayesian statistics can help us better understand recent NAEP results while moving the field away from the p<.05 bright line that leads to misinterpretations of the data.

The purpose of this blog is not to cement Bayesian statistics into place as the only alternative to the p<.05 standard, but to help propel ongoing discussions about alternatives.

The Bayesian alternative

One way that we can avoid interpretive errors is to take advantage of Bayesian methods and draw on related data from multiple states, cities, and student groups to identify meaningful differences and similarities across the country. Consider how using a range of related data might affect the interpretation of results for Denver, CO, one of the cities that showed a non-significant change in 4th-grade NAEP reading scores, where a 5-point decline was not statistically significant. Meanwhile, 4th graders across Colorado showed a (not statistically significant) 2-point decline in 4th-grade reading scores and, as noted above, 4th graders across the U.S. showed a 3-point decline. In short, other data are consistent with an actual 4th-grade reading decline in Denver.

We can get a sense of the magnitude of Denver's 5-point decline by comparing it with educationally relevant benchmarks, such as Stanford's Educational Opportunity Project, which found that a year of learning is approximately equivalent to 10 points on the NAEP scale. By that metric, the decline observed in Denver is about half a year of learning—certainly educationally meaningful—and is greater than the national decline, even if Denver's decline isn't statistically significant using the p<.05 standard.

Further, Bayesian analysis can—unlike p-values—estimate the probability that a decline occurred or exceeded an educationally meaningful threshold. The Stanford analysis suggests that 3 points on NAEP scales is a little more than a quarter of a year of learning—a plausible standard for "educationally meaningful." Using that standard, Mathematica estimated a 74 percent probability that Denver's 4th-grade reading scores declined by an educationally meaningful amount of at least 3 points (and a 99 percent probability that they declined at all). A similar analysis could be conducted using any threshold for "educationally meaningful" that might be of interest. (Here's a challenge: We all know how the p<.05 threshold leads to "p-hacking"; how can we spot and avoid Bayesian bouts of "threshold hacking," where different stakeholders argue for different thresholds that suit their interests?)

A Bayesian interpretive approach can be applied to many research contexts as an alternative to p-values. IES has supported the field's transition in a Bayesian direction by providing guidance to the Regional Educational Laboratories, by publishing research reports and evaluation studies that use Bayesian methods, and by developing tools like BASIE, a framework for the Bayesian Interpretation of Estimates. Moreover, other federal statistical agencies are using Bayesian statistics: for example, the Census Bureau uses these methods to better understand what's going on in smaller units of nationwide data.

Clearly, change is in the wind and the field is exploring methods that are not only statistically appropriate but can also provide more meaningful insights that the public can better understand.

This is a movement that IES needs to help lead. Doing so will hopefully end (to repurpose a Diane Ravitch book title) the reign of error perpetuated by the misuse of traditional significance testing and the p<.05 standard.

As always, please let me know what you think: mark.schneider@ed.gov