Skip Navigation

Home Blogs Illuminating School Performance Measures: Building a Brighter Flashlight

Illuminating School Performance Measures: Building a Brighter Flashlight

Mid-Atlantic | June 14, 2023

Attention to historically underserved student groups has long been a focus of federal accountability. In 2022, for the first time in three years, states had to identify low-performing schools, including those where particular subgroups of students are not meeting standards. Secretary Cardona, in an address in January of this year, framed the requirement as an opportunity to rethink accountability.  He said:

"We need to recognize once and for all that standardized tests work best when they serve as a flashlight on what works and what needs our attention—not as hammers to drive the outcomes we want in education from the top down, often pointing fingers to those with greater needs and less resources."

Accountability should be illuminating, helping us understand where and how we need to improve.

In other words, accountability shouldn't be punitive—it should be illuminating, helping us understand where and how we need to improve. But our flashlight doesn't shine a light on everything we need it to. As states across the country know from experience, measuring the performance of subgroups within schools is particularly challenging, especially when the subgroups have small numbers of students. Subgroup performance measures are essential—they're used to identify schools for targeted support and to inform efforts to improve equity. But performance measures for individual subgroups within schools can change due to random luck, bouncing wildly from year to year, making them unstable and unreliable. Fortunately, this challenge has a solution.

It's possible to stabilize subgroup performance measures to reconcile both accuracy and equity

Here, we'll take a closer look at stabilizing performance measures using a statistical technique called Bayesian hierarchical modeling. You can also view our new infographic, designed with the broader education community in mind, for a high-level, visual introduction to Bayesian stabilization.

Figure 1

Figure 1.Simulated performance data for a small subgroup

Performance data for a small subgroup might look like what we see in figure 1 (which is based on simulated data). There's a lot of variation—including one large dip that would bring the subgroup under scrutiny for accountability. But it's hard to say whether the subgroup really needed support in that year or whether the change can be attributed to other factors, such as random bad luck, rather than true performance.

States know that instability is a problem with small numbers, so they set minimum "n-sizes," thresholds for the number of students below which they don't include a subgroup in accountability. But minimum n-sizes cast long shadows; they make the students in small subgroups invisible to accountability, even if they truly need support. In that sense, minimum sample sizes force a tradeoff between accuracy—whether performance measurements are trustworthy – and equity—whether we're including as many students as possible in accountability so they can get the support they need.  

Bayesian stabilization can reconcile these two goals, making our flashlight bigger and brighter. This statistical technique draws on all available information, including information from other schools and from past years, to improve our understanding of performance.

Figure 2

Figure 2. Stabilization increases the reliability of measurements for small subgroups

In figure 2, the purple line shows the statewide average of performance for the same subgroup.  It shows the same general trend as the other lines, but it's much more stable because it uses information from all students in the subgroup across the state. Bayesian stabilization lets us borrow some of this information to improve our understanding of the subgroup's performance in each school. 

The stabilized performance measures, the blue line in figure 2, represent a compromise between the information about the subgroup in one small school and the information about the subgroup statewide. As we've said before, stabilized performance measures are much more reliable than unstabilized measures, especially for small subgroups. We've found this to be true in our own analyses of real-world accountability data: in a REL Mid-Atlantic study conducted for the Pennsylvania Department of Education, we found that stabilization could allow a state to reduce the minimum n-size from 20 students to 10 students while simultaneously improving the reliability of the results.

Figure 3

Figure 3. Reducing the minimum n-size from 20 to 10 students increases the number of subgroups included in accountability, with 61 percent of schools adding at least one subgroup

Bayesian stabilization turns Secretary Cardona's flashlight into a spotlight. It illuminates performance that has previously gone unmeasured, bringing a substantial number of subgroups and students—especially those who have historically been underserved—into accountability, helping us improve student learning for all.

Reducing the minimum n-size shines a light on students and subgroups that would otherwise be in the shadows. In Pennsylvania, we found that a substantial majority of schools—61 percent—would be able to assess at least one additional subgroup that would otherwise be excluded from accountability. Twenty-three percent of schools would be able to assess at least two more subgroups (figure 3).

The benefits are especially large for subgroups that are likely to fall short of the minimum n-size. Again using data from Pennsylvania, we found that reducing the minimum n-size from 20 to 10 students would increase the number of multiracial students included in accountability by 43 percent, and the number of Asian and English learner students included in accountability by 16 percent each.

REL Mid-Atlantic is making Bayesian stabilization more accessible

Pennsylvania and New Jersey, seeking to promote equitable student outcomes and to ensure they ground their decisions in reliable measures of subgroup performance, are working with REL Mid-Atlantic to explore the use of statistical stabilization techniques. You can learn more by visiting these pages and products on our website:

Of course, not every state has the resources to implement stabilization as part of what are often already complex accountability calculations. To make stabilization available to state and local education agencies across the country, we're exploring the development of a web-based tool that would enable states and large districts to stabilize their own performance measures. We hope this tool will help all states, whatever their size or resource level, to use stabilization to improve their accountability systems.

Author(s)

Lauren Forrow

Lauren Forrow

Brian Gill

Brian Gill
Director for REL Mid-Atlantic

Jennifer Starling

Jennifer Starling

Connect with REL Mid-Atlantic