Skip Navigation
Identifying and Implementing Educational Practices Supported By Rigorous Evidence: A User Friendly Guide
December 2003

III. How to evaluate whether an intervention is backed by "possible" evidence of effectiveness

Because well-designed and implemented randomized controlled trials are not very common in education, the evidence supporting an intervention frequently falls short of the above criteria for "strong" evidence of effectiveness in one or more respects. For example, the supporting evidence may consist of:

  • Only nonrandomized studies;
  • Only one well-designed randomized controlled trial showing the intervention's effectiveness at a single site;
  • Randomized controlled trials whose design and implementation contain one or more flaws noted above (e.g., high attrition);
  • Randomized controlled trials showing the intervention's effectiveness as implemented by researchers in a laboratory-like setting, rather than in a typical school or community setting; or
  • Randomized controlled trials showing the intervention's effectiveness for students with different academic skills and socioeconomic backgrounds than the students in your schools or classrooms.

Whether an intervention not supported by "strong" evidence is nevertheless supported by "possible" evidence of effectiveness (as opposed to no meaningful evidence of effectiveness) is a judgment call that depends, for example, on the extent of the flaws in the randomized controlled trials of the intervention and the quality of any nonrandomized studies that have been done. While this Guide cannot foresee and provide advice on all possible scenarios of evidence, it offers in this section a few factors to consider in evaluating whether an intervention not supported by "strong" evidence is nevertheless supported by "possible" evidence.

A. Circumstances in which a comparison-group study can constitute "possible" evidence of effectiveness:

1. The study's intervention and comparison groups should be very closely matched in academic achievement levels, demographics, and other characteristics prior to the intervention.

The investigations, discussed in section I, that compare comparison-group designs with randomized controlled trials generally support the value of comparison-group designs in which the comparison group is very closely matched with the intervention group. In the context of education studies, the two groups should be matched closely in characteristics including:

  • Prior test scores and other measures of academic achievement (preferably, the same measures that the study will use to evaluate outcomes for the two groups);
  • Demographic characteristics, such as age, sex, ethnicity, poverty level, parents' educational attainment, and single or two-parent family background;
  • Time period in which the two groups are studied (e.g., the two groups are children entering kindergarten in the same year as opposed to sequential years); and
  • Methods used to collect outcome data (e.g., the same test of reading skills administered in the same way to both groups).

These investigations have also found that when the intervention and comparison groups differ in such characteristics, the study is unlikely to generate accurate results even when statistical techniques are then used to adjust for these differences in estimating the intervention's effects.

2. The comparison group should not be comprised of individuals who had the option to participate in the intervention but declined.

This is because individuals choosing not to participate in an intervention may differ systematically in their level of motivation and other important characteristics from the individuals who do choose to participate. The difference in motivation (or other characteristics) may itself lead to different outcomes for the two groups, and thus contaminate the study's estimates of the intervention's effects.

Therefore, the comparison group should be comprised of individuals who did not have the option to participate in the intervention, rather than individuals who had the option but declined.

3. The study should preferably choose the intervention/comparison groups and outcome measures "prospectively" - that is, before the intervention is administered.

This is because if the groups and outcomes measures are chosen by the researchers after the intervention is administered ("retrospectively"), the researchers may consciously or unconsciously select groups and outcome measures so as to generate their desired results. Furthermore, it is often difficult or impossible for the reader of the study to determine whether the researchers did so.

Prospective comparison-group studies are, like randomized controlled trials, much less susceptible to this problem. In the words of the director of drug evaluation for the Food and Drug Administration, "The great thing about a [randomized controlled trial or prospective comparison-group study] is that, within limits, you don't have to believe anybody or trust anybody. The planning for [the study] is prospective; they've written the protocol before they've done the study, and any deviation that you introduce later is completely visible." By contrast, in a retrospective study, "you always wonder how many ways they cut the data. It's very hard to be reassured, because there are no rules for doing it." 20

4. The study should meet the guidelines set out in section II for a well-designed randomized controlled trial (other than guideline 2 concerning the random-assignment process).

That is, the study should use valid outcome measures, have low attrition, report tests for statistical significance, and so on.

B. Studies that do not meet the threshold for "possible" evidence of effectiveness:

1. Pre-post studies, which often produce erroneous results, as discussed in section I.

2. Comparison-group studies in which the intervention and comparison groups are not well-matched.

As discussed in section I, such studies also produce erroneous results in many cases, even when statistical techniques are used to adjust for differences between the two groups.

Example. As reported in Education Week, several comparison-group studies have been carried out to evaluate the effects of "high-stakes testing" - i.e., state-level policies in which student test scores are used to determine various consequences, such as whether the students graduate or are promoted to the next grade, whether their teachers are awarded bonuses, or whether their school is taken over by the state. These studies compare changes in test scores and dropout rates for students in states with high-stakes testing (the intervention group) to those for students in other states (the comparison groups). Because students in different states differ in many characteristics, such as demographics and initial levels of academic achievement, it is unlikely that these studies provide accurate measures of the effects of high-stakes testing. It is not surprising that these studies reach differing conclusions about the effects of such testing.21

3. "Meta-analyses" that combine the results of individual studies that do not themselves meet the threshold for "possible" evidence.

Meta-analysis is a quantitative technique for combining the results of individual studies, a full discussion of which is beyond the scope of this Guide. We merely note that when meta-analysis is used to combine studies that themselves may generate erroneous results - such as randomized controlled trials with significant flaws, poorly-matched comparison group studies, and pre-post studies - it will often produce erroneous results as well.

Example. A meta-analysis combining the results of many nonrandomized studies of hormone replacement therapy found that such therapy significantly lowered the risk of coronary heart disease.22 But, as noted earlier, when hormone therapy was subsequently evaluated in two large-scale randomized controlled trials, it was actually found to do the opposite - namely, it increased the risk of coronary disease. The meta-analysis merely reflected the inaccurate results of the individual studies, producing more precise, but still erroneous, estimates of the therapy's effect.

20 Robert J. Temple, Director of the Office of Medical Policy, Center for Drug Evaluation and Research, Food and Drug Administration, quoted in Gary Taubes, "Epidemiology Faces Its Limits," Science, vol. 269, issue 5221, p. 169.

21 Debra Viadero, "Researchers Debate Impact of Tests," Education Week, vol. 22, no. 21, February 5, 2003, page 1.

22 E. Barrett-Connor and D. Grady, "Hormone Replacement Therapy, Heart Disease, and Other Considerations," Annual Review of Public Health, vol. 19, 1998, pp. 55-72.