Skip Navigation
Identifying and Implementing Educational Practices Supported By Rigorous Evidence: A User Friendly Guide
December 2003

I. The randomized controlled trial: What it is, and why it is a critical factor in establishing "strong" evidence of an intervention's effectiveness.

Well-designed and implemented randomized controlled trials are considered the "gold standard" for evaluating an intervention's effectiveness, in fields such as medicine, welfare and employment policy, and psychology.7 This section discusses what a randomized controlled trial is, and outlines evidence indicating that such trials should play a similar role in education.

A. Definition: Randomized controlled trials are studies that randomly assign individuals to an intervention group or to a control group, in order to measure the effects of the intervention.

For example, suppose you want to test, in a randomized controlled trial, whether a new math curriculum for third-graders is more effective than your school's existing math curriculum for third-graders. You would randomly assign a large number of third-grade students to either an intervention group, which uses the new curriculum, or to a control group, which uses the existing curriculum. You would then measure the math achievement of both groups over time. The difference in math achievement between the two groups would represent the effect of the new curriculum compared to the existing curriculum.

In a variation on this basic concept, sometimes individuals are randomly assigned to two or more intervention groups as well as to a control group, in order to measure the effects of different interventions in one trial. Also, in some trials, entire classrooms, schools, or school districts - rather than individual students - are randomly assigned to intervention and control groups.

B. The unique advantage of random assignment: It enables you to evaluate whether the intervention itself, as opposed to other factors, causes the observed outcomes.

Specifically, the process of randomly assigning a large number of individuals to either an intervention group or a control group ensures, to a high degree of confidence, that there are no systematic differences between the groups in any characteristics (observed and unobserved) except one - namely, the intervention group participates in the intervention, and the control group does not. Therefore - assuming the trial is properly carried out (per the guidelines below) - the resulting difference in outcomes between the intervention and control groups can confidently be attributed to the intervention and not to other factors.

C. There is persuasive evidence that the randomized controlled trial, when properly designed and implemented, is superior to other study designs in measuring an intervention's true effect.

1. "Pre-post" study designs often produce erroneous results.

Definition: A"pre-post" study examines whether participants in an intervention improve or regress during the course of the intervention, and then attributes any such improvement or regression to the intervention.

The problem with this type of study is that, without reference to a control group, it cannot answer whether the participants' improvement or decline would have occurred anyway, even without the intervention. This often leads to erroneous conclusions about the effectiveness of the intervention.

Example: A randomized controlled trial of Even Start - a federal program designed to improve the literacy of disadvantaged families - found that the program had no effect on improving the school readiness of participating children at the 18th-month follow-up. Specifically, there were no significant differences between young children in the program and those in the control group on measures of school readiness including the Picture Peabody Vocabulary Test (PPVT) and PreSchool Inventory.8

If a pre-post design rather than a randomized design had been used in this study, the study would have concluded erroneously that the program was effective in increasing school readiness. This is because both the children in the program and those in the control group showed improvement in school readiness during the course of the program (e.g., both groups of children improved substantially in their national percentile ranking on the PPVT). A pre-post study would have attributed the participants' improvement to the program whereas in fact it was the result of other factors, as evidenced by the equal improvement for children in the control group.

Example: A randomized controlled trial of the Summer Training and Education Program - a Labor Department pilot program that provided summer remediation and work experience for disadvantaged teenagers - found that program's short-term impact on participants' reading ability was positive. Specifically, while the reading ability of the control group members eroded by a full grade-level during the first summer of the program, the reading ability of participants in the program eroded by only a half grade-level. 9

If a pre-post design rather than a randomized design had been used in this study, the study would have concluded erroneously that the program was harmful. That is, the study would have found a decline in participants' reading ability and attributed it to the program. In fact, however, the participants' decline in reading ability was the result of other factors - such as the natural erosion of reading ability during the summer vacation months - as evidenced by the even greater decline for members of the control group.

2. The most common "comparison group" study designs (also known as "quasi-experimental" designs) also lead to erroneous conclusions in many cases.

a. Definition: A "comparison group" study compares outcomes for intervention participants with outcomes for a comparison group chosen through methods other than randomization.

The following example illustrates the basic concept of this design. Suppose you want to use a comparison-group study to test whether a new mathematics curriculum is effective. You would compare the math performance of students who participate in the new curriculum ("intervention group") with the performance of a "comparison group" of students, chosen through methods other than randomization, who do not participate in the curriculum. The comparison group might be students in neighboring classrooms or schools that don't use the curriculum, or students in the same grade and socioeconomic status selected from state or national survey data. The difference in math performance between the intervention and comparison groups following the intervention would represent the estimated effect of the curriculum.

Some comparison-group studies use statistical techniques to create a comparison group that is matched with the intervention group in socioeconomic and other characteristics, or to otherwise adjust for differences between the two groups that might lead to inaccurate estimates of the intervention's effect. The goal of such statistical techniques is to simulate a randomized controlled trial.

b. There is persuasive evidence that the most common comparison-group designs produce erroneous conclusions in a sizeable number of cases.

A number of careful investigations have been carried out - in the areas of school dropout prevention,.10 K-3 class-size reduction,.11 and welfare and employment policy.12 - to examine whether and under what circumstances comparison-group designs can replicate the results of randomized controlled trials.13 These investigations first compare participants in a particular intervention with a control group, selected through randomization, in order to estimate the intervention's impact in a randomized controlled trial. Then the same intervention participants are compared with a comparison group selected through methods other than randomization, in order to estimate the intervention's impact in a comparison-group design. Any systematic difference between the two estimates represents the inaccuracy produced by the comparison-group design.

These investigations have shown that most comparison-group designs in education and other areas produce inaccurate estimates of an intervention's effect. This is because of unobservable differences between the members of the two groups that differentially affect their outcomes. For example, if intervention participants self-select themselves into the intervention group, they may be more motivated to succeed than their control-group counterparts. Their motivation - rather than the intervention - may then lead to their superior outcomes. In a sizeable number of cases, the inaccuracy produced by the comparison-group designs is large enough to result in erroneous overall conclusions about whether the intervention is effective, ineffective, or harmful.

Example from medicine. Over the past 30 years, more than two dozen comparison-group studies have found hormone replacement therapy for postmenopausal women to be effective in reducing the women's risk of coronary heart disease, by about 35-50 percent. But when hormone therapywas finally evaluated in two large-scale randomized controlled trials - medicine's "gold standard" - it was actually found to do the opposite: it increased the risk of heart disease, as well as stroke and breast cancer.14

Medicine contains many other important examples of interventions whose effect as measured in comparison-group studies was subsequently contradicted by well-designed randomized controlled trials. If randomized controlled trials in these cases had never been carried out and the comparison-group results had been relied on instead, the result would have been needless death or serious illness for millions of people. This is why the Food and Drug Administration and National Institutes of Health generally use the randomized controlled trial as the final arbiter of which medical interventions are effective and which are not.

3. Well-matched comparison-group studies can be valuable in generating hypotheses about "what works," but their results need to be confirmed in randomized controlled trials.

The investigations, discussed above, that compare comparison-group designs with randomized controlled trials generally support the value of comparison-group designs in which the comparison group is very closely matched with the intervention group in prior test scores, demographics, time period in which they are studied, and methods used to collect outcome data. In most cases, such well-matched comparison-group designs seem to yield correct overall conclusions in most cases about whether an intervention is effective, ineffective, or harmful. However, their estimates of the size of the intervention's impact are still often inaccurate. As an illustrative example, a well-matched comparison-group study might find that a program to reduce class size raises test scores by 40 percentile points - or, alternatively, by 5 percentile points - when its true effect is 20 percentile points. Such inaccuracies are large enough to lead to incorrect overall judgments about the policy or practical significance of the intervention in a nontrivial number of cases.

As discussed in section III of this Guide, we believe that such well-matched studies can play a valuable role in education, as they have in medicine and other fields, in establishing "possible" evidence an intervention's effectiveness, and thereby generating hypotheses that merit confirmation in randomized controlled trials. But the evidence cautions strongly against using even the most well-matched comparison-group studies as a final arbiter of what is effective and what is not, or as a reliable guide to the strength of the effect.

D. Thus, we believe there are compelling reasons why randomized controlled trials are a critical factor in establishing "strong" evidence of an intervention's effectiveness.

7 See, for example, the Food and Drug Administration's standard for assessing the effectiveness of pharmaceutical drugs and medical devices, at 21 C.F.R. ¡±314.126. See also, "The Urgent Need to Improve Health Care Quality," Consensus statement of the Institute of Medicine National Roundtable on Health Care Quality, Journal of the American Medical Association, vol. 280, no. 11, September 16, 1998, p. 1003; and Gary Burtless, "The Case for Randomized Field Trials in Economic and Policy Research," Journal of Economic Perspectives, vol. 9, no. 2, spring 1995, pp. 63-84.

8 Robert G. St. Pierre et. al., "Improving Family Literacy: Findings From the National Even Start Evaluation," Abt Associates, September 1996.

9 Jean Baldwin Grossman, "Evaluating Social Policies: Principles and U.S. Experience," The World Bank Research Observer, vol. 9, no. 2, July 1994, pp. 159-181.

10 Roberto Agodini and Mark Dynarski, "Are Experiments the Only Option? A Look at Dropout Prevention Programs," Mathematica Policy Research, Inc., August 2001, at

11 Elizabeth Ty Wilde and Rob Hollister, "How Close Is Close Enough? Testing Nonexperimental Estimates of Impact against Experimental Estimates of Impact with Education Test Scores as Outcomes," Institute for Research on Poverty Discussion paper, no. 1242-02, 2002, at

12 Howard S. Bloom et. al., "Can Nonexperimental Comparison Group Methods Match the Findings from a Random Assignment Evaluation of Mandatory Welfare-to-Work Programs?" MDRC Working Paper on Research Methodology, June 2002, at James J. Heckman, Hidehiko Ichimura, and Petra E. Todd, "Matching As An Econometric Evaluation Estimator: Evidence from Evaluating a Job Training Programme," Review of Economic Studies, vol. 64, no. 4, 1997, pp. 605-654. Daniel Friedlander and Philip K. Robins, "Evaluating Program Evaluations: New Evidence on Commonly Used Nonexperimental Methods," American Economic Review, vol. 85, no. 4, September 1995, pp. 923-937; Thomas Fraker and Rebecca Maynard, "The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs," Journal of Human Resources, vol. 22, no. 2, spring 1987, pp. 194-227; Robert J. LaLonde, "Evaluating the Econometric Evaluations of Training Programs With Experimental Data," American Economic Review, vol. 176, no. 4, September 1986, pp. 604-620.

13 This literature, including the studies listed in the three preceding endnotes, is systematically reviewed in Steve Glazerman, Dan M. Levy, and David Myers, "Nonexperimental Replications of Social Experiments: A Systematic Review," Mathematica Policy Research discussion paper, no. 8813-300, September 2002. The portion of this review addressing labor market interventions is published in "Nonexperimental versus Experimental Estimates of Earnings Impact," The American Annals of Political and Social Science, vol. 589, September 2003.

14 J.E. Manson et. al, "Estrogen Plus Progestin and the Risk of Coronary Heart Disease," New England Journal of Medicine, August 7, 2003, vol. 349, no. 6, pp. 519-522. International Position Paper on Women's Health and Menopause: A Comprehensive Approach, National Heart, Lung, and Blood Institute of the National Institutes of Health, and Giovanni Lorenzini Medical Science Foundation, NIH Publication No. 02-3284, July 2002, pp. 159-160. Stephen MacMahon and Rory Collins, "Reliable Assessment of the Effects of Treatment on Mortality and Major Morbidity, II: Observational Studies," The Lancet, vol. 357, February 10, 2001, p. 458. Sylvia Wassertheil-Smoller et. al., "Effect of Estrogen Plus Progestin on Stroke in Postmenopausal Women - The Women's Health Initiative: A Randomized Controlled Trial, Journal of the American Medical Association, May 28, 2003, vol. 289, no. 20, pp. 2673-2684.