Skip Navigation
A gavel National Board for Education Sciences Members | Priorities | Reports | Agendas | Minutes | Resolutions| Briefing Materials
Policy Recommendation: That Congress revise the statutory definition of "scientifically based research" so that it includes studies likely to produce valid conclusions about a program's effectiveness, and excludes studies that often produce erroneous conclusions.

Resolution Adopted by the National Board for Education Sciences, October 31, 2007

The Problem: The current definition includes some study designs that can produce erroneous findings about program effectiveness, leading to practices that are ineffective or possibly harmful.

Many of the Department of Education programs authorized in the No Child Left Behind Act (NCLB) require program grantees to implement educational practices that are based on "scientifically based research" or "scientifically based reading research." Similarly, the Education Sciences Reform Act (ESRA) requires the Institute of Education Sciences' research activities to follow "scientifically based research standards." Currently the law defines these terms quite broadly. To elaborate—

"Scientifically based research" now includes studies that compare program participants to a "control group" of non-participants, without restrictions on how the controls are selected.

The current definitions of "scientifically based reading research" and "scientifically based research standards" are even broader, with no requirement for a control group.

"Scientifically based research" thus currently encompasses studies with very different levels of rigor, including the following:

  • Well-designed and implemented randomized controlled trials — which, when feasible, are widely recognized as the strongest design for evaluating a program's effectiveness.

    The unique advantage of such studies is that they enable one to assess whether the program itself, as opposed to other factors, causes the observed outcomes. This is because the process of randomly assigning a sufficiently large number of individuals to either a program group or a control group ensures, to a high degree of confidence, that there are no systematic differences between the two groups in any characteristics (observed and unobserved) except one — the program group participates in the program, and the control group does not. Thus the resulting difference in outcomes between the two groups can confidently be attributed to the program and not to other factors.1 (Such studies are sometimes called "experimental" studies.)

  • Well-matched comparison-group studies, which evidence suggests can be a second-best alternative when a randomized controlled trial is not feasible.

    Such studies compare program participants to a group of non-participants selected through means other than random assignment, but who are very closely matched with participants in key characteristics, such as prior educational achievement, demographics, and motivation (e.g., through matching methods such as "propensity scores," or selection of sample members just above and just below the threshold for program eligibility). Careful investigations have found that, among studies that use non-randomized control groups, these well-matched studies are the most likely to produce valid conclusions about a program's overall effectiveness, although they may still mis-estimate the size of the effect.2

  • Comparison-group studies without close matching, which often produce erroneous findings about which practices are effective (but can still be useful in generating hypotheses that merit testing in more rigorous studies).

    These are among the most common designs in educational research. There is strong evidence from education and other fields that such designs, although useful in hypothesis-generation, often produce erroneous findings, and therefore should not be relied upon to inform policy decisions.3 This is true even when statistical techniques, such as regression adjustment, are used to correct for observed differences between the program participants and non-participants. Attachment 1 provides a concrete example of how such designs can yield the wrong answer about a program's effectiveness. (Comparison-group studies are sometimes called "quasi-experimental" studies.)

Specific recommendation: That Congress revise the statutory definition of "scientifically based research" and "scientifically based reading research" to clarify that such research "makes claims about an activity's impact on educational outcomes only in well-designed and implemented random assignment experiments, when feasible, and other methods (such as well-matched comparison group studies) that allow for the strongest possible causal inferences when random assignment is not feasible."

Attachment 2 shows this revision, and the language it replaces, in the relevant sections of NCLB and ESRA. This revision is actually an adaptation of language that already appears in ESRA under the definition of "scientifically valid education evaluation" (also shown in attachment 2).

Precedent for the revised definition: It is broadly consistent with the standards of evidence used by authoritative organizations across a wide range of policy areas, such as:

  • National Academy of Sciences, Institute of Medicine4
  • American Psychological Association5
  • Society for Prevention Research6
  • Department of Education7
  • Academic Competitiveness Council (13 federal agencies funding math/science education)8
  • Department of Justice, Office of Justice Programs9
  • Food and Drug Administration10
  • Office of Management and Budget11

These various standards all recognize well-designed and implemented randomized controlled trials, where feasible, as the strongest design for evaluating a program or practice's effectiveness, and many recognize well-matched comparison-group studies as a second-best alternative when a randomized controlled trial is not feasible.

Conclusion: The definition of "scientifically based research" should be revised, as discussed above, so that it helps focus federal funds on activities that are truly effective.

Attachment 1: Example of How a Comparison-Group Study Without Close Matching Can Produce Erroneous Conclusions

The following example shows how a comparison-group study without careful matching can fail to replicate a central finding of a well-designed randomized controlled trial, producing an invalid result.

Randomized controlled trial results: From 1993–2004, the Departments of Education and Labor, and several private foundations, sponsored a large, well-designed randomized controlled trial of Career Academies. Career Academies are an educational program for middle and high school students that provides academic and technical courses in small learning communities, with a career theme and partnership with local employers. One of the trial's main findings, at the 8-year follow-up, was that the program had no effect on participants' high school graduation rate, compared to the control group (as shown by the two left-hand bars in the chart below).12

Comparison-group study results: When the study team then used a comparison-group design comparing program participants to a non-randomized control group of similar students in similar schools — rather than a randomized control group — the study produced an erroneous finding that Career Academies had a large effect on the high school graduation rate, increasing it by over 30 percent (see the two right-hand bars in the chart below).13

add image here

A likely reason the comparison-group design produced this erroneous finding: The program group had volunteered for the Career Academy — which is an indication that they were motivated to achieve — whereas the non-randomized control group members had not volunteered, and so presumably were less motivated on average. This difference in motivational level likely caused the program group to have a higher graduation rate (an effect sometimes called "self-selection bias"). In the randomized controlled trial, by contrast, both the program group and the control group had volunteered for Career Academies prior to random assignment, and so were well-matched in level of motivation, as well as other characteristics.

Attachment 2: Proposed Legislative Language

  1. Suggested revisions to the definition of "scientifically based research" and "scientifically based reading research" in the No Child Left Behind Act of 2001 (P.L. 107–110)
"(37) Scientifically based research.—The term 'scientifically based research'—
  1. "means research that involves the application of rigorous, systematic, and objective procedures to obtain reliable and valid knowledge relevant to education activities and programs; and
  2. "includes research that—
    1. "employs systematic, empirical methods that draw on observation or experiment;
    2. "involves rigorous data analyses that are adequate to test the stated hypotheses and justify the general conclusions drawn;
    3. "relies on measurements or observational methods that provide reliable and valid data across evaluators and observers, across multiple measurements and observations, and across studies by the same or different investigators;
    4. makes claims about an activity's impact on educational outcomes only in well-designed and implemented random assignment experiments, when feasible, and other methods (such as well-matched comparison group studies) that allow for the strongest possible causal inferences when random assignment is not feasible.is evaluated using experimental or quasi-experimental designs in which individuals,entities, programs, or activities are assigned to different conditions and with appropriate controls to evaluate the effects of the condition of interest, with a preference for random-assignment experiments, or other designs to the extent that those designs contain within-condition or across-condition controls;
    5. "ensures that experimental studies are presented in sufficient detail and clarity to allow for replication or, at a minimum, offer the opportunity to build systematically on their findings; and
    6. "has been accepted by a peer-reviewed journal or approved by a panel of independent experts through a comparably rigorous, objective, and scientific review.

* * *

"(6) Scientifically based reading research.—The term 'scientifically based reading research' means research that—
  1. "applies rigorous, systematic, and objective procedures to obtain valid knowledge relevant to reading development, reading instruction, and reading difficulties; and
  2. "includes research that—
    1. "employs systematic, empirical methods that draw on observation or experiment;
    2. "involves rigorous data analyses that are adequate to test the stated hypotheses and justify the general conclusions drawn;
    3. "relies on measurements or observational methods that provide valid data across evaluators and observers and across multiple measurements and observations;
    4. "makes claims about an activity's impact on educational outcomes only in well-designed and implemented random assignment experiments, when feasible, and other methods (such as well-matched comparison group studies) that allow for the strongest possible causal inferences when random assignment is not feasible; and
    5. "has been accepted by a peer-reviewed journal or approved by a panel of independent experts through a comparably rigorous, objective, and scientific review.
  1. Suggested revision to the definition of "scientifically based research standards" in the Education Sciences Reform Act of 2002 (P.L. 107–279)
(18) Scientifically based research standards.—
  1. The term "scientifically based research standards" means research standards that—
    1. apply rigorous, systematic, and objective methodology to obtain reliable and valid knowledge relevant to education activities and programs; and
    2. present findings and make claims that are appropriate to and supported by the methods that have been employed.
  2. The term includes, appropriate to the research being conducted—
    1. employing systematic, empirical methods that draw on observation or experiment;
    2. involving data analyses that are adequate to support the general findings;
    3. relying on measurements or observational methods that provide reliable data;
    4. making claims about an activity's impact on educational outcomes only in well-designed and implemented random assignment experiments, when feasible, and other methods (such as well-matched comparison group studies) that allow for the strongest possible causal inferences when random assignment is not feasible of causal relationships only in random assignment experiments or other designs (to the extent such designs substantially eliminate plausible competing explanations for the obtained results);
    5. ensuring that studies and methods are presented in sufficient detail and clarity to allow for replication or, at a minimum, to offer the opportunity to build systematically on the findings of the research;
    6. obtaining acceptance by a peer-reviewed journal or approval by a panel of independent experts through a comparably rigorous, objective, and scientific review; and
    7. using research designs and methods appropriate to the research question posed.
  1. The proposed revisions above are adaptations of language that already appears in the Education Sciences Reform Act, under the definition of "scientifically valid education evaluation":
(19) Scientifically valid education evaluation.—The term "scientifically valid education evaluation means an evaluation that—
  1. adheres to the highest possible standards of quality with respect to research design and statistical analysis;
  2. provides an adequate description of the programs evaluated and, to the extent possible, examines the relationship between program implementation and program impacts;
  3. provides an analysis of the results achieved by the program with respect to its projected effects;
  4. employs experimental designs using random assignment, when feasible, and other research methodologies that allow for the strongest possible causal inferences when random assignment is not feasible; and
  5. may study program implementation through a combination of scientifically valid and reliable methods.

References

1 By contrast, nonrandomized studies by their nature can never be entirely confident that they are comparing program participants to non-participants who are equivalent in observed and unobserved characteristics (e.g., motivation). Thus, these studies cannot rule out the possibility that such characteristics, rather than the program itself, are causing an observed difference in outcomes between the two groups.

2 The following are citations to the relevant literature in education, welfare/employment, and other areas of social policy. Howard S. Bloom, Charles Michalopoulos, and Carolyn J. Hill, "Using Experiments to Assess Nonexperimental Comparison-Groups Methods for Measuring Program Effects," in Learning More From Social Experiments: Evolving Analytic Approaches, Russell Sage Foundation, 2005, pp. 173–235. James J. Heckman et. al., "Characterizing Selection Bias Using Experimental Data," Econometrica, vol. 66, no. 5, September 1998, pp. 1017–1098. Daniel Friedlander and Philip K. Robins, "Evaluating Program Evaluations: New Evidence on Commonly Used Nonexperimental Methods," American Economic Review, vol. 85, no. 4, September 1995, pp. 923–937. Thomas Fraker and Rebecca Maynard, "The Adequacy of Comparison Group Designs for Evaluations of Employment-Related Programs," Journal of Human Resources, vol. 22, no. 2, spring 1987, pp. 194–227. Robert J. LaLonde, "Evaluating the Econometric Evaluations of Training Programs With Experimental Data," American Economic Review, vol. 176, no. 4, September 1986, pp. 604–620. Roberto Agodini and Mark Dynarski, "Are Experiments the Only Option? A Look at Dropout Prevention Programs," Review of Economics and Statistics, vol. 86, no. 1, 2004, pp. 180–194. Elizabeth Ty Wilde and Rob Hollister, "How Close Is Close Enough? Testing Nonexperimental Estimates of Impact against Experimental Estimates of Impact with Education Test Scores as Outcomes," Institute for Research on Poverty Discussion paper, no. 1242–02, 2002, at http://www.ssc.wisc.edu/irp/, and forthcoming in Journal of Public Policy and Management.

This literature is systematically reviewed in Steve Glazerman, Dan M. Levy, and David Myers, "Nonexperimental Replications of Social Experiments: A Systematic Review," Mathematica Policy Research discussion paper, no. 8813–300, September 2002. The portion of this review addressing labor market interventions is published in "Nonexperimental versus Experimental Estimates of Earnings Impact," The American Annals of Political and Social Science, vol. 589, September 2003, pp. 63–93.

3 Ibid (the literature cited in reference 2 addresses the general question of whether, and under what circumstances, comparison-group studies can replicate the results of well-designed randomized controlled trials).

4 "The Urgent Need to Improve Health Care Quality," Consensus statement of the Institute of Medicine National Roundtable on Health Care Quality, Journal of the American Medical Association, vol. 280, no. 11, September 16, 1998, p. 1003.

5 American Psychological Association, "Criteria for Evaluating Treatment Guidelines," American Psychologist, vol. 57, no. 12, December 2002, pp. 1052–1059.

6 Society for Prevention Research, Standards of Evidence: Criteria for Efficacy, Effectiveness and Dissemination, April 12, 2004, at http://www.preventionresearch.org/sofetext.php.

7 U.S. Department of Education, "Scientifically-Based Evaluation Methods: Notice of Final Priority," Federal Register, vol. 70, no. 15, January 25, 2005, pp. 3586–3589. U.S. Education Department, Institute of Education Sciences, What Works Clearinghouse Study Review Standards, February 2006, http://ies.ed.gov/ncee/wwc/DocumentSum.aspx?sid=19.

8 U.S. Department of Education, Report of the Academic Competitiveness Council, May 2007.

9 U.S. Department of Justice, Office of Juvenile Justice and Delinquency Prevention, Model Programs Guide, at http://www.dsgonline.com/mpg2.5/ratings.htm; U.S. Department of Justice, Office of Justice Programs, What Works Repository, December 2004.

10 The Food and Drug Administration's standard for assessing the effectiveness of pharmaceutical drugs and medical devices, at 21 C.F.R. 314.12.

11 Office of Management and Budget, What Constitutes Strong Evidence of Program Effectiveness, op. cit., no. 4.

12 James Kemple and Judith Scott-Clayton, "Career Academies: Impacts on Labor Market Outcomes and Educational Attainment," MDRC, February 2004, at http://www.mdrc.org/publications/366/full.pdf. Although the study found that Career Academies had no effect on high school graduation rates, it did find that the program produced sizeable increases in participants' job earnings, compared to the control group.

13 James Kemple and Kathleen Floyd, "Why Do Impact Evaluations? Notes from Career Academy Research and Practice," presentation at a conference of the Coalition for Evidence-Based Policy and the Council of Chief State School Officers, December 10, 2003, http://www.excelgov.org/usermedia/images/uploads/PDFs/MDRC-Conf-12-09-2003.ppt.