Technical Methods Report: Using State Tests in Education Experiments

Technical Methods Report: Using State Tests in Education Experiments
A Discussion of the Issues

NCEE 2009-013
November 2009

Introduction

Securing data on students' academic achievement is often a central challenge faced by researchers conducting education experiments. These data are typically obtained by either (1) administering an assessment to students as part of the study or (2) collecting students' test scores on existing assessments administered by states or districts. Three trends have increased the relative appeal of the second strategy:

Growth in Statewide Proficiency Assessments. Requirements under the No Child Left Behind (NCLB) Act of 2001 and parallel standards-based state education reforms have led nearly all states to test students yearly in grades three through eight and in at least one grade in high school. Because of the adequate yearly progress (AYP) provisions in NCLB, these tests have significant stakes, leading school staff to encourage nearly all students to take the tests and to apply themselves, which increases the potential value of state tests as a comprehensive measure of students' academic achievement. Because educators and policymakers are mindful of student performance on these tests, the ability of a program to demonstrate impacts on state assessments also increases the likelihood that evaluation results may lead to changes in education policy and practice.
Declines in the Relative Costs of Using Test Data from States and Districts. As states and districts develop electronic databases with unique student identifiers, the costs of securing and using these data for research purposes have declined. Finding cost-effective ways to collect student achievement data is important because obtaining these outcome measures can be among the largest costs of an experiment.
Growing Demand to Minimize Testing Burden on Students and School Staff. With the growth of state proficiency tests and of formative assessments designed to prepare students for these tests or to help tailor instruction, some educators have become increasingly concerned about taking additional time out of the school day to administer a separate study test to students. Both the U.S. Department of Education (ED) and Office of Management and Budget (OMB) have sought to minimize burden on students and school staff. Recruiting districts and schools for experimental studies is sometimes easier if the study team commits to relying on existing assessments rather than adding a new assessment.

Although these trends have made state proficiency tests a more common and increasingly appealing source of outcome measures, the use of these assessments in education experiments provide specific advice for dealing with methodological caveats and point out instances in which decisions can influence study results. Part IV provides conclusions and recommendations.

In addition, two appendices provide important background information and additional context for the discussion paper. This information is provided especially for readers who may not be fully familiar with the current landscape and recent evolution of state assessment programs in the United States, or with how data from state proficiency tests are commonly used within education experiments. Appendix A highlights important characteristics of and recent trends in state assessment systems. Appendix B describes how state tests have been or were planned to be used in recent IES-funded studies; this information helped anchor our discussion of issues related to the use of state tests to those which various research teams called attention to.

The material in these appendices reveals a number of important themes about state proficiency assessment systems in the U.S. that researchers should bear in mind as they read Parts II and III of this discussion paper:

State assessment programs have become practically universal and more uniform in terms of grades and subjects tested. All 50 states test students yearly in English/Language Arts (ELA) and math in grades 3 through 8, and at least once in grades 10 through 12. Most states also test students yearly in science, but such assessments are administered only in selected grades. Testing in other subjects and other grades is less prevalent.
The design of state tests generally reflects their main purpose, to assess skills relative to state-specified proficiency standards. This objective is reflected in at least two important traits of state testing systems. First, there is notable diversity in the structure, content, and emphasis of tests across grades and states, which reflects the diversity in states' academic standards. Second, state tests consist primarily of multiple-choice items sampled broadly from states' many content and proficiency standards. Such broad sampling is consistent with a desire to assess proficiency relative to the entire set of standards. Furthermore, multiple-choice tests tend to produce highly reliable scores for the overall student population, which is desirable given the high stakes attached to proficiency determinations.
The diverse content of state assessments complicates the task of determining whether a particular test is suitable for research purposes. It also poses important challenges when deciding whether and how to combine evaluation results based on distinct assessments.
The multiple-choice format of many assessments raises other important challenges for evaluations. For instance, the reliability of such tests tends to be highest around the cut point of interest—in this case, the scores that define proficiency—and can be much lower for students at the tail-ends of the test score distribution (that is, very high- or very low-performing students). Multiple choice tests might be relatively more prone to ceiling and floor effects and therefore of potentially limited value for evaluations examining the effects of interventions targeting high- or low-performing students. Another concern is that multiple-choice tests do not measure higher-order skills well. Thus, test scores on state assessments might not be appropriate for evaluations of interventions focused on such outcomes.
A key advantage to using state assessments is that the cost of obtaining these data is typically much lower than the cost of administering new tests. Nevertheless, the process to gain access to state test data is not necessarily simple. Researchers intending to use state test data should therefore have a clear understanding of the steps necessary and allow sufficient time for data collection from the appropriate state and/or local education agencies.
Many studies funded by the Institute of Education Sciences (IES) rely on state assessments as a source of outcome measures. Such studies tend to evaluate a diverse set of interventions generally focused on improving students' overall achievement (in one or more subject areas) and/or their ability to meet states' academic standards. Estimating program impacts using students' test scores seems appropriate in such contexts.
Many of these studies are nevertheless conducted across multiple states and/or grades, and it is not always clear if the necessary assumptions to aggregate results have been met. Study reports do not make clear whether or how the research teams established that the state tests were sufficiently well aligned with key outcome objectives for the intervention. When results based on tests for different grades and/or states are combined, reports do not typically discuss whether the rescaling is appropriate given characteristics of the study sample, the different tests administered across grades or states, and the intervention's overall target population.
Possible changes to relevant Federal and state legislation could prompt changes to state assessment policies. Such changes, in turn, would prompt changes in the types of state test data collected and potentially available to researchers for evaluation purposes. Researchers should therefore be mindful of the issues and assumptions in using state tests for education evaluations.

Top