Securing data on students' academic achievement is often a central challenge faced
by researchers conducting education experiments. These data are typically obtained
by either (1) administering an assessment to students as part of the study or (2)
collecting students' test scores on existing assessments administered by states
or districts. Three trends have increased the relative appeal of the second strategy:
- Growth in Statewide Proficiency Assessments. Requirements under the No
Child Left Behind (NCLB) Act of 2001 and parallel standards-based state education
reforms have led nearly all states to test students yearly in grades three through
eight and in at least one grade in high school. Because of the adequate yearly progress
(AYP) provisions in NCLB, these tests have significant stakes, leading school staff
to encourage nearly all students to take the tests and to apply themselves, which
increases the potential value of state tests as a comprehensive measure of students'
academic achievement. Because educators and policymakers are mindful of student
performance on these tests, the ability of a program to demonstrate impacts on state
assessments also increases the likelihood that evaluation results may lead to changes
in education policy and practice.
- Declines in the Relative Costs of Using Test Data from States and Districts.
As states and districts develop electronic databases with unique student identifiers,
the costs of securing and using these data for research purposes have declined.
Finding cost-effective ways to collect student achievement data is important because
obtaining these outcome measures can be among the largest costs of an experiment.
- Growing Demand to Minimize Testing Burden on Students and School Staff.
With the growth of state proficiency tests and of formative assessments designed
to prepare students for these tests or to help tailor instruction, some educators
have become increasingly concerned about taking additional time out of the school
day to administer a separate study test to students. Both the U.S. Department of
Education (ED) and Office of Management and Budget (OMB) have sought to minimize
burden on students and school staff. Recruiting districts and schools for experimental
studies is sometimes easier if the study team commits to relying on existing assessments
rather than adding a new assessment.
Although these trends have made state proficiency tests a more common and increasingly
appealing source of outcome measures, the use of these assessments in education
experiments provide specific advice for dealing with methodological caveats and
point out instances in which decisions can influence study results. Part IV provides
conclusions and recommendations.
In addition, two appendices provide important background information and additional
context for the discussion paper. This information is provided especially for readers
who may not be fully familiar with the current landscape and recent evolution of
state assessment programs in the United States, or with how data from state proficiency
tests are commonly used within education experiments. Appendix A highlights important
characteristics of and recent trends in state assessment systems. Appendix B describes
how state tests have been or were planned to be used in recent IES-funded studies;
this information helped anchor our discussion of issues related to the use of state
tests to those which various research teams called attention to.
The material in these appendices reveals a number of important themes about state
proficiency assessment systems in the U.S. that researchers should bear in mind
as they read Parts II and III of this discussion paper:
- State assessment programs have become practically universal and more uniform
in terms of grades and subjects tested. All 50 states test students yearly
in English/Language Arts (ELA) and math in grades 3 through 8, and at least once
in grades 10 through 12. Most states also test students yearly in science, but such
assessments are administered only in selected grades. Testing in other subjects
and other grades is less prevalent.
- The design of state tests generally reflects their main purpose, to assess skills
relative to state-specified proficiency standards. This objective is reflected
in at least two important traits of state testing systems. First, there is notable
diversity in the structure, content, and emphasis of tests across grades and states,
which reflects the diversity in states' academic standards. Second, state tests
consist primarily of multiple-choice items sampled broadly from states' many content
and proficiency standards. Such broad sampling is consistent with a desire to assess
proficiency relative to the entire set of standards. Furthermore, multiple-choice
tests tend to produce highly reliable scores for the overall student population,
which is desirable given the high stakes attached to proficiency determinations.
- The diverse content of state assessments complicates the task of determining
whether a particular test is suitable for research purposes. It also poses
important challenges when deciding whether and how to combine evaluation results
based on distinct assessments.
- The multiple-choice format of many assessments raises other important challenges
for evaluations. For instance, the reliability of such tests tends to be highest
around the cut point of interest—in this case, the scores that define proficiency—and
can be much lower for students at the tail-ends of the test score distribution (that
is, very high- or very low-performing students). Multiple choice tests might be
relatively more prone to ceiling and floor effects and therefore of potentially
limited value for evaluations examining the effects of interventions targeting high-
or low-performing students. Another concern is that multiple-choice tests do not
measure higher-order skills well. Thus, test scores on state assessments might not
be appropriate for evaluations of interventions focused on such outcomes.
- A key advantage to using state assessments is that the cost of obtaining these
data is typically much lower than the cost of administering new tests. Nevertheless,
the process to gain access to state test data is not necessarily simple. Researchers
intending to use state test data should therefore have a clear understanding of
the steps necessary and allow sufficient time for data collection from the appropriate
state and/or local education agencies.
- Many studies funded by the Institute of Education Sciences (IES) rely on state
assessments as a source of outcome measures. Such studies tend to evaluate
a diverse set of interventions generally focused on improving students' overall
achievement (in one or more subject areas) and/or their ability to meet states'
academic standards. Estimating program impacts using students' test scores seems
appropriate in such contexts.
- Many of these studies are nevertheless conducted across multiple states and/or
grades, and it is not always clear if the necessary assumptions to aggregate results
have been met. Study reports do not make clear whether or how the research
teams established that the state tests were sufficiently well aligned with key outcome
objectives for the intervention. When results based on tests for different grades
and/or states are combined, reports do not typically discuss whether the rescaling
is appropriate given characteristics of the study sample, the different tests administered
across grades or states, and the intervention's overall target population.
- Possible changes to relevant Federal and state legislation could prompt changes
to state assessment policies. Such changes, in turn, would prompt changes in
the types of state test data collected and potentially available to researchers
for evaluation purposes. Researchers should therefore be mindful of the issues and
assumptions in using state tests for education evaluations.