Technical Methods Report: Using State Tests in Education Experiments - Appendix A: State Testing Programs Under NCLB

Technical Methods Report: Using State Tests in Education Experiments
A Discussion of the Issues

NCEE 2009-013
November 2009

Introduction
Whether to use State Tests in Education Experiments
How to use State Test Data in Education Experiments
Conclustions and Recommendations
References
Appendix A: State Testing Programs Under NCLB
Appendix B: How NCEE-Funded Evaluations use State Test Data
List of Tables
PDF & Related Info

Appendix A: State Testing Programs Under NCLB

This appendix aims to provide some context on state assessment programs for researchers who might be unfamiliar or have limited knowledge about the current landscape and recent evolution of such programs in the United States. Although testing programs are continuously changing as states work to meet federal requirements and improve performance, this section aims to provide a useful snapshot for researchers working to identify factors that could affect the design of their studies.

In this appendix, we describe key provisions of the No Child Left Behind Act of 2001 (NCLB) that have influenced state testing policies and the overall availability of student assessment data, trends in state testing since NCLB was introduced, and issues related to the alignment of state tests to academic content and performance standards. Our discussion of these topics is based on reviews of information on state assessment policies from the Council of Chief State School Officers (CCSSO) and key reports on student testing.

1. Key NCLB Provisions That Influence State Testing Policies
The No Child Left Behind Act of 2001 was signed into law by President George W. Bush on January 8, 2002.³¹ It is the reauthorization of the Elementary and Secondary Education Act, which governs the distribution and use of Title I funds, the federal government’s principal aid program for the education of disadvantaged students.

At the core of NCLB are a number of provisions requiring, as a condition for receipt of Title I funds, that states implement comprehensive student testing programs. By the 2005-2006 school year, states had to test students annually in mathematics and English Language Arts (ELA) in grades three through eight and once in high school. Starting in 2007-2008, states also had to test students in science at least once in each of the following grade periods: 3 through 5, 6 through 9, and 10 through 12. In addition, NCLB requires that assessments be aligned with states’ academic content standards, which states can accomplish either by developing assessments specifically designed to reflect those standards or by modifying commercially available “off-the-shelf” tests.

Other key provisions of NCLB that influence state testing policies and the overall availability of test scores for individual students include the following:

Adequate Yearly Progress Toward Proficiency. NCLB requires that all students reach proficiency in the state-defined standards by the spring of the 2013-2014 academic year, as measured by performance on state tests. Adequate yearly progress (AYP) is the measure by which schools, districts, and states are held accountable for student progress toward this 100 percent proficiency goal. Based on 2001-2002 test data, states set their baseline proficiency rates. States were then required to specify yearly benchmarks for how students would progress to meet the goal of 100 percent proficiency by 2014. To achieve AYP, 95 percent of students in a school as a whole must meet or exceed the “annual measurable objectives” set by the state for a given academic year. Schools or districts that fail to make AYP for two consecutive years are identified as “in need of improvement.”

Statewide Accountability Systems. NCLB requires states to develop a single accountability system to determine whether all students and key subgroups of students are meeting AYP. All students must be assessed using the same state assessment (with limited exceptions, described below) and AYP definitions must apply to all public schools and districts in the state, Title I and non-Title I.

Student Participation Requirements. An additional condition to achieve AYP is that at least 95 percent of the students enrolled in a school or local education agency (LEA) must take the state tests. The participation rate must also reach 95 percent for “numerically significant” student subgroups, which include various racial/ethnic subgroups, socioeconomically disadvantaged students, English language learners (ELLs), and students with disabilities.

Testing Accommodations and Exemptions. Because the assessments play a major role in states’ accountability systems, NCLB provisions allow some modifications to the typical assessment scenario to improve fairness. For example, NCLB allows ELL students to be exempted from state testing in their first year in school. ELL students must participate in the state testing program thereafter, but may take the state test in their native language. Another common accommodation involves a testing proctor reading aloud portions of the math assessment to ELL students. Similarly, special education students may receive accommodations (for example, extended time), an alternate version of the test (for example, large-print or Braille versions), or be administered an entirely different assessment (for example, a portfolio assessment) that reflects academic standards and goals that apply specifically to them (that is, those specified in an Individualized Education Program or IEP) and are different from those that apply to the general student population (U.S. Department of Education 2006).

2. Characteristics of and Recent Trends in State Testing Programs
Statewide assessment programs were already prevalent before NCLB was enacted. A 2001 study by the Consortium for Policy Research in Education (CPRE) at the University of Pennsylvania found that in the years before NCLB was enacted, 48 states already had statewide student assessment programs (Goertz et al. 2001). (The two remaining states—Iowa and Nebraska—allowed districts to choose whether and how to assess students.) The same study nevertheless found wide cross-state variation in how often students were tested (for example, how many and which grades) and in the types of tests administered to students (for example, nationally-normed versus state-developed criterion-referenced tests).

Statewide assessment programs have nevertheless become more uniform, closely reflecting NCLB requirements. Tables A.1, A.2, and A.3 (presented at the end of this appendix) reflect data from the CCSSO on the assessments used and grades tested in ELA, mathematics, and science, respectively, during two time periods—2003-2004 and 2007-2008 for ELA and mathematics, and 2004-2005 and 2007-2008 for science—for the 50 states and the District of Columbia (CCSSO 2005, 2008). As the tables show, in 2003-2004, there was still notable variation in state assessment programs along the dimensions examined. By 2007-2008, however, all states complied with NCLB’s requirements to test students yearly in grades 3-8 and at least once in grades 10-12 in mathematics and ELA. Thirty-four states (67 percent) tested students solely in the grades required by NCLB.

States test high school students in one or more of grades 10 through 12. As of the 2007-2008 academic year, the majority of states tested students in grade 10 (for example, 53 percent for mathematics, see Table A.1) and a few states (for example, Iowa and South Dakota) test students in grade 12. Some states (for example, Nevada and New Hampshire) tested students in multiple grades in high school. However, states rarely tested ninth-grade students (which is not mandated by NCLB). Some states (for example, Maryland and North Carolina) administered end-of-course exams instead of testing high school students in specific grades.

Only a handful of states test very young students. Examples of states that, as of 2007-2008, tested students below grade three include California and Delaware. This pattern likely reflects both the lack of an NCLB testing mandate for these grades as well as the difficulties in assessing young children economically and reliably. As of 2007-2008, only seven and eight states tested students at least once in grades Kindergarten through two in mathematics and ELA, respectively (see Tables A.1 and A.2).

As NCLB science testing requirements have come into effect, many states have added science to their lists of subjects tested. The exact grades tested vary across states, however. In 2007-2008, 46 states tested students in science at least once in the required grade blocks (as compared to 35 states in 2004-2005). As of 2007-2008, Maine, Maryland, and Nevada did not test students in science in grades 10 through 12, while Arkansas and the District of Columbia were still developing their science assessment programs.

Use of nationally normed assessments is now rare. Goertz et al. (2001) note that 31 states used nationally-normed tests in their state assessment programs in 1999-2000. Since passage of NCLB, most states have nevertheless opted to develop state-specific assessments to test students in all three NCLB-mandated subjects. By 2007-2008, the number of states using nationally normed assessments had decreased to seven for mathematics (Table A.1) and four for ELA (Table A.2). As of 2007-2008, only one state (Alabama) used a nationally normed assessment to test students in science (Table A.3). Notably, those states that in 2007-2008 still used nationally normed assessments in mathematics and ELA administered them in addition to state-specific tests in the same grades, or administered them only in high school.

However, some states contract with commercial testing companies to develop customized assessments. Although the use of “off-the-shelf” nationally normed assessments has become less common, many states have contracted with commercial test developers such as CTB/McGraw-Hill, Educational Testing Service (ETS), Pearson Assessment, and Riverside Publishing.³² We were unable to locate information on the degree to which, in such instances, state tests draw upon or are derived from the item banks for the nationally normed assessments sold by these same testing vendors. However, this contracting practice suggests that the format and content of some state tests may be closely related to the format and content of some nationally normed student assessments.

State tests rely primarily on multiple-choice items to measure student performance. Quality Counts 2008 indicates that, in 2007-2008, the assessment programs of 49 states (all except Nebraska) and the District of Columbia included multiple-choice test items (Education Week 2008). Most state tests rely on about 40 to 50 multiple-choice items per subject tested (Webb 2007), which translates to only one or two items per standards-based objective assessed. Thus, many academic objectives are typically left unassessed in any given year. Multiple-choice items produce very reliable test scores, but some educators and psychologists argue that they do a poor job of measuring higher-order skills (Darling-Hammond 2007; Bracey 2002; Kohn 2000). The typical design of state tests is not surprising given their intended use: determining a student’s level of proficiency relative to state standards. Such use requires highly reliable scores—justifying the use of multiple-choice items—that represent a student’s proficiency across the entire set of standards for a particular grade—justifying a broad sampling of items across many standards.

There are important differences in test content and performance standards across states. Studies that have examined content standards and proficiency levels across states (separately) conclude that both vary widely. For example, Porter, Polikoff, and Smithson (2008) examined the state assessment programs of 31 states for grades three through eight in mathematics, ELA, and science; they found more overlap in standards across grades in a given state than for a given grade across states. Similarly, an NCES (2007) study used traditional psychometric equating techniques to link assessments from all 50 states to the National Assessment of Educational Progress (NAEP). This study found that the NAEP test scores corresponding to states’ proficiency cutoffs for state tests in ELA and mathematics for grades four and eight ranged from a high of 12 points above the NAEP cut score for “proficient” performance to a low of 45 points below the NAEP cut score for “basic” performance. Petrilli (2008) examined proficiency cut scores in 26 states and concluded that these varied tremendously in the difficulty level represented.

Schools and LEAs also vary in the participation rates they are able to achieve. As noted, NCLB provisions set a national standard of 95 percent for the participation of students and subgroups in state assessment programs. According to a 2007 study commissioned by ED, among the 25 percent of U.S. schools that did not make AYP in 2003-2004, 6 percent failed solely because of their test participation rates (U.S. Department of Education 2007). In other words, a little more than one percent of schools in the United States did not make AYP solely because of their test participation rates.

Special education students might have lower participation rates. A 2004 study sponsored by the National Center for Education Outcomes found that the participation rates for students with an IEP could differ within states by as much as 40 to 50 percentage points (Thurlow 2004). However, only eight states had differences greater than 25 percentage points in the participation rates of IEP and non-IEP students. According to the Government Accounting Office (GAO 2005), in 2003-2004, eight states—Alabama, Arkansas, the District of Columbia, Georgia, New Mexico, New York, Pennsylvania, and Texas—had participation rates in ELA exams below 95 percent for students with disabilities, as compared to four states—Alabama, the District of Columbia, Georgia, and Texas—with participation rates below 95 percent for all students. The GAO nevertheless concluded that, for the United States as a whole, the participation rates for special education students were generally similar to those for all students.

State testing policies also influence the completeness of student test data. Under NCLB, states independently determine a testing window within which students must take or make up the state assessment. Longer testing windows allow time for more students to be tested.³³ Some states (for example, California, Colorado, and Washington) allow parents to opt out of testing for personal or religious reasons, excluding their children from having to take the state assessment.

3. The Future of State Assessment Systems
The No Child Left Behind Act of 2001 expired in 2007 and various proposals for changes to the law have been offered as part of reauthorization efforts. Changes in regulations or priorities at the Federal, state, or other levels are likely to prompt important changes in state testing policies, which in turn would prompt changes in the types of data potentially available for research purposes. The diversity and ever-changing nature of state assessment systems heightens the importance that researchers be mindful of the issues and assumptions when using state tests for education evaluations.

Top

³¹ If a random effects model must be run with a small number of states, it might As of October 6, 2009, the complete text of the NCLB Act could be obtained from the U.S. Department of Education website (http://www.ed.gov/policy/elsec/leg/esea02/index.html), along with a variety of other useful summary and overview materials (see http://www.ed.gov/nclb/overview/intro/execsumm.pdf).
³² If a random effects model must be run with a small number of states, it might For example, CTB/McGraw Hill offers “state specific” assessment products that reportedly are aligned with the content standards of 15 states, including California, New York, Florida, New Jersey, Pennsylvania, and Ohio (http://www.ctb.com/products/category_home.jsp?FOLDER%3C%3Efolder_id=2534374302134883&bmUID=1220106041853; accessed on October 6, 2009). Riverside Publishing claims to have “collaborated with over half of U.S. states to provide assessment programs designed to meet their state-specific, large scale testing needs” (http://www.riverpub.com/large-scaleprograms/; accessed on October 6, 2009).
³³ If a random effects model must be run with a small number of states, it might For example, in 2008-2009, New Jersey had a four-week testing window for grades three through eight, including the designated weeks for make-up testing (http://www.state.nj.us/education/assessment/schedule.shtml). In contrast, Texas requires that students take make-up exams within five days of the original testing date (http://ritter.tea.state.tx.us/student.assessment/admin/calendar/2007_2008_revised_01_17_08.pdf).