Regional need and study purpose
The use of standardized benchmarks to differentiate instruction for students is receiving renewed attention (Bennett 2002; Public Agenda 2008; Russo 2002). Effective differentiation based on readiness, interests, and learning profiles requires a valid descriptive data set at the classroom level (Decker 2003). While teachers may use their own student-level assessments (tests, quizzes, homework, problem sets) to monitor learning, it is challenging to use performance on classroom measures to assess likely performance on external measures such as statewide tests or nationally normed standardized tests. School practitioners view benchmark measures reflective of such external tests as potentially more valid in making differentiated instruction decisions that can lead to student learning gains, higher scores on state standardized tests, and improvements in schoolwide achievement (Baenen et al. 2006; Baker and Linn 2003).
One of the most widely used commercially available systems incorporating benchmark assessment and training in differentiated instruction is the Northwest Evaluation Association's Measures of Academic Progress (MAP) program. MAP tests and training are used in more than 10 percent of K–12 school districts nationwide and in more than a third of districts in the Midwest (http://www.nwea.org/about/members.asp). The developer has produced numerous technical reports demonstrating strong evidence of reliability and validity in its portfolio of MAP assessments and operates the largest repository of student growth data in the country (Cronin et al. 2007). These features have influenced research partnerships between the Northwest Evaluation Association and education researchers that used MAP assessments as a key data source in studies of educational initiatives.
While the technical merits and popularity of MAP assessments have been widely referenced in practitioner-oriented journals and teacher magazines (Ash 2008; Olson 2007; Clarke 2006; Woodfield 2003; Russo 2002), studies investigating the effects of MAP or other benchmark assessment programs on student outcomes are scarce. In contrast, ample research on the effects of formative assessment1 suggests that formative assessment is associated with improvements in student learning (Kingston and Nash 2009; Black and Wiliam 1998, Nyquist 2003, Meisels et al. 2003), particularly among low achievers (Black and Wiliam 1998) and students with learning disabilities (Fuchs and Fuchs 1986). As a consequence, the formative assessment literature is frequently cited to support the effectiveness of benchmark assessments (Perie, Marion, and Gong 2007).
Using the evidence from formative assessment research to demonstrate the effectiveness of benchmark assessment has at least three shortcomings. First, although many studies used experimental and quasi-experimental designs, most had design constraints, such as confounded treatments, nonequivalent comparison groups, and nonrandom assignment of participants to treatment groups, that undermine their validity (Dunn and Mulvenon 2009; Fuchs and Fuchs 1986). Second, only recently has formative assessment been clearly defined in the literature, with commonly used models. Thus wide variations among reported effects of formative assessment on student outcomes across studies have at least partly reflected differing (and often complex) conceptions about the nature of formative assessment (Dunn and Mulvenon 2009; Hattie and Timperley 2007). Finally, most formative assessment practices investigated in these studies focused on classroom-based assessment practices, which are administered much more frequently than benchmark assessments and are used to guide classroom instruction on a day-to-day basis (Torgesen and Miller 2009).
More recently, the Regional Educational Laboratory Northeast and Islands published two studies on the impact of benchmark assessments on student outcomes. The studies found no significant differences in math achievement gains between schools that used quarterly benchmark exams and schools that did not (Henderson et al. 2007). While similar in focus, the two studies differ from the current investigation of MAP in at least two ways. First, the studies focused on the impact of benchmark testing, whereas the current study focuses on the impact of a program that relies on training to understand and use MAP assessment results to differentiate instruction for students. Second, both previous studies used a quasi-experimental design to create a set of comparison schools that were similar to treatment schools across several observable characteristics, leaving open the possibility that other known or unknown factors could have influenced the study's findings. This study will build on these two studies by using random assignment to control for both known and unknown factors that could influence findings.
No other strong experimental or quasi-experimental studies were found that investigated the effects of benchmark assessment or benchmark assessment training on student outcomes. Extensive use of MAP among districts and schools, its lack of an empirical research base, and a projected growth in the number of schools investing in MAP or similar programs call for further investigation to determine its effectiveness and potential return on investment.
1 The more recent research literature on formative assessment distinguishes it from benchmark assessment (Torgesen and Miller 2009; Perie, Marion, and Gong 2007). This summary uses the term formative assessment to denote “a process used by teachers and students during instruction that provides feedback to adjust ongoing teaching and learning” (Council of Chief State School Officers 2008, p. 3. Benchmark assessment is used much less frequently (three to four times a year) and is designed primarily for predicting a student's academic success, monitoring progress, and providing information about a student's performance on a specific set of standards or skills that teachers can use to differentiate instruction. Additionally, while formative assessment is conducted unobtrusively as part of normal classroom activity, benchmark assessment is administered as an interrupted event outside the context of normal instruction (Hunt and Pellegrino 2002).
This study uses experimental design to assess the effectiveness of the MAP benchmark testing system and its associated teacher training on elementary students' reading performance. This study is designed to provide causal evidence on the following question:
Key outcomes will address the effect of the MAP program on student achievement. The final report is intended to inform education policymakers and practitioners about whether frequent access to standardized benchmark data, in combination with training to understand and use the data for differentiating classroom instruction, leads to improvements in student performance after two years. The study's findings will be limited to the sample of schools involved in the study.
The MAP program is a collection of computer-adaptive assessments in reading, language usage, mathematics, and science that places students on a continuum of learning from pre–grade 3 to grade 10. Each MAP assessment uses a continuous interval Rasch unit (RIT) scale to evaluate student growth and mastery of various subject-area, strand-defined skills. The developer has conducted scale alignment studies linking the MAP assessment's RIT scale to proficiency levels from assessments in all 50 states and the District of Columbia that provide evidence of the relationship between the MAP assessments and each state's assessment (Northwest Evaluation Association 2005; Brown and Coughlin 2007). In addition, studies have established evidence that MAP assessments sufficiently predict performance on assessments in at least five states (Cronin et al. 2007; Steering Committee of the Delaware Statewide Academic Growth Assessment Pilot 2007). Relying on this evidence, schools and teachers use MAP results to monitor student progress toward state proficiency standards.
The MAP developer recommends that schools administer each subject-area test three times a year (fall, winter, spring), with a fourth administration suggested during summer school. The test is computer adaptive, and students receive an overall score at the conclusion of the test. Typically, teachers can generate customized reports to review students' performance on key subject domains and goal strands within 24 hours of test completion.
MAP training involves four one-day sessions, along with conference calls and on-site visits from a MAP coach over the school year to support implementation. The training is intended to equip teachers with the knowledge and skills to administer the tests; generate and interpret individual, group, and classroom-level outcome reports; use report results to determine student readiness and differentiate instruction; and use MAP data to set student growth goals and evaluate instructional programs and practices. In each one-day session, a certified MAP trainer lectures and facilitates a structured set of activities on one of these four major areas.
Schools may also schedule three or four additional consultative sessions with a MAP trainer to provide further training on specific areas. For instance, teachers may request assistance generating reports or understanding how to use the results to group students for reading instruction or to target individual student skill needs. Visits are typically one to two hours and, depending on teacher preferences, may occur before, during, or after school. Teacher groups may also schedule quarterly conference calls to receive extra support on administering, accessing, or using MAP data in the classroom. Table 1 shows the typical MAP testing and training schedule for the study schools.
Table 1. Measures of Academic Progress testing and training timeline
|Testing schedule||Testing conducted three times a year in 2008/09 and 2009/10|
|X ->||X||X ->||X||X ->||X|
|One-day training sessions||2008/09 only|
|Step 1: MAP administration||X|
|Step 2: Use of MAP data||X|
|Step 3: Differentiated instruction||X|
|Step 4: Growth and goals||X|
|Ongoing school-based support||Schedule varies by school needs, 2008/09 and 2009/10|
|Consultative on-site school visits||X||X||X|
|Quarterly conference calls||X||X||X||X|
Source: Research team's analysis.
A key assumption underlying the training is that differentiated instruction relies on the availability of periodic assessment data and that effective use of the data relies on a clear and functional understanding of techniques in differentiation. During the school year, teachers have unrestricted access to student-level MAP results from multiple test administrations. They also have access to on-line resources to assist them in interpreting results, reconfiguring instructional strategies, and tailoring instruction to student needs.
This study uses a cluster-randomized design to randomly assign grade 4 and 5 classroom teachers in each school, as a grade-level group, to either receive the MAP program or to conduct business as usual, with no exposure to the MAP tests or training program. For instance, if grade 5 teachers in school A were assigned to the intervention condition, then grade 4 teachers in the same school were assigned to the control condition and asked not to participate in the MAP program. In this way, the control condition for grade 4 contains grade 4 classes from schools in which MAP was randomly assigned to grade 5, and the control condition for grade 5 contains grade 5 classes from schools in which grade 4 was randomly assigned to the MAP program.
The study will collect data on teachers and students in grades 4 and 5 during the 2008/09 and 2009/10 school years. Because the MAP training will not be fully delivered until the end of 2008/09, the chance of detecting significant effects at the end of the first year is unlikely. At the start of the second year, grade 4 students who were previously assigned to teachers in the intervention condition will be placed with grade 5 teachers assigned to the control condition. This creates the possibility that prior exposure to teachers receiving MAP training could affect changes in grade 5 control group students' performance during 2009/10 and possibly dilute any true differences in performance between the groups. For these reasons, confirmatory analyses will be restricted to grade 4 students' performance after the second year of implementation. Exploratory, descriptive analyses, presented in a separate report, will investigate grade 4 and 5 students' performance after one year of implementation, in addition to grade 5 students' performance after two years.
MAP logic model. The logic model underlying the MAP program is diagrammed in figure 1. It presumes that if benchmark data are used to differentiate classroom instruction and if training in using benchmark data is instrumental for effective delivery of tailored instruction, then teachers in the intervention group should be more effective than teachers in the control group in meeting student learning needs. Teacher effectiveness, in turn, should result in higher performance by students in intervention classrooms on the Illinois Standards Achievement Test (ISAT), which is administered every spring to all Illinois students in grades 3–8. MAP testing is spaced across the school year to enable teachers to alter their instructional approaches between MAP test administrations.
Figure 1. Measures of Academic Progress training and in-class practices
Study eligibility. A primary criterion for study participation is that schools and teachers did not previously use MAP or similar computer-adaptive benchmark testing programs. Schools in Illinois with at least one full-time grade 4 teacher and one grade 5 teacher who teach the general elementary curriculum to students in a self-contained classroom in the same building were eligible for the study.
These criteria are designed to enable the study team to make its best determination about what the outcomes would have been for the intervention group had it never been exposed to the MAP program and continued working in a business as usual fashion. Business as usual does not preclude control group teachers from testing their students or using the results of other assessments in making instructional decisions,. It only prohibits teachers from administering MAP, attending MAP training, or using a similar computer-adaptive benchmark assessment and training program.
Study sample. The study team aimed to recruit at least 30 schools in order to detect an effect size of 0.2 standard deviation units on the ISAT and MAP assessments. The sample includes 174 regular education reading teachers at grades 4 and 5 in 32 elementary schools and five districts in Illinois (table 2). Of the 32 schools, 27 serve preK–5 or K–5, 2 serve grades K–6, and the remaining 3 serve grades K–8, 3–8, and 4–6. Thirty-three percent of students in the study schools are members of racial/ethnic minorities, and 44 percent qualify for free or reduced-price lunch. Ninety-nine percent (174 of 176) of eligible teachers in the study schools agreed to participate.
Table 2. Characteristics of study sample
|Total number of schools||32|
|Number of schools in a mid-size city||20|
|Number of schools in suburbs, towns, and rural areas||12|
|Number of eligible teachers||176|
|Number of participating teachers||174|
|Students eligible for low or reduced-price lunch (percent)||44|
|Students belonging to a racial/ethnic minority (percent)||33|
Source: Research team's analysis based on 2006/07 data from the National Center for Education Statistics Common Core of Data (U.S. Department of Education 2009).
Key outcomes and measures
The reading performance of students in both intervention and control classrooms will be assessed with the ISAT in spring 2009 and spring 2010. A composite measure of reading and language usage from the spring 2009 and 2010 administrations of the MAP, given to students in both groups, will provide the post-test measure for comparing student achievement across experimental conditions. Confirmatory analyses will be restricted to grade 4 students' performance after the second year of implementation. Exploratory, descriptive analyses, presented in a separate report, will investigate grade 4 and 5 students' performance after one year of implementation and grade 5 students' performance after two years. Data on teachers will also be collected to measure fidelity of implementation, using instructional logs, student engagement surveys for a sample of eight students in each classroom, observations of teachers' instruction, and principal and teacher surveys.Data collection approach
Table 3 summarizes the study's two-year data collection plan for student outcomes for the 2008/09 and 2009/10 school years. Student outcome data include annual (spring 2009 and spring 2010) student assessment results on the ISAT in reading and on MAP tests in reading and language usage. (The fall and winter MAP reading and language usage assessments are administered only to students in intervention classrooms.) Data on teachers will also be collected to measure fidelity of implementation.
Table 3. Data collection schedule for the Measures of Academic Progress impact study, student reading performance, 2008/09 and 2009/10
|Data collection elements||Aug||Sep||Oct||Nov||Dec||Jan||Feb||Mar||Apr||May|
|Illinois Standards Achievement Test (ISAT)||X|
|Measures of Academic Progress (MAP) assessment||X||X||X|
|Student rating form||X|
|School leadership survey||X|
Source: Research team.
The effects of MAP on student achievement will be estimated from grade 4 data in year two, after teachers have used the MAP program elements for a full academic year. Analyses of year one grade 4 and 5 data and year two grade 5 data will be exploratory, and the findings will be released in a separate descriptive report. To account for the variance in the outcome measure at multiple levels, multilevel analysis (hierarchical linear modeling) will be used to determine impacts on student outcomes. Students will be nested in schools, allowing the study to account for variance between students within schools and variance across schools. Analyses will be conducted separately at grades 4 and 5.
In year one, the ISAT scores of students in intervention classrooms will be compared with those of students in control classrooms, adjusting for pre-existing differences using the previous year's ISAT scores and student, teacher, and school characteristics. In year two, the same analysis will be conducted with a new cohort of grade 4 and 5 students. Additionally, the study will assess possible differences in achievement between the two experimental conditions by comparing results on the MAP reading and language usage tests administered in spring 2009 and spring 2010.
A concern with using the MAP test in assessing the MAP program is overalignment of the test with the content of the MAP intervention. Overalignment could occur for three reasons: more frequent administration of the MAP test to the intervention group than to the control group, MAP teachers' use of terminology or concepts from the MAP training program that are not ordinarily used in classrooms, and different testing conditions for intervention and control groups.
The MAP assessment includes several features to limit any advantage a student or teacher might gain by becoming familiar with the test over time. The test is not timed, teachers do not have access to test items, and individual items will not be re-administered to the same student for two consecutive years. In addition, MAP test items are aligned with state content standards, and developers strive to maintain the test's high reliability and validity for predicting state achievement test performance. The developer trains school-based MAP test proctors to achieve consistency across testing events. Finally, as an additional measure to mitigate cross-group contamination, the developer will turn off the scoring function on the MAP test for the control group to prevent control teachers and students from seeing their MAP scores and preclude control teachers from generating MAP reports.
David Cordray, Vanderbilt University; Georgine Pion, Vanderbilt University; Matt Dawson, Learning Point Associates; and W. Christopher Brandt, Learning Point Associates.
W. Christopher Brandt
Regional Educational laboratory Midwest at Learning Point Associates
1120 E. Diehl Road, Suite 200
Naperville, IL 60563-1486
Phone: (630) 649-6649
Fax: (630) 649-6700
Ash, K. (2008). Adjusting to test takers. Education Week, 28 (13), 1–4.
Baenen, N., Ives, S., Lynn, A., Warren, T., Gilewicz, E., and Yaman, K. (2006). Effective practices for at-risk elementary and middle school students (No. 06.03). Raleigh: Wake County Public School System.
Baker, E.L., and Linn, R.L. (2003). Validity issues for accountability systems. In Fuhrman, S. H., and Elmore, R. F. (Eds.). Redesigning accountability systems for education (pp. 47–72). New York: Teachers College Press.
Bennett, R. E. (2002). Using electronic assessment to measure student performance. Princeton, NJ: National Governors Association.
Black, P., and Wiliam, D. (1998). Inside the black box: Raising standards through classroom assessment. Phi Delta Kapan 80(2), 139–48.
Brown, R.S., and Coughlin, E. (2007). The predictive validity of selected benchmark assessments used in the Mid-Atlantic Region (Issues and Answers Report, REL 2007-No. 017). Washington DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Mid-Atlantic. Retrieved on January 26, 2009, from http://ies.ed.gov/ncee/edlabs.
Clarke, B. (2006). Breaking through to reluctant readers. Educational Leadership, 63(5), 66–69.
Cronin, J., Kingsbury, G. G., Dahlin, M., Adkins, D., and Bowe, B. (2007). Alternate methodologies for estimating state standards on a widely-used computer adaptive test. Paper presented at the Annual Conference of the American Educational Research Association, Chicago, IL.
Council of Chief State School Officers. (2008). Attributes of effective formative assessment. Washington, DC: Council of Chief State School Officers. Accessible online at www.ccsso.org/publications/details.cfm?PublicationID=362.
Decker, G. (2003). Using data to drive student achievement in the classroom and on high-stakes tests. THE Journal. Retrieved October 22, 2008, from www.thejournal.com/articles/16259.
Dunn, K.E., and Mulvenon, S.W. (2009). A critical review of research on formative assessment: the limited scientific evidence of the impact of formative assessment in education. Practical Assessment, Research and Evaluation, 14(7), 1–11.
Fuchs, L. S., and Fuchs, D. (1986). Effects of systematic formative evaluation: a meta-analysis. Exceptional Children, 53(3), 199–208.
Hattie, J. and Timperley, H, (2007). The power of feedback. Review of Educational Research, 77(1), 81–112.
Henderson, S., Petrosino, A. Guckenburg, S., and Hamilton, S. (2007). Measuring how benchmark assessments affect student achievement (Issues and Answers Report, REL 2007-No. 39). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Northeast and Islands. Retrieved on January 36, 2009, from http://ies.ed.gov/ncee/edlabs.
Hunt, E. and Pellegrino, J.W. (2002). Issues, examples, and challenges in formative assessment. New Directions for Teaching and Learning, 89, 73–85.
Kingston, N., and Nash, B. (2009). The efficacy of formative assessment: a meta-analysis. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.
Meisels, S.J, Atkins-Burnett, X., Xue, Y., Bickel, D., Son, S., and Nicholson, J. (2003). Creating a system of accountability: the impact of instructional assessment on elementary children's achievement test scores. Education Policy Analysis Archives, 11(9). Retrieved January 29, 2009 from http://epaa.asu.edu/epaa/v11n9/.
Northwest Evaluation Association. (2005). RIT scale norms for use with achievement level tests and Measures of Academic Progress. Lake Oswego, OR: Northwest Evaluation Association.
Nyquist, J. (2003). Reconceptualizing feedback as formative assessment: a meta-analysis. Unpublished Master's Thesis. Vanderbilt University, Department of Psychology and Human Development.
Olson, A. (January, 2007). Growth measures for systemic change. The School Administrator, 1 (64) Retrieved January 26, 2009, from www.aasa.org/SchoolAdministratorArticle.aspx?id=7350.
Perie, M., Marion, S., and Gong, B. (2007). The role of interim assessments in a comprehensive assessment system. Washington D.C.: The Aspen Institute.
Public Agenda. (2008). REL Midwest task 1.1: Regional education needs analysis year two report. Unpublished report available on request from REL Midwest.
Russo, A. (2002). Mixing technology and testing. The School Administrator, 59(4), 6–12.
Steering Committee of the Delaware Statewide Academic Growth Assessment Pilot (2007). Toward a more powerful student assessment system: the evaluation and recommendations of the Delaware Statewide Academic Assessment. Retrieved November 2, 2009, from http://www.mapuser.k12.de.us/files/DE_Growth_Pilot_Eval_FINAL_REPORT.pdf.
Torgesen, J. K., and Miller, D. H. (2009). Assessments to guide adolescent literacy instruction. Portsmouth, NH: RMC Research Corporation, Center on Instruction.
Woodfield, K. (January 2003). Getting on board with online testing. THE Journal, 30(6), 32–37.
U. S. Department of Education, National Center for Education Statistics. (2009). Common Core of Data. Washington, DC: Retrieved on February 24, 2009, from http://nces.ed.gov/ccd/bat/.