Skip Navigation

What Works Clearinghouse


Appendices


Appendix A1.1 Study characteristics: Johnson & Hall, 2003 (quasi-experimental design)

Characteristic Description
Study citation Johnson, J., & Hall, M. (2003). Technical report: Houghton Mifflin California math performance evaluation. Raleigh, NC: EDSTAR, Inc.
Participants The participants in this study were second through fifth graders from 16 districts in California. The intervention group included 1601 schools from eight districts using Houghton Mifflin Mathematics. The comparison group included 137 schools in eight different districts. The intervention group was identified by Houghton Mifflin, which provided the names of eight districts in California that began using Houghton Mifflin Mathematics in 2002. Using data from the Quality Education Database, the California Department of Education, and the American Institutes for Research, comparison districts were matched based on prior math achievement scores, student demographic characteristics, and district sizes.
Setting The participating school districts were located throughout California.
Intervention The intervention group used the 2002 edition of Houghton Mifflin Mathematics and had completed their first year of implementing the curriculum during the 2001-2002 school year.
Comparison There is no information in the study about the specific math programs used in the comparison school districts, except that the schools did not use Houghton Mifflin Mathematics.
Primary outcomes and measurement The outcome measure was the total math score on the California statewide assessment, the Standardized and Reporting (STAR) Stanford 9 test, used during the 2000-01 and 2001-02 school years. (See Appendix A2 for more detailed descriptions of outcome measures.) The study authors reported scores as national percentile ranks, but the WWC reports scaled scores sent by the author in response to a data request, because scaled scores are more direct indicators of performance and do not require extrapolation based on national norms.
Teacher training No information is available on the training or professional development provided to the teachers in the intervention group.

1 Some of the grade level analyses contained fewer than 160 intervention schools because not all schools had all grade levels.

Top

Appendix A1.2 Study characteristics: EDSTAR, Inc., 2004 (quasi-experimental design)

Characteristic Description
Study citation EDSTAR, Inc. (2004). Large-scale evaluation of student achievement in districts using Houghton Mifflin. Raleigh-Durham, NC: Author.
Participants The participating 519 schools were selected from different regions of the country including the West (California), the Midwest (Illinois, Missouri, and Wisconsin), the Northeast (New Jersey and New York), and the Southeast (South Carolina). The grade levels evaluated varied by state: California, grades 2-5; South Carolina, grades 3-5; Missouri, New Jersey, New York, and Wisconsin, grade 4; Illinois, grades 3 and 5. The authors indicate that no attrition occurred in this study. Due to the confounding of the intervention effect with the effect of other district characteristics,1 the analysis was limited to a sample of 16 districts (eight pairs) and 212 schools in the three states that had multiple districts in the intervention and comparison groups: California, New Jersey, and South Carolina.
Setting Districts were selected in various states to represent ranges in size, demographic characteristics, and student achievement. Within districts, schools were matched based on size of schools, student achievement level, school socioeconomic level, and school minority level.
Intervention The eight districts in the intervention group had begun using Houghton Mifflin Mathematics in 2002-03.
Comparison The comparison group used one of three types of math programs: reform, traditional, or balanced. The reform programs included Everyday Math, Mathland, and Excel Math. The traditional programs included Saxon and SRA. Scott Foresman 2000, Harcourt-Brace Mathematics, and Silver Burdett comprised the balanced programs. This WWC report focuses on an analysis of a reduced sample of states and therefore includes only comparison groups with balanced (California and South Carolina) and reform (New Jersey) programs.
Primary outcomes and measurement The outcome measures were the state achievement tests used by each state in the study. Due to differences in state tests and state standards, results for each state were analyzed and evaluated separately. (See Appendix A2 for more detailed descriptions of outcome measures.) The study authors reported scores as percent of students at or above proficiency.
Teacher training No information is available on the training or professional development provided to the teachers in the intervention group.

1 For more information see the WWC Technical Paper on Teacher-Intervention Confound.

Top

Appendix A2 Outcome measures in the mathematics achievement domain

Outcome measure Description
Standardized and Reporting (STAR) Stanford 9 test Johnson and Hall (2003) used the 2001 and 2002 Stanford 9 scaled test scores to measure mathematics achievement. The test scores were obtained from the California Department of Education website.
State achievement tests EDSTAR, Inc. (2004) used state achievement tests from California, New Jersey, and South Carolina to measure students' mathematics achievement.1 for California, the authors used two tests from the Standardized Testing and Reporting (STAR) program of the California Assessment System: the California Standards Test and the Stanford 9 test. In 2003 the Stanford 9 test was replaced by another norm-referenced test, the California Achievement Test (as cited in EDSTAR, Inc., 2004). The California Standards Test was administered to grades 2-9 and the Stanford 9 test was administered to grades 2-11. In New Jersey, the state assessment was the Elementary School Proficiency Assessment (ESPA), which is administered to fourth-grade students. For South Carolina, the authors used results from the Palmetto Achievement Challenge Test, which was administered to students in grades 3-8.

1 Additional outcome measures (state tests for Illinois, Missouri, and Wisconsin) were reported by the study authors but are not described here because these analyses were excluded from the WWC report due to a confound between the district and the intervention.

Top

Appendix A3 Summary of study findings included in the rating for the mathematics achievement domain1

  Author's findings from the study  
  Mean outcome (standard deviation2) WWC calculations
Outcome measure Study sample Sample size (Schools/districts, except where indicated) Houghton Mifflin Mathematics group3 Comparison group3 Mean difference4 (Houghton Mifflin Mathematics -comparison) Effect size5 Statistical significance6 (at α= 0.05) Improvement index7
Johnson & Hall, 2003 (quasi-experimental design)8
CA STAR test: 2002 SAT9 mean scaled scores 16 California school districts: grade 2 297/16 592.52 (nr) 586.12 (nr) 6.40 na10 ns na10
CA STAR test: 2002 SAT9 mean scaled scores 16 California school districts: grade 3 296/16 618.04 (nr) 615.11 (nr) 2.93 na10 ns na10
CA STAR test: 2002 SAT9 mean scaled scores 16 California school districts: grade 4 296/16 636.87 (nr) 632.60 (nr) 4.27 na10 ns na10
CA STAR test: 2002 SAT9 mean scaled scores 16 California school districts: grade 5 293/16 657.34 (nr) 654.13 (nr) 3.21 na10 ns na10
Average9 for mathematics achievement (Johnson & Hall, 2003) na10 ns na10
EDSTAR, Inc., 2004 (quasi-experimental design)8
NJ ASK4 exam: percent at or above proficiency, 2002-03 New Jersey: grade 4 16/4 40.50 (nr) 37.70 (nr) 2.80 na10 ns na10
SC PACT exam: percent at or above proficiency, 2002–03 South Carolina: grades 3-5 128/8 34.30 (nr) 32.10 (nr) 2.20 na10 ns na10
CA CAT/6 exam: percent at or above proficiency, 2002–03 California: grades 2-5 68/4 36.40 (nr) 38.70 (nr) -2.30 na10 ns na10
Average9 for mathematics achievement (EDSTAR,Inc., 2004) na10 ns na10
Average9 for mathematics achievement across all studies na10 na na10

ns = not statistically significant
na = not applicable
nr = not reported

1 This appendix reports findings considered for the effectiveness rating and the average improvement indices.
2 The standard deviation across all students in each group shows how dispersed the participants' outcomes are: a smaller standard deviation on a given measure would indicate that participants had more similar outcomes.
3 The intervention and control group values are based on information provided by the authors for both the Johnson and Hall (2003) and EDSTAR, Inc. (2004) studies. These values may differ from what appeared in the original studies.
4 Positive differences and effect sizes favor the intervention group; negative differences and effect sizes favor the comparison group.
5 For an explanation of the effect size calculation, see Technical Details of WWC-Conducted Computations.
6 Statistical significance is the probability that the difference between groups is a result of chance rather than a real difference between the groups.
7 The improvement index represents the difference between the percentile rank of the average student in the intervention condition and that of the average student in the comparison condition. The improvement index can take on values between -50 and +50, with positive numbers denoting favorable results.
8 The level of statistical significance was reported by the study authors or, where necessary, calculated by the WWC to correct for clustering within classrooms or schools and for multiple comparisons. For an explanation about the clustering correction, see WWC Tutorial on Mismatch. See Technical Details of WWC-Conducted Computations for the formulas the WWC used to calculate statistical significance. In the case of Johnson and Hall (2003) and EDSTAR, Inc. (2004), a correction for clustering was needed, so the statistical significance reported by the WWC may differ from that reported by the study authors.
9 The WWC-computed average effect size for each study and for the domain across studies are simple averages rounded to two decimal places. The average improvement indices are calculated from the average effect sizes.
10 Student-level standard deviations were not available for this study. In Johnson & Hall (2003), school-level standard deviations for grades 2 through 5 were 21.56, 20.65, 20.21, and 20.66 for the intervention group and 20.72, 20.00, 19.16, and 19.29 for the comparison group. In EDSTAR, Inc. (2004), school-level standard deviations for the New Jersey, South Carolina, and California samples were 22.00, 15.20, and 18.30 for the intervention group and 21.90, 13.10, and 16.60 for the comparison group. Because the student-level effect size and improvement index could not be computed, the magnitude of the effect size was not considered for rating purposes. However, the statistical significance for this study is comparable to other studies and is included in the intervention rating. For further details, please see Technical Details of WWC-Conducted Computations.

Top

Appendix A4 Houghton Mifflin Mathematics rating for the mathematics achievement domain

The WWC rates the effects of an intervention in a given outcome domain as positive, potentially positive, mixed, no discernible effects, potentially negative, or negative.1

For the outcome domain of mathematics achievement, the WWC rated Houghton Mifflin Mathematics as having no discernible effects. It did not meet the criteria for positive effects because no studies met WWC evidence standards for a strong design or showed significant, positive effects. Further, it did not meet the criteria for other ratings (potentially positive, mixed, potentially negative, and negative effects) because neither of the two studies showed statistically significant or substantively important effects, either positive or negative.

Rating received

No discernible effects: No affirmative evidence of effects.

  • Criterion 1: None of the studies shows a statistically significant or substantively important effect, either positive or negative.

    Met. The two studies of Houghton Mifflin Mathematics showed indeterminate effects. .

Other ratings considered

Positive effects: Strong evidence of a positive effect with no overriding contrary evidence.

  • Criterion 1: Two or more studies showing statistically significant positive effects, at least one of which met WWC evidence standards for a strong design.

    Not met. The WWC analysis found no statistically significant positive effects in this domain.

  • Criterion 2: No studies showing statistically significant or substantively important negative effects.

    Met. The WWC analysis found no statistically significant or substantively important negative effects in this domain.

Potentially positive effects: Evidence of a positive effect with no overriding contrary evidence.

  • Criterion 1: At least one study showing a statistically significant or substantively important positive effect.

    Not met. The WWC analysis found no statistically significant or substantively important positive effects in this domain.

  • Criterion 2: No studies showing a statistically significant or substantively important negative effect. Fewer or the same number of studies showing indeterminate effects than showing statistically significant or substantively important positive effects.

    Not met. Two studies showed indeterminate effects, and no studies of Houghton Mifflin Mathematics showed statistically significant or substantively important effects, either positive or negative.

Mixed effects: Evidence of inconsistent effects as demonstrated through either of the following criteria.

  • Criterion 1: At least one study showing a statistically significant or substantively important positive effect. At least one study showing a statistically significant or substantively important negative effect, but no more such studies than the number showing a statistically significant or substantively important positive effect.

    Not met. The WWC analysis found no statistically significant or substantively important effects in this domain.

  • Criterion 2: At least one study showing a statistically significant or substantively important effect, and more studies showing an indeterminate effect than showing a statistically significant or substantively important effect.

    Not met. The WWC analysis found no statistically significant or substantively important effects in this domain.

Potentially negative effects: Evidence of a negative effect with no overriding contrary evidence.

  • Criterion 1: At least one study showing a statistically significant or substantively important negative effect.

    Not met. The WWC analysis found no statistically significant or substantively important negative effects in this domain.

    Criterion 2: No studies showing a statistically significant or substantively important positive effect, or more studies showing statistically significant or substantively important negative effects than showing statistically significant or substantively important positive effects.

    Not met. The WWC analysis found no statistically significant or substantively important positive effects in this domain.

Negative effects: Strong evidence of a negative effect with no overriding contrary evidence.

  • Criterion 1: Two or more studies showing statistically significant negative effects, at least one of which met WWC evidence standards for a strong design.

    Not met. The WWC analysis found no statistically significant or substantively important negative effects in this domain.

  • Criterion 2: No studies showing statistically significant or substantively important positive effects.

    Met. The WWC analysis found no statistically significant or substantively important positive effects in this domain.

1 For rating purposes, the WWC considers the statistical significance of individual outcomes and the domain level effect. The WWC also considers the size of the domain level effect for ratings of potentially positive effects. See the WWC Intervention Rating Scheme for a complete description.

Top

Appendix A5 Extent of evidence by domain

  Sample size
Outcome domain Number of studies Schools Studies Extent of evidence1
Math achievement 2 Over 800 nr Medium to large

nr = not reported

1 A rating of "medium to large" requires at least two studies and two schools across studies in one domain, and a total sample size across studies of at least 350 students or 14 classrooms. Otherwise, the rating is "small."

Top


PO Box 2393
Princeton, NJ 08543-2393
Phone: 1-866-503-6114