Skip Navigation
Third National Even Start Evaluation: Follow-Up Findings From the Experimental Design Study
NCEE 2005-3002
December 2004

Section Three: Description of the Evaluation

Implementation of the Evaluation

The original design for and resources allocated to the EDS called for an experiment to be conducted in 15 to 20 Even Start projects. In practice, the EDS was implemented in 18 projects which voluntarily agreed to randomly assign incoming families to be in Even Start or a control group, providing an experimental assessment of Even Start's impacts. A summary of the numbers of projects and participants at each stage of the evaluation is given in Figure 3.1.

EDS Sample and Evaluation Design. Projects were recruited during the 1999–2000 and 2000–2001 school years to participate in the EDS. During this time, all Even Start projects in the nation were screened for eligibility. To pass the eligibility screen, projects had to minimally meet Even Start's legislative requirements, be in operation for at least two years, plan to operate through the length of the study, plan to serve about 20 new families at the start of data collection, offer instructional services of moderate or high intensity relative to all Even Start projects, and be willing to participate in a random assignment study. Projects also were recruited from urban and rural areas, as well as projects that served varying proportions of ESL participants. Over the two recruitment years, 115 out of a universe of about 750 projects met the selection criteria. All 115 eligible projects were contacted, materials were sent describing the study, telephone calls were made to all 115 projects to discuss the study, and site visits were made to many of the projects. In the end, 18 of these projects (about 15% of the eligible projects) were willing to participate in the study. The background characteristics of families in the two cohorts of projects were similar, so data were combined across all 18 projects for analytic purposes.

The fact that only 115 out of 750 projects met the selection criteria for the EDS should not call the validity of the study into question. The selection criteria outlined above were applied in order to obtain a sample of projects that would be operating during the time of the study, that were not brand-new projects, that offered a reasonable amount of instructional service, and that could recruit a sufficient number of new families. All of these are fair study requirements.

However, the fact that only 18 out of 115 eligible projects were willing to participate in the EDS does make us worry about the generalizability of the findings (see discussion below). Why was the rate of participation of projects in the study so low? The key reason is that participation in the evaluation was not mandated—it was not a condition for continued receipt of federal funding. The approach of mandated participation in federal studies has been used in the recent past and has been shown to be very effective, e.g., for the Head Start Impact Study. In the absence of this sort of a mandate, the EDS had to rely on incentives and the good will of project staff. Several incentives were offered including a cash honorarium of $1,500 for each project, $20 for each family at each wave of data collection, and $15 for each teacher at each wave of data collection. Projects were offered the opportunity to meet with each other at national meetings, letters of commendation were written to local school boards, and discussions about the importance of the research were held with project staff. Of course, the main deterrent to participating in the EDS was the requirement that projects allow research staff to randomly assign incoming families to be in Even Start or a control group.

Randomization of Families. Each of the 18 EDS projects was asked to recruit families as they normally do and to provide listings of eligible families to Abt Associates staff who randomly assigned families either to participate in Even Start (two-thirds of the families) or to be in a control group (one-third of the families). Assignment to the control group meant that the family could not participate in Even Start for one year. A total of 463 families were randomly assigned in the EDS—309 to Even Start and 154 to the control group (Table 3.1 and Figure 3.1), maintaining the planned 2:1 ratio. This is an average of about 26 families per project.

Instead of restricting children in the EDS to, say, preschoolers, children throughout the Even Start age range were included. Even though the EDS provided some data on all children in the study, the sample for analysis of literacy gains on direct assessments was limited to children who were at least 2.5 years old at the time of pretesting since most standardized literacy measures are not appropriate for children until they reach this age. About one-third of the children in the EDS were under 2.5 years of age at the time of pretest (Table 3.2). At the time of the follow-up, only about 10% of children in the EDS were under 2.5 years. Parent-report measures of child literacy skills were available for children of all ages.

Comparability of Even Start and Control Groups. Even Start and control families were statistically equivalent at the time of randomization and at the pretest (Table 3.3). Group equivalence at the time of randomization is guaranteed, within known statistical bounds, by proper implementation of random assignment and a sufficiently large sample size. However, 10% of the families were lost between the time of randomization and time of pretest. This attrition occurred equally in the Even Start and control groups. An analysis of pretest data showed that Even Start and control groups did not differ significantly on the percent of families where Spanish was spoken at home, families where English was spoken at home, Hispanic families, parents with a high school diploma or a GED, single parent households, employed parents, and households with annual income less than $9,000.

Generalizability of EDS Findings. The EDS used a random assignment design, the strongest approach for estimating program impacts. However, projects volunteered for this study instead of being randomly selected, so we cannot generalize to the Even Start population on a strict statistical basis. The plan was to select EDS projects to include urban and rural projects, projects that offer varying amounts of instruction, and projects that serve high and low percentages of ESL families. Due to the voluntary nature of the study, this plan could not be implemented perfectly, and while the EDS projects do represent major kinds of projects funded in Even Start, the data presented in Table 3.3 show that EDS families are more likely than the population of Even Start families to be Hispanic (75% vs. 46%). Further, 83% of EDS projects are in urban areas compared with 55% of all Even Start projects. These data suggest that findings from the EDS are most relevant to urban projects that serve large numbers of Hispanic/ESL families.

Data comparing the mean pretest scores of EDS families with the population of Even Start families on 18 parent-reported outcomes having to do with child literacy skills and home literacy activities are shown in St.Pierre, Ricciuti, Tao, et al (2003, Exhibit 6.1.41). For most variables there is no difference between the two groups, and the data support the contention that there are no important differences between EDS families and the Even Start population in terms of parent reported literacy skills and home literacy activities.

Data Collection. EDS data were collected at three time points. For the 11 projects that began the EDS in the 1999–2000 program year, pretest data were collected in fall 1999, posttest data in spring 2000, and follow-up data in spring 2001. For the seven projects that began the EDS in the 2000–2001 program year, pretest data were collected in fall 2000, posttest data in spring 2001, and follow-up data in spring 2002. In many projects, families entered Even Start on a rolling basis, so the pretest data collection was spread across several months (October through January) as new families entered the program. There was an average of 8.8 months between pretest and posttest, with a minimum of 5 months and maximum of 12 months. There was an average of 19.6 months between pretest and follow-up, with a minimum of 16 months, and a maximum of 24 months. Data collection from parents and children was done by field staff members that were recruited by, trained by, and employed by the research contractor. Field staff members had backgrounds in interviewing and in working with children, although experience assessing children and adults was not a prerequisite for employment.

Data Collection Response Rates. Response rates for the EDS data collection were high compared with those achieved by many educational studies: 90% at the pretest, 81% at the first posttest, and 76% at the follow-up assessment (Table 3.1 and Figure 3.1). Response rates are based on completed parent interviews, which generally correspond to the number of adults for whom we have direct assessment data. As mentioned above, the number of children for whom direct assessment data is available is less than the number of parents with such data, since child assessments could only be administered to children over 2.5 years of age. Sample sizes for individual outcomes vary considerably due to (1) response rates, as noted above, (2) children who were too young to be tested, and (3) children/parents who were tested in Spanish.

We examined the comparability of the samples of families who were randomized (n=463), those who were assessed at pretest and posttest (n=364, reported on in St.Pierre, Ricciuti, Tao et al, 2003), and those who were assessed at pretest, posttest, and follow-up (n=317, reported on in the current document). For the sample that was randomized, but never found at one of the assessment points, we have demographic information that was obtained as part of the consent and study enrollment process (Table 3.4). For the samples that were assessed at one or more time-points, we have additional demographic data, in addition to PPVT pretest scores. It can be seen that the three samples are quite comparable with regard to demographics and pretest assessment scores. Since the data presented in Table 3.3 show that the Even Start and control groups were statistically equivalent at pretest, and the data presented in Table 3.4 show that families in the longitudinal analytic sample (Even Start and control group combined) have the same characteristics as families in the sample at pretest, sample bias does not appear to be a concern when interpreting the longitudinal results presented in this report.

Test Language. Many Even Start projects serve a high percentage of non-English speaking families and deciding which language to use for literacy assessments posed difficult issues for this evaluation. We selected literacy measures that were available in both English and Spanish, e.g., the Peabody Picture Vocabulary Test and the Woodcock-Johnson. However, the English version of each measure was administered whenever possible. This approach served two purposes. First, assessing in English is consistent with Even Start's goal for adults and children to become literate in English. Second, assessing in English provides for the largest possible analytic sample of children and adults tested in a common language. We compared pretest data for adults and children tested in English with pretest data for the small number of adults and children tested in Spanish. In spite of the claims of publishers that English and Spanish test forms are "equivalent", we found very large differences in the pretest scores of English test-takers and Spanish test-takers, making us uneasy about combining the two sets of data. Just as difficult was the fact that some children and adults took the Spanish version of an assessment at one time (pretest, posttest or follow-up) and the English version at another time. We were uneasy about trying to conduct any analysis of change when different test languages were used, and in the end we restricted analyses of child and parent literacy outcomes to children and parents who took the assessment in English at all three times (pretest, posttest, follow-up). This restriction led us to exclude 59 children and 86 parents from the analysis of child and parent literacy outcomes, representing 13% and 19%, respectively, of the total sample of 463 families. This limits the generalizability of findings to children and parents who were comfortable enough with English to be assessed in that language.

Analysis Sample and Methods. The bottom part of Figure 3.1 shows how the analysis sample was constructed for two key outcome measures—the PPVT, which was administered to children and the WJ-R, which was administered to adults. Ninety-seven Even Start children and 44 control group children had valid PPVT scores at all three time points and thus formed the analysis sample for this outcome measure. Children were excluded from the analysis of PPVT data for several reasons:

  • Thirty-two Even Start and 14 control children were in families that could not be found for the pretest data collection.
  • Eighty-seven Even Start and 43 control children were too young (under age 2.5) to be tested at pretest.
  • Forty Even Start and 19 control children took the PPVT in Spanish (the TVIP) at one or more time points.
  • Fifty-three Even Start and 34 control children did not have a complete set of longitudinal data (pretest, posttest, follow-up).

Similar logic was followed to construct the analysis sample on the Woodcock-Johnson of 149 Even Start adults and 65 control group adults.

Separate analyses were conducted for each of 41 separate outcome variables. While a smaller set of composite variables could have been derived using factor analytic techniques, we chose to present each outcome separately so that readers have a clear understanding of the meaning of each outcome. The analysis consisted of a comparison of Even Start and control families pooled across all 18 of the projects participating in the study. In other words, each family in the evaluation was given equal weight in the analysis.

For continuous variables, differences in gains for treatment and control groups were tested by conducting a t-test on the simple pre-post gain score for each group. For dichotomous variables, a gain score was created that could take on the values of 0 (no change), -1 (negative change), or 1 (positive change). A McNemar test was then used to assess the differences in gain between treatment and control groups. For data collected through teacher ratings and school records, which were collected only at post-test, we tested for treatment/control differences with t-tests on the post-test scores.

Although the data for this study are nested within sites, analyses to account for such nesting were not possible due to small sample sizes within each site.1 Using the Peabody Picture Vocabulary test as an example, there was a total of 97 Even Start and 44 control children with usable longitudinal data. Within-site samples ranged from 1 to 13 Even Start children and from 1 to 6 control children. Using the Woodcock-Johnson Letter-Word Identification test as an example for adults, there were a total of 149 Even Start adults and 65 control adults with usable longitudinal data. Within-site samples ranged from 2 to 14 Even Start adults and from 1 to 10 control adults. Finally, note that the gain scores shown in the tables of results are simple pre-post differences.

Statistical Power. A total of 463 families were randomly assigned in the EDS—309 to Even Start and 154 to the control group. For several reasons, the number of parents and children that enter into any given analysis of Even Start's effectiveness is smaller than these totals: some families could not be found at the time of pretesting, posttesting, and follow-up testing; some children accepted into the study were too young (under 2.5 years of age) to be pretested; and some parents and children were assessed but had missing data on selected items. The statistical power to detect effects in the EDS therefore varies across measures. To understand statistical power it is helpful to have a shared definition of an "effect" produced by a program such as Even start. As an example, if Even Start had an effect of .50 standard deviations on the PPVT it would mean that the average child in Even Start gained a half standard deviation more than the average child in the control group. This is equivalent to 7.5 standard score points, because the PPVT standard deviation is 15.0 standard score points.

Table 3.5 shows statistical power for some of the key outcome measures. It can be seen that at follow-up, the EDS still had high statistical power to detect large and medium-sized effects, but poor power to detect small effects. Statistical power is greater than .85 for effects of .50 standard deviation (sd) or larger, greater then .70 for effects of .40 sd, and .85 or greater for effects of .30 sd for parents. But statistical power is less than .60 for effects of .30 sd for children, and less than .50 for effects of .20 sd or smaller.

1 When possible, nesting should be taken into account in the analysis of nested designs. Here, we have a nested design but a nested analysis was not conducted due to small within-site sample sizes. When nesting is not taken into account, as in the present case, treatment effects may be overestimated.