Skip Navigation
Technical Methods Report: Estimation and Identification of the Complier Average Causal Effect Parameter in Education RCTs

NCEE 2009-4040
April 2009

Chapter 6: Empirical Analysis

RCTs in the education field often report the same significance levels for each of the ITT and CACE estimators considered above. This chapter uses data from ten RCTs to assess this approach.

Data
Data for our analysis come from ten large published RCTs conducted by Mathematica Policy Research, Inc. (MPR). We selected these RCTs due to their significance for policy and their coverage of a wide range of interventions found in the education and social policy fields. Most of these evaluations were advised by national panels of evaluation and subject-area experts. Appendix Table B.1 lists the RCTs and summarizes the basic features of each one, including the 20 key outcome variables selected for our analysis, the covariates used in regression adjustment, and the unit of random assignment (that is, the level of clustering).

The RCTs include six evaluations of K-12 educational interventions. The remaining four RCTs include evaluations of interventions in welfare, labor, and early childhood education, which are included to help gauge the robustness of our findings beyond the K-12 setting. Overall, the ten studies span a wide range of outcomes, geographic areas, and target populations, and there is a mix of clustered and nonclustered designs. All ten studies were used for the standardized ITT analysis.

The CACE analysis was conducted using data from seven RCTs where noncompliers were identified using service receipt data. Appendix Table B.2 provides definitions of program "participation" used in our CACE analysis, and shows unadjusted service receipt rates in the treatment and control groups. For the 21st Century, New York City Voucher, Power4Kids, Early Head Start, and Job Corps evaluations, we defined program participation using the same rules as used by the studies. The Teacher Induction and Education Technologies evaluations did not conduct CACE analyses, so we developed illustrative rules for defining noncompliers using available service receipt data. The CACE analysis was not conducted for the remaining three RCTs (the Teach for America, San Diego Food Stamp Cash-Out, and Teenage Parent Demonstration evaluations) due to full compliance of study subjects.

For the CACE analysis, individuals were coded as service recipients if they received at least a minimal amount of services. It is appropriate to set the bar low for defining service receipt to ensure that impacts on never-takers are likely to be zero (see assumptions U3 and S5 above).

Methods

Data from each RCT were used to obtain (1) uncorrected variance estimators where the denominator terms of the impact estimators were assumed to be known, and (2) corrected variance estimators that accounted for all sources of estimation error.

Variance estimators for α̂ITT _E were obtained using (28). To apply (28), we estimated between-unit ANCOVA models to obtain α̂ITT and then used (14) to obtain AsyV̂ar(α̂ITT ). The ANCOVA models included covariates as similar as possible to those used in the published studies (see Table B.1).6 The estimation of σy and AsyVar(Sy) involved a straightforward application of (26) and (29). Equations (31), (22), and (23) were used to obtain AsyV̂ar(α̂CACE_E). Similar impact and variance results were found using simple differences-in-means procedures and the other estimation methods discussed above (not shown).

The CACE analysis required the estimation of the fraction p1 of individuals who were compliers. To do this, we defined a binary variable dij that was set to 1 if the individual received services and zero otherwise. We then estimated p1 as the coefficient on Ti from a between-unit regression of di on Ti and the same covariates that were used to estimate the ITT parameters. Similar results were found using simple differences-in-means procedures and logit models.

For each outcome, we quantified the importance of the variance corrections in two ways. First, we calculated the difference between the corrected standard error (the square root of the sum of the two terms in (28) or four terms in (31)) and the uncorrected standard error (the square root of the first terms in (28) or (31)) as a percentage of the uncorrected standard error. Second, we used t-statistics to assess the effect of the variance corrections on the statistical significance of α̂ITT _E α̂CASE_E by calculating the absolute difference between the corrected and uncorrected p-values.7

The importance of the variance corrections will depend on the size of the impact estimates. Thus, to assess the sensitivity of our main findings to larger impacts than were typically found in the considered RCTs, we conducted simulations assuming that impacts were 0.25 standard deviations, which is a value that education RCTs are often powered to detect (Schochet 2007). For these simulations, variances of nominal ITT impact estimators were assumed to be the same as those observed in the data. Finally, for each outcome, we conducted a related analysis by identifying the smallest positive impact values for which the variance corrections would raise the standard errors of the impact estimators by 5 percent from the uncorrected values. Such an increment to the standard errors would cause an impact estimate with an uncorrected p-value of 0.04 to become, as a result of the correction, barely insignificant at the 0.05 level.

Because our variance formulas are based on asymptotic normality of the impact estimators and assume equal cluster sizes, they are only approximations. Thus, to evaluate whether our formulas apply well to sample and cluster sizes that arise in practice, we compared p-values based on our variance formulas with those based on a nonparametric bootstrap. The two methods yield very similar p-values (Appendix Table B.3).

Results

The nominal ITT estimates are statistically significant at the five percent level for half of the 20 outcomes included in the analysis (Table 6.1; Column 4). These significance levels also apply to the standardized ITT and CACE estimates (discussed later) using the uncorrected variance estimates. Estimates of ITT_E are less than 0.15 standard deviations for 16 of the 20 outcomes (Table 6.1; Column 5). The Power4Kids study had the largest intervention effects (0.38 and 0.22 standard deviations for the two reading outcomes, respectively).

Compliance rates varied somewhat across the 7 studies included in the CACE analysis (Table 6.2; Column 2). The compliance rate was at least 88 percent in four RCTs, and ranged from 72 to 77 percent in the three other RCTs. By construction, α̂CACE_E becomes closer to α̂ITT_E as estimated compliance rates increase (Table 6.2).

Is Accounting for the Estimation Error in the Denominator of α̂ITT_E Important?
The answer to this question is "no." We find strong evidence that accounting for the estimation error in Sy has a negligible effect on the standard error of α̂ITT_E (Table 6.3). In our data, the correction term raises the standard error of α̂ITT_E by less than one-quarter of 1 percent for 18 out of 20 outcome variables, and the correction never increases the standard error by more than 2 percent (Table 6.3; Column 5). Similarly, the correction has a trivial effect on the statistical significance of α̂ITT_E. As shown in the final column of Table 6.3, the correction changes the p-value of α̂ITT_E only at the fourth or higher decimal place.

The correction for the estimation error in y S would remain ignorable even if the ITT estimates were 0.25 standard deviations, which is larger than most of our observed α̂ITT_E values (Table 6.4; Column 2). As expected, when α̂ITT_E is set to 0.25, the correction becomes more important than before, but the correction still has a very small effect on the standard errors (less than a 2 percent increase in all but one instance). Similarly, the p-value of α̂ITT_E is hardly affected; the absolute increase in the p-value due to the correction never exceeds 0.001 (Table 6.4; Column 3). In fact, if α̂ITT_E were 0.25, the t-statistic of the estimate would typically be so far out in the right tail of the distribution that the slight decrease in the t-statistic from the correction would leave the p-value virtually unchanged.

Similarly, we find that on average across the considered RCTs, α̂ITT_E would need to be about 0.8 standard deviations for the corrections to increase the standard error of α̂ITT_E by 5 percent (last column in Table 6.4). This is a large effect size in social policy evaluations, and is more than double the largest ITT impact found in our studies.

Is Accounting for the Estimation Error in the Denominator of α̂CACE_E Important?
The answer to this question is also "no." We find that the variance corrections exert a bit more influence on the variance estimates forα̂CACE_E than α̂ITT_E, but the influence is still generally very small; only in rare instances do these corrections change the variance estimates by more than 1 percent.

Our key finding is that the standard error of α̂CACE_E does not rise noticeably when correction terms involving Sy and p̂1 are included in the variance calculations (Table 6.5). The corrections increase the uncorrected standard errors by less than 0.5 percent for all studies except for the Word Attack Score in the Power4Kids study where the increase is 1.6 percent (Table 6.5; Column 5). The effect of the corrections on p-values is negligible; the corrections never raise or lower the p-value by more than 0.001 (Table 6.5; last column).

We find also that none of the individual correction terms in equation (31) is consistently important (Table 6.6). For 12 out of 14 outcome variables, every correction term is less than 0.5 percent of the uncorrected variance value for α̂CACE_E. Interestingly, AsyCov(α̂ITT, p̂1 has no consistent sign. In some instances, the variance reduction due to a negative covariance term offsets the positive variance contributions of the other correction terms. This explains why in some cases the corrections reduce the standard errors shown in Table 6.6 (as indicated by negative values in the fifth column of Table 6.6).

Simulations suggest that the results remain unchanged if α̂CACE_E is set to 0.25 (Table 6.7; Columns 2 and 3). For this scenario, for all but one outcome, the correction terms raise the standard error of α̂CACE_E by less than 2 percent; the corresponding rise in the p-value never exceeds 0.001. Furthermore, on average across the considered RCTs, the standardized CACE impact would need to be 0.7 standard deviations for the corrections to raise the standard error of α̂CACE_E by 5 percent (Table 6.7; Column 4).

Top

6 The impact estimates and uncorrected variance estimates that we report are slightly different than those reported in the published study reports due to the standardization of the estimation methods that we used across studies, and small differences in covariate sets, the treatment of strata, and weighting schemes. However, the two sets of findings are very similar (see Appendix Table B.1).
7 We focus on absolute, rather than percentage changes in the p-values because a large percentage change in a p-value may have only a trivial effect on statistical significance if the original, uncorrected p-value is already small.