## Equivalence

- Is establishing baseline equivalence between the intervention and comparison groups as important in randomized controlled trial (RCT) studies as in quasi-experimental designs (QEDs)?
- What effect size measure and threshold do the WWC use to assess baseline equivalence based on non-continuous covariates?
- What types of statistical adjustments are needed to demonstrate baseline equivalence between groups? When are the adjustments needed?
- A gain score analysis that simply subtracts the baseline score from the outcome score is not sufficient for meeting WWC standards. However, some difference-in-differences analyses can include pretests, and with a pooled regression and dummies and interaction terms, they do statistically adjust for differences in the pretest. Why is the difference-in-differences approach not an acceptable adjustment for baseline differences?
- If the groups are equivalent at baseline—that is, there is less than |0.05| standard deviations difference on key characteristics listed in the protocol—should the study still adjust for baseline characteristics?
- If teacher-level analyses focus on class (or aggregate) achievement (e.g., class mean score), can the prior year’s class achievement be used as baseline, even though the class composition of students will likely vary?

### Is establishing baseline equivalence between the intervention and comparison groups as important in randomized controlled trial (RCT) studies as in quasi-experimental designs (QEDs)?

It depends. The importance of baseline equivalence for RCT studies depends on two factors: attrition and integrity of randomization. The WWC assesses equivalence in an RCT if there is high attrition or concern about the integrity of the randomization process. In those cases, the study must demonstrate equivalence as discussed in the *WWC Procedures and Standards Handbook* (version 3.0). If an RCT is required to demonstrate equivalence, the highest rating it can receive is *meets WWC group design standards with reservations*. If there are no concerns about the randomization process and there is low attrition, the study is eligible for the rating of *meets WWC group design standards without reservations*, even without a demonstration of baseline equivalence.

### What effect size measure and threshold do the WWC use to assess baseline equivalence based on non-continuous covariates?

The WWC uses the same thresholds for the two effect sizes used: Hedges’ g for continuous variables and the Cox index for dichotomous variables. The *WWC Procedures and Standards Handbook* (version 3.0) provides details on the thresholds and how to calculate both Hedges’ g and the Cox index (on pages 15–16, F.1–F.6). The WWC looks at the absolute magnitude of the difference between the intervention and comparison groups on baseline characteristics identified in the review protocol. For characteristics with a difference less than |0.05| standard deviations, no adjustments are necessary to demonstrate equivalence. A study with differences between |0.05| and |0.25| standard deviations would need to statistically control for the characteristics in the analysis to demonstrate equivalence. If a difference is greater than |0.25| standard deviations, then the study does not demonstrate equivalence for that characteristic, even if it is included as a statistical control in the analysis.

### What types of statistical adjustments are needed to demonstrate baseline equivalence between groups? When are the adjustments needed?

Some studies must include the baseline characteristics in the analytic model to demonstrate equivalence. The *WWC Procedures and Standards Handbook* (version 3.0) indicates (on pages 15–16) that for differences in baseline characteristics that are between |0.05| and |0.25| standard deviations, the analysis must include a statistical adjustment for the baseline characteristics to meet the baseline equivalence requirement.

The WWC review protocols provide guidance on what kinds of baseline variables must be examined in order to demonstrate equivalence and how the statistical adjustment should be made for multiple outcomes in the domain. A typical statistical adjustment involves using a regression analysis to estimate program impacts, where the key baseline variable requiring adjustment is included as a covariate in the analytical model.

### A gain score analysis that simply subtracts the baseline score from the outcome score is not sufficient for meeting WWC standards. However, some difference-in-differences analyses can include pretests, and with a pooled regression and dummies and interaction terms, they do statistically adjust for differences in the pretest. Why is the difference-in-differences approach not an acceptable adjustment for baseline differences?

It depends. Studies that must demonstrate equivalence with a difference between |0.05| standard deviations and |0.25| standard deviations and analyze gain or change scores do not demonstrate equivalence; they do not provide sufficient statistical control for the baseline measure, as they do not account for the correlation between the outcome and the baseline measure. However, a difference-in-differences analysis that included baseline measures as covariates would be considered as demonstrating baseline equivalence for the WWC. When a QED or high-attrition RCT study has baseline differences in the range of |0.05| to |0.25| standard deviations, the WWC requires analyses that include baseline measures as covariates, such as in a regression or ANCOVA analysis.

### If the groups are equivalent at baseline—that is, there is less than |0.05| standard deviations difference on key characteristics listed in the protocol—should the study still adjust for baseline characteristics?

No, but such adjustments are welcome because they potentially improve the precision of estimates. The WWC does not formally require statistical adjustments for differences of less than or equal to |0.05| standard deviations.

### If teacher-level analyses focus on class (or aggregate) achievement (e.g., class mean score), can the prior year’s class achievement be used as baseline, even though the class composition of students will likely vary?

Yes, if the focus of the analysis is the classroom or teacher, then the prior year’s class achievement scores can be used as baseline. The WWC reviews studies that use aggregate level data to calculate effects, called cluster designs. In many cases, student composition in the teacher- or class-level data changes over time. The WWC will note that any observed impact may be in part due to changes in student composition within classrooms over time if the composition of students changed over the course of the study.

## Comparison Group

### Does the WWC have preferred matching methods for identifying a Comparison Group?

No, the WWC does not advocate any one matching method over another. Authors may use a variety of matching methods, including propensity score, Mahalanobis distance matching, or coarsened exact matching. In general, matching procedures are a way to create equivalent groups. For example, propensity score matching is a method by which a study can create intervention and comparison groups that are equivalent, on average, across a number of characteristics. If a study uses propensity score matching, the variables identified in the review protocol should be included in the matching process in order to make it more likely that equivalence will be achieved on the variables that will be examined by the WWC. However, regardless of the matching method used, the WWC will still require that a study must demonstrate the equivalence of the intervention and comparison groups on the key variables identified in the appropriate WWC review protocol for the study. Including these variables in the construction of the propensity scores used for matching or weighting does not replace or alter this requirement. The WWC review protocols are available on the WWC website.

## Missing Data

### Why are studies that impute missing data not eligible to meet WWC group design standards?

The WWC procedures for missing data are designed to avoid the bias that can arise from data that are not missing at random. First, if an RCT is determined to have low attrition, analyses with imputed outcome and/or baseline data do not affect the rating, provided that the imputation procedure is one described in the *WWC Procedures and Standards Handbook* (version 3.0). In RCTs, missing outcomes are considered attrition, so when attrition is low, the results are unlikely to be substantively affected by whether a study imputed or excluded the missing data.

Any study that needs to demonstrate equivalence must do so on an analytic sample with no missing or imputed data. This is because using imputed data to assess baseline equivalence could bias a study towards finding equivalence. For example, this could occur when imputed results are based on very little information, resulting in similar values in the intervention and comparison groups. Similarly, the WWC does not allow these studies to use imputed data for outcomes because no imputation strategy can fully address sources of potential bias from missing data.

## Eligibility

- How does the WWC review interrupted time series studies with a single unit? These studies are quasi-experimental, but compare the group to itself before and after the intervention.
- Does the WWC have a minimum study sample size requirement?
- Does the WWC consider the intensity of the intervention when assessing quality of studies using quasi-experimental designs?
- Does the WWC only review studies conducted by an independent third party researcher? Or will the WWC review developer-conducted studies?

### How does the WWC review interrupted time series studies with a single unit? These studies are quasi-experimental, but compare the group to itself before and after the intervention.

A study without a comparison group is not eligible for review under the WWC group design standards. Even with multiple pre- and post-intervention periods, such a study cannot illustrate a counterfactual for what would have happened in the absence of the intervention, under WWC group design standards. This is particularly a problem for a study with a single unit. In general, for a study to meet WWC group design standards, there should be at least two units in each condition. For eligible WWC designs, please consult the *WWC Procedures and Standards Handbook* (version 3.0).

### Does the WWC have a minimum study sample size requirement?

No, the WWC does not have standards regarding sample size or statistical power. The WWC does require two or more units per condition in a group design study to avoid a confounding factor. However, studies with small sample sizes will have limited statistical power to produce statistically significant impacts.

### Does the WWC consider the intensity of the intervention when assessing quality of studies using quasi-experimental designs?

It depends. While the WWC design standards do not require interventions to be implemented with a certain intensity or fidelity, these factors can sometimes affect whether a study is eligible for review. For example, a content expert can decide the intervention was not implemented in a manner that supports considering the study a test of the intervention. Information about the intensity of an intervention provides important context for understanding the implications of a study, and, therefore, is included in the descriptive summary of a study across a variety of WWC products.

### Does the WWC only review studies conducted by an independent third party researcher? Or will the WWC review developer-conducted studies?

The WWC does not have a requirement that evaluations are conducted by independent organizations. An evaluation conducted by the organization that created/designed the intervention can meet WWC design standards if the evaluation is well implemented. However, having an independent third party conduct an evaluation can be useful in convincing a skeptical reader that the results of the evaluation are free from conflicts of interest. That is, when an organization conducts an evaluation of its own intervention, some critical readers may assume that the organization has implemented its evaluation in a manner that will produce a favorable finding for the intervention.

## Outcome Measure

- If a study uses extant data, should the data be collected for the intervention and comparison groups at the same time?
- What is overalignment of outcome measures with an intervention?

### If a study uses extant data, should the data be collected for the intervention and comparison groups at the same time?

Yes, the data should be collected for the intervention and comparison groups at the same time. If a study uses extant data, and the data were not collected at the same point in time, then the study has a confounding factor and will be rated *does not meet WWC group design standards*. If there is no confounding factor (i.e., the data were collected at the same time), then the groups will need to demonstrate equivalence on the characteristics identified in the relevant WWC review protocols, which are available on the WWC website.

### What is overalignment of outcome measures with an intervention?

The *WWC Procedures and Standards Handbook* (version 3.0) indicates that when outcome measures are closely aligned with or tailored to the intervention, the study findings may not be an accurate indication of the effect of the intervention. For example, an outcome measure based on an assessment that relied on materials used in the intervention condition but not in the comparison condition (e.g., specific reading passages) likely would be judged to be overaligned. That is, the intervention group is exposed to reading passages from the outcome measure during the intervention, but the comparison group is not. It is therefore difficult to separate the effects of the intervention from the effects of the additional exposure to the outcome measure.

## Confounding Factor

- In a retrospective design, outcomes for the intervention and comparison groups may not be measured at the same time. Is that acceptable to the WWC? Or is there a confound with history?
- What is the “single unit” confound? Why do we need at least two classes in each condition (intervention and comparison)?
- Is it acceptable to compare groups within one district with groups in a different district?
- Do confounds need to be related to the outcomes of interest? For example, could eye color be a confound?

### In a retrospective design, outcomes for the intervention and comparison groups may not be measured at the same time. Is that acceptable to the WWC? Or is there a confound with history?

No, this design scenario would not meet WWC design standards because of a potential confound with history. If a retrospective comparison group is not measured at the same time with the intervention group, the WWC would rate such studies as not meeting WWC group design standards. The *WWC Procedures and Standards Handbook* (version 3.0) indicates that if information on the intervention group comes from one school year, whereas information on the comparison group comes from a different school year, then time (history) is a confounding factor.

### What is the “single unit” confound? Why do we need at least two classes in each condition (intervention and comparison)?

A single unit (classroom, teacher, or school) assigned to either of the intervention or comparison study conditions constitutes the “single unit” confound. The *WWC Procedures and Standards Handbook* (version 3.0) defines a confounding factor as a component of the study design (or the circumstances under which the intervention was implemented) that is perfectly aligned with either the intervention or comparison group. That is, some factor is present for members of only one group and absent for all members in the other group. In these cases, it is not possible to tell whether the intervention or the confounding factor is responsible for the difference in outcomes (intervention effects).

In some studies reviewed by the WWC, only one classroom is assigned to each condition. For example, within a school, classroom A taught by teacher A may be assigned to receive the intervention, while classroom B taught by teacher B may be assigned to serve as a comparison condition.

The estimate of the intervention’s effect is problematic because the composition of the classrooms and effects of the teachers will contribute to the observed impact estimate. For example, if teacher A is especially effective, any observed impact may be more in part due to teacher A’s effectiveness than the actual effect of the intervention. Therefore, two or more units must be observed in each condition in a group design study to not have a confounding factor.

### Is it acceptable to compare groups within one district with groups in a different district?

No, this design scenario would not meet WWC design standards. The analytic sample should include at least two districts in each condition. If there is a single unit in either of the intervention or comparison units, then the study will not meet WWC group design standards due to a confounding factor. The *WWC Procedures and Standards Handbook* (version 3.0) indicates, for example, that if all of the intervention group schools are from a single school district, the district would constitute a confounding factor.

### Do confounds need to be related to the outcomes of interest? For example, could eye color be a confound?

Yes, confounds should relate to the outcomes of interest, and, therefore, eye color would not be considered a confound in an education study. The *WWC Procedures and Standards Handbook* (version 3.0) defines a confounding factor as a component of the study design (or the circumstances under which the intervention was implemented) that is perfectly aligned with either the intervention or comparison group. That is, some factor is present for members of only one group and absent for all members in the other group. The most frequent confounding factor seen by WWC reviewers is a single unit in a condition—for example, a single intervention school or single comparison classroom. In these cases, it is not possible to tell whether the intervention or the confounding factor (school, classroom, or teacher) is responsible for the difference in outcomes (intervention effects). The confounding factor must present a reasonable alternative cause for the effects seen, and as such, something like eye color in an education study would likely not be considered a confound.

## Attrition

### What level of attrition would cause an RCT to receive a rating of *meets WWC group design standards with reservations* rather than the highest rating (*meets WWC group design standards without reservations*)?

It depends on the attrition threshold specified in a particular review protocol. The *WWC Procedures and Standards Handbook* (version 3.0) provides detail on the liberal and conservative thresholds that are used to determine whether an RCT has a high level of attrition. The WWC review protocols indicate whether a given review uses the liberal or conservative attrition threshold for its reviews.

## Reporting

### Is there a “to-do list” to help write a report that can be easily reviewed by the WWC?

No, there is not currently a “to-do list.” However, we suggest using the *WWC Procedures and Standards Handbook* and the Reporting Guide for Study Authors as guidance for reporting study results. The *WWC Procedures and Standards Handbook* (version 3.0) describes the standards against which all studies are reviewed. The Reporting Guide for Study Authors provides guidance on how to describe studies and report their findings in a way that is clear, complete, and transparent. Following this guidance will facilitate the WWC’s review of your study.