|Title:||Approaches for Weighting and Estimation of Public-release Education Data using Two-level Covariance Structure Models|
|Principal Investigator:||Stapleton, Laura||Awardee:||University of Maryland, College Park|
|Program:||Statistical and Research Methodology in Education [Program Details]|
|Award Period:||2 years||Award Amount:||$159,620|
|Goal:||Methodological Innovation||Award Number:||R305D110050|
Original Grant: R305D110046 University of Maryland, Baltimore County
This project will identify best methods for estimating parameters and their sampling variances when using multilevel analyses with data collected via complex sampling designs typically used in education research.
Traditional estimation of multilevel models assumes that school data are a function of random selection and that student data are obtained via random selection within schools. These assumptions are violated with typical national survey sampling designs, and parameter estimates and their sampling variances may be biased under traditional estimation. For example, most national education- related datasets use sampling procedures that are much more complicated in design. With a three-stage sample, primary sampling units (PSUs) of geographic areas are first selected, then schools within those PSUs as secondary sampling units (SSUs) are selected, and finally teachers or students within those SSUs are selected as the ultimate sampling units (USUs). With a two-stage sample, the schools are typically selected as PSUs directly. Additionally, at each stage of selection, stratification of the population elements is used in selecting the sample. This stratum information may or may not be included in a researcher's statistical model.
Appropriate methods to model data from multi-stage stratified sampling designs have been proposed (e.g., multilevel pseudo-maximum likelihood [MPML]), but have not been tested under conditions similar to those found with national education-related datasets. These methods require sampling weights at both student and school levels and these level-1 and level-2 weights often are not found on public-release datasets.
The project has four specific aims to address multilevel analysis with complex sample data. First, the project will quantify the effects of ignoring the sampling design when using a multilevel model on estimates of parameters and sampling variances through a Monte Carlo simulation. Bias of estimates will be examined across a range of typical sampling designs and population characteristics found with education-related datasets. A simulation study will be conducted to determine the levels of bias found in parameter and sampling variance estimates when using multilevel covariance structure modeling with complex sample data ignoring the sample design. The first step in developing the simulation study will consist of an extensive review of education-related datasets to define the values used within the conditions of the simulation study as explained in the research plan.
Second, the project will determine the best method of level-1 and level-2 sampling weight approximation from the available overall (unconditional) sampling weights found on public-release datasets. This will be accomplished by comparing the approximated values with the known values from simulated data. From the simulated data introduced in Aim 1, unconditional USU sampling weights will be used to approximate conditional weights for the USU and SSU (if a 3-stage design) or PSU (if a 2-stage design). Bias in these estimates will be determined by correlating the known weights to the approximations.
Third, the project will determine the most robust method of sampling variance estimation by comparing the performance of a sandwich estimator with replication methods. Bias found with each technique is expected to vary with the data and sampling conditions. Typical conditions with education- related datasets will be examined using Monte Carlo simulations. The simulated data (introduced in Aim 1) will be analyzed with the MPML method and with three different approaches to sampling variance estimation (linearized, jackknife replication and bootstrap replication) to determine the method that yields the least bias in sampling variances. This provides an adequate 95 percent confidence interval coverage rates of the parameters of interest.
Fourth, the project will examine the performance of the scaled change in chi-squared test statistic in model selection, both under conditions of taking the sampling design into account and not. The model fit for models run for Aims 1 and 3 will be examined in comparison to the fit of six other misspecified models: three over-specified and three under-specified.