|Title:||Practical Solutions for Missing Data and Imputation|
|Principal Investigator:||Gelman, Andrew||Awardee:||Columbia University|
|Program:||Statistical and Research Methodology in Education [Program Details]|
|Award Period:||3 years||Award Amount:||$904,972|
|Type:||Methodological Innovation||Award Number:||R305D090006|
Co-Principal Investigator: Hill, Jennifer
Purpose: Missing data are ubiquitous in education research studies. The literature discusses the shortcoming of simple missing data approaches such as complete case analysis and inclusion of indicators for missing data; however, the use of these practices remains widespread. Multiple imputation is becoming an increasingly widely used approach to handling missing data but there are outstanding research questions regarding the most reliable methods for implementing it and when it is worthwhile to invest in this technique. In addition, researchers may have a legitimate reluctance to use an algorithm whose steps and outcomes they do not completely understand.
The project developed, extended, and tested strategies for multiple imputation of missing data. The project's goals were: (1) investigating the properties of imputation models and algorithms; (2) developing diagnostics to reveal problems with imputations in real time; (3) developing models and algorithms that are more likely to create appropriate imputations; (4) creating software in both R and Stata that is reliable and usable by non-statisticians yet can accommodate the needs of more sophisticated modelers; and (5) testing the diagnostics, models, and algorithms in applied research. An important part of the tests of the developed software is a comparison of the performance of multiple imputations with simpler missing data strategies. These tests are designed to help identify when multiple imputation is worth using.
Project Activities: The research related to goals 1 and 3 will lead to better missing data models and algorithms. This work focused on four objectives. First, the researchers explored the relative efficacy of imputation algorithms as compared to simpler strategies, particularly in the context of randomized experiments. The second objective was to identify the conditions under which chained imputation algorithms can fail and identify modeling choices that won't violate these conditions. The third was to examine the properties of competing models to accommodate a wider variety of data structures (e.g., time series and multilevel data) and non-ignorable missing data mechanisms and then implement the most useful. The final objective was to increase computational efficiency when implementing the most useful models.
The research regarding goal 2 will improve or develop graphical and numerical diagnostics that can be used to identify problems with parametric assumptions, flag likely violations of structural assumptions, monitor convergence of the fitting algorithm, and determine situations in which the implicit model from the chained regressions is not close to any joint distribution.
The software development work for goal 4 will incorporate the results from the work done for goals 1-3 while also including an accessible user-interface. This user-interface will help researchers identify potential problems at the outset (e.g., perfect correlation among predictors); choose the right model and accommodate complications such as interactions and transformations; provide the ability to implement other missing data strategies (including a range of multiple imputation models and algorithms) to allow for comparisons; and make the software available both in the open-source statistical environment R as well as a stand-alone platform-independent package.
The work done for goal 5 will engage education researchers to test and apply the software to multiple datasets with varied study designs and missing data patterns. Additionally, efforts will be made to establish a catalog of scenarios and examples where current missing data imputation algorithms fail.
Publications and Products
Journal article, monograph, or newsletter
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., ... and Riddell, A. (2017). Stan: A Probabilistic Programming Language. Journal of Statistical Software, 76(1).
Chen, Q., Gelman, A., Tracy, M., Norris, F. H., and Galea, S. (2015). Incorporating the Sampling Design in Weighting Adjustments for Panel Attrition. Statistics in medicine, 34(28), 3637–3647.
Gelman, A. (2010). Bayesian Statistics Then and Now. Statistical Science, 25(2): 162–165.
Gelman, A. (2011). Induction and Deduction in Bayesian Data Analysis. Markets and Morals, 2: 67–78.
Gelman, A., and Shalizi, C. (2013). Philosophy and the Practice of Bayesian Statistics. British Journal of Mathematical and Statistical Psychology, 66(1): 8–38.
Gelman, A., and Unwin, A. (2013). Infovis and Statistical Graphics: Different Goals, Different Looks. Journal of Computational and Graphical Statistics, 22(1): 2–28.
Gelman, A., and Unwin, A. (2013). Tradeoffs in Information Graphics. Journal of Computational and Graphical Statistics, 22(1): 45–49.
Kropko, J., Goodrich, B., Gelman, A., and Hill, J. (2014). Multiple Imputation for Continuous and Categorical Data: Comparing Joint Multivariate Normal and Conditional Approaches. Political Analysis, 22(4), 497–519.
Lock, K., and Gelman, A. (2010). Bayesian Combination of State Polls and Election Forecasts. Political Analysis, 18(3): 337–348.