IES is in its 20th anniversary year. One of our themes throughout this year has been identifying what is needed to build a modern education sciences infrastructure. This blog focuses on one of those building blocks: identifying, creating, and supporting large datasets to facilitate research at scale.
Large data sets have an obvious benefit to the scientific endeavor— they allow us to more accurately identify data that are critical to informing policy and practice. One of the challenges facing researchers is finding or collecting data sets large enough and with enough statistical power to answer research questions. Even more germane is that large data sets enable us to better incorporate more modern techniques into the education sciences. A key to meaningfully incorporating AI and machine learning into work designed to improve education outcomes is the availability of large data sets for training and evaluation.
The current landscape of large datasets for use in education research
To help us chart the future of IES' support of large datasets, we released a request for information (RFI) seeking public input to help us identify existing large datasets that could be useful for education research and help us understand the challenges and limitations that may affect access to them and affect their value for research. We received 38 unique responses in the 30-day window. Respondents were mostly from nonprofit education organizations (13), academic institutions (12), and ed tech companies (7). You can access more information including the responses here.
Respondents universally expressed support for investments in large datasets to improve and modernize education research. However, the responses also highlighted the dearth of such datasets: only 20 datasets were referenced and only 11 of these contained 100,000 or more observations. Respondents called for more than just increasing the number of available large datasets; they recommended expanding and contextualizing the data collected, including more demographic information. Respondents also called for better coordination around data governance and data use agreements.
How IES can lead the way forward
There are at least two areas in which IES can lead the way in executing the technical and policy work required to create or facilitate the development of new large datasets. These include supporting the modernization of statewide longitudinal education data systems and developing a data library of student essays to spur AI-based research and development efforts to improve student writing.
Modernized State Longitudinal Data Systems
Several responses to the RFI noted the value of the data systems created by the close to $1 billion (in today's dollars) spent on creating State Longitudinal Data Systems (SLDS). These responses also bemoaned the difficulties in accessing these invaluable data. Within the next few years, we might see a sizable flow of new money come from Congress to create SLDS 2.0. Based upon the actions of current SLDS grantees, I think we know the outlines of what that modern system looks like, including
While it will take substantial resources, I believe that a new SLDS 2.0 built around these principles could provide stakeholders in government agencies and non-government organizations at all levels—ranging from school districts through the federal government—with opportunities to better understand what works for whom under what conditions. With appropriate protections, SLDS 2.0 could provide a gold mine of opportunities for collaborative partnerships between researchers and states.
Data Library of Student Essays
More immediate is the work IES is planning to make NAEP's library of student essays available to the research community with the goal of identifying approaches and products that can improve student writing.
From the 2011 and 2017 writing assessments, NAEP has collected over 200,000 student essays. We have been exploring the logistics and costs of assembling 40,000 to 45,000 8th grade student essays from this large data collection. These student essays have been scored based on the kind of writing task involved (convey, explain, persuade). NAEP also collects demographic information about the student who took the assessment and the school that student attended, as well as the kinds of writing activities students do in and out of school.
To create this library, automated techniques and human reviewers will scrub each student essay of all PII. We will also tag the writing features behind the holistic scores to provide meaningful feedback for learning and teaching. We are working through the most cost-effective way of meeting these challenges and turning these essays into a data library that researchers and developers can use.
If we can accomplish the scrubbing and tagging tasks, we will then have a large data library of student writing samples that can be made public. After that public release, IES is considering what mix of grants, contracts, and competitions might best advance the field. We have identified three different tasks:
IES is carefully reviewing the RFI comments as we forge a path forward. There are many challenges ahead; however, the opportunities and potential payoff are great for education research and, more importantly, for students across the nation.
I invite readers to weigh in with additional comments or suggestions for IES as we explore the application of large datasets to education research. Please contact me at: email@example.com.