Skip Navigation

Big Dreams for Big Data: A Look Ahead

Mark Schneider, Director of IES | March 1, 2023

Over the past 20 years, IES has built a strong foundation for the education sciences. But time does not stand still, and much of the way we have worked needs a rethink. The question we are faced with is how best to build a more modern, quicker, and less expensive infrastructure for education research and development atop the existing foundation. This is particularly so now that Congress has put IES on a path toward what we hope will be ARPA-ED. There are two avenues into the future that we are particularly excited about.

A View of the Future

First, I believe we now have an exemplar of how to think creatively about the future path of the education sciences: the jointly funded IES/NSF AI institute designed to help students who need speech and language services. There is a large gap between the number of such students and the availability of specialists who can deliver needed services. The AI Institute for Exceptional Education aims to close this gap by developing advanced AI technologies to scale the availability of speech and language services so every child in need can be helped. The AI Institute has multiple strands, among them: an AI supported screener to identify the specific speech and language delays students are facing and using AI in the development of individualized treatments. The research team is also using generative AI, such as the now famous ChatGPT, to alleviate the paperwork burden on school-based speech language pathologists (estimates are that as much as half of these specialists' time is spent on paperwork). This work also supports speech language pathologists' ability to create individualized intervention.

Note that this institute uses technology not to replace teachers or service providers, but to free them to do what only they can do—work closely with students to advance what they know and what they can do. Harnessing technology to free up teacher time moves us to what we all know is key to a good education: individualized instruction tailored to student needs.

We will be working with the education community to identify other domains in which persistent problems may now be amenable to solution using new technologies.

Using Large Data Sets

One take-away from the myriad articles and "experiments" that have accompanied ChatGPT is that its answers are only as good as the information that it ingests. When it comes to highly specialized information, often there is not enough good data and information on the internet to inform the algorithms underlying these bots. This is particularly true in education, where large data sets needed for informing AI applications are hard to come by. As a result, crucial insights about how to improve student performance on basic education functions such as reading, writing, and doing math remain untapped.

Fortunately, we have a large treasure trove of data about student performance that can be made available for modern research methods: the data contained in NAEP assessments.

There are, literally, hundreds of thousands of student essays, reading assessments, and open-ended responses to math problems that NAEP has collected, especially since converting to digitally based assessments in 2017, that are not yet readily available to researchers. Many of these items have already been released or retired and will not be used for future NAEP administrations. NAEP has extensive processes for vetting items, reviewing scores, and ensuring that the information collected is of high quality. Adding to the value of these large data sets, NAEP also collects extensive information about student demographics and school/family characteristics, giving rise to unparalleled opportunities to study how these characteristics affect student outcomes.

Clearly, NAEP produces important indicators of performance (fulfilling its mission as "the Nation's Report Card"), but NAEP could have a much greater impact on the learning sciences if the underlying data from its many assessments are made available—something IES is committed to making happen. The path forward is clear: using both AI and human readers, make sure that student "artifacts" are scrubbed of personally identifiable information and then make them available to researchers who follow existing rules governing access to restricted data.

By making the NAEP data available under a restricted-use data license, we will enable competitions for the development of new ideas and algorithms using education data. Our plan follows standard practices for the release of large data sets: we will release a subset of the data so researchers can train models, which will then be tested for accuracy on a withheld test dataset. We believe researchers are best positioned to identify the appropriate schemas and elements to add to datasets.

Some of you may have followed our earlier work with NAEP reading data, in which we were trying to assess how well autoscoring methods could be used to save millions of dollars in NAEP grading. (The answer was quite well.) Soon, we plan to launch another challenge to use autoscoring for open-ended math items, which are more difficult to score since they mix specific solutions with general explanations. As in our previous challenge, we will require transparency and fairness analysis in the submissions so that we can understand them and ensure their validity for all students.

In addition to improving existing scoring, we will be sponsoring competitions to see how far we can push the research field in better understanding and using autoscoring and natural language methods for math. More substantively, we will be sponsoring a competition to see if we can identify the math concepts (and misconceptions) underlying student responses; while there has been extensive work to define these misconceptions in principle, there has been much less work that tests these misconceptions on responses from large numbers of students. Using NAEP data, we will work to identify where such misconceptions are most common and for which students, so that we can devise targeted interventions that help resolve those problems.

We will also be releasing tens of thousands of student essays. We originally thought that we would use the data in a competition to create an AI-assisted writing tutor. Well, a funny thing happened to that idea—you guessed it: ChatGPT. We think that these "dialogical applications" will become (or are already) new tools that students are using in writing, and we hope to use the NAEP writing data to work with research scientists to tap these new tools to improve student writing.

IES is a small agency. And while our staff are smart and hardworking, they represent a very small slice of the research community, so we need to gather input from external researchers too. In the next few months, we plan to release more NAEP data sets. In the short run, IES is focusing on how these data can be used in AI-informed research work, but the beauty of releasing the data is that the research community will have opportunities to explore the data using their own interests and expertise.

I have been in education research long enough to know that there have been other times when innovations, technologies, and political interest have come together to produce a "moment" in which new horizons are visible; all too often those moments have dissipated without the breakthroughs we thought possible. We are again at a moment in which many advances in technology have been recognized by researchers, policymakers, foundations, and ed tech entrepreneurs as potential game changers. And, as with the IES/NSF AI Institute described above, there is a growing recognition that these advances must be leveraged to help students, teachers, families, and communities work together to create better educational outcomes.

I am so happy that IES is in the middle of this mix. In a recent blog, I joked that IES' initials should now be read as Innovation in the Education Sciences. What an exciting time in which IES investment in basic and applied research can now help ensure that the field of education and America's learners of all ages can benefit from the advances taking place all around us.

As always: