AI is everywhere—from the websites you search to the stores where you shop to the schools where you send your kids. Companies, governments, and nonprofits alike are spending fortunes on making AI algorithms and interfaces better. Much of this money is pursuing broad generative AI that can answer lots of questions across lots of subjects. But there is a complement to pursuing AI as an all-purpose machine: building focused AI tools designed to achieve specific objectives. One such target is to increase the speed with which we score math assessments to put more timely information in the hands of students, teachers, and policymakers. To that end, IES recently completed a prize competition for AI-assisted autoscoring of a set of open-ended NAEP math items.
Autoscoring of open-ended math items has lagged autoscoring for reading. The Automated Scoring Assessment Prize, conducted over a decade ago, demonstrated that AI could accurately score student essays but also showed that AI was not quite ready to accurately score shorter responses. That gap has since closed, and AI-assisted autoscoring of reading and writing is now ubiquitous. Similar progress has not been made for math, despite attempts to do so. The results of the IES math autoscoring challenge show that, given recent advances in AI, autoscoring student open-ended math responses is now possible.
While there is still lots of work in front of us, here are four lessons we learned from this challenge.
AI can now address many problems that had hindered autoscoring of open-ended math questions. Autoscoring math responses was a bigger challenge than for reading and writing in part because natural language processing has difficulty giving the precise answers often required for math problems (rather than broader ideas used in assessing reading). Added to this was the complexity of how students mix symbols, equations, and words in their answers to math items. The winning teams in the IES math autoscoring challenge were able to solve these problems and demonstrated that AI could replicate human-assigned math scores.
All challenge participants used large language models (LLMs). While all teams used LLMs, data security guidelines on the use of NAEP data led them to use locally hosted LLMs, rather than externally hosted services, such as OpenAI. Their use of locally hosted LLMs reflects a growing trend toward smaller and cheaper LLMs—a trend that will become more widespread in education research. Locally hosted LLMs can allow greater control over practices affecting access to and ownership of data and that may make AI tools more appealing to education researchers and practitioners.
Increasing transparency is a challenge. A simple flow chart describing how it feels to interact with generative AI programs looks something like this: Prompt –> Magic –> Answer. There is growing concern about the magic stage. This speaks to a need to increase the "explainability" of how these models work. In our challenge, we offered a separate prize for participants who provided additional analyses to explain the "magic" of their LLM predictions. None of the winning teams satisfactorily addressed the issue. Here's the problem: we will not be able to advance education-oriented AI tools without making them more transparent and explainable. Without that, we will continue relying on the "magic" of these applications. While improving transparency sounds simple in theory, in practice it is difficult. Imagine if we asked humans to describe how they reach decisions, could they be as transparent as we are asking LLMs to be?
We need to make more progress in checking for bias. While all respondents conducted (and passed) basic tests for fairness, none went further to assess bias more fully. In education, fairness and showing a lack of bias in AI-generated "artifacts" must be addressed if we are to build trust in AI and increase its use.
AI is poised to be—indeed already is—a boon to both researchers and educators. And, as our recent challenge shows, using AI-assisted methods to autoscore responses to math problems is now doable, which should make grading math assessments faster and cheaper. But more than that, these same methods have the potential to improve instruction by improving our ability to better personalize learning experiences. If an autoscoring algorithm can evaluate what a student knows and can do, then it can also help teachers better identify what students don't know and can't do. Take it from me, a former middle school math teacher, information doesn't get much more valuable than that.
As always, feel free to contact me at firstname.lastname@example.org