Goodhart's Law and the Optimized Learner

Goodhart's Law and the Optimized Learner
A student taking a test out on the educational savanna, courtesy of stable diffusion.

These days when I talk with teachers working at the college level, I often hear complaints like this:

After trying with little success to get a discussion going in class, a student approached me afterwards and said. "I know you are trying to make things interesting for us, but I think you could save a lot of time if you just told us what we have to do to get an A."

A variation of this is associated with assignments that are even slightly open-ended - meaning that the students are given some freedom to choose which topic to explore.

I gave an assignment to write an essay about a topic of their choice. During office hours several approached me to ask what they should write about and what their opinion and argument should be if they want to get an "A." I had to explain, again, that it was up to them to decide.

Why do we hear this so often these days, and what does it mean for education in the age of rapid climate change?

The Ecological Landscape of Education Today

Over the past century we've come to understand a little about how ecosystems are in dialogue with and shape the species that live within them. A classic example is the mixed population of black and white peppered moths in England. During the industrial revolution, when coal soot darkened most surfaces in industrial areas, the ratio of black to white moths shifted from predominantly white to predominantly black. The black moths had an easier time blending in with dark, soot-covered objects, making them harder for predators to find. After measures were taken to control air pollution and soot levels decreased, the moth population again became mostly white, as white moths could more easily blend in with their now lighter surroundings.

Let's imagine the students of today as a species living in an ecosystem that favors certain characteristics over others. Of course there's a good deal of variation across schools, but there's much that can be said which generally applies. 

The first is that we live in the information age, a time in which the total amount of information available has been increasing exponentially over the past century. To give one example, in the 1950s children may have had geology on their curriculum, but it didn't include anything about plate tectonics, a fringe theory at the time. Now it has to. And all the other subjects now contain loads of information discovered or invented in the past century, some of which will get added to the canon of “fundamentals” that make up a curriculum. Of course, some information gets edited out sometimes - last I checked, phrenology didn’t make the cut. But even so, there's simply a lot more to know today to qualify as “educated” than there has been for most of human history. 

How do we know if students are learning all this stuff? The field of academic testing is past its heyday in the 2000s, when massively funded initiatives like No Child Left Behind in the US made the case for multiple choice tests as the best “objective” tool for evaluation of learning. But today many, if not most schools still depend on multiple choice testing for evaluation. As long as the phrase "show me the numbers" is still used to assert the rhetorical dominance of quantitative measurement in education, we know that psychometric tests of learning are still playing a big role in the public discourse. 

To be successful in the educational ecosystem, students today have to not only handle a lot of information, but also learn how to get good evaluations. If reproductive fitness is the criteria for success on the savanna, evaluation and scores tend to be the primary measures of success in schools. But this is not the first generation to face these problems, so we should expect that the current specimens of the species "student" will have already evolved to maximize fitness in their environment. One interesting trait we can point to in the students of today is that they have learned to focus their intellectual energy on what’s being measured. 

If you want to be competitive in school and you have a huge amount of information to learn, then memorizing the information that is likely to be evaluated should rank higher on your list of priorities. The reason for this is simple: the educational ecosystem does not value information you have learned that it cannot measure. So today's species of optimized student focuses on what they need to learn to get the score and the grade, and de-prioritizes everything else. Students asking teachers what they have to do to get the grade are employing an efficient, proven success strategy. This is not something we should blame them for - they're just trying to survive in today’s educational ecosystem. To opt-out of these kinds of optimizations would make it much harder for them to compete and succeed.

Unfortunately these optimizations, and the quantitative measurement-based educational ecosystem that inspired them, may be working against the goal of giving students a good education. If you believe that a learner’s choice and agency plays an important (or even significant) role in the learning process, as I do, then the news isn’t good. As things get more competitive, the need to optimize for evaluation begins to crowd out the space for agency, curiosity, and meaning making - as well as the rest of the squishy, subjective, and idiosyncratic aspects of learning. Eventually we may reach a state where the ecology of education requires so much optimization that it effectively punishes initiative and independent inquiry. Perhaps this is what's behind curious phenomena like papers and patents becoming less disruptive over time.

Goodhart’s Law and the Optimized Learner

In a blog post titled Too much efficiency makes everything worse, the AI researcher Jascha Sohl-Dickstein offers us a simple explanation of a phenomena he calls the strong version of Goodhart's Law.

When a measure becomes a target, if it is effectively optimized, then the thing it is designed to measure will grow worse.

Sohl-Dickstein argues that Goodhart's law applies in many different fields, but uses education as his first, perhaps most generally recognizable, example. If we apply Goodhart's law to education, the story goes more or less like this.

  1. Because "learning" is a fuzzy, complex, hard to measure concept, well intentioned policy makers choose a proxy for learning that's easier to objectively measure: scores on standardized, multiple choice tests. Everyone involved – policymakers, teachers, parents, and students - set out to improve these scores, the assumption being that if they get better, learning has improved.
  2. At first the scores get better and learning seems to be getting better too. But soon the proxy and learning start to diverge: The scores continue to get a little better, but learning starts getting much, much worse.

In the field of artificial intelligence, this is called "overfitting." It happens when the AI system one is trying to improve starts to fit the measurable proxy better than the more abstract, difficult to measure quality the proxy was chosen to represent.

0:00
/0:28

Images by Jascha Sohl-Dickstein, borrowed from https://sohl-dickstein.github.io/2022/11/06/strong-Goodhart.html.

Assuming this model is a good fit to reality, what might we see happening in schools as a result of “overfitting” and the strong version of Goodhart’s Law?  We might see a lot of disengagement and an increase in anxiety associated with learning. Teachers may start to complain that the focus on tests "undermine teacher-student relationships, lead to a narrowing of the curriculum, demoralize teachers, and bore students." Students might look at school and learning in general as an unpleasant, tedious chore that’s mostly irrelevant to what they are actually interested in.

In my own experience as an educator and educational design researcher over the past two decades, I’ve noticed that young people seem to be becoming less capable of finding and pursuing things to be curious about on their own. The reason for this seems obvious. The need to optimize scores absorbs more and more of their time, time that might otherwise be spent being curious and exploring things without worrying about measurable outcomes. Sometimes when I invite learners to be creative they are initially suspicious, and want to know exactly how they’ll be assessed. It's usually still possible to engage them in a state of playful, creative exploration, but it takes more time and energy to create the right conditions and allay their anxieties.

I believe that the focus on measurement and high stakes testing has been part of what's driving the decline in the quality of education I’ve witnessed over the past few decades. Sohl-Dickstein's suggestion that this is one of many manifestations of the strong version of Goodhart's Law provides a useful lens through which to understand the problem - and to recognize that it applies across many different domains.

There’s an important social power dynamic here that thwarts our collective ability to respond to the effects of Goodhart's Law in education. If policy makers value objective quantitative measurements over what they see as 'squishy' subjective qualitative evaluations, then the fact that learning is getting worse while scores are steady or improving may not be acknowledged. Many policymakers have backgrounds in statistics and experimental psychology, and so have been trained to take an "evidence-based" approach to focusing on what can be measured. This makes them skeptical of subjective opinion and anecdotal data, as well as anything else that is difficult to directly quantify. It’s also worth keeping in mind that Upton Sinclair’s famous dictum may apply here, at least in some cases: “It is difficult to get a man to understand something, when his salary depends on his not understanding it.”

It's the practicing educators who will see the negative effects of Goodhart's Law long before the policymakers or the experimental psychologists do, because it's playing out in front of their eyes every day. Teachers, as practitioners, generally aren’t given the same level of intellectual respect that quantitatively trained researchers and policymakers are. So even after they communicate that these evaluation strategies are making things worse, sometimes by resigning in protest, the Goodhart’s Law dynamic can continue unabated for years or, as in the case of public education in the USA, decades. 

What specific mechanisms are behind the negative effects of Goodhart’s Law in Education? There could be many different factors in play. One is surely that as soon as the criteria for evaluation become clear, everyone starts optimizing towards it at the expense of other vital but difficult-to-quantify aspects of learning. After all, learning is still a process that is complex, idiosyncratic, difficult to define and not particularly well understood. Perhaps Alfie Kohn’s description of the destructive effects of extrinsic rewards might make up part of the explanation as well. Whatever the mechanism, it’s important to remember that observable reality, even when based on subjective consensus, must take precedence over theories of psychometric evaluation. We don’t need to understand exactly why the strong version of Goodhart’s Law happens in education in order to acknowledge that it is happening, and take steps to mitigate it.

The Optimized Learner in the age of Rapid Climate Change

Having become optimized to fit our educational ecosystems, where does this leave students today? The problem with a species becoming highly optimized for a particular environmental niche is that it makes them brittle in the face of ecological change. Formula one race cars are so optimized for perfectly paved race tracks that they can reach top speeds of 250 miles per hour. But if that race car hits a pothole - perhaps because the track maintenance crew missed it while distracted by flooding, heat waves, and other effects of climate change - that race car will probably crash. If you aren’t sure what sort of terrain you will encounter in the future, you don’t want a race car. What you want is an all terrain vehicle. It can’t go nearly as fast, but it can handle lots of different ecological conditions. 

Optimizing against one or a few measurable qualities in education, like the ability to excel at test taking, may lead to greatness in those highly specialized domains. But the cost is that the species becomes less resilient to broad changes in the ecosystem. And resilience is exactly what's needed when one's ecological niche starts to change. Instead of maintaining educational ecosystems that reward high scoring “race car” students, we should be retooling to shift the selection pressure in the direction of all terrain vehicles. The criteria for success isn’t how well they can answer questions we already know the answers to, but how well they can improvise in the face of uncertainty and get things done in this messy, complex, and very real world. 

This doesn’t mean we can’t use any sort of measurement, quantitative or otherwise, for evaluation. Sohl-Dickstein describes several mitigation strategies against Goodhart’s Law. The most intuitive is to better align the proxy being measured with the learning goals. If you want learners to be able to propose and defend an argument about a topical issue, then create real conditions for them to do that and evaluate them. This may require a subjective evaluation strategy, but subjectivity will likely inject some noise into the system in a way that may also be helpful for the prevention of overfitting.

In addition to inspiration from Sohl-Dickstein and other fields, we can look to the evaluation strategies of educational traditions that don't suffer from the effects of Goodhart's Law. The Reggio Emilia approach uses reflective documentation of learning, a process that supports the development of both students and teachers. You can see beautiful examples of this in Ben's documentation of his work at Newtowne School.

My own view is that we should be creating learning communities in which educators support young people in doing hands-on work with real world problems, ideally those that are related to climate change. This requires an evaluation strategy that must be customizable for each student and their project, thereby avoiding much of the risks of overfitting and Goodhart’s Law. In doing this, we’d have to sacrifice the ability to rank and compare learners across identical, objective criteria. But, for my own part, I’m not sure what the obsessive focus on ranking and comparison of the past few decades has gotten us.

There are probably an infinite number of strategies for building resilience and creativity in learners, but they all require time, energy, and attention from both learners and educators to discover, develop, and share. As Graeber and Wengrow point out, the archaeological record is full of examples of humans collectively exploring different ways to live in changing ecosystems. But for those creative processes to get going, we need to acknowledge that we are no longer working towards a predefined future in which academic “optimization” is a ticket to success. If students and teachers have to spend their time focusing on test scores, then there will be no time left for them to figure out, together, how to build the resilience and creativity we all need to address the oncoming ecological crises.