University of Delaware

02/09/2026 | Press release | Distributed by Public on 02/09/2026 12:44

Creating Humanity’s Last Exam

Creating Humanity's Last Exam

Article by Hilary Douwes Photo courtesy of Manuel Schottdorf | Photo illustration by Jeffrey C. Chase February 09, 2026

UD professor provides questions for test to benchmark AI learning

With the explosion of artificial intelligence and the rapid rate at which various programs seem to be "learning," how do we measure how fast AI's capabilities are advancing?

To get the answer, the nonprofit organization Center for AI Safety turned to experts around the world, asking for help in creating a test to benchmark AI programs' knowledge, accuracy and ability to reason.

The result is Humanity's Last Exam (HLE). The dramatically titled test is 2,500 questions, crowdsourced from more than 1,000 professors, experts, researchers and graduate students at nearly 500 institutions in 50 countries. Their work was published in the journal Nature on Jan. 28.

HLE isn't just any test.

"Many questions are at the very edge of what we humans currently know," said Manuel Schottdorf, a neuroscientist in the University of Delaware's Department of Psychological and Brain Sciences in the College of Arts and Sciences and HLE contributor. "It's a better test of figuring out whether the machines can come up with solutions independently."

Schottdorf studies the nervous system and the way our brains process perception, reasoning and planning. He submitted several questions on various topics to the HLE, one of which is on the final version of the exam. All of the submissions were peer-reviewed and ranked. Any question whose answer was available online was disqualified, as it would be available to AI programs, defeating HLE's purpose.

​​"They were particularly interested in things like math puzzles, obscure knowledge and niche domains, which are things that are unlikely to appear in the body of training data that these machines have absorbed. That way you get a better sense of how good they actually are in reasoning," Schottdorf said.

For example, Schottdorf submitted an image of a poem with a question about when the author was born. We can't tell you too much more without giving away the answer. Schottdorf said to get the answer one would have to know the font and the language in the image, and then would have to know who the author was.

"It's not a hard question if you know how to read the font and speak the language, but if you don't, it's a pretty difficult question," he said.

UDaily asked Schottdorf about HLE, whether we can trust AI and whether the programs will ever be as smart as humans.

Q: Where does AI struggle with these questions?

Schottdorf: I think one of the fundamental limitations is that AI doesn't really experience the real world. There's nothing empirical about it. Some people submitted questions like: "Picture yourself standing on a beach and the sun is 12 degrees above the horizon and it's 10 o'clock in the morning. Where on the planet are you?" To solve it, you have to do a little trigonometry to figure out how the planets behave, and how we move around the sun, figure out the time zones and all that, and that requires a mental representation of the physical process itself. The machines typically don't form such representations. AI may know the words, but that isn't enough. This is one of the reasons why my colleagues and I here at UD study abstract representations in animal and human brains.

Q: Can AI really learn reasoning skills?

Schottdorf: What is different in our reasoning abilities is that, based on a much smaller sample, we infer structure in the world. And we use language as a method to convey what we have perceived physically. So the underlying mechanism is quite different. A successful learning machine would be able to form these same thoughts based on a much smaller body of training data, but using this ability of building mental representations. I think that is the key missing step here. Once that is solved, then it should be possible to build machines that can reason, but using a much, much smaller training data set.

Q: How trustworthy do you find AI right now, and what advice would you give to the public?

Schottdorf: Based on how AI programs are trained and how they work, hallucinations and a lack of trustworthiness are fundamentally baked into them at the moment. These programs can expedite your search process, but you really need to double check everything they spit out. Sometimes you get answers that are just nonsensical, despite all of this claimed progress that they have made. Even these little AI summaries should be simple enough, but some of the mistakes are really obvious.

For some things it's great, like if you want to look up a pancake recipe on a Saturday morning. But if it's more critical for your own wellbeing, then don't take the output at face value.

Q: If an AI program aces your HLE question, what does that tell academia and the rest of the world?

Schottdorf: It says that the program is very good at answering very hard questions. Doing well on HLE is a necessary, but not a sufficient criterion to say that machines have reached true intelligence. They will have to be good enough to solve these questions, but that as a fact alone can't allow us to conclude that machines are truly intelligent. It's a good starting point, but there are too many more things that are missing. Plus, we have to consider that one of the possibilities is that some of the solutions to these questions might just have leaked out into the internet.

University of Delaware published this content on February 09, 2026, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on February 09, 2026 at 18:44 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]