The University of Manchester

02/09/2026 | Press release | Distributed by Public on 02/09/2026 05:06

University of Manchester academics contribute to the toughest AI benchmark

Chat Gpt, Gemini. pexels-markus-winkler-1430818-30869073
Download Media Kit
Preparing your download...
Download

An error occurred while preparing your download

09
February
2026
|
11:04
Europe/London

University of Manchester academics contribute to the toughest AI benchmark

Researchers from The University of Manchester have contributed to a new global benchmark designed to measure the limits of today's most advanced artificial intelligence (AI) systems.

As large language models such as ChatGPT and Gemini have rapidly improved in recent years, many widely used benchmarks have become less informative. In 2023, leading models were found to pass the Turing testand, separately, in 2025, achieved gold medal-level performance on International Mathematical Olympiad questions, achieving over 80% accuracy.

Now, two Manchester mathematicians, Dr Cesare Giulio Ardito and Dr Igor Chernyavsky, have joined nearly 1,000 expert contributors worldwide to create a multidisciplinary academic test called "Humanity's Last Exam" (HLE), which sets AI systems a fresh challenge.

The test consists of 2,500 rigorously reviewed questions spanning dozens of disciplines, from mathematics and the natural sciences to humanities. Questions are deliberately precise, closed-ended and resistant to simple internet search or memorisation, with some using both textual and image data.

Every question in HLE was tested against leading AI models before inclusion. If an AI system could answer a question correctly at the time the benchmark was designed, it was rejected.

The study, now published in Nature, found they passed fewer than 10% of the HLE questions when the dataset was first released in early 2025, despite scoring above 80% on more conventional benchmarks.

HLE dataset creation pipeline
Download Media Kit
Preparing your download...
Download

An error occurred while preparing your download

Although the rapid pace of AI development has enabled some systems to significantly improve their scores in less than a year, the top-ranked models still reach just below 40%. The results also show that many AI systems still frequently express high confidence in incorrect answers to the HLE questions. However, their capability in self-assessing knowledge gaps has gradually improved.

Dr Cesare Giulio Arditosaid: "I'm happy that the University of Manchester is represented among contributors from all over the world. This was a human team effort and, so far, we appear to still have an edge."

Although this new AI benchmark only measures performance on closed-ended, expert-level questions at the frontier of current knowledge, the authors hope it will help identify remaining limitations and potentially capture emerging generalist research capabilities.

This research was published in the journal Nature

Full title: A benchmark of expert-level academic questions to assess AI capabilities

DOI: https://doi.org/10.1038/s41586-025-09962-4

Share this page

University of Manchester academics contribute to the toughest AI benchmark
Share on: X Share on: Facebook Share on: LinkedIn
The University of Manchester published this content on February 09, 2026, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on February 09, 2026 at 11:06 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]