MIT - Massachusetts Institute of Technology

11/18/2025 | Press release | Distributed by Public on 11/17/2025 23:04

Bigger datasets aren’t always better

Determining the least expensive path for a new subway line underneath a metropolis like New York City is a colossal planning challenge - involving thousands of potential routes through hundreds of city blocks, each with uncertain construction costs. Conventional wisdom suggests extensive field studies across many locations would be needed to determine the costs associated with digging below certain city blocks.

Because these studies are costly to conduct, a city planner would want to perform as few as possible while still gathering the most useful data for making an optimal decision.

With almost countless possibilities, how would they know where to start?

A new algorithmic method developed by MIT researchers could help. Their mathematical framework provably identifies the smallest dataset that guarantees finding the optimal solution to a problem, often requiring fewer measurements than traditional approaches suggest.

In the case of the subway route, this method considers the structure of the problem (the network of city blocks, construction constraints, and budget limits) and the uncertainty surrounding costs. The algorithm then identifies the minimum set of locations where field studies would guarantee finding the least expensive route. The method also identifies how to use this strategically collected data to find the optimal decision.

This framework applies to a broad class of structured decision-making problems under uncertainty, such as supply chain management or electricity network optimization.

"Data are one of the most important aspects of the AI economy. Models are trained on more and more data, consuming enormous computational resources. But most real-world problems have structure that can be exploited. We've shown that with careful selection, you can guarantee optimal solutions with a small dataset, and we provide a method to identify exactly which data you need," says Asu Ozdaglar, Mathworks Professor and head of the MIT Department of Electrical Engineering and Computer Science (EECS), deputy dean of the MIT Schwarzman College of Computing, and a principal investigator in the Laboratory for Information and Decision Systems (LIDS).

Ozdaglar, co-senior author of a paper on this research, is joined by co-lead authors Omar Bennouna, an EECS graduate student, and his brother Amine Bennouna, a former MIT postdoc who is now an assistant professor at Northwestern University; and co-senior author Saurabh Amin, co-director of Operations Research Center, a professor in the MIT Department of Civil and Environmental Engineering, and a principal investigator in LIDS. The research will be presented at the Conference on Neural Information Processing Systems.

An optimality guarantee

Much of the recent work in operations research focuses on how to best use data to make decisions, but this assumes these data already exist.

The MIT researchers started by asking a different question - what are the minimum data needed to optimally solve a problem? With this knowledge, one could collect far fewer data to find the best solution, spending less time, money, and energy conducting experiments and training AI models.

The researchers first developed a precise geometric and mathematical characterization of what it means for a dataset to be sufficient. Every possible set of costs (travel times, construction expenses, energy prices) makes some particular decision optimal. These "optimality regions" partition the decision space. A dataset is sufficient if it can determine which region contains the true cost.

This characterization offers the foundation of the practical algorithm they developed that identifies datasets that guarantee finding the optimal solution.

Their theoretical exploration revealed that a small, carefully selected dataset is often all one needs.

"When we say a dataset is sufficient, we mean that it contains exactly the information needed to solve the problem. You don't need to estimate all the parameters accurately; you just need data that can discriminate between competing optimal solutions," says Amine Bennouna.

Building on these mathematical foundations, the researchers developed an algorithm that finds the smallest sufficient dataset.

Capturing the right data

To use this tool, one inputs the structure of the task, such as the objective and constraints, along with the information they know about the problem.

For instance, in supply chain management, the task might be to reduce operational costs across a network of dozens of potential routes. The company may already know that some shipment routes are especially costly, but lack complete information on others.

The researchers' iterative algorithm works by repeatedly asking, "Is there any scenario that would change the optimal decision in a way my current data can't detect?" If yes, it adds a measurement that captures that difference. If no, the dataset is provably sufficient.

This algorithm pinpoints the subset of locations that need to be explored to guarantee finding the minimum-cost solution.

Then, after collecting those data, the user can feed them to another algorithm the researchers developed which finds that optimal solution. In this case, that would be the shipment routes to include in a cost-optimal supply chain.

"The algorithm guarantees that, for whatever scenario could occur within your uncertainty, you'll identify the best decision," Omar Bennouna says.

The researchers' evaluations revealed that, using this method, it is possible to guarantee an optimal decision with a much smaller dataset than would typically be collected.

"We challenge this misconception that small data means approximate solutions. These are exact sufficiency results with mathematical proofs. We've identified when you're guaranteed to get the optimal solution with very little data - not probably, but with certainty," Amin says.

In the future, the researchers want to extend their framework to other types of problems and more complex situations. They also want to study how noisy observations could affect dataset optimality.

"I was impressed by the work's originality, clarity, and elegant geometric characterization. Their framework offers a fresh optimization perspective on data efficiency in decision-making," says Yao Xie, the Coca-Cola Foundation Chair and Professor at Georgia Tech, who was not involved with this work.

MIT - Massachusetts Institute of Technology published this content on November 18, 2025, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on November 18, 2025 at 05:04 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]