02/05/2026 | Press release | Distributed by Public on 02/05/2026 10:52
Historical data are central to the economic research that informs policymaking. But economists can't always access usable historical data, and even when the raw data are available, collecting and processing them can be costly and time-consuming.
Since the mid-2010s, deep learning tools have made it easier to extract complex historical data, but these tools are typically very specialized and require users to have significant technical expertise. However, multimodal large language models (LLMs) have recently created new research opportunities for economists.
Multimodal LLMs are particularly well suited for compiling and synthesizing large amounts of panel data, which are collected over time and cross-sectionally across individuals or groups. But just how reliable are these LLMs when processing panel data? The stakes are high because decisions are made and policies are implemented based on the findings of economic research, and economic research is only as good as the data that back it up.
For their paper, "Can LLMs Credibly Transform the Creation of Panel Data from Diverse Historical Tables? " Vitaly Meursault and Christopher Severen of the Federal Reserve Bank of Philadelphia and Verónica Bäcker-Peral of the Massachusetts Institute of Technology introduce and evaluate a novel "digitization architecture" to show the potential of LLMs. The authors use 20th century U.S. vehicle registration data as their test case.
Historical data tables, including vehicle registration data, typically feature nonstandard layouts and formats that are often produced by varied, decentralized sources. If LLMs can reliably extract these data, they should provide economists with greater opportunities to pursue research - although "rigorous validation" of these data, the authors write, "becomes paramount to ensure data reliability."
Lengthy data series encompassing health, demographic, environmental, and other historical indicators enhance economists' understanding of long-term patterns and relationships, and this knowledge can guide policy. For example, these data can help policymakers see how social connectedness affects crime1 or how trade agreements can lead to political realignment.2 Despite the importance of these data, they write, "much crucial historical information remains locked in archival documents due to the extensive resources required to convert [the information] into machine-readable formats suitable for analysis."
The authors used their digitization architecture to transform the raw historical data on vehicle registrations into cohesive, analysis-ready data panels. They note that the raw data tables, "produced independently by various state-level agencies, exhibit precisely the kind of heterogeneity that makes traditional digitization difficult, thus serving as an ideal benchmark." To transform these heterogeneous data tables, they designed an LLM-based pipeline to process the data's complex and varied layouts. This processing is possible thanks to the LLM's ability to integrate visual and textual inputs. Nonetheless, a researcher must guide this process by using the LLM's natural language prompts, applying the researcher's industry-specific expertise regarding the data (for example, its historical reporting conventions), and making any necessary refinements based on observed errors.
Next, the authors evaluated their LLM-produced data set by comparing it with a gold standard data set. The gold standard data set they created includes nearly 700 diverse tables and more than 49,000 data points, which they manually corrected for numeric and formatting errors. They then used their LLM and gold standard data sets to run two econometric analyses of vehicle adoptions. These two analyses allowed them to see if the two data sets yield similar results.3
They find that it makes no difference if they use the LLM data set or the gold standard data set to produce their historical vehicle registration tables: The results of both are "statistically indistinguishable."4 This is good news for economists, who can use LLMs to create data indistinguishable from the carefully hand-corrected data. And for their test case of vehicle registrations, the LLM permits access to more granular county-level data rather than state-level data.
What's more, the LLM for their vehicle adoption test case can process historical data at one-hundredth the cost per table compared with outsourcing options.5 The LLM approach, they find, makes "complex panel dataset creation significantly more accessible to domain experts, regardless of technical background or budget." And when evaluating their econometric analyses of vehicle adoptions, they find that the conclusions drawn (from their regressions) using the LLM-produced data set are consistent with those derived from the gold standard data set.
Meursault, Severen, and Bäcker-Peral conclude that LLMs "offer a watershed change for the digitization of historical tables." They encourage other researchers to adopt LLMs to unlock historical information, which should give them the opportunity to expand their research. They recommend, however, that economists still use the gold standard approach to evaluate LLM-generated historical data. Doing so will maintain the integrity of their research findings.