noodls browser compatibility check

The security settings of your browser are blocking the execution of scripts.

To use noodls, javascript support must be enabled. Please change your browser's security settings to enable javascript.

If you have changed your browser's security settings, you can click here.

related announcements

News

City of Olympia, WA

05/13/2026 - Lane closure Capitol Way - May 14-15, 2026
Gaia Inc.

Statement of Changes in Beneficial Ownership (Form 4)
California State Assembly[...]

Assemblymember Boerner and San Diego Students Advocate for Inclusive[...]

Education

UCSD - University of California - San Diego

05/13/2026 | Press release | Distributed by Public on 05/13/2026 11:38

Governments May Shape What AI Chatbots Say by Shaping the Web They Learn From

Story by:

Published Date

May 13, 2026

Story by:

Topics covered:

Article Content

Key Takeaways

State-coordinated media can shape how AI chatbots answer political questions.
Chinese state-linked content appears in AI training data and can influence model responses.
Chatbots gave more favorable answers about China when prompted in Chinese than in English.
Similar patterns appeared across 37 countries with varying levels of media control.
The findings raise concerns about AI, democracy, censorship and training-data transparency.

Ask an AI model the same political question in two different languages, and you may get two very different responses. A new study in Nature suggests one reason why: governments can indirectly influence large language models (LLMs) by shaping the online media environment, and thus the text those systems learn from.

A team of researchers spanning University of Oregon, Purdue University, the University of California San Diego, New York University, and Princeton University found evidence that state media control can leave detectable traces in AI model behavior. The researchers combine evidence from evaluating LLMs in the local languages of 37 countries with a case study from China to understand how this happens. Across six studies, the team traced the pathway from online media to training data to model behavior, combining analysis of open training data, experiments with training small models, human evaluation, and real-world tests of commercial chatbots."

People often talk about AI as if it learns from the internet in some neutral way," said Hannah Waight, co-first author of the study and Assistant Professor of Sociology at University of Oregon. "It doesn't. It learns from information environments that have already been shaped by institutions and power, and those environments can leave measurable traces in what models say."

The researchers call this idea institutional influence.

Joshua Tucker, co-author and co-Director of the NYU Center for Social Media, AI, and Politics, added, "The public debate has focused on what AI can generate, but this study points upstream. Before AI systems can influence politics, politics can influence AI."

To trace this institutional influence through the training process the authors first showed that state-coordinated media appears frequently in real training data. Comparing two sources of Chinese state-coordinated media with a major open-source multilingual training dataset derived from Common Crawl, the researchers found more than 3.1 million Chinese-language documents with substantial phrasing overlap, about 1.64% of the dataset's Chinese-language subset. That is over 40 times the rate for documents from Chinese-language Wikipedia, a common training source. Among documents mentioning Chinese political leaders or institutions, the share rose as high as 23%. Only about 12% of the matched documents came from known government or news domains, suggesting that the material had spread widely across the web before reaching AI training corpora.

The researchers also found that commercial models memorized distinctive phrases associated with this material, suggesting that they had been seen a number of times during training.

"State-coordinated content is not just about what appears in official media. It is also about recirculation; the same phrasing moving through newspapers, apps, reposts and ordinary web pages until it looks like part of the broader information environment. Once state-coordinated content is in the training data, the model can launder it into what looks and sounds like neutral, objective information," said Brandon M. Stewart, the paper's corresponding author, and Associate Professor of Sociology at Princeton University.

The team then tested whether that content could actually shift a model's behavior. Large commercial models take months and millions of dollars in compute to train so the team experimented with taking a small, open model and adding additional documents to the training process. The results were clear: adding scripted news to the training data made the models more likely to produce more favorable answers-nearly 80% of the time compared with an unmodified model. This is true even when compared to other non-scripted Chinese media, and especially compared to just adding general Chinese-language text from the internet.

"When the same political question produces systematically different answers with only small changes to the training data, that suggests those additional documents are doing real work," explained Eddie Yang, co-first author of the study and Assistant Professor of Political Science at Purdue University, who started the research while he was a doctoral student in political science at UC San Diego.

The team reasoned that if states have strong real world influence over the pretraining data, it should appear most clearly in the state's primary language. For example, a question about the Chinese government should produce a more pro-government answer when posed in Chinese than the same question posed in English. They used this within-model, cross-language comparison to probe commercial models without access to their internal parameters. In responses to political questions about China, human raters judged the Chinese-prompted answer to be more favorable to China 75.3% of the time. For prompts not about China, the rate was no different from chance. The language difference gave them a rare window into a closed system. Follow-on studies using real user prompts and additional commercial models found the same general tendency: on questions about Chinese leaders and institutions, answers tended to be more favorable when the prompt was in Chinese than when it was in English.

The researchers also show that this is not just about China. In a cross-national study of 37 countries where a national language is largely concentrated within a single country, models portrayed governments and institutions from countries with stronger media control more favorably in that country's language than in English. The authors emphasize that this result is correlational, but say it is consistent with the mechanism identified in the China case study.

Margaret E. Roberts, UC San Diego Professor of Political Science in the School of Social Sciences who is Co-Director of the China Data Lab at the School of Global Policy's 21st Century China Center.

"This is not evidence that AI companies set out to curry favor with those governments, or that those governments control media systems with chatbots in mind," said Margaret E. Roberts, a co-author and UC San Diego Professor of Political Science who is Co-Director of the China Data Lab at the School of Global Policy and Strategy's 21st Century China Center. "States shape the information environment, the information environment shapes training data, and training data shapes model outputs. But going forward, our findings suggest that LLMs create new incentives for powerful actors to think strategically about the text they disseminate online."

The authors stress that no single test can capture how a commercial model was trained because many of those details aren't publicly known. The paper instead combines multiple approaches including analysis of open-source data, memorization tests of commercial systems, retraining experiments, human evaluation, real-user audits, and cross-national comparison to identify one of the ways that political power can enter AI systems.

At their project website, https://state-media-influence-llm.github.io/, the authors show that the results replicate using the latest models released.

Beyond nation-states, the researchers emphasize that other powerful institutions may also be able to shape large volumes of online text.

"Training data is the foundation of modern AI," said Solomon Messing, a co-author and Research Associate Professor at the NYU Center for Social Media, AI, and Politics. "If we want to understand the powerful interests these models reflect, we need to know how we're sourcing the concrete. That starts with more transparency about what goes into the training data."

Learn more about research and education at UC San Diego in: Artificial Intelligence

related announcements

News

Education

UCSD - University of California - San Diego

Governments May Shape What AI Chatbots Say by Shaping the Web They Learn From

Story by:

Published Date

Story by:

Topics covered:

Share This:

Article Content

Key Takeaways

Topics covered:

Share This: