UCSD - University of California - San Diego

05/13/2026 | Press release | Distributed by Public on 05/13/2026 11:38

Governments May Shape What AI Chatbots Say by Shaping the Web They Learn From

Published Date

May 13, 2026

Article Content

Key Takeaways

  • State-coordinated media can shape how AI chatbots answer political questions.
  • Chinese state-linked content appears in AI training data and can influence model responses.
  • Chatbots gave more favorable answers about China when prompted in Chinese than in English.
  • Similar patterns appeared across 37 countries with varying levels of media control.
  • The findings raise concerns about AI, democracy, censorship and training-data transparency.

Ask an AI model the same political question in two different languages, and you may get two very different responses. A new study in Nature suggests one reason why: governments can indirectly influence large language models (LLMs) by shaping the online media environment, and thus the text those systems learn from.

A team of researchers spanning University of Oregon, Purdue University, the University of California San Diego, New York University, and Princeton University found evidence that state media control can leave detectable traces in AI model behavior. The researchers combine evidence from evaluating LLMs in the local languages of 37 countries with a case study from China to understand how this happens. Across six studies, the team traced the pathway from online media to training data to model behavior, combining analysis of open training data, experiments with training small models, human evaluation, and real-world tests of commercial chatbots."

People often talk about AI as if it learns from the internet in some neutral way," said Hannah Waight, co-first author of the study and Assistant Professor of Sociology at University of Oregon. "It doesn't. It learns from information environments that have already been shaped by institutions and power, and those environments can leave measurable traces in what models say."

The researchers call this idea institutional influence.

Joshua Tucker, co-author and co-Director of the NYU Center for Social Media, AI, and Politics, added, "The public debate has focused on what AI can generate, but this study points upstream. Before AI systems can influence politics, politics can influence AI."

To trace this institutional influence through the training process the authors first showed that state-coordinated media appears frequently in real training data. Comparing two sources of Chinese state-coordinated media with a major open-source multilingual training dataset derived from Common Crawl, the researchers found more than 3.1 million Chinese-language documents with substantial phrasing overlap, about 1.64% of the dataset's Chinese-language subset. That is over 40 times the rate for documents from Chinese-language Wikipedia, a common training source. Among documents mentioning Chinese political leaders or institutions, the share rose as high as 23%. Only about 12% of the matched documents came from known government or news domains, suggesting that the material had spread widely across the web before reaching AI training corpora.

The researchers also found that commercial models memorized distinctive phrases associated with this material, suggesting that they had been seen a number of times during training.

"State-coordinated content is not just about what appears in official media. It is also about recirculation; the same phrasing moving through newspapers, apps, reposts and ordinary web pages until it looks like part of the broader information environment. Once state-coordinated content is in the training data, the model can launder it into what looks and sounds like neutral, objective information," said Brandon M. Stewart, the paper's corresponding author, and Associate Professor of Sociology at Princeton University.

The team then tested whether that content could actually shift a model's behavior. Large commercial models take months and millions of dollars in compute to train so the team experimented with taking a small, open model and adding additional documents to the training process. The results were clear: adding scripted news to the training data made the models more likely to produce more favorable answers-nearly 80% of the time compared with an unmodified model. This is true even when compared to other non-scripted Chinese media, and especially compared to just adding general Chinese-language text from the internet.

"When the same political question produces systematically different answers with only small changes to the training data, that suggests those additional documents are doing real work," explained Eddie Yang, co-first author of the study and Assistant Professor of Political Science at Purdue University, who started the research while he was a doctoral student in political science at UC San Diego.

The team reasoned that if states have strong real world influence over the pretraining data, it should appear most clearly in the state's primary language. For example, a question about the Chinese government should produce a more pro-government answer when posed in Chinese than the same question posed in English. They used this within-model, cross-language comparison to probe commercial models without access to their internal parameters. In responses to political questions about China, human raters judged the Chinese-prompted answer to be more favorable to China 75.3% of the time. For prompts not about China, the rate was no different from chance. The language difference gave them a rare window into a closed system. Follow-on studies using real user prompts and additional commercial models found the same general tendency: on questions about Chinese leaders and institutions, answers tended to be more favorable when the prompt was in Chinese than when it was in English.

The researchers also show that this is not just about China. In a cross-national study of 37 countries where a national language is largely concentrated within a single country, models portrayed governments and institutions from countries with stronger media control more favorably in that country's language than in English. The authors emphasize that this result is correlational, but say it is consistent with the mechanism identified in the China case study.

Margaret E. Roberts, UC San Diego Professor of Political Science in the School of Social Sciences who is Co-Director of the China Data Lab at the School of Global Policy's 21st Century China Center.

"This is not evidence that AI companies set out to curry favor with those governments, or that those governments control media systems with chatbots in mind," said Margaret E. Roberts, a co-author and UC San Diego Professor of Political Science who is Co-Director of the China Data Lab at the School of Global Policy and Strategy's 21st Century China Center. "States shape the information environment, the information environment shapes training data, and training data shapes model outputs. But going forward, our findings suggest that LLMs create new incentives for powerful actors to think strategically about the text they disseminate online."

The authors stress that no single test can capture how a commercial model was trained because many of those details aren't publicly known. The paper instead combines multiple approaches including analysis of open-source data, memorization tests of commercial systems, retraining experiments, human evaluation, real-user audits, and cross-national comparison to identify one of the ways that political power can enter AI systems.

At their project website, https://state-media-influence-llm.github.io/, the authors show that the results replicate using the latest models released.

Beyond nation-states, the researchers emphasize that other powerful institutions may also be able to shape large volumes of online text.

"Training data is the foundation of modern AI," said Solomon Messing, a co-author and Research Associate Professor at the NYU Center for Social Media, AI, and Politics. "If we want to understand the powerful interests these models reflect, we need to know how we're sourcing the concrete. That starts with more transparency about what goes into the training data."

Learn more about research and education at UC San Diego in: Artificial Intelligence

Related content

Why Chatbots Aren't Neutral

Hannah Waight (Department of Sociology, University of Oregon, co-first author)

"AI systems do not learn from a neutral internet. Before these models were ever conceived, the internet has been shaped by states, markets, and media systems. Those forces inevitably show up in the answers models generate now."

"We chose to study China because we studied their media system in prior work. Once material from the media is scraped, copied and reused across the web, it is hard to know where the framing originally came from. Tracing these information flows was a key part of our earlier work and a foundation for this study."

Eddie Yang (Department of Political Science, Purdue University, co-first author)

"This is really an AI supply-chain issue. Models have to get their information from somewhere and there is uneven access to high-quality sources."

"Our experiments show state-coordinated text can move a model's political answers. The point is not that one document changes a chatbot; it is that repeated, coordinated language becomes part of the model."

Yin Yuan (21st Century China Center, School of Global Policy and Strategy, University of California San Diego, co-author)

"Political language in China is highly coordinated. That coordination makes it visible to researchers, but it may also make it easier for models to pick up and reproduce."

"The China case lets us trace the mechanisms of institutional influence in detail, but the cross-national study shows us that this isn't a story about one political system. When a government has substantial influence over a media system, the model's answers in the language of that system may carry the imprint of that political environment."

Solomon Messing (Center for Social Media, AI and Politics, New York University, co-author)

"We spent years studying how political information flows through traditional and social media. Now we need to do the same thing for frontier models. The challenge is that in AI systems, those flows are obscure, because the origins of the data become difficult to trace once information is absorbed during model training."

"A chatbot answer just sort of appears from the ether. There is no author byline and no publication masthead. So it's incredibly important to do what we can to understand where it comes from, and that means understanding what's in the training data. "

Margaret E. Roberts (Department of Political Science, University of California San Diego, co-author)

"Censorship and propaganda have always shaped what information people encounter. What is new here is that they can also shape the systems people increasingly ask to summarize, explain, and interpret the world for them."

"Media control affects what gets repeated and what is missing from a story. If a model learns from an online environment where official narratives are everywhere and alternative accounts are out of reach, that imbalance can become part of how the model represents the world."

Brandon M. Stewart (Department of Sociology and Office of Population Research, Princeton University, corresponding author)

"Large language models separate the message from the messenger. What began as a strategic narrative from a powerful government in a state media outlet can reappear as informed commentary from a highly knowledgeable intelligent agent. With no visible source reputation, people lack any signal about the interests that shaped that answer."

"No single test can tell us everything about commercial AI training. But when open-data analysis, memorization tests, pretraining experiments, and cross-language comparisons all point in the same direction, the best explanation is that media control is already shaping model behavior."

Joshua A. Tucker (Wilf Family Department of Politics and Center for Social Media, AI, and Politics, New York University, co-author)

"This is a democracy and governance issue, not just a technical issue. As people turn to chatbots for political information, we need to examine which institutions have shaped the answers before a user ever asks the question."

"The bottom line is that training data across languages for these models does not fall from the sky - training data is produced in the context of existing socio-political institutions. When political institutions lead to asymmetries in the production of politically relevant text,, these differences can be reflected in model outputs."

UCSD - University of California - San Diego published this content on May 13, 2026, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on May 13, 2026 at 17:38 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]