Cornell University

05/04/2026 | Press release | Distributed by Public on 05/04/2026 08:49

What does it mean to train an AI to speak like you

Ultra-personalized artificial intelligence for assisted communication risks muting aspects of the user's identity and occasionally breaches privacy, according to a new study from a Cornell Tech doctoral student who trained the technology on himself.

Doctoral student Tobias Weinberg, who uses augmentative and alternative communication (AAC), conceived the research when he realized he could train a model on his own speech data.

"Since I'm typing it anyway, I might as well see what I can do with it," he said.

Rather than relying on hypothetical users or lab-based simulations, or asking others to take on privacy and identity risks, Weinberg used his own speech to ask questions like: "What does it mean to train a machine to be you?"

That question became the foundation of "I, Robot?," presented in April at the 2026 CHI Conference on Human Factors in Computing Systems, which explores the promises and risks of ultra-personalized AI in AAC. The work emerged from Cornell Tech's Matter of Tech Lab and was co-authored by Weinberg; Thijs Roumen, assistant professor at Cornell Tech; Ricardo Gonzalez Penuela, a doctoral student in information science based at Cornell Tech; and Stephanie Valencia, an assistant professor at the University of Maryland.

Over seven months, Weinberg logged his real-world AAC communication and trained a language model on the data. For three months, he lived with the resulting personalized system in daily conversations. One of Weinberg's most striking findings was that the act of logging speech changed his behavior before any AI entered the picture.

"I didn't expect the discomfort," Weinberg said. "Just knowing I was logging changed how I spoke. I was no longer just speaking in the moment. I was curating a future dataset, and that changed my sense of freedom in conversation."

"There is an interesting tension between how Tobi wanted his speech to be recorded and perceived by others, and how he communicates," said Roumen, Weinberg's adviser, who is also affiliated with the Cornell Bowers College of Computing and Information Science. "This played a role in the recording, leading to self-censorship, and in living with the model where context sensitivity played a crucial role."

The first things to disappear, Weinberg said, were informal, emotionally charged expressions - dark jokes, gossip, venting - which were filtered out to prevent inappropriate text from resurfacing in professional settings. As a result, the AI learned what he described as a "cleaned-up" version of himself.

"It raised a question I still don't have a clean answer to," he said. "Can you build a truly personal AAC without also building a surveillance system for your own speech?"

Remarks that work in one setting can be inappropriate in another, yet once speech data is collected and aggregated, it often loses the social cues that gave it meaning. Roumen said this is a fundamental design challenge for these AI systems.

"Our findings suggest that ultra-personal AAC requires a high granularity of context," Roumen said. "Who you speak with, what the intention of the conversation is, and in what environment, all contribute to the type of suggestions that may or may not make sense."

While the personalized model performed well in structured settings - helping elaborate ideas smoothly and efficiently - Weinberg found that it struggled in fast-moving social situations, sometimes nudging conversations in directions he didn't intend.

"At bars, during quick topic shifts, or in mixed conversations jumping between work and personal life, the model often pushed toward familiar patterns instead of what I actually wanted to say in that moment," he said.

The implications, the researchers said, extend beyond AAC.

"Everything we found - the self-censorship, the privacy violations, the identity reshaping - happened with a system I built for myself, that I fully controlled and could turn off at any moment. Most users won't have that," Weinberg said. "Right now, we're moving very fast toward deploying these systems at scale without having figured out the basics: how to capture contextual information without erasing privacy; how the system knows when, what, and in front of whom to surface a suggestion; and how to keep users in control of the technology that mediates their speech."

Roumen said that while the technology itself is advancing rapidly, the social and ethical groundwork has not caught up.

"Our work highlights both the potential and the risks of personalized AI," he said. "Before these systems are deployed at scale, we need to think much harder about when recording should stop, how context is preserved, and how users remain in control of what becomes their voice."

This research was supported by a Google Research Scholar Award.

Grace Stanley is the staff writer-editor for Cornell Tech.

Cornell University published this content on May 04, 2026, and is solely responsible for the information contained herein. Distributed via Public Technologies (PUBT), unedited and unaltered, on May 04, 2026 at 14:49 UTC. If you believe the information included in the content is inaccurate or outdated and requires editing or removal, please contact us at [email protected]