05/22/2025 | News release | Distributed by Public on 05/22/2025 13:52
As GPT-4o showed us, conversational AI is making the voice AI we imagined in movies like Her into reality. AI voice interactions and conversations are becoming richer, faster, and easier to use, making them a key part of building multimodal AI agents.
But there's still a gap: voice agents still don't behave quite like real humans.
In natural conversations, things like interruptions, pauses, and overlapping speech happen all the time. The user experience feels off when AI responses come too early, too late, or not at all.
In real-world conversations, how you pause or interrupt means a lot, whether it's politeness, hesitation, confidence or more. It's not just about what is said but how it is said. For voice agents to feel truly human, they need to do more than just "hear" and "reply" correctly, but listen, understand, and respond naturally, with full awareness of context.
To help make voice interaction with AI more human, we built two new state-of-the-art models TEN Voice Activity Detection (VAD) and TEN Turn Detection. Both models are built to make voice agents feel much more natural based on Agora's more than 10 years of deep research in real-time voice communication and ultra-low latency streaming. Both models are supported by Agora and the community, allowing anyone to use, and the new models are key parts of the open-source conversational AI TEN ecosystem.
Developers can either use TEN VAD and TEN Turn Detection separately or combine both models to build a voice agent with human-like conversational experience.
TEN VAD is a lightweight, low-latency and deep-learning based VAD model. It is designed to run before the Speech-to-Text (STT) system, which is before the voice input is fed into large language models to detect frames that contain human speech and filtering out the non-human speech. What it does is simple but powerful:
By doing this, it not only makes downstream Speech-to-Text (STT) results more accurate but also cuts STT costs significantly - because you prevent sending voiceless audio into expensive processing pipelines.
VAD isn't optional if you care about turn-taking. Accurate turn detection relies heavily on reliable VAD as a foundation.
Performance Comparison:
Compared to popular VADs like WebRTC Pitch VAD and Silero VAD on the open data samples, TEN VAD outperforms both on TEN VAD Test Sample, an open dataset collected from diverse scenarios, with frame-by-frame manual annotations for ground truth.
In addition, TEN VAD outperforms others in latency comparison. TEN VAD rapidly detects speech-to-non-speech transitions, while Silero VAD suffers from a delay of several hundred milliseconds, resulting in increased end-to-end latency in human-agent interaction systems.
The TEN VAD Test Sample with manually labeled frame-level VAD annotations is available for integration and test with just one click, so the community developers can build and benchmark VAD models easily.
Real-world problem solving:
From a real-world user case, the measurements show that using TEN VAD reduced the audio traffic by 62%.
Try out TEN VAD and start building on Hugging Face and GitHub
TEN Turn Detection is built for one of the trickiest parts of human-AI conversations: figuring out when someone is done speaking. It is built specifically for dynamic, real-time conversations between humans and AI agents, allowing AI to distinguish between a mid-sentence pause and the end of a question or statement. If agents jump in too early or wait too long, the conversation feels unnatural to humans.
TEN Turn Detection enables full-duplex interaction with AI agents, to make conversations more natural and human by detecting natural turn-taking signals in real-time.
How it works:
The goal is to enable voice agents to understand when to listen and when to speak, so conversation can flow more naturally.
The TEN Turn Detection model is open source and available to all voice agent builders in the community, with support for both English and Chinese.
Performance Comparison:
We benchmarked TEN Turn Detection with other models on a multi-scenario dataset. Here are the results:
Multi-scenario dataset resultsTry out TEN Turn Detection and start building on Hugging Face and GitHub
When developers combine TEN VAD and TEN Turn Detection, they unlock a better way to build voice agents:
Both TEN VAD and TEN Turn Detection are designed to integrate seamlessly with the TEN Framework. Check out the demo video below to see the before-and-after differences of using TEN Turn Detection in TEN Agent (a conversational voice AI agent powered by TEN Framework).
You can run TEN VAD and TEN Turn Detection with the TEN Agent either on Hugging Face Spaces or locally on your own GPU.
Running on Hugging Face (Recommended for quick start)
Running Locally with Your Own GPU
Now, in this new conversational AI trend, make your voice agent truly human-like!
Stay tuned for any future TEN family changes or releases on