Inside a crowded bar, even the best noise-canceling earbuds struggle. They can either shut the whole world out or let everything in, but they can’t do what humans do naturally: focus on the voices that matter while ignoring everything else. A new study from researchers at the University of Washington proposes a third way—a “proactive hearing assistant” that automatically figures out who you’re talking to using AI and enhances only their voices in real time, without taps or gestures.

“We were asking a very simple question,” says Shyam Gollakota, head of the Mobile Intelligence Lab at the University of Washington and coauthor of the study. “If you’re in a bar with a hundred people, how does the AI know who you are talking to?”

The team’s answer blends audio engineering with conversational science. Building on previous research by Gollakota’s lab, the system uses AI trained to detect the subtle turn-taking patterns humans instinctively follow to alternate speaking turns with minimal overlap. That conversational rhythm becomes the cue for identifying who is in the exchange. Voices that don’t follow the pattern are filtered out.

RELATED: AI Headphones Create Cones of Silence

The prototype uses microphones in both ears and a directional audio filter aimed at the wearer’s mouth to extract the user’s own speech, which acts as an anchor for detecting turn-taking. With that anchor, the system isolates and enhances conversation partners while suppressing everyone else, operating at latencies less than ten milliseconds—fast enough to keep the amplified audio aligned with lip movements.

“The key insight is intuitive,” Gollakota says. “If I’m having a conversation with you, we aren’t talking over each other as much as people who are not part of the conversation.” The AI identifies voices that alternate naturally with the wearer’s own and ignores those that overlap too often to fit the conversation. The method does not rely on proximity, loudness, direction, or pitch. “We don’t use any sensors beyond audio,” he says. “You could be looking away, or someone farther away could be speaking louder—it still works.”

The technology could be useful to people who have hearing challenges, as traditional hearing aids amplify all sound and noise alike. “It could be extremely powerful for quality of life,” says Gollakota. Proactive hearing assistants with this technology could also help older users who would struggle to manually select speakers to amplify.

Headphones with one earbud taped to the exterior face of each speaker resting on a table next to a smart phone. To deal with latency issues, the system uses a two-part model that mimics how our brains also process conversation. Shyam Gollakota

A Brain-Inspired Dual Model

To feel natural, conversational audio must be processed in under ten milliseconds, but detecting turn-taking patterns requires one to two seconds of context. Reconciling those timescales required a split architecture: a slower model that updates once per second and a faster model that runs every 10 to 12 milliseconds.

The slower model infers conversational dynamics and generates a “conversational embedding.” The fast model uses that embedding to extract only the identified partner voices, suppressing all others quickly enough for seamless dialogue. Gollakota compares the process to how the brain separates slower deliberation from quick speech production. “There’s a slower process making sense of the conversation, and a much faster process that responds almost instantaneously,” he says.

Conversational rhythm varies across cultures, so the team trained the system on both English and Mandarin. It generalized to Japanese conversations despite never being trained on Japanese—evidence, they say, that the model is capturing universal timing cues.

In controlled tests, the system identified conversation partners with 80 to 92 percent accuracy and had 1.5 to 2.2 percent confusion (meaning the system identified an outside speaker as being part of the conversation by mistake). It improved speech clarity by up to 14.6 dB.

Listen to the difference the hearing assistant makes when it’s turned on

Promise and Boundaries

“What they describe is an interesting and novel direction. But when it comes to real-world applications, many challenges remain,” says Te-Won Lee, CEO of AI glasses company SoftEye, who has recently developed a similar technology for commercial use. Lee’s tech was based on blind source separation, a signal processing technique that tries to sift individual sound sources from a mixture of sounds without knowing what the sources are in advance.

“In most environments, you don’t get four people neatly taking turns,” Lee says. “You get music, unpredictable noise, people interrupting each other. The scenarios described in the research are not the scenarios you encounter in most real-world environments.” As soundscapes become more chaotic, performance may degrade.

Still, he sees a major strength in the prototype’s very low latency. “When it comes to deployment in millions of devices, latency has to be extremely low,” he says. “Even 100 milliseconds is unacceptable. You need something close to ten milliseconds.”

Lee also notes that decades of blind source separation and speech-enhancement work have yielded algorithms that work across many noise conditions to isolate one desired speaker, usually the device user, from all other sources. “Real-world speech enhancement is about separating the desired speech from all other noise,” Lee says. “Those techniques are more geared toward unpredictable environments.” But in earbuds or AR glasses, where the system knows whom the wearer intends to talk to, he says the UW approach “can be very effective if the scenario matches their assumptions.”

Risks, Limitations, and Next Steps

The system relies heavily on self-speech, so long silences can confuse it. Overlapping speech and simultaneous turn-changes remain challenging. The method is not suited for passive listening, since it assumes active participation. And because conversational norms vary culturally, additional fine-tuning may be needed.

Incorrect detection can also amplify the wrong person—a real risk in fast-moving exchanges. Lee adds that unpredictable noise, from music to chaotic soundscapes, remains a major hurdle. “The real world is messy,” he says.

Next, the team plans to incorporate semantic understanding using large language models so that future versions can infer not only who is speaking but who is contributing meaningfully, making hearing assistants more flexible and more humanlike in how they follow conversations.