Tristan Jehan
|
March 18, 2025
If you're building a voice AI agent, beware: nothing works quite as advertised. In this post, we review lessons from experimenting with all the latest cutting edge-voice solutions like OpenAI, Eleven Labs, LiveKit and more.
We all want a voice AI experience that feels like magic — fluent, responsive, and lifelike. But the reality? Voice AI experiences like Siri or Alexa have always failed to meet the mark. With the introduction of new technology from companies like OpenAI or ElevenLabs, it might seem like this industry is poised to take the giant leap towards the magical voice experiences we've been expecting for the last decade. Unfortunately, we're not there yet -- not even close.
As promising as the technology looks in demos, it isn’t quite ready for prime time. We learned this the hard way by spending countless hours trying (and failing) to use solutions from OpenAI, Google, Eleven Labs and others to build a voice agent into our application. While significant progress has been made, even the best systems stumble, cut off mid-sentence, or fail when you need them most.
In this post, we'll walk through our approach to building voice, the pros and cons of using different technology providers, and the pitfalls to avoid if you're considering building a voice application yourself.
HEARD streamlines how companies gather customer insights by using AI to moderate user interviews. Instead of spending weeks scheduling and moderating a handful of in-person customer interviews, companies use our AI moderator to interview hundreds of customers in a few short hours. Product, research and marketing teams use HEARD anytime they need to collect deep feedback from their customers, like after a new product launch or to test a prototype of a new feature before building it.
For us, achieving a smooth conversational interaction between the AI and the end research participant is paramount. HEARD interviews can range from simple conversations (e.g. why did you cancel your subscription?) to complex task moderation (e.g. please complete this flow on a live website). Our goal was to build a real-time voice agent that would feel natural and lifelike to research participants while also being capable of managing many different scenarios within an interview.
Even for a relatively straightforward use case, several key components must align.
With HEARD, we needed precise control over the questions that the AI moderator asks in the interview, which is powered by our proprietary text-based AI model. We explored different approaches, starting back in 2023 by using an early version of OpenAI's Whisper to build a working demo in just a day. From there, we tried Google’s voice synthesis which then provided decent quality but was too slow. Eleven Labs was an improvement in quality but also suffered from latency. We tried a browser-based synthesis approach, which was much faster but wasn’t realistic enough.
Finally, when OpenAI released its realtime API in the fall, we saw new potential. Unfortunately the experience remains unreliable; audio cuts off arbitrarily, and the turn detection was overly sensitive, and hard to control. A few months later ElevenLabs then introduced a conversational AI model with a 70ms Flash TTS model and a low-latency streaming solution, as well as providing a convenient web-based setup interface. That solution wasn’t a true voice-to-voice model but with the optimized end-to-end backend solution, we felt this could work. Again, we ran into numerous issues—audio cut-offs, lack of control over turn detection, and several persistent bugs, like WebSocket disconnections that we could not control. We also attempted to work with Gemini, only to find it unreliable, frequently responding with, “Sorry, I didn’t get that.”
Managing latency, quality, flexibility, and reliability proved more difficult than expected. No solution got everything right. LiveKit’s semantically-informed turn detection is very promising in allowing users to pause mid-sentence without being interrupted, but it’s still a text-based system, requiring additional STT and TTS, which introduces delays. OpenAI’s realtime AI is highly interactive but doesn’t allow fine control over conversation flow. ElevenLabs offers the best voice quality, but its standard APIs are slow and lack the ability to carry over natural conversational intonation. Their newest conversational AI solution comes close, offering low latency, high-quality output, and some flexibility, but reliability has remained an issue so far.
Throughout our tests, we encountered these persistent issues:
Some companies provide a warning but most pretend their solutions are deployment-ready when they’re not. We ended up talking to most teams that develop these solutions. They are working hard to fix their issues. But testing was time-consuming and expensive. Over the course of 6 weeks, we logged over 50 hours of real audio testing—roughly 3,000 conversations averaging one minute each. Despite this effort, no solution proved reliable enough for real-world conditions.
After weeks of testing, we reverted to a solution close to what we built a year ago. It’s not as seamless—it requires users to click to talk rather than dynamically responding—but it works reliably, ensures data integrity, and avoids frustrating customer experiences.
It was frustrating to get a solution 80% of the way there, see the promise, but struggle to perfect the last mile to meet our standard for a high-quality experience. These solutions might work for a flashy demo or a prototype, but likely will lead to a frustrating end-user experience if shipped to production. Innovation is important, but so is stability. And when building AI for real-world use, reliability beats novelty every time.
Despite the challenges, we remain optimistic. OpenAI, Google, ElevenLabs, and others are pushing rapid improvements. The pieces are coming together; just last week ElevenLabs reached out to update us of new fixes they recently released. We’re back to testing and evaluating whether or not it’s finally ready for prime time and we’ll be sure to keep the community of voice builders updated with what we find.
Speaking of community – if you’re a fellow voice builder we’d love to hear more about your experience and exchange notes. You can connect with me here on LinkedIn or email me at tjehan@elis.ai.
Get the best content on user research, and product building delivered to your inbox.