Book a demo

Get Early Access

Read our collaboration with user interviews on how research teams can increase product impact
Learn more
button-icon

Tristan Jehan

|

March 18, 2025

The most important lessons we learned from building a conversational voice AI agent

If you're building a voice AI agent, beware: nothing works quite as advertised. In this post, we review lessons from experimenting with all the latest cutting edge-voice solutions like OpenAI, Eleven Labs, LiveKit and more.

The Harsh Reality of Building a Conversational Voice AI That Actually  Works

We all want a voice AI experience that feels like magic — fluent, responsive, and lifelike. But the reality? Voice AI experiences like Siri or Alexa have always failed to meet the mark. With the introduction of new technology from companies like OpenAI or ElevenLabs, it might seem like this industry is poised to take the giant leap towards the magical voice experiences we've been expecting for the last decade. Unfortunately, we're not there yet -- not even close.

As promising as the technology looks in demos, it isn’t quite ready for prime time. We learned this the hard way by spending countless hours trying (and failing) to use solutions from OpenAI, Google, Eleven Labs and others to build a voice agent into our application. While significant progress has been made, even the best systems stumble, cut off mid-sentence, or fail when you need them most.

In this post, we'll walk through our approach to building voice, the pros and cons of using different technology providers, and the pitfalls to avoid if you're considering building a voice application yourself.

Why we decided to build a voice agent

HEARD streamlines how companies gather customer insights by using AI to moderate user interviews. Instead of spending weeks scheduling and moderating a handful of in-person customer interviews, companies use our AI moderator to interview hundreds of customers in a few short hours. Product, research and marketing teams use HEARD anytime they need to collect deep feedback from their customers, like after a new product launch or to test a prototype of a new feature before building it.

For us, achieving a smooth conversational interaction between the AI and the end research participant is paramount. HEARD interviews can range from simple conversations (e.g. why did you cancel your subscription?) to complex task moderation (e.g. please complete this flow on a live website). Our goal was to build a real-time voice agent that would feel natural and lifelike to research participants while also being capable of managing many different scenarios within an interview.

What needs to go right to build a winning voice AI experience

Even for a relatively straightforward use case, several key components must align.

  • Turn detection is how the AI knows when you’re done talking so that it can begin to respond. Quality turn detection should allow for natural interruptions, but most systems rely on simplistic silence detection.
  • AI models must decide when to call functions and how to respond—smarter models are slow, while faster models lack precision.
  • AI voices should sound human, avoiding robotic intonations.
  • Speech to text (STT) accuracy must be near-perfect, especially when capturing every word matters, as in the case of our product.

The trials and errors of testing different solutions

With HEARD, we needed precise control over the questions that the AI moderator asks in the interview, which is powered by our proprietary text-based AI model. We explored different approaches, starting back in 2023 by using an early version of OpenAI's Whisper to build a working demo in just a day. From there, we tried Google’s voice synthesis which then provided decent quality but was too slow. Eleven Labs was an improvement in quality but also suffered from latency. We tried a browser-based synthesis approach, which was much faster but wasn’t realistic enough. 

Finally, when OpenAI released its realtime API in the fall, we saw new potential. Unfortunately the experience remains unreliable; audio cuts off arbitrarily, and the turn detection was overly sensitive, and hard to control. A few months later ElevenLabs then introduced a conversational AI model with a 70ms Flash TTS model and a low-latency streaming solution, as well as providing a convenient web-based setup interface. That solution wasn’t a true voice-to-voice model but with the optimized end-to-end backend solution, we felt this could work. Again, we ran into numerous issues—audio cut-offs, lack of control over turn detection, and several persistent bugs, like WebSocket disconnections that we could not control. We also attempted to work with Gemini, only to find it unreliable, frequently responding with, “Sorry, I didn’t get that.”

Managing latency, quality, flexibility, and reliability proved more difficult than expected. No solution got everything right. LiveKit’s semantically-informed turn detection is very promising in allowing users to pause mid-sentence without being interrupted, but it’s still a text-based system, requiring additional STT and TTS, which introduces delays. OpenAI’s realtime AI is highly interactive but doesn’t allow fine control over conversation flow. ElevenLabs offers the best voice quality, but its standard APIs are slow and lack the ability to carry over natural conversational intonation. Their newest conversational AI solution comes close, offering low latency, high-quality output, and some flexibility, but reliability has remained an issue so far.

Our Experiments & Results

Throughout our tests, we encountered these persistent issues:

  • Voices cutting off mid-sentence or remaining silent.
  • STT transcriptions missing sections or misinterpreting speech.
  • Latency spikes disrupting natural conversation flow.
  • Poor handling of function calls leading to incoherent conversations.
  • WebSocket disconnections with no automatic recovery.

Some companies provide a warning but most pretend their solutions are deployment-ready when they’re not. We ended up talking to most teams that develop these solutions. They are working hard to fix their issues. But testing was time-consuming and expensive. Over the course of 6 weeks, we logged over 50 hours of real audio testing—roughly 3,000 conversations averaging one minute each. Despite this effort, no solution proved reliable enough for real-world conditions.

Where we landed (for now)

After weeks of testing, we reverted to a solution close to what we built a year ago. It’s not as seamless—it requires users to click to talk rather than dynamically responding—but it works reliably, ensures data integrity, and avoids frustrating customer experiences.

It was frustrating to get a solution 80% of the way there, see the promise, but struggle to perfect the last mile to meet our standard for a high-quality experience. These solutions might work for a flashy demo or a prototype, but likely will lead to a frustrating end-user experience if shipped to production. Innovation is important, but so is stability. And when building AI for real-world use, reliability beats novelty every time.

Despite the challenges, we remain optimistic. OpenAI, Google, ElevenLabs, and others are pushing rapid improvements. The pieces are coming together; just last week ElevenLabs reached out to update us of new fixes they recently released. We’re back to testing and evaluating whether or not it’s finally ready for prime time and we’ll be sure to keep the community of voice builders updated with what we find.

Speaking of community – if you’re a fellow voice builder we’d love to hear more about your experience and exchange notes. You can connect with me here on LinkedIn or email me at tjehan@elis.ai.

Sign up for our newsletter

Get the best content on user research, and product building delivered to your inbox.

Thank you!
Your submission has been received!
Please complete this required field.