Voice-first future: AI voice technology for startup founders and tech leaders... Register now x
Article | Read in 1 minute

Voice-first future: AI voice technology for startup founders and tech leaders

 

Listen to a fireside chat between Carlos Alzate (AI Fund CTO), David Norris (CEO & Co-founder of Affineon Health), and Ankur Jain (CEO & Co-founder of Jivi AI) on how they are leveraging AI voice technology.

Takeaways include:

Voice Technology Impact

  • Voice is becoming critical in healthcare as it provides a higher-bandwidth medium for patients to explain symptoms
  • At Jivi AI, implementing voice-first AI doctors resulted in improved Net Promoter Scores as patients could convey more health information in 10-20 seconds
  • Voice AI can handle logistics challenges like “phone tag” by instantiating multiple agents simultaneously, ensuring someone always answers when patients call back
  • Affineon Health found voice AI especially valuable for lab result communications where practices traditionally have humans calling patients, creating inefficiencies

Technical Challenges & Solutions

  • For local Indian languages, Ankur’s team at Jivi AI:
    • Fine-tuned Whisper models specifically for medical voice data in regional languages
    • Created 5,000+ hours of medical voice training data for these languages
    • Reduced word error rates from 25-30% to 10-11%, outperforming commercially available systems
  • Similar-sounding body parts in different languages posed particular challenges, as misidentification could lead to completely different diagnoses
  • Latency improved from 4-5 seconds per turn (unusable) to 600-700ms through infrastructure optimization and smart caching techniques
  • For background noise, Jivi AI developed approaches using synthetic noise generation in training data, while Affineon Health found success with lapel microphones that don’t create barriers between providers and patients

Implementation Strategies

  • Both speakers recommended starting with off-the-shelf solutions (Whisper, Google, Amazon, ElevenLabs) to learn user needs before deep customization
  • Ankur emphasized: “Over-engineering from day one is not recommended. Buy, learn, then build your own.”
  • Voice activity detection was enhanced using syllable detection models to determine when users stop speaking
  • David highlighted the importance of de-identification beyond just changing names, noting that unique combinations of age and conditions can still identify patients

Context Management & Memory

  • Both companies maintain “very long memory” of patient interactions
  • Jivi AI implements “smart context” that distinguishes between relevant recent symptoms (few days ago) versus unrelated historical issues (five years ago)
  • Affineon Health incorporates patient history, conditions, medications, and previous conversations as context for interpreting lab results correctly
  • Context is crucial for understanding normal values (e.g., blood analytes that would be abnormal for regular patients may be normal during pregnancy)

Future Directions

  • Ankur predicted voice technology will reach “GPT-4 quality” within a year, after which “voice will explode” as a medium over text
  • David envisions voice agents talking to other voice agents (e.g., provider systems calling payer systems for prior authorization)
  • Both are exploring multimodal approaches combining voice with other data types
  • Emotion detection from voice is described as the “holy grail” by Ankur, particularly for mental health applications
  • David noted patients may eventually prefer AI voice agents because they “aren’t in a hurry to get off the phone” and can provide more empathetic, thorough explanations

Adoption Challenges

  • Safety and accuracy remain primary concerns for healthcare organizations
  • User behavior transition requires incentivizing people to speak for the initial 5-10 seconds before they become comfortable with voice interfaces
  • Grammatical errors in speech pose challenges since LLMs struggle with incorrect syntax (60% understanding with first mistake, dropping dramatically with second mistakes)
  • Current models don’t truly understand language nuances like emphasis and word rearrangement, creating comprehension challenges