Skip to main content

Voice Interactions

Enable voice-based conversations with your agentic apps.

Overview

The platform supports voice interactions through integration with AI for Service, enabling users to speak naturally with your agents. Two modes are available: real-time voice for natural conversations and ASR/TTS for text-based processing with voice I/O.

Voice Modes

Real-Time Voice

Natural voice conversations using multimodal language models.
User speaks → Audio processed directly by LLM → Audio response generated
Characteristics:
  • Simultaneous voice input/output
  • Natural conversational flow
  • Lower latency for back-and-forth
  • Requires compatible multimodal model

ASR/TTS (Speech-to-Text/Text-to-Speech)

Hybrid approach where speech is converted to text, processed, and converted back to speech.
User speaks → ASR → Text → Agent → Text response → TTS → Audio
Characteristics:
  • Works with any text-based model
  • More flexible model choices
  • TTS streaming reduces perceived latency
  • Better for complex responses

Configuration

Enable Real-Time Voice

  1. Configure in AI for Service Automation Node
  2. Enable in Platform’s Agentic App settings
  3. Select a real-time voice compatible model
voice:
  mode: realtime
  model: gpt-4o-realtime
  settings:
    voice: alloy  # Voice persona
    language: en-US

Enable ASR/TTS

  1. Disable real-time voice in AI for Service
  2. Enable TTS Streaming for progressive delivery
  3. Configure voice settings
voice:
  mode: asr_tts
  asr:
    provider: default
    language: en-US
  tts:
    provider: default
    voice: neural-female-1
    streaming: true  # Progressive delivery

TTS Streaming

Reduce perceived latency by streaming text output progressively:
Without streaming:
├── Generate full response (3s)
├── Convert to speech (1s)
└── Play audio (2s)
Total: 6s before user hears anything

With streaming:
├── Generate first sentence (0.5s)
├── Stream to TTS → Play (0.5s)
├── Continue generating while playing
└── Seamless audio delivery
Total: 1s to first audio

Voice-Specific Considerations

Response Design

Optimize responses for voice:
instructions: |
  When responding via voice:
  - Keep responses concise (1-2 sentences when possible)
  - Avoid complex lists or tables
  - Use natural, conversational language
  - Spell out abbreviations and numbers
  - Pause naturally between topics

Handling Interruptions

Configure interruption behavior:
voice:
  interruption:
    enabled: true
    sensitivity: medium  # low, medium, high
    action: pause_and_listen

Multi-Turn Conversations

Maintain context across voice turns:
context:
  # Remember more context for voice (users can't scroll back)
  context_window: 75
  summarization: enabled

Limitations

Real-Time Voice

  • Requires specific multimodal models
  • Higher compute costs
  • Wait-time experience features don’t apply

ASR/TTS

  • Transcription errors possible
  • Additional latency from conversion
  • May miss voice tone/emotion

Best Practices

Design for Ears, Not Eyes

  • Shorter responses work better
  • Avoid visual formatting (tables, code blocks)
  • Use conversational markers (“First…”, “Next…”)

Handle Voice Errors Gracefully

error_handling:
  transcription_failure:
    message: "I didn't catch that. Could you please repeat?"
    retry_count: 2

  unclear_speech:
    message: "I want to make sure I understand. Did you say...?"

Test with Real Speech

  • Test with various accents
  • Try background noise scenarios
  • Validate with actual users