Voice Interactions

Enable voice-based conversations with your agentic apps.

Overview

The platform supports voice interactions through integration with AI for Service, enabling users to speak naturally with your agents. Two modes are available: real-time voice for natural conversations and ASR/TTS for text-based processing with voice I/O.

Voice Modes

Real-Time Voice

Natural voice conversations using multimodal language models.

User speaks → Audio processed directly by LLM → Audio response generated

Characteristics:

Simultaneous voice input/output
Natural conversational flow
Lower latency for back-and-forth
Requires compatible multimodal model

ASR/TTS (Speech-to-Text/Text-to-Speech)

Hybrid approach where speech is converted to text, processed, and converted back to speech.

User speaks → ASR → Text → Agent → Text response → TTS → Audio

Characteristics:

Works with any text-based model
More flexible model choices
TTS streaming reduces perceived latency
Better for complex responses

Configuration

Enable Real-Time Voice

Configure in AI for Service Automation Node
Enable in Platform’s Agentic App settings
Select a real-time voice compatible model

voice:
  mode: realtime
  model: gpt-4o-realtime
  settings:
    voice: alloy  # Voice persona
    language: en-US

Enable ASR/TTS

Disable real-time voice in AI for Service
Enable TTS Streaming for progressive delivery
Configure voice settings

voice:
  mode: asr_tts
  asr:
    provider: default
    language: en-US
  tts:
    provider: default
    voice: neural-female-1
    streaming: true  # Progressive delivery

TTS Streaming

Reduce perceived latency by streaming text output progressively:

Without streaming:
├── Generate full response (3s)
├── Convert to speech (1s)
└── Play audio (2s)
Total: 6s before user hears anything

With streaming:
├── Generate first sentence (0.5s)
├── Stream to TTS → Play (0.5s)
├── Continue generating while playing
└── Seamless audio delivery
Total: 1s to first audio

Voice-Specific Considerations

Response Design

Optimize responses for voice:

instructions: |
  When responding via voice:
  - Keep responses concise (1-2 sentences when possible)
  - Avoid complex lists or tables
  - Use natural, conversational language
  - Spell out abbreviations and numbers
  - Pause naturally between topics

Handling Interruptions

Configure interruption behavior:

voice:
  interruption:
    enabled: true
    sensitivity: medium  # low, medium, high
    action: pause_and_listen

Multi-Turn Conversations

Maintain context across voice turns:

context:
  # Remember more context for voice (users can't scroll back)
  context_window: 75
  summarization: enabled

Limitations

Real-Time Voice

Requires specific multimodal models
Higher compute costs
Wait-time experience features don’t apply

ASR/TTS

Transcription errors possible
Additional latency from conversion
May miss voice tone/emotion

Best Practices

Design for Ears, Not Eyes

Shorter responses work better
Avoid visual formatting (tables, code blocks)
Use conversational markers (“First…”, “Next…”)

Handle Voice Errors Gracefully

error_handling:
  transcription_failure:
    message: "I didn't catch that. Could you please repeat?"
    retry_count: 2

  unclear_speech:
    message: "I want to make sure I understand. Did you say...?"

Test with Real Speech

Test with various accents
Try background noise scenarios
Validate with actual users

BUILDING AGENTS

PLATFORM FEATURES

OPERATIONS

REFERENCES

Voice

Voice Interactions

Overview

Voice Modes

Real-Time Voice

ASR/TTS (Speech-to-Text/Text-to-Speech)

Configuration

Enable Real-Time Voice

Enable ASR/TTS

TTS Streaming

Voice-Specific Considerations

Response Design

Handling Interruptions

Multi-Turn Conversations

Limitations

Real-Time Voice

ASR/TTS

Best Practices

Design for Ears, Not Eyes

Handle Voice Errors Gracefully

Test with Real Speech

BUILDING AGENTS

PLATFORM FEATURES

OPERATIONS

REFERENCES

​Voice Interactions

​Overview

​Voice Modes

​Real-Time Voice

​ASR/TTS (Speech-to-Text/Text-to-Speech)

​Configuration

​Enable Real-Time Voice

​Enable ASR/TTS

​TTS Streaming

​Voice-Specific Considerations

​Response Design

​Handling Interruptions

​Multi-Turn Conversations

​Limitations

​Real-Time Voice

​ASR/TTS

​Best Practices

​Design for Ears, Not Eyes

​Handle Voice Errors Gracefully

​Test with Real Speech

​Related

Voice Interactions

Overview

Voice Modes

Real-Time Voice

ASR/TTS (Speech-to-Text/Text-to-Speech)

Configuration

Enable Real-Time Voice

Enable ASR/TTS

TTS Streaming

Voice-Specific Considerations

Response Design

Handling Interruptions

Multi-Turn Conversations

Limitations

Real-Time Voice

ASR/TTS

Best Practices

Design for Ears, Not Eyes

Handle Voice Errors Gracefully

Test with Real Speech

Related