> ## Documentation Index
> Fetch the complete documentation index at: https://koreai.mintlify.app/llms.txt
> Use this file to discover all available pages before exploring further.

# Voice

# Voice Interactions

Enable voice-based conversations with your agentic apps.

***

## Overview

The Platform supports voice interactions through integration with AI for Service, enabling users to speak naturally with your agents. Two modes are available: real-time voice for natural conversations and ASR/TTS for text-based processing with voice I/O.

***

## Voice Modes

### Real-Time Voice

Natural voice conversations using multimodal language models.

```
User speaks → Audio processed directly by LLM → Audio response generated
```

**Characteristics**:

* Simultaneous voice input/output
* Natural conversational flow
* Lower latency for back-and-forth
* Requires compatible multimodal model

### ASR/TTS (Speech-to-Text/Text-to-Speech)

Hybrid approach where speech is converted to text, processed, and converted back to speech.

```
User speaks → ASR → Text → Agent → Text response → TTS → Audio
```

**Characteristics**:

* Works with any text-based model
* More flexible model choices
* TTS streaming reduces perceived latency
* Better for complex responses

***

## Configuration

### Enable Real-Time Voice

1. Configure in **AI for Service** Automation Node
2. Enable in Platform's **Agentic App** settings
3. Select a real-time voice compatible model

```yaml theme={null}
voice:
  mode: realtime
  model: gpt-4o-realtime
  settings:
    voice: alloy  # Voice persona
    language: en-US
```

### Enable ASR/TTS

1. Disable real-time voice in AI for Service
2. Enable TTS Streaming for progressive delivery
3. Configure voice settings

```yaml theme={null}
voice:
  mode: asr_tts
  asr:
    provider: default
    language: en-US
  tts:
    provider: default
    voice: neural-female-1
    streaming: true  # Progressive delivery
```

***

## TTS Streaming

Reduce perceived latency by streaming text output progressively:

```
Without streaming:
├── Generate full response (3s)
├── Convert to speech (1s)
└── Play audio (2s)
Total: 6s before user hears anything

With streaming:
├── Generate first sentence (0.5s)
├── Stream to TTS → Play (0.5s)
├── Continue generating while playing
└── Seamless audio delivery
Total: 1s to first audio
```

***

## Voice-Specific Considerations

### Response Design

Optimize responses for voice:

```yaml theme={null}
instructions: |
  When responding via voice:
  - Keep responses concise (1-2 sentences when possible)
  - Avoid complex lists or tables
  - Use natural, conversational language
  - Spell out abbreviations and numbers
  - Pause naturally between topics
```

### Handling Interruptions

Configure interruption behavior:

```yaml theme={null}
voice:
  interruption:
    enabled: true
    sensitivity: medium  # low, medium, high
    action: pause_and_listen
```

### Multi-Turn Conversations

Maintain context across voice turns:

```yaml theme={null}
context:
  # Remember more context for voice (users can't scroll back)
  context_window: 75
  summarization: enabled
```

***

## Limitations

### Real-Time Voice

* Requires specific multimodal models
* Higher compute costs
* Wait-time experience features don't apply

### ASR/TTS

* Transcription errors possible
* Additional latency from conversion
* May miss voice tone/emotion

***

## Best Practices

### Design for Ears, Not Eyes

* Shorter responses work better
* Avoid visual formatting (tables, code blocks)
* Use conversational markers ("First...", "Next...")

### Handle Voice Errors Gracefully

```yaml theme={null}
error_handling:
  transcription_failure:
    message: "I didn't catch that. Could you please repeat?"
    retry_count: 2

  unclear_speech:
    message: "I want to make sure I understand. Did you say...?"
```

### Test with Real Speech

* Test with various accents
* Try background noise scenarios
* Validate with actual users
