> ## Documentation Index
> Fetch the complete documentation index at: https://koreai.mintlify.app/llms.txt
> Use this file to discover all available pages before exploring further.

# Realtime Multimodal Orchestration

Realtime multimodal orchestration coordinates AI agents and models to simultaneously process and respond to multiple data types—text, audio, images, and video. It enables smooth, context-aware interactions across complex multi-agent workflows, removing the single-modality constraint of traditional AI systems.

```
Audio / Text / Image  →  Multimodal Model  →  Orchestrator  →  Agents  →  Output
```

***

## Key Capabilities

* **Richer context and understanding**: Integrating multiple data types gives the system a deeper, more accurate picture of user needs and intent.

* **Improved accuracy and user experience**: Cross-referencing modalities and maintaining conversation history produces more relevant responses and a seamless experience.

* **Scalability and flexibility**: Orchestration frameworks scale across many agents and servers, supporting thousands of concurrent interactions without code changes.

* **Any input, any output**: Multimodal AI handles text, images, audio, and other input types, and converts them into any output format.

***

## Multimodal Architecture

```mermaid actions={false} theme={null}
%%{init: {'theme': 'base', 'themeVariables': {'background': 'transparent', 'primaryColor': '#e8f0fe', 'primaryTextColor': '#1a1a1a', 'primaryBorderColor': '#4a6fa5', 'lineColor': '#4a6fa5', 'secondaryColor': 'transparent', 'tertiaryColor': 'transparent', 'clusterBkg': 'transparent', 'clusterBorder': '#4a6fa5', 'titleColor': '#1a1a1a', 'clusterLabelBackground': 'transparent'}}}%%
flowchart TD
    AU[Audio] & TX[Text] & IM[Image] --> IL

    subgraph IL[INPUT LAYER]
        SM[Session Manager<br />WebSocket · Modality Detection]
    end

    IL --> MM

    subgraph MM[NATIVE MULTIMODAL]
        M1[GPT-4o Realtime]
        M2[Gemini Live]
        M3[Azure OpenAI Realtime]
    end

    MM --> OL

    subgraph OL[ORCHESTRATION LAYER]
        OR[Plans · Reasons<br/>Delegates · Coordinates]
        OR --> AA[Agent A<br/>+ Tools]
        OR --> AB[Agent B<br/>+ Tools]
    end

    OL --> OUT

    subgraph OUT[OUTPUT LAYER]
        OP[Streaming Responses<br/>Guardrails]
    end
```

### Input Layer

Captures data from multiple sources—spoken queries (audio), written text, and uploaded images. A session manager handles WebSocket connections and detects the input modality on arrival.

### Native Multimodal Model

Modern realtime models such as OpenAI GPT-4o Realtime, Google Gemini Live, and Azure OpenAI Realtime API process audio and text natively, eliminating separate ASR → LLM → TTS pipelines. This preserves vocal nuances and reduces latency.

### Orchestration Layer

The app orchestrator plans, reasons, and delegates tasks to the right agents based on current context and user intent. It coordinates multi-agent workflows and maintains session state throughout the interaction.

### Task Execution and Coordination

Each agent runs its own task procedures, tools, and sub-agents. The orchestrator sequences tasks correctly and handles dynamic interactions—including mid-stream function calls—without interrupting the conversation flow.

### Output Layer

Delivers immediate streaming responses and adapts to new inputs or context changes in real time. Guardrails validate outputs to ensure reliability throughout the session.

***
