- Groups chunks into batches.
- Sends each batch to the embedding model.
- Stores the generated vectors for later search or retrieval.
- Improve throughput and performance
- Reduce API overhead and cost
- Handle large-scale ingestion efficiently
Example: Impact of Batching
Without Batching
- Chunk 1 → API Call 1
- Chunk 2 → API Call 2
- Chunk 3 → API Call 3
- …
- Chunk 100 → API Call 100
With Batching
Assuming: Token limit per request = 10,000 | Each chunk = 1,000 tokens
- Multiple chunks are packed into one batch.
- Batches are sent until all chunks are processed.
Enabling Batch Processing
- Navigate to GenAI Features → Prompt Library → Add Prompt / Edit Prompt.
- Provide the following rate limit configurations. These three fields work together to control the size and speed of batches sent to the model.
| Field | Description | Mandatory |
|---|---|---|
| Token limit per request | Maximum tokens the system can pack into a single API call. Set this to your embedding model’s maximum input token limit. For example, for OpenAI’s text-embedding-3-small, this is 8,191 tokens. If left empty, batching is disabled and chunks are sent individually. | No |
| Token limit per minute | Maximum tokens sent to the API per minute. Check your provider’s rate limits page. Set to 0 for unlimited — the system will not enforce any TPM limit. | No |
| Rate limit | Maximum API calls allowed per minute. Check your provider’s rate limits page. Set to 0 for unlimited — the system will not enforce any RPM limit. | No |
- When both Token limit per minute and Rate limit are set to
0, the system still performs dynamic batching (grouping chunks by Token limit per request) but sends them as fast as possible without throttling. - If a single chunk exceeds the token limit, it is skipped. Ensure your chunk size is always smaller than the token limit per request to avoid skipped chunks.
- Only successfully indexed chunks appear in the Chunk Browser. If the token limit per request is lower than the chunk size, no chunks will be indexed or displayed.