Batch Processing of Embedding Generation

Batch processing of custom embeddings means generating vector embeddings for many data chunks at once rather than one by one in real time. It introduces dynamic batching and rate limiting to optimize this flow. When there are large volumes of content, the system:

Groups chunks into batches.
Sends each batch to the embedding model.
Stores the generated vectors for later search or retrieval.

This allows multiple chunks to be grouped into a single API request, and requests are automatically paced to stay within provider limits. Batch processing helps to:

Improve throughput and performance
Reduce API overhead and cost
Handle large-scale ingestion efficiently

Example: Impact of Batching

Without Batching

Chunk 1 → API Call 1
Chunk 2 → API Call 2
Chunk 3 → API Call 3
…
Chunk 100 → API Call 100

Total: 100 API calls

With Batching

Assuming: Token limit per request = 10,000 | Each chunk = 1,000 tokens

Multiple chunks are packed into one batch.
Batches are sent until all chunks are processed.

Total: ~10 API calls instead of 100

Enabling Batch Processing

Navigate to GenAI Features → Prompt Library → Add Prompt / Edit Prompt.
Provide the following rate limit configurations. These three fields work together to control the size and speed of batches sent to the model.

Field	Description	Mandatory
Token limit per request	Maximum tokens the system can pack into a single API call. Set this to your embedding model’s maximum input token limit. For example, for OpenAI’s `text-embedding-3-small`, this is 8,191 tokens. If left empty, batching is disabled and chunks are sent individually.	No
Token limit per minute	Maximum tokens sent to the API per minute. Check your provider’s rate limits page. Set to `0` for unlimited — the system will not enforce any TPM limit.	No
Rate limit	Maximum API calls allowed per minute. Check your provider’s rate limits page. Set to `0` for unlimited — the system will not enforce any RPM limit.	No

When both Token limit per minute and Rate limit are set to 0, the system still performs dynamic batching (grouping chunks by Token limit per request) but sends them as fast as possible without throttling.
If a single chunk exceeds the token limit, it is skipped. Ensure your chunk size is always smaller than the token limit per request to avoid skipped chunks.
Only successfully indexed chunks appear in the Chunk Browser. If the token limit per request is lower than the chunk size, no chunks will be indexed or displayed.

This enables batch processing of vector generation whenever indexing occurs next.

⌘I

​Example: Impact of Batching

Without Batching

With Batching

​Enabling Batch Processing

Example: Impact of Batching

Enabling Batch Processing