Index Configuration Guide

Configure and manage the Search AI index, including content extraction, transformation, workbench processing, and content browsing.

Overview

The Index Configuration in Search AI encompasses the complete data processing pipeline that transforms raw content into searchable, high-quality chunks optimized for answer generation. This pipeline consists of three main phases:

Extraction – Breaking down source content into manageable chunks using various extraction strategies
Transformation – Enriching and refining extracted content through configurable processing stages
Indexing – Storing processed chunks with vector embeddings for efficient retrieval

Content Extraction

Content extraction segments ingested data into smaller chunks to organize and efficiently retrieve relevant information for answer generation. The application supports various extraction strategies that can be customized based on content format, structure, and answer requirements.

Extraction Strategies

Text Extraction Model

Combines NLP and machine learning techniques based on tokenization, where text is segmented into smaller units. Configuration Options:

Chunk Size (Pages): Treats every page as a single chunk
Chunk Size (Tokens): Prepares chunks using:
- Tokens: Maximum tokens per chunk (up to 5000)
- Chunk Overlap: Number of tokens overlapping between consecutive chunks

Layout Aware Extraction

Extracts data by considering content layout and structure, improving precision for documents with tables, graphs, and charts. Configuration:

General Template: Extracts content from complex PDFs and DOCX files including tables and images
Uses OCR technology, layout detection models, and layout awareness rules

Advanced HTML Extraction

Designed specifically for extracting data from HTML files, including tables, images, and textual content. Videos are included in chunks but transcripts are not extracted. Configuration Templates:

General: Identifies different components and classes within HTML documents
Token-Based: Generates chunks based on token size (100-1000 tokens)

Note: Documents with content shorter than 60 characters are skipped when using the General template.

Markdown Extraction

Transforms each page of a source document into structured Markdown format before processing. Effective for preserving semantic structure. Supported formats: PDF files (uploaded directly or via connectors)

Image-Based Document Extraction

Handles complex PDFs with non-textual layouts such as forms, tables, or visually rich content. Each page is converted to an image and processed using VDR embedding models. Supported formats: PDF files

Note: Requires image-based embedding model selection in Vector Configuration. Answer generation not supported by Kore XO GPT with this strategy.

Custom Extraction

Enables integration with third-party services for custom content processing. Configuration:

Endpoint: URL for sending content (POST endpoint)
Concurrency: Maximum API calls per second
Headers: Additional request headers
Request Body: Content fields to send to the service

Default Extraction Strategies

Content Source	Content Type	Extraction Strategy
WebPages	HTML	Advanced HTML Extraction
Documents	pdf, doc, docx	Markdown Extraction
Documents	pptx, txt	Text Extraction
Connectors	pdf, doc, docx, html, aspx	Markdown Extraction
Connectors	pptx, txt	Text Extraction
Connectors	json	JSON Extraction

Managing Extraction Strategies

Adding a Strategy:

Navigate to Index > Extract
Click +Add Strategy
Configure:
- Strategy Name: Unique identifier
- Define Source: Select data sources and content types using filters
- Extraction Model: Choose the extraction method
- Configure model-specific settings

Multiple Strategies: When multiple strategies exist, they apply based on sequence order. Higher-priority strategies process content first. Managing Strategies:

Enable/Disable: Toggle from the strategy page
Delete: Use the Delete button (does not affect existing chunks)
Reorder: Drag and drop to change sequence

Content Transformation

Content transformation refines extracted text into high-quality data after the extraction phase. This enrichment process addresses incomplete metadata, formatting inconsistencies, and missing context to improve search and retrieval effectiveness.

Benefits

Improved Data Quality: Fix errors, standardize fields, add contextual information
Enhanced Control: Adapt enrichment to specific business needs
Bulk Transformation: Apply rules to all applicable content simultaneously

Transformation Stages

Field Mapping Stage

Adds, updates, or deletes specific fields from input content. Useful for ensuring uniformity across pages. Configuration:

Name: Unique stage identifier
Condition: Rules for selecting content (Basic or Script mode)
- Field Name, Operator, Value
Outcome: Actions to perform:
- Set: Sets value for target field
- Delete: Removes target field
- Copy: Copies value between fields

Custom Script Stage

Implements custom JavaScript transformations for specific business needs. Configuration:

Condition: Painless script for content selection
Outcome: Painless script defining transformations

Example - Count pages:

int temp_total_pages = 0;
if(ctx.file_content_obj != null){
    for (def item: ctx.file_content_obj) {
        if (item!="") {
            temp_total_pages = temp_total_pages+1;
        }
    }
}
ctx.total_pages = temp_total_pages;

Exclude Documents Stage

Filters out unnecessary or irrelevant content before ingestion. Configuration:

Field: Document field for condition (e.g., creation date, file type)
Operator: Comparison operator (greater than, less than, equals)
Value: Comparison value

Benefits:

Reduces unnecessary chunk generation
Improves search accuracy
Enhances indexing efficiency

API Stage

Invokes external APIs to modify, enrich, or analyze content during transformation. Configuration:

Endpoint: POST endpoint URL (supports dynamic fields with {{field_name}})
Headers: Key-value pairs for authentication
Request Body: Content to send (use {{field_name}} for dynamic values)

Testing: Click Test to validate configuration and map response fields to Search AI schema.

Note: Only POST APIs and Sync APIs are currently supported.

LLM Stage

Leverages external LLMs to refine, update, or enrich content (summarization, readability improvements, context addition). Prerequisites:

Set up required LLM in Models Library
Create custom prompt for ‘Transform Documents with LLM’
Enable the feature in Gen AI features page

Configuration:

LLM: Select the language model
Prompt: Select the custom prompt
Target Field: Field for storing enriched content

Stage Availability by Extraction Strategy

Strategy	Field Mapping	Custom Script	Exclude Documents	API Stage	LLM Stage
Text Extraction	Yes	Yes	Yes	Yes	Yes
Advanced HTML Extraction	Yes	Yes	Yes	Yes	Yes
Layout Aware Extraction	NA	NA	NA	NA	NA
Markdown Extraction	NA	NA	NA	NA	NA
Image-based Document Extraction	NA	NA	NA	NA	NA

Managing Transformation Stages

Adding a Stage:

Navigate to Index > Enrich (or Transform page)
Click +New Stage
Select stage type and configure
Click Save

Stage Operations:

Enable/Disable: Use ellipsis menu or status toggle
Delete: Permanently removes the stage
Reorder: Drag and drop to change sequence

Workbench

The Workbench is a tool for processing and enhancing ingested content through a series of configurable stages. Each stage performs specific data transformations before passing content to the next stage.

Key Features

Pipeline Processing: Sequential stage execution
Custom Transformation: Customizable stages per business needs
Simulation Capabilities: Built-in testing before deployment
Flexible Sequencing: Design efficient processing workflows

Supported Stages

Stage Type	Purpose
Field Mapping	Map document fields to target fields based on conditions
Custom Script	Run custom scripts on input data
Exclude Document	Exclude documents from indexing based on conditions
API Stage	Connect with third-party services for dynamic updates
LLM Stage	Leverage external LLMs for content enrichment

Adding Workbench Stages

Navigate to Index > Enrich
Click +New Stage
Select stage type
Configure:
- Unique stage name
- Condition(s) for content selection
- Outcome(s) defining transformations
Click Save

Stage Management

Ordering: Data processes through stages sequentially. Use drag-and-drop to reorder. Enable/Disable: Toggle stages on/off without deleting. Disabled stages are skipped during processing. Deleting: Permanently removes the stage from the Workbench.

Workbench Simulator

The Workbench includes a built-in simulator for testing stage behavior before deployment.

Features

Test and verify individual stage outputs
Test cumulative effects of multiple stages
Works with any data source type

Running a Simulation

Click Simulate button on the Workbench page
Select stage(s) to test
Choose number of documents for testing
View results in JSON format in the Chunk Viewer

Key Points:

Simulator shows changes from all stages in sequence order
Testing from a specific stage shows cumulative transformations up to that point
Temporarily disable other stages to test individual stage behavior
Errors during transformation appear in the simulate_errors object

Tip: Use the Workbench Simulator frequently during configuration to catch errors early.

Content Browser

The Content Browser provides tools to observe, verify, and edit extracted chunks from source data.

Key Capabilities

Observation and Verification: Inspect and verify extracted chunks for accuracy
Editing of Chunks: Modify chunk information directly within the browser

Viewing Chunks

Navigate to Index > Browse to view all chunks. Each chunk displays:

Preview (including images and tables)
Summary information

Click Details to view:

Document Title: Source document/page title
Chunk Title: Title assigned to the chunk
Chunk Text: Content of the chunk
Chunk ID: Unique identifier
Page Number: Source page (if applicable)
Source Title: Name of the source in the application
Source: Content type (web pages, files, connectors)
Extraction Strategy: Strategy used for extraction
Edited on: Last update date (if applicable)
Source URL: Source URL (if applicable)

View JSON: Displays chunk contents in JSON format with all properties.

Editing Chunks

Click the Details icon on a chunk
Modify title and/or text
Click Save

Alternatively, use the Edit option directly from the browser home page.

Note: When documents are updated, all related chunks are deleted and recreated, losing manual edits.

Search and Filter

Search: Use the search bar to find chunks by properties (chunkTitle, chunkText, source, etc.) Filter: Advanced search using multiple chunk properties:

Source types
Extraction strategy
Content keywords
And more

Example: Find all chunks from “Kore blogs” containing “virtual assistant”:

Set Source Name = “Kore blogs”
AND chunkText contains “virtual assistant”

Vector Configuration

Vector Configuration manages how extracted and transformed chunks are converted into vector embeddings for efficient semantic search and retrieval. This includes:

Embedding Model Selection: Choose appropriate embedding models
Image-based Embedding: Enable for image-based document extraction
Indexing Parameters: Configure vector storage settings

Note: When using Image-Based Document Extraction, select an image-based embedding model in Vector Configuration for proper indexing.

Best Practices

Extraction Strategy Selection

Match strategy to content type and structure
Consider expected query complexity when setting chunk sizes
Use Layout Aware for documents with tables and charts

Transformation Pipeline Design

Order stages logically (e.g., exclude before enrichment)
Use simulation frequently during development
Keep transformations focused and modular

Content Quality

Review chunks in Content Browser after processing
Edit chunks only when necessary (edits lost on content updates)
Use filters to identify problematic chunks

Testing and Validation

Simulate all stages before training
Test with representative sample documents
Verify cumulative effects of multiple stages

Processing Workflow Summary

1. Content Ingestion
   └── Sources: Websites, Documents, Connectors

2. Extraction Phase
   └── Apply extraction strategy based on content type
   └── Generate initial chunks

3. Transformation Phase (Document Workbench)
   └── Stage 1 → Stage 2 → Stage N
   └── Each stage processes and passes to next

4. Enrichment Phase (Workbench)
   └── Additional processing stages
   └── Field mapping, scripts, API calls, LLM enrichment

5. Indexing Phase
   └── Vector embedding generation
   └── Storage in vector database

6. Verification
   └── Content Browser review
   └── Manual edits if needed

MODULES

BUILDING

OPERATIONS

​Overview

​Content Extraction

​Extraction Strategies

​Text Extraction Model

​Layout Aware Extraction

​Advanced HTML Extraction

​Markdown Extraction

​Image-Based Document Extraction

​Custom Extraction

​Default Extraction Strategies

​Managing Extraction Strategies

​Content Transformation

​Benefits

​Transformation Stages

​Field Mapping Stage

​Custom Script Stage

​Exclude Documents Stage

​API Stage

​LLM Stage

​Stage Availability by Extraction Strategy

​Managing Transformation Stages

​Workbench

​Key Features

​Supported Stages

​Adding Workbench Stages

​Stage Management

​Workbench Simulator

​Features

​Running a Simulation

​Content Browser

​Key Capabilities

​Viewing Chunks

​Editing Chunks

​Search and Filter

​Vector Configuration

​Best Practices

​Extraction Strategy Selection

​Transformation Pipeline Design

​Content Quality

​Testing and Validation

​Processing Workflow Summary

​Related Resources