Extract and Search Content from Image-Based Documents

Search AI can extract and interpret text from image-based content to support retrieval and answer generation. This capability enables responses to be generated from visual documents such as scanned PDFs, screenshots, and infographics. The effectiveness of answer generation depends on selecting the appropriate extraction strategy for the document type.

Extraction Strategies Supported

Search AI supports two strategies for processing visual content. The right choice depends on the document type, layout, and the LLMs you have configured.

Strategy	Description
Image-Based Document Extraction (Recommended)	Converts each page into an image and uses visual document retrieval embeddings to capture both textual and visual semantics.
Layout-Aware Extraction	Uses OCR and layout-detection models to identify and extract structured elements while respecting document hierarchy.

Image-Based Document Extraction (Recommended)

This strategy is designed for complex PDFs that contain non-text layouts, such as forms, tables, and visually rich structures, for which standard text extraction is insufficient. It converts each page into an image and uses Vision embeddings to capture both textual and visual semantics. The trade-off is that it requires LLM models that support image-to-answer generation (such as GPT-4o) and processes documents as images rather than as granular text chunks.

This strategy is supported for PDF files only.

How It Works

Object Identification — Each page in the PDF is converted into an image.
Visual Embedding Generation — A vision embedding model generates embeddings that capture both textual and visual semantics. The default model is VDR; custom vision models can also be configured.
Query Embedding — When a user submits a query, it is converted into two embeddings:
- A text embedding to search for information from text-based chunks.
- An image embedding to search for information from image-based chunks.
Retrieval — The system retrieves the top 5 image chunks and top 20 text chunks, then sends them to the LLM for answer generation. For image chunks, the image URLs are also passed to the model, allowing it to access and interpret the images directly when forming the answer.

Configuration

Upload Content. This strategy supports PDF content uploaded directly or via connectors.
Set up an extraction strategy for image-based document extraction.
Train the application.
Verify the images extracted as chunks in the Browse section.
Go to Index Configuration. Under the Image tab, ensure that XO GPT - VDR Embeddings Model is selected. You can use the default model or create a custom one.
Configure retrieval and answer generation settings. Use one of the following LLMs for answer generation:
- OpenAI: gpt-4o, gpt-4o-mini
- Azure OpenAI: GPT-4o, GPT-4o-mini

Custom LLMs and Amazon Bedrock are not supported for this strategy.

Test the answers. Go to the Answer Generation page and use the Test Answer widget to verify results. When queries are made against indexed images, the text extracted from the image is included in the generated answer. The complete image can be viewed by clicking the Preview icon next to the references.

Layout-Aware Extraction

Layout-Aware Extraction uses OCR and layout-detection models to identify and extract structured elements while respecting document hierarchy. This method offers flexibility for custom extraction rules and works with any LLM, making it ideal for standardized documents with predictable layouts. However, it may struggle with highly visual or non-standard formatting where context depends on spatial arrangement.

How It Works

Object Identification — The system combines OCR, layout-detection models, and layout-aware rules to identify document elements.
Structured Extraction — The model identifies different types of chunks on a page:
- Text chunks — extracted using text extraction.
- Table chunks — converted to HTML structures and stored.
- Image chunks — OCR is used to extract all text from the images.
Retrieval & Generation — When a query is made, relevant chunks are retrieved and sent to the LLM. If the LLM uses text from image-based chunks in its answer, the source image is also presented to the user.

Configuration

Upload Content.
Set Extraction Strategy. Select Layout-Aware Extraction.
Train Application.
Verify Chunks in the Browse section.
Configure Retrieval & Generation. All LLM models are supported with this strategy.
Test Answers. Use the Test Answer widget on the Answer Generation page to verify results.

End User Experience

When an answer is displayed, users can see both the text extracted from the image and references to the image that generated the answer. Click the info icon to preview the source image.

Summary

Criteria	Image-Based Extraction	Layout-Aware Extraction
File Support	PDF only	PDF and DOCX
Extraction Basis	Visual + Textual semantics	Text + Layout structure
Retrieval Type	Image + Text	Text
LLM Compatibility	Limited to specified models	Broader support
Best Suited For	Visually complex documents including forms, tables, and infographics	Structured documents

​Extraction Strategies Supported

​Image-Based Document Extraction (Recommended)

​How It Works

​Configuration

​Layout-Aware Extraction

​How It Works

​Configuration

​End User Experience

​Summary

Extraction Strategies Supported

Image-Based Document Extraction (Recommended)

How It Works

Configuration

Layout-Aware Extraction

How It Works

Configuration

End User Experience

Summary