Extraction Strategies Supported
Search AI supports two strategies for processing visual content. The right choice depends on the document type, layout, and the LLMs you have configured.| Strategy | Description |
|---|---|
| Image-Based Document Extraction (Recommended) | Converts each page into an image and uses visual document retrieval embeddings to capture both textual and visual semantics. |
| Layout-Aware Extraction | Uses OCR and layout-detection models to identify and extract structured elements while respecting document hierarchy. |
Image-Based Document Extraction (Recommended)
This strategy is designed for complex PDFs that contain non-text layouts, such as forms, tables, and visually rich structures, for which standard text extraction is insufficient. It converts each page into an image and uses Vision embeddings to capture both textual and visual semantics. The trade-off is that it requires LLM models that support image-to-answer generation (such as GPT-4o) and processes documents as images rather than as granular text chunks.This strategy is supported for PDF files only.
How It Works
- Object Identification — Each page in the PDF is converted into an image.
- Visual Embedding Generation — A vision embedding model generates embeddings that capture both textual and visual semantics. The default model is VDR; custom vision models can also be configured.
- Query Embedding — When a user submits a query, it is converted into two embeddings:
- A text embedding to search for information from text-based chunks.
- An image embedding to search for information from image-based chunks.
- Retrieval — The system retrieves the top 5 image chunks and top 20 text chunks, then sends them to the LLM for answer generation. For image chunks, the image URLs are also passed to the model, allowing it to access and interpret the images directly when forming the answer.
Configuration
- Upload Content. This strategy supports PDF content uploaded directly or via connectors.
- Set up an extraction strategy for image-based document extraction.
- Train the application.
- Verify the images extracted as chunks in the Browse section.
- Go to Index Configuration. Under the Image tab, ensure that XO GPT - VDR Embeddings Model is selected. You can use the default model or create a custom one.
- Configure retrieval and answer generation settings. Use one of the following LLMs for answer generation:
- OpenAI:
gpt-4o,gpt-4o-mini - Azure OpenAI:
GPT-4o,GPT-4o-mini
- OpenAI:
Custom LLMs and Amazon Bedrock are not supported for this strategy.
- Test the answers. Go to the Answer Generation page and use the Test Answer widget to verify results. When queries are made against indexed images, the text extracted from the image is included in the generated answer. The complete image can be viewed by clicking the Preview icon next to the references.
Layout-Aware Extraction
Layout-Aware Extraction uses OCR and layout-detection models to identify and extract structured elements while respecting document hierarchy. This method offers flexibility for custom extraction rules and works with any LLM, making it ideal for standardized documents with predictable layouts. However, it may struggle with highly visual or non-standard formatting where context depends on spatial arrangement.How It Works
- Object Identification — The system combines OCR, layout-detection models, and layout-aware rules to identify document elements.
- Structured Extraction — The model identifies different types of chunks on a page:
- Text chunks — extracted using text extraction.
- Table chunks — converted to HTML structures and stored.
- Image chunks — OCR is used to extract all text from the images.
- Retrieval & Generation — When a query is made, relevant chunks are retrieved and sent to the LLM. If the LLM uses text from image-based chunks in its answer, the source image is also presented to the user.
Configuration
- Upload Content.
- Set Extraction Strategy. Select Layout-Aware Extraction.
- Train Application.
- Verify Chunks in the Browse section.
- Configure Retrieval & Generation. All LLM models are supported with this strategy.
- Test Answers. Use the Test Answer widget on the Answer Generation page to verify results.
End User Experience
When an answer is displayed, users can see both the text extracted from the image and references to the image that generated the answer. Click the info icon to preview the source image.Summary
| Criteria | Image-Based Extraction | Layout-Aware Extraction |
|---|---|---|
| File Support | PDF only | PDF and DOCX |
| Extraction Basis | Visual + Textual semantics | Text + Layout structure |
| Retrieval Type | Image + Text | Text |
| LLM Compatibility | Limited to specified models | Broader support |
| Best Suited For | Visually complex documents including forms, tables, and infographics | Structured documents |