Content Sources enable Search AI to ingest data from websites, documents, and third-party applications, creating a unified knowledge base for generating accurate answers. When content is ingested, Search AI automatically initiates training to create the answer index using configured extraction strategies.
Websites
Crawl and index web pages from any public or authenticated website.
Navigate to Content > Websites to manage web sources.
Adding a Web Crawler
Click +Web Crawl and configure the following:
| Field | Description |
|---|
| Source Title | Unique name for the web source |
| Description | Information about the source and its purpose |
| Crawl Source | URL — provide a base URL; the crawler automatically discovers linked pages within the domain. Upload Sitemap (CSV) — provide a CSV of sitemaps and a base URL; only URLs matching the base URL are crawled, others are skipped. Crawl options apply for further filtering. Upload URL (CSV) — provide a CSV of specific URLs and a base URL; only matching URLs are crawled. Crawl depth and max URL settings are not required for this method. |
| Crawl Depth | Levels to traverse (1–10, default: 5) |
| Max URL Limit | Maximum URLs to crawl (1–10,000, default: 10) |
Advanced Crawl Options
| Option | Purpose |
|---|
| Use Cookies | Crawl pages requiring cookie acceptance |
| JavaScript Rendered | Capture content rendered dynamically via JavaScript. When enabled, set the Crawl Delay — the time in seconds the crawler waits before treating the page as fully rendered and beginning indexing. Recommended for JS-heavy pages. |
| Crawl Beyond Sitemap | Include URLs not listed in the sitemap |
| Respect robots.txt | Honor crawler directives in robots.txt |
| Automatic Cleaning | Remove headers, footers, and head tags before ingestion |
| Retain Original | Ingest complete HTML content for custom transformation via Workbench |
URL Filtering
Control which pages are crawled using rules:
- Crawl everything — All discovered URLs
- Crawl everything except — Block URLs matching specified conditions
- Crawl only specific URLs — Allow only URLs matching specified conditions
Conditions include: equals, not equal, contains, does not contain, begins with, ends with.
Authentication
Search AI supports two authentication methods for protected websites:
Basic HTTP Authentication
- Username/email and password
- Optional authorization fields (headers, payload, query string, path parameters)
- Authentication URL (may differ from source URL)
- Test type: Text Presence, Redirection, or Status Code
Form-based Authentication
- Form fields with key, type, and value
- Session maintained after initial authentication
- Same test options as Basic Auth
The crawl source type cannot be changed after the crawler is created. All other configuration fields can be updated at any time.
Managing Web Sources
| Action | How To |
|---|
| View crawled pages | Go to source > Pages tab (shows Successful, Failed, Skipped) |
| View execution logs | Go to source > Executions tab |
| View page JSON | Pages tab > click page > View JSON — shows the raw ingested data for the page, useful for debugging what content was indexed |
| Recrawl entire source | Click Recrawl action or use Re-Crawl button on configuration page |
| Recrawl specific page | Pages tab > Actions > Recrawl |
| Schedule recrawl | Set date, time, and frequency in crawl configuration |
| Delete source | Click Delete action (removes all indexed content) |
If a page is manually deleted and the source is recrawled, the page reappears unless the crawl options are updated to exclude it — for example, by adding it to the block list under URL Filtering.
Troubleshooting
| Issue | Potential Causes | Resolution |
|---|
| Crawl failure | Permissions, credentials, invalid URL, domain not whitelisted | Verify access, credentials, and URL format |
| Crawl succeeds but no indexing | JS-rendered pages, incorrect include/exclude rules, undiscoverable pages | Enable JavaScript Rendered, check rules, verify sitemap |
| Crawl takes too long | Large content volume, JS pages returning no content | Adjust max URLs, enable JavaScript Rendered |
Documents
Upload and index files directly to Search AI. Navigate to Content > Documents to manage uploaded files.
PDF, DOCX, PPT, TXT (scanned PDFs and password-protected files not supported)
Upload Options
| Method | Description |
|---|
| File | Upload one or more files to an existing directory |
| URL | Upload a file from a remote URL with title and description |
| Directory | Upload a complete folder from local device |
Upload Limits
| Limit | Default Value |
|---|
| File size | 15 MB maximum per file |
| Batch upload | 40 files at once |
| Directory upload | 20 files maximum |
Contact support to increase these limits.
Managing Documents
- View files: Click any directory to see its files with type and page count
- View file details: Click a file for preview, metadata, and JSON view
- Delete file: Actions > Delete (from directory or file details page)
- Delete directory: Actions > Delete (removes directory and all files)
- Search: Use search bar on Directory page or within directory details
Connectors
Pre-built integrations for 60+ third-party applications. Navigate to Content > Connectors to configure integrations.
Connector Setup Workflow
- Authentication — Provide OAuth credentials, API keys, or tokens
- Ingestion — Select content types and apply filters
- Field Mapping — Align source fields with Search AI schema (optional)
- Permissions — Configure RACL settings
Configuration Options
Content Types: Most connectors support multiple content types (pages, articles, tasks, tickets, documents). Select which types to ingest under the Ingestion section.
Filters: Some connectors support selective ingestion by timeframe, category, user assignment, or other criteria.
Field Mapping: Customize how source fields map to Search AI’s schema using custom scripts for data transformation. Learn More.
Sync Operations
| Type | Description |
|---|
| Manual sync | Click Sync Now on Configuration tab |
| Scheduled sync | Set automatic intervals (files >15MB are skipped) |
| Permission sync | Separate scheduler for RACL updates |
Managing Connectors
- Enable/Disable: Toggle connector without deleting data (disables sync temporarily)
- View content: Content tab lists ingested items with URLs and timestamps
- Remove connector: Connection Settings > Remove this content source (deletes all indexed data)
Custom Connector
For applications without a pre-built connector, the Custom Connector uses REST APIs via a middleware service.
Setup Steps:
- Download the reference implementation from Search AI
- Configure
config.json with application details and content fields
- Set authentication in
.env file
- Host the service and configure endpoint in Search AI
- Add headers (Authorization required for default implementation).
- The service processes content in batches of 30 documents.
- RACL support available via
sys_racl field in config file. Use the Permission Entity APIs to add list of permitted users for the content.
Learn More.
JSON Connector
Upload structured data in JSON format for indexing.
- Upload up to 10 files at once
- Maximum 15 MB per file
- Add to existing source or create new
Learn More.
Supported Connectors
Search AI provides out-of-the-box support for ingesting data from a range of third-party repositories. If you want to use a repository not listed below, contact us.
RACL: Auto — permissions synced automatically from source; Manual — RACL available, manual entity management required; No — not supported.
| Connector | Repository | Supported Content | Filtering | RACL |
|---|
| Aha! | Cloud | Ideas, Features | No | Auto |
| Airtable | Cloud | Spreadsheets, Databases | No | Manual |
| Amazon S3 | Cloud | .pdf, .ppt, .txt, .docx | No | No |
| Asana | Cloud | Projects, Tasks | No | Auto |
| Axero | Cloud | Pages, Wiki, Discussions, Documents, Articles, Announcements, Blogs | Yes | Auto |
| Azure Storage | Cloud | .txt, .pdf, .rtf, .doc, .docx, .ppt, .pptx | No | Manual |
| BitBucket | Cloud | Pull Requests | No | Auto |
| Box | Cloud | .pdf, .docx, .txt | No | Auto |
| Clickup | Cloud | Tasks, Subtasks, Checklists | No | Manual |
| Coda Docs | Cloud | Pages | No | Manual |
| Confluence Cloud | Cloud | Knowledge Articles | Yes | Auto |
| Confluence Server | On-prem | Knowledge Articles | Yes | Auto |
| Custom Connector | Cloud | — | No | Manual |
| Datadog | Cloud | Metrics, Dashboards, Monitors | Yes | Manual |
| DotCMS | Cloud | — | Yes | Manual |
| Dropbox | Cloud | .doc, .docx, .ppt, .pptx, .pdf, .txt, .html | No | Manual |
| Egnyte | Cloud | .doc, .docx, .ppt, .pptx, .pdf, .txt | No | Manual |
| Figma | Cloud | Figma Files | No | No |
| FreshDesk | Cloud | Tickets | No | Auto |
| Freshservice | Cloud | Solution Articles | No | Manual |
| Front | Cloud | Knowledge Base Articles | No | Manual |
| GitHub | Cloud | Issues, README, Pull Requests | Yes | Auto |
| GitHub On-Prem | On-prem | Issues, Pull Requests, Pages, Files, Commits | Yes | Manual |
| GitLab | Cloud | Issues | No | Manual |
| Google Drive | Cloud | .doc, .docx, .ppt, .pptx, .pdf, .txt, .html | Yes | Auto |
| Guru | Cloud | Cards | No | Auto |
| HelpScout | Cloud | Articles | No | Auto |
| Hive | Cloud | Actions, Sub-actions | No | Auto |
| HubSpot | Cloud | Tickets, Deals, Companies, Contacts | No | Auto |
| Jenkins | Cloud | Dashboard, Jobs, Builds, Plugins | No | No |
| JFrog Artifactory | Cloud | Artifacts | No | Auto |
| Jira | Cloud | Issues, Filters, Dashboards | No | Auto |
| Jira On-Prem | On-prem | Work Items | Yes | Auto |
| JSON Connector | — | Structured JSON Data | No | No |
| LumApps ¹ | Cloud | Pages, News, Custom Objects, Community Posts | Yes | Auto |
| Miro | Cloud | Boards | No | Auto |
| Monday | Cloud | Board Items | No | Manual |
| MS Teams | Cloud | Channel Conversations | No | Auto |
| Notion | Cloud | Pages | No | Manual |
| OneDrive | Cloud | .aspx, .doc, .docx, .ppt, .pptx, .html, .txt, .pdf | No | Auto |
| Opsgenie | Cloud | Alerts, Incidents | No | Manual |
| Oracle Knowledge | Cloud | Knowledge Articles | No | No |
| PagerDuty | Cloud | Schedules, Escalation Policies | No | Auto |
| Re:amaze | Cloud | Articles | No | Auto |
| Salesforce | Cloud | Knowledge Articles | Yes | No |
| ServiceNow | Cloud | Incidents, Service Catalog, Knowledge Articles | Yes | Auto |
| SharePoint | Cloud | .aspx, .doc, .docx, .ppt, .pptx, .html, .txt, .pdf | Yes | Auto |
| Shopify | Cloud | Articles, Product Catalog | No | Manual |
| Shortcut | Cloud | Stories | No | Auto |
| Slab | Cloud | Posts | No | Manual |
| Slack | Cloud | Messages | No | Auto |
| Teamwork | Cloud | Tasks | No | Manual |
| TestRail | Cloud | Test Cases | No | Manual |
| Trello | Cloud | Boards, Cards | No | Auto |
| WordPress | Cloud | Pages, Posts | No | Manual |
| Workday | Cloud | HR Org Charts, Employee Details | No | Auto |
| Wrike | Cloud | Tasks | No | Manual |
| xMatters | Cloud | Incidents, Events, On-Calls | No | Auto |
| YouTrack | Cloud | Projects, Issues, Knowledge Articles | No | Auto |
| Zendesk | Cloud | Knowledge Articles, Tickets | No | Auto |
| Zeplin | Cloud | Screens | No | Auto |
| Zoho | Cloud | Leads, Accounts, Contacts, Deals | No | Manual |
| Zoom | Cloud | Meeting Summaries | Yes | Auto |
| Zulip | Cloud | Messages | No | Auto |
¹ LumApps supports attachment ingestion.
For any connectors not listed above, use the Custom Connector or contact Kore.ai.
Role-based Access Control (RACL)
RACL ensures users only see content they’re authorized to access by synchronizing permissions from source applications.
For example, if User A has access to policy documents and User B can only view FAQs, and both query a policy description — User A receives an answer sourced from the policy document, while User B receives an answer sourced from the FAQs. The response content differs based on each user’s access permissions.
RACL in Search AI
Ingest Content → Store Permissions in sys_racl → User Submits Query → Match User Identity to sys_racl → Return Answers from Accessible Content Only
RACL is supported for connector-sourced content only. It does not apply to websites or uploaded documents.
How RACL Works
Search AI enforces access control in three steps:
Step 1 — Retrieve and Store Permissions
During ingestion, the connector fetches permission data from the source application along with the content. This access information is stored in the sys_racl field of every indexed chunk.
Search AI handles two types of permissions:
| Type | Description | sys_racl Value |
|---|
| Individual | The content specifies user identities directly (e.g. user@example.com) | User email addresses or IDs |
| Group / User Criteria | The content specifies a group or user criterion (e.g. devteam@example.com) | A Permission Entity ID representing the group |
| Public | No specific permissions; accessible to all users | * |
When group permissions are involved, Search AI creates a Permission Entity for each group and stores it in sys_racl. Individual users must then be associated with that Permission Entity — either automatically by the connector or manually via API.
Example — Group Permission
Consider a Google Drive file shared with two individuals and one group:
john.doe@example.com (individual)
smitha.joseph@example.com (individual)
testteam@example.com (group)
Search AI stores the two user identities directly and creates a Permission Entity for the group. The sys_racl field for the indexed chunks would look like this:
"sys_racl": [
"john.doe@example.com",
"smitha.joseph@example.com",
"testteam@example.com"
]
At query time, john.doe and smitha.joseph are matched directly. For testteam@example.com, Search AI must resolve which individual users belong to that group — either automatically (for supported connectors) or via the Permission Entity APIs.
To verify stored permissions:
- Go to Content and open the JSON view for any file.
- Check the
sys_racl field to confirm what identities or entities are stored.
- The same information is available per chunk in the Content Browser.
Step 2 — Verify User Identity at Query Time
When a user submits a query, Search AI matches their identity against the sys_racl field of indexed chunks. Only chunks where the user’s identity appears — directly or through a Permission Entity — are eligible to generate answers.
Access control rule: Only users whose identities appear in the sys_racl field — directly or through a resolved Permission Entity — can view answers generated from that content.
Passing user identity in API requests
When using the Search API directly, user identity must be included in each request. If it is not passed, restricted content will not be returned.
User identity can be passed via the request header or request body, depending on how the identity field is configured by your system admin. Contact your system admin to confirm the field name and format. Learn More.
Example:
curl --request POST \
'https://<domain>/api/public/stream/<streamId>/advancedSearch?useremail=user@example.com' \
--header 'Auth: <token>' \
--data-raw '{"query": "leave policy"}'
Passing user identity via SDK
When using the Search AI SDK, embed the user identity in the SDK configuration so access checks are applied automatically on every search request.
Step 3 — Resolve Group Identities
Group-level permissions store a Permission Entity in sys_racl rather than individual user identities. To enforce access at query time, Search AI must resolve which individual users belong to each Permission Entity.
This resolution works in two ways:
Automatic Resolution — For supported connectors, Search AI automatically fetches group membership data from the source and maintains up-to-date user-to-group mappings. No manual action is required.
Refer to this to know the list of connectors that support automatic resolution of permission entities.
Manual Resolution — For connectors without automatic resolution, use the Permission Entity APIs to manually associate users with each Permission Entity. Refer to individual connector documentation to confirm which method applies.
Enabling RACL
- Go to the Permissions page of the connector.
- Select Restricted Access.
If the connector supports automatic permission entity resolution, an additional checkbox is available to enable it.
To disable RACL and make all content publicly accessible, select Public Access. This disables RACL automatically — no sync is required.
Updating Permissions After Configuration Changes
| Change | What To Do |
|---|
| Public → Restricted Access | Manually trigger a sync after enabling RACL to fetch and apply the latest permission data. Scheduled syncs will apply permissions at the next run. |
| Restricted → Public Access | RACL is disabled automatically. No sync is required. All the content is publicly available. |
RACL Scheduler
Access permissions change more frequently than content. Configure a separate permission sync schedule to keep RACL data up to date without triggering a full content re-sync.
- Open the Permissions section of the connector.
- Enable Permissions Sync Scheduled.
- Set the desired time and frequency.
The Sync Scope column shows Permission Sync for RACL-specific operations, keeping it distinct from content syncs for easier traceability.
If the RACL scheduler is disabled, permissions are updated as part of the regular content sync.
RACL Limitations
- Supported for connector content only — not for websites or uploaded documents.
- Switching from Restricted to Public Access automatically disables RACL.
- For connectors without automatic resolution, manual Permission Entity management via API is required.
For complex RACL requirements or custom access control implementations,
contact us for expert guidance.
Quick Reference
| Task | Location |
|---|
| Add website | Content > Websites > +Web Crawl |
| Upload files | Content > Documents > Upload |
| Add connector | Content > Connectors > Select connector |
| View indexed content | Index > Content Browser |
| Configure extraction | Index > Extraction |
| Test answers | Configuration > Testing |