Ingest Data from Content Sources

Content Sources enable Search AI to ingest data from websites, documents, and third-party applications, creating a unified knowledge base for generating accurate answers. When content is ingested, Search AI automatically initiates training to create the answer index using configured extraction strategies.

Websites

Crawl and index web pages from any public or authenticated website. Navigate to Content > Websites to manage web sources.

Adding a Web Crawler

Click +Web Crawl and configure the following:

Field	Description
Source Title	Unique name for the web source
Description	Information about the source and its purpose
Crawl Source	URL — provide a base URL; the crawler automatically discovers linked pages within the domain. Upload Sitemap (CSV) — provide a CSV of sitemaps and a base URL; only URLs matching the base URL are crawled, others are skipped. Crawl options apply for further filtering. Upload URL (CSV) — provide a CSV of specific URLs and a base URL; only matching URLs are crawled. Crawl depth and max URL settings are not required for this method.
Crawl Depth	Levels to traverse (1-10, default: 5)
Max URL Limit	Maximum URLs to crawl (1-10,000, default: 10)

Advanced Crawl Options

Option	Purpose
Use Cookies	Crawl pages requiring cookie acceptance
JavaScript Rendered	Capture content rendered dynamically via JavaScript. When enabled, set the Crawl Delay — the time in seconds the crawler waits before treating the page as fully rendered and beginning indexing. Recommended for JS-heavy pages.
Crawl Beyond Sitemap	Include URLs not listed in the sitemap
Respect robots.txt	Honor crawler directives in robots.txt
Automatic Cleaning	Remove headers, footers, and head tags before ingestion
Retain Original	Ingest complete HTML content for custom transformation via Workbench

URL Filtering

Control which pages are crawled using rules:

Crawl everything — All discovered URLs
Crawl everything except — Block URLs matching specified conditions
Crawl only specific URLs — Allow only URLs matching specified conditions

Conditions include: equals, not equal, contains, does not contain, begins with, ends with.

Authentication

Search AI supports two authentication methods for protected websites: Basic HTTP Authentication

Username/email and password
Optional authorization fields (headers, payload, query string, path parameters)
Authentication URL (may differ from source URL)
Test type: Text Presence, Redirection, or Status Code

Form-based Authentication

Form fields with key, type, and value
Session maintained after initial authentication
Same test options as Basic Auth

The crawl source type cannot be changed after the crawler is created. All other configuration fields can be updated at any time.

Managing Web Sources

Action	How To
View crawled pages	Go to source > Pages tab (shows Successful, Failed, Skipped)
View execution logs	Go to source > Executions tab
View page JSON	Pages tab > click page > View JSON — shows the raw ingested data for the page, useful for debugging what content was indexed
Recrawl entire source	Click Recrawl action or use Re-Crawl button on configuration page
Recrawl specific page	Pages tab > Actions > Recrawl
Schedule recrawl	Set date, time, and frequency in crawl configuration
Delete source	Click Delete action (removes all indexed content)

If a page is manually deleted and the source is recrawled, the page reappears unless the crawl options are updated to exclude it — for example, by adding it to the block list under URL Filtering.

Troubleshooting

Issue	Potential Causes	Resolution
Crawl failure	Permissions, credentials, invalid URL, domain not whitelisted	Verify access, credentials, and URL format
Crawl succeeds but no indexing	JS-rendered pages, incorrect include/exclude rules, undiscoverable pages	Enable JavaScript Rendered, check rules, verify sitemap
Crawl takes too long	Large content volume, JS pages returning no content	Adjust max URLs, enable JavaScript Rendered

Documents

Upload and index files directly to Search AI. Navigate to Content > Documents to manage uploaded files.

Supported Formats

PDF, DOCX, PPT, TXT, XLSX (scanned PDFs and password-protected files not supported)

Upload Options

Method	Description
File	Upload one or more files to an existing directory
URL	Upload a file from a remote URL with title and description
Directory	Upload a complete folder from local device

Upload Limits

Limit	Default Value
File size	15 MB maximum per file
Batch upload	40 files at once
Directory upload	20 files maximum

Contact support to increase these limits.

Managing Documents

View files: Click any directory to see its files with type and page count
View file details: Click a file for preview, metadata, and JSON view
Delete file: Actions > Delete (from directory or file details page)
Delete directory: Actions > Delete (removes directory and all files)
Search: Use search bar on Directory page or within directory details

For efficient ingestion of content from spreadsheets, see Ingestion from Spreadsheets.

Connectors

Pre-built integrations for 60+ third-party applications. Navigate to Content > Connectors to configure integrations.

Connector Setup Workflow

Authentication — Provide OAuth credentials, API keys, or tokens
Ingestion — Select content types and apply filters
Field Mapping — Align source fields with Search AI schema (optional)
Permissions — Configure RACL settings

Configuration Options

Content Types: Most connectors support multiple content types (pages, articles, tasks, tickets, documents). Select which types to ingest under the Ingestion section. Filters: Some connectors support selective ingestion by timeframe, category, user assignment, or other criteria. Field Mapping: Customize how source fields map to Search AI’s schema using custom scripts for data transformation. Learn More.

Sync Operations

Type	Description
Manual sync	Click Sync Now on Configuration tab
Scheduled sync	Set automatic intervals (files >15MB are skipped)
Permission sync	Separate scheduler for RACL updates

Managing Connectors

Enable/Disable: Toggle connector without deleting data (disables sync temporarily)
View content: Content tab lists ingested items with URLs and timestamps
Remove connector: Connection Settings > Remove this content source (deletes all indexed data)

Custom Connector

For applications without a pre-built connector, the Custom Connector uses REST APIs via a middleware service. Setup Steps:

Download the reference implementation from Search AI
Configure config.json with application details and content fields
Set authentication in .env file
Host the service and configure endpoint in Search AI
Add headers (Authorization required for default implementation).

The service processes content in batches of 30 documents.
RACL support available via sys_racl field in config file. Use the Permission Entity APIs to add list of permitted users for the content.

Learn More.

JSON Connector

Upload structured data in JSON format for indexing.

Upload up to 10 files at once
Maximum 15 MB per file
Add to existing source or create new

Learn More.

Supported Connectors

Search AI provides out-of-the-box support for ingesting data from a range of third-party repositories. If you want to use a repository not listed below, contact us. RACL: Auto — permissions synced automatically from source; Manual — RACL available, manual entity management required; No — not supported.

Connector	Repository	Supported Content	Filtering	RACL
Aha!	Cloud	Ideas, Features	No	Auto
Airtable	Cloud	Spreadsheets, Databases	No	Manual
Amazon S3	Cloud	.pdf, .ppt, .txt, .docx	No	No
Asana	Cloud	Projects, Tasks	No	Auto
Axero	Cloud	Pages, Wiki, Discussions, Documents, Articles, Announcements, Blogs	Yes	Auto
Azure Storage	Cloud	.txt, .pdf, .rtf, .doc, .docx, .ppt, .pptx	No	Manual
BitBucket	Cloud	Pull Requests	No	Auto
Bigtincan	Cloud	Training courses	Yes	Yes
Box	Cloud	.pdf, .docx, .txt	No	Auto
Clickup	Cloud	Tasks, Subtasks, Checklists	No	Manual
Coda Docs	Cloud	Pages	No	Manual
Confluence Cloud	Cloud	Knowledge Articles	Yes	Auto
Confluence Server	On-prem	Knowledge Articles	Yes	Auto
Custom Connector	Cloud	—	No	Manual
Datadog	Cloud	Metrics, Dashboards, Monitors	Yes	Manual
DotCMS	Cloud	—	Yes	Manual
Dropbox	Cloud	.doc, .docx, .ppt, .pptx, .pdf, .txt, .html	No	Manual
Egnyte	Cloud	.doc, .docx, .ppt, .pptx, .pdf, .txt	No	Manual
Figma	Cloud	Figma Files	No	No
FreshDesk	Cloud	Tickets	No	Auto
Freshservice	Cloud	Solution Articles	No	Manual
Front	Cloud	Knowledge Base Articles	No	Manual
GitHub	Cloud	Issues, pull requests, and files (.md, .mdx, .mdoc, .mdc, .rst, .adoc, .asciidoc, .txt, .xlsx, .xls, .html)	Yes	Auto
GitHub On-Prem	On-prem	Issues, pull requests, and files (.md, .mdx, .mdoc, .mdc, .rst, .adoc, .asciidoc, .txt, .xlsx, .xls, .html)	Yes	Manual
GitLab	Cloud	Issues	No	Manual
Google Drive	Cloud	.doc, .docx, .ppt, .pptx, .pdf, .txt, .html	Yes	Auto
Guru	Cloud	Cards	No	Auto
HelpScout	Cloud	Articles	No	Auto
Hive	Cloud	Actions, Sub-actions	No	Auto
HubSpot	Cloud	Tickets, Deals, Companies, Contacts	No	Auto
Invision Community	Cloud/On-prem	Forum topics (threads) and posts	Yes	Yes
Jenkins	Cloud	Dashboard, Jobs, Builds, Plugins	No	No
JFrog Artifactory	Cloud	Artifacts	No	Auto
Jira	Cloud	Issues, Filters, Dashboards	No	Auto
Jira On-Prem	On-prem	Work Items	Yes	Auto
JSON Connector	—	Structured JSON Data	No	No
LumApps ¹	Cloud	Pages, News, Custom Objects, Community Posts	Yes	Auto
Miro	Cloud	Boards	No	Auto
Monday	Cloud	Board Items	No	Manual
MS Teams	Cloud	Channel Conversations	No	Auto
Notion	Cloud	Pages	No	Manual
OneDrive	Cloud	.aspx, .doc, .docx, .ppt, .pptx, .html, .txt, .pdf	No	Auto
Opsgenie	Cloud	Alerts, Incidents	No	Manual
Oracle Knowledge	Cloud	Knowledge Articles	No	No
PagerDuty	Cloud	Schedules, Escalation Policies	No	Auto
Re:amaze	Cloud	Articles	No	Auto
Salesforce	Cloud	Knowledge Articles	Yes	No
ServiceNow	Cloud	Incidents, Service Catalog, Knowledge Articles	Yes	Auto
SharePoint	Cloud	.aspx, .doc, .docx, .ppt, .pptx, .html, .txt, .pdf	Yes	Auto
Shopify	Cloud	Articles, Product Catalog	No	Manual
Shortcut	Cloud	Stories	No	Auto
Slab	Cloud	Posts	No	Manual
Slack	Cloud	Messages	No	Auto
Teamwork	Cloud	Tasks	No	Manual
TestRail	Cloud	Test Cases	No	Manual
Trello	Cloud	Boards, Cards	No	Auto
WordPress	Cloud	Pages, Posts	No	Manual
Workday	Cloud	HR Org Charts, Employee Details	No	Auto
Wolken Service Desk	Cloud	Knowledge Base Articles and Cases/Tickets	Yes	Yes
Wrike	Cloud	Tasks	No	Manual
xMatters	Cloud	Incidents, Events, On-Calls	No	Auto
YouTrack	Cloud	Projects, Issues, Knowledge Articles	No	Auto
YouTube	Cloud	Transcript	Yes	No
Zendesk	Cloud	Knowledge Articles, Tickets	No	Auto
Zeplin	Cloud	Screens	No	Auto
Zoho	Cloud	Leads, Accounts, Contacts, Deals	No	Manual
Zoom	Cloud	Meeting Summaries	Yes	Auto
Zulip	Cloud	Messages	No	Auto

¹ LumApps supports attachment ingestion. For any connectors not listed above, use the Custom Connector or contact Kore.ai.

Role-based Access Control (RACL)

RACL ensures users only see content they’re authorized to access by synchronizing permissions from source applications. For example, if User A has access to policy documents and User B can only view FAQs, and both query a policy description — User A receives an answer sourced from the policy document, while User B receives an answer sourced from the FAQs. The response content differs based on each user’s access permissions.

RACL in Search AI

Ingest Content → Store Permissions in sys_racl → User Submits Query → Match User Identity to sys_racl → Return Answers from Accessible Content Only

RACL is supported for connector-sourced content only. It does not apply to websites or uploaded documents.

How RACL Works

Search AI enforces access control in three steps: Step 1 — Retrieve and Store Permissions During ingestion, the connector fetches permission data from the source application along with the content. This access information is stored in the sys_racl field of every indexed chunk. Search AI handles two types of permissions:

Type	Description	sys_racl Value
Individual	The content specifies user identities directly (e.g. `john.doe@example.com`)	User email addresses or IDs
Group / User Criteria	The content specifies a group or user criterion (e.g. `devteam@example.com`)	A Permission Entity ID representing the group
Public	No specific permissions; accessible to all users	`*`

When group permissions are involved, Search AI creates a Permission Entity for each group and stores it in sys_racl. Individual users must then be associated with that Permission Entity — either automatically by the connector or manually via API. Example — Group Permission Consider a Google Drive file shared with two individuals and one group:

john.doe@example.com (individual)
jane.doe@example.com (individual)
testteam@example.com (group)

Search AI stores the two user identities directly and creates a Permission Entity for the group. The sys_racl field for the indexed chunks would look like this:

"sys_racl": [
  "john.doe@example.com",
  "jane.doe@example.com",
  "testteam@example.com"
]

At query time, john.doe and jane.doe are matched directly. For testteam@example.com, Search AI must resolve which individual users belong to that group — either automatically (for supported connectors) or via the Permission Entity APIs. To verify stored permissions:

Go to Content and open the JSON view for any file.
Check the sys_racl field to confirm what identities or entities are stored.
The same information is available per chunk in the Content Browser.

Step 2 — Verify User Identity at Query Time When a user submits a query, Search AI matches their identity against the sys_racl field of indexed chunks. Only chunks where the user’s identity appears — directly or through a Permission Entity — are eligible to generate answers.

Access control rule: Only users whose identities appear in the sys_racl field — directly or through a resolved Permission Entity — can view answers generated from that content.

Passing user identity in API requests When using the Search API directly, user identity must be included in each request. If it is not passed, restricted content will not be returned. User identity can be passed via the request header or request body, depending on how the identity field is configured by your system admin. Contact your system admin to confirm the field name and format. Learn More. Example:

curl --request POST \
  'https://<domain>/api/public/stream/<streamId>/advancedSearch?useremail=user@example.com' \
  --header 'Auth: <token>' \
  --data-raw '{"query": "leave policy"}'

Passing user identity via SDK When using the Search AI SDK, embed the user identity in the SDK configuration so access checks are applied automatically on every search request.

Step 3 — Resolve Group Identities Group-level permissions store a Permission Entity in sys_racl rather than individual user identities. To enforce access at query time, Search AI must resolve which individual users belong to each Permission Entity. This resolution works in two ways: Automatic Resolution — For supported connectors, Search AI automatically fetches group membership data from the source and maintains up-to-date user-to-group mappings. No manual action is required. Refer to this to know the list of connectors that support automatic resolution of permission entities. Manual Resolution — For connectors without automatic resolution, use the Permission Entity APIs to manually associate users with each Permission Entity. Refer to individual connector documentation to confirm which method applies.

Enabling RACL

Go to the Permissions page of the connector.
Select Restricted Access.

If the connector supports automatic permission entity resolution, an additional checkbox is available to enable it. To disable RACL and make all content publicly accessible, select Public Access. This disables RACL automatically — no sync is required.

Updating Permissions After Configuration Changes

Change	What To Do
Public → Restricted Access	Manually trigger a sync after enabling RACL to fetch and apply the latest permission data. Scheduled syncs will apply permissions at the next run.
Restricted → Public Access	RACL is disabled automatically. No sync is required. All the content is publicly available.

RACL Scheduler

Access permissions change more frequently than content. Configure a separate permission sync schedule to keep RACL data up to date without triggering a full content re-sync.

Open the Permissions section of the connector.
Enable Permissions Sync Scheduled.
Set the desired time and frequency.

The Sync Scope column shows Permission Sync for RACL-specific operations, keeping it distinct from content syncs for easier traceability. If the RACL scheduler is disabled, permissions are updated as part of the regular content sync.

RACL Limitations

Supported for connector content only — not for websites or uploaded documents.
Switching from Restricted to Public Access automatically disables RACL.
For connectors without automatic resolution, manual Permission Entity management via API is required.

For complex RACL requirements or custom access control implementations, contact us for expert guidance.

Email Notifications

Search AI can send email alerts when long-running tasks — web crawls, connector syncs, and training jobs — start, complete, or fail. This lets you stay informed without having to manually check job status.

Notifications are disabled by default. Notifications are sent for both manually triggered and scheduled jobs.

When enabled, you can:

Choose which event types trigger notifications
Add one or more team members as email recipients
Receive alerts for both manually triggered and scheduled jobs

Enable Email Notifications

On the Websites or Connectors page, click the email icon to open the Notification Settings in Search AI.
Toggle the Enable switch on.
Select one or more event types you want to be notified about.
In the Send to users field, type an email address and press Enter. Repeat for each recipient.
Click Save.

Notification Event Types

Event	Triggered by	Notifies you when
Web Crawl Status	Manual crawl, scheduled crawl	Crawl started, completed, or failed
Connectors Status	Manual sync, scheduled sync	Sync started, completed, or failed
Training Status	Manual training, auto-triggered training	Training started, completed, or failed

Quick Reference

Task	Location
Add website	Content > Websites > +Web Crawl
Upload files	Content > Documents > Upload
Add connector	Content > Connectors > Select connector
View indexed content	Index > Content Browser
Configure extraction	Index > Extraction
Test answers	Configuration > Testing
Configure Notifications	Content > Websites or Connectors > Email icon

Modules

Platform Services

Administration

References

Ingest Data from Content Sources

Websites

Adding a Web Crawler

Advanced Crawl Options

URL Filtering

Authentication

Managing Web Sources

Troubleshooting

Documents

Supported Formats

Upload Options

Upload Limits

Managing Documents

Connectors

Connector Setup Workflow

Configuration Options

Sync Operations

Managing Connectors

Custom Connector

JSON Connector

Supported Connectors

Role-based Access Control (RACL)

RACL in Search AI

How RACL Works

Enabling RACL

Updating Permissions After Configuration Changes

RACL Scheduler

RACL Limitations

Email Notifications

Quick Reference

Modules

Platform Services

Administration

References

Documentation Index

​Websites

​Adding a Web Crawler

​Advanced Crawl Options

​URL Filtering

​Authentication

​Managing Web Sources

​Troubleshooting

​Documents

​Supported Formats

​Upload Options

​Upload Limits

​Managing Documents

​Connectors

​Connector Setup Workflow

​Configuration Options

​Sync Operations

​Managing Connectors

​Custom Connector

​JSON Connector

​Supported Connectors

​Role-based Access Control (RACL)

​RACL in Search AI

​How RACL Works

​Enabling RACL

​Updating Permissions After Configuration Changes

​RACL Scheduler

​RACL Limitations

​Email Notifications

​Quick Reference

Websites

Adding a Web Crawler

Advanced Crawl Options

URL Filtering

Authentication

Managing Web Sources

Troubleshooting

Documents

Supported Formats

Upload Options

Upload Limits

Managing Documents

Connectors

Connector Setup Workflow

Configuration Options

Sync Operations

Managing Connectors

Custom Connector

JSON Connector

Supported Connectors

Role-based Access Control (RACL)

RACL in Search AI

How RACL Works

Enabling RACL

Updating Permissions After Configuration Changes

RACL Scheduler

RACL Limitations

Email Notifications

Quick Reference