Guardrails
Maintain safety, stability, and compliance during agent execution.Overview
Guardrails are pre-deployed scanners that evaluate inputs and outputs to protect your application from harmful content, ensure compliance, and maintain quality standards.Scanner Types
Input Scanners
Monitor data agents receive from users:- Detect harmful or inappropriate language
- Identify jailbreak attempts
- Block unsafe instructions
- Scan for sensitive data patterns
Output Scanners
Evaluate responses before delivery:- Filter inappropriate content
- Enforce compliance rules
- Mask sensitive information
- Validate response quality
Available Scanners
| Scanner | Purpose | Applied To |
|---|---|---|
| Toxicity | Detect harmful, offensive, or inappropriate language | Input, Output |
| PII Detection | Identify personal information (SSN, credit cards, etc.) | Input, Output |
| Jailbreak Detection | Identify attempts to bypass agent instructions | Input |
| Prompt Injection | Detect malicious prompt manipulation | Input |
| Regex Patterns | Custom pattern matching | Input, Output |
| Content Moderation | Block specific topics or content types | Output |
How Guardrails Work
Processing Flow
Configuration
Account-Level Guardrails
Guardrails deployed at the account level are available to all apps:App-Level Configuration
Override or extend account settings:Tool-Level Guardrails
Apply to specific tools:Scanner Configuration
Toxicity Scanner
PII Detection
Jailbreak Detection
Custom Regex Patterns
Actions
| Action | Behavior |
|---|---|
| block | Reject the request/response entirely |
| mask | Replace sensitive content with masked characters |
| warn | Allow but flag for review |
| log | Record without intervention |
Testing Guardrails
Validate scanner effectiveness:- Navigate to Settings → Guardrails
- Select a scanner
- Click Test
- Enter sample input
- Review detection results
Test Cases
PII Protection Pipeline
Complete PII handling across the system:Monitoring
Track guardrail effectiveness:Metrics
- Block rate: Percentage of blocked requests
- Detection accuracy: False positive/negative rates
- Categories: Distribution of detected issues
- Trends: Changes over time
Alerts
Configure alerts for unusual patterns:Best Practices
Start Conservative
Begin with stricter settings and loosen as needed based on false positives.Layer Protection
Use multiple scanners for defense in depth:Test Regularly
- Review blocked content for false positives
- Test with adversarial inputs
- Update patterns as threats evolve