> ## Documentation Index
> Fetch the complete documentation index at: https://koreai.mintlify.app/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluations

Evaluations provide a structured framework for testing, scoring, and analyzing agent behavior before production deployment.

You can simulate conversations using different user personas and test scenarios, evaluate agent responses using automated evaluators, and track evaluation results over time.

Evaluations help you:

* Test agents across different user behaviors and conversation flows.
* Validate the quality, safety, and tool usage of the response.
* Benchmark results with the expected outcomes.
* Identify gaps before production deployment.
* Improve overall agent reliability and trustworthiness.

The evaluation workflow consists of the following components:

| Component  | Description                                                                                             |
| ---------- | ------------------------------------------------------------------------------------------------------- |
| Personas   | Simulated user profiles with configurable communication styles, goals, behaviors, and constraints.      |
| Scenarios  | Conversation flows and test cases used to evaluate agent behavior.                                      |
| Evaluators | Scoring mechanisms that assess conversation quality, safety, efficiency, tool usage, and other metrics. |
| Eval Sets  | Reusable evaluation configurations that combine personas, scenarios, and evaluators.                    |
| Runs       | Executed evaluation sessions and results.                                                               |

**Navigation**: Go to your project and select **Evaluate** > **Evals**.

<img src="https://mintcdn.com/koreai/mOz_ZNDanBGKqV7f/agent-platform/images/evaluate.png?fit=max&auto=format&n=mOz_ZNDanBGKqV7f&q=85&s=a4fba7023ccbc04b33cfd0cd43e53a5c" alt="Evaluate" width="967" height="582" data-path="agent-platform/images/evaluate.png" />

<Note>Personas, scenarios, evaluators, and eval sets are reusable across multiple evaluations within the same project.</Note>

## Evaluation Workflow

You can evaluate agents in two ways:

| Evaluation Type | Description                                                                                                                 |
| --------------- | --------------------------------------------------------------------------------------------------------------------------- |
| Quick Eval      | Automatically generates personas, scenarios, evaluators, and runs using AI for rapid testing and iteration.                 |
| Manual Eval     | You can manually create personas, scenarios, evaluators, and eval sets to enable controlled, reusable evaluation workflows. |

### Quick Eval

Use Quick Eval for:

* Rapid testing
* Early-stage validation
* Smoke testing
* Fast iteration during development

The platform automatically generates the required personas, scenarios, evaluators, and evaluation runs.

### Manual Evaluation Workflow

<Steps>
  <Step title="Create Personas">
    Define simulated user profiles with specific communication styles, goals, behaviors, and constraints.
  </Step>

  <Step title="Create Scenarios">
    Define conversation flows, expected outcomes, milestones, and user intents used to test agent behavior.
  </Step>

  <Step title="Create Evaluators">
    Configure evaluators to measure response quality, safety, efficiency, tool usage, and other evaluation criteria.
  </Step>

  <Step title="Create Eval Sets">
    Combine personas, scenarios, and evaluators into reusable evaluation configurations.
  </Step>

  <Step title="Run Evaluations">
    Execute evaluation runs to simulate conversations and generate evaluator scores and transcripts.
  </Step>

  <Step title="Analyze Results">
    Review scores, evaluator reasoning, regressions, transcripts, traces, and execution metrics to improve agent performance.
  </Step>
</Steps>

## Personas

Personas represent different types of users who interact with your agent.

Each persona simulates unique communication styles, domain expertise, goals, behaviors, and constraints to help test how the agent performs across varied user interactions.

### Create a Persona

1. Go to **Evaluate > Evals > Personas**.
2. Click **Create Persona**.
3. Specify persona details such as:

* Communication style
* Domain knowledge
* Behavioral traits
* Goals and constraints
* Optional session variables

| Field               | Description                                        | Options                                                         |
| ------------------- | -------------------------------------------------- | --------------------------------------------------------------- |
| Name                | Unique name within the project                     | For example, `Impatient Business Traveler`                      |
| Communication Style | Defines how the persona phrases messages           | Casual, formal, technical, terse, verbose                       |
| Domain Knowledge    | Defines how much the persona knows about the topic | Beginner, intermediate, expert                                  |
| Behavior Traits     | Specific behaviors the persona exhibits            | Free-text tags, for example, `"asks follow-ups"`, `"impatient"` |
| Goals               | Defines what the persona is trying to accomplish   | Free text                                                       |
| Constraints         | Rules the persona follows during conversation      | Free text                                                       |

4. Select an adversarial behavior type if you want to simulate edge cases or malicious interactions.
5. Click **Create**.

#### Example Persona

```yaml theme={null}
PERSONA:
  name: "Impatient Business Traveler"
  communication_style: terse
  domain_knowledge: expert
  behavior_traits:
    - impatient
    - asks_follow_up_questions
  goal: "Rebook a cancelled flight quickly"
  constraint: "Avoid unnecessary conversation"
```

### Adversarial Persona Types

You can simulate adversarial or edge-case user behaviors using the **Adversarial Type** field.

To test agent safety and robustness:

1. Enable **Adversarial** while creating a persona.
2. Select the adversarial type.

| Type               | Purpose                                                        |
| ------------------ | -------------------------------------------------------------- |
| Prompt Injection   | Attempts to override agent instructions                        |
| Social Engineering | Attempts to extract sensitive information                      |
| Off-topic Derailer | Redirects conversations away from the intended agent goal      |
| Abusive User       | Uses hostile or inappropriate language                         |
| Edge Case Explorer | Sends unusual or unexpected inputs (empty, very long messages) |

### Additional Options

* Edit, duplicate, or delete personas from the Personas page.
* Create multiple personas for different user behaviors and communication patterns.
* Reuse personas across multiple eval sets within the same project.

### AI-Generated Personas

Instead of defining personas manually, use the **Generate with AI** option to automatically create personas based on your agent’s domain and objectives.

Generated personas typically include a mix of:

* Communication styles
* Knowledge levels
* Behavioral patterns
* User goals

You can review and edit the generated personas after they're created.

### Troubleshooting Personas

| Issue                                     | Recommendation                                               |
| ----------------------------------------- | ------------------------------------------------------------ |
| The personas behavior is inconsistent     | Refine the goals and constraints fields                      |
| The personas responses are unrealistic    | Add more specific behavioral traits and communication styles |
| The personas is too passive or aggressive | Adjust goals, constraints, and adversarial settings          |

## Scenarios

Scenarios define the conversation flow, user intent, and expected outcomes used during evaluations.

Each scenario represents a conversation flow used to evaluate how the agent handles specific tasks, behaviors, or outcomes.

### Create a Test Scenario

1. Go to **Evaluate > Evals > Scenarios**.
2. Click **Create Scenario**.
3. Specify the following scenario details.
4. Click **Create**.

| Field               | Description                                                     |
| ------------------- | --------------------------------------------------------------- |
| Name                | Unique name within the project                                  |
| Category            | Grouping label, for example, booking, returns, or auth.         |
| Difficulty          | Defines the scenario's complexity level: easy, medium, or hard. |
| Entry Agent         | The agent that starts the conversation (optional).              |
| Initial Message     | The first user message that starts the scenario.                |
| Expected Outcome    | Description of what a successful conversation should achieve.   |
| Max Turns           | Maximum number of conversation turns before timeout.            |
| Expected Milestones | Key checkpoints the conversation is expected to reach.          |
| Agent Path          | Expected sequence of agents for multi-agent projects.           |

#### Example Scenario

```yaml theme={null}
SCENARIO:
  name: "Flight Rebooking After Cancellation"
  category: booking
  difficulty: medium

  initial_message: >
    My flight was cancelled and I need to rebook for tomorrow.

  expected_outcome: >
    Agent identifies the cancelled booking, offers alternatives,
    and confirms a new flight.

  max_turns: 15

  expected_milestones:
    - "Identify cancelled flight"
    - "Present rebooking options"
    - "Confirm new booking"

  agent_path:
    - "Supervisor"
    - "Booking_Manager"
```

### Bulk Import Personas and Scenarios

Use the API to programmatically create multiple personas or scenarios.

```bash theme={null}
curl -X POST /api/projects/:projectId/eval-personas \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Confused First-Time User",
    "communicationStyle": "verbose",
    "domainKnowledge": "beginner",
    "behaviorTraits": [
      "asks for clarification",
      "repeats questions"
    ],
    "goals": "Complete a simple booking",
    "constraints": "Never provides information upfront"
  }'
```

### Additional Options

* Edit, duplicate, or delete scenarios from the Scenarios page.
* Reuse scenarios across multiple eval sets.

### Troubleshooting Scenarios

| Issue                                     | Recommendation                                                                                                            |
| ----------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| Duplicate name error                      | Persona and scenario names must be unique within a project. Use a more specific name or delete the existing one.          |
| Persona not behaving as expected in evals | Refine the **Goals** and **Constraints** fields. These are used as system prompt instructions for the simulated user LLM. |
| Scenario timing out                       | Increase the **Max Turns** value or simplify the expected conversation path.                                              |

## Evaluators

Evaluators define how agent conversations are scored and analyzed during the evaluation process.

You can use evaluators to assess:

* Response quality
* Safety
* Efficiency
* Empathy
* Tool correctness
* Custom evaluation criteria

### Create an Evaluator

1. Go to **Evaluate > Evals > Evaluators**.
2. Click **Create Evaluator**.
3. Configure the evaluator:
   1. Select the evaluator type.
   2. Choose the evaluation category.
   3. Define the scoring scale and criteria.
4. Click **Create**.

<Note>Lower evaluator temperatures generally produce more consistent scoring results.</Note>

### Evaluator Types

Supported evaluator types include:

| Type         | Description                                 |
| ------------ | ------------------------------------------- |
| LLM Judge    | Uses an LLM to evaluate conversations.      |
| Code Scorer  | Uses deterministic programmatic scoring.    |
| Trajectory   | Evaluates conversation flow and milestones. |
| Human Review | Flags conversations for manual review.      |

### LLM Judge Evaluator

An LLM Judge evaluator uses a separate LLM to assess the quality of agent responses based on a scoring rubric you define.

| Field            | Description                                                                                    |
| ---------------- | ---------------------------------------------------------------------------------------------- |
| Judge Model      | Defines which LLM is used as the evaluator judge.                                              |
| Judge Prompt     | Instructions that define what the judge should evaluate.                                       |
| Temperature      | LLM temperature for the judge. Lower values generally produce more consistent scoring results. |
| Chain of Thought | Defines whether the judge explains its reasoning before scoring.                               |

#### Write Effective Judge Prompts

The judge prompt is one of the most important evaluator configurations.

Effective judge prompts:

* Clearly define evaluation criteria
* Focus on observable behavior
* Avoid ambiguous language
* Include examples when possible

#### Example Judge Prompt

```yaml theme={null}
JUDGE_PROMPT:
  You are evaluating an AI agent's response quality in a customer support context.

  Evaluate the conversation on these criteria:
    1. Did the agent correctly identify the customer's intent?
    2. Did the agent provide accurate information?
    3. Did the agent follow the expected conversation flow?
    4. Was the agent's tone appropriate and professional?

  Score each conversation using the provided rubric.

  Focus on the agent's responses, not the simulated user's messages.
```

#### Configure Bias Mitigation

LLM judges can exhibit scoring biases. Use bias mitigation settings to improve evaluation consistency and reliability.

| Setting                 | Description                                                                               | Default |
| ----------------------- | ----------------------------------------------------------------------------------------- | ------- |
| Position Swap           | Evaluates the conversation in both original and reversed order to reduce positional bias. | On      |
| Blind Evaluation        | Removes agent or persona identity information before judging.                             | On      |
| Cross-Model Judge       | Uses a different model family than the agent being evaluated.                             | Off     |
| Evidence-First (RULERS) | Requires the judge to cite evidence before assigning scores.                              | On      |

### Trajectory Evaluators

Trajectory evaluators assess the agent's execution behavior rather than response quality.

Use them to validate:

* Milestone completion -- did the conversation hit expected checkpoints?
* Handoff correctness -- did the supervisor route to the right agent?
* Path efficiency -- how many unnecessary steps did the agent take?
* Tool sequence -- did the agent call tools in the right order?

### Code Scorer Evaluators

Use Code Scorer evaluators for deterministic validations that do not require an LLM.

Typical use cases include:

* Regex matching
* Keyword validation
* Latency or response-time thresholds
* Structured output validation

Code Scorer evaluators execute custom scoring logic to validate agent responses and runtime behavior using deterministic rules.

### Human Review Evaluators

Use Human Review evaluators for subjective or manual quality assessments.

Human Review evaluators flag conversations for manual inspection when evaluation scores fall below configured thresholds, allowing reviewers to validate agent behavior, response quality, or policy compliance before approval or release.

### Scoring Scale Types

The scoring rubric defines how the evaluator assigns scores to conversations.

Supported scale types include:

* 1 to 5 scale
* Pass or Fail

#### 1 to 5 Scale

Use a 1 to 5 scale to define detailed evaluation criteria for each score level.

| Score         | Description                                                          |
| ------------- | -------------------------------------------------------------------- |
| 5 - Excellent | Addresses the user's request with accurate and complete information. |
| 4 - Good      | Addresses the request with minor omissions.                          |
| 3 - Adequate  | Partially addresses the request but misses important details.        |
| 2 - Poor      | Mostly misses the request or provides inaccurate information.        |
| 1 - Failing   | Fails to address the request or provides harmful information.        |

#### Pass and Fail Scale

Use pass or fail scoring for binary evaluation criteria.

| Score    | Description                                                              |
| -------- | ------------------------------------------------------------------------ |
| 1 - Pass | The agent completes the task within the expected flow.                   |
| 0 - Fail | The agent fails to complete the task or deviates from expected behavior. |

### Troubleshoot Evaluators

| Issue                            | Recommendation                                                                                                                              |
| -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| Inconsistent scores across runs  | Lower evaluator temperature (try `0.1`) and enable evidence-first mode. Run multiple variants per evaluation to get statistical confidence. |
| Judge ignores rubric criteria    | Make rubric instructions more specific using examples.                                                                                      |
| The Judge model is too expensive | Use a smaller model for initial screening and reserve larger models for detailed analysis. Set appropriate `maxTokens` limits.              |
| The Evaluation cost is too high  | Use smaller judge models during development.                                                                                                |
| Scores appear random             | Increase statistical sample size using variants.                                                                                            |

## Eval Sets

Eval Sets run evaluation batches to systematically test agents across combinations of personas, scenarios, and evaluators.

Eval Sets combine personas, scenarios, and evaluators into reusable evaluation configurations.

You can use eval sets to:

* Reuse evaluation pipelines.
* Standardize testing across environments.
* Execute multiple evaluations consistently.
* Detect regressions over time

### Execution Model

During execution:

* Every selected Persona interacts with every selected Scenario
* Each conversation is independently executed
* All configured Evaluators score the resulting conversations

This creates a full evaluation matrix across personas, scenarios, and evaluators.

**Example Evaluation Matrix**

```yaml theme={null}
EVAL_SET:
  personas: 3
  scenarios: 4
  evaluators: 2
  variants: 2

TOTAL_CONVERSATIONS:
  formula: "3 Personas × 4 Scenarios × 2 Variants"
  result: 24

TOTAL_EVALUATIONS:
  formula: "24 Conversations × 2 Evaluators"
  result: 48
```

Each conversation is executed as an independent multi-turn session where the persona LLM simulates the user according to the scenario definition.

### Create an Eval Set

1. Go to **Evaluate** > **Evals** > **Eval Sets**.
2. Click **Create Eval Set**.
3. Specify the following details.
4. Click **Create**.

| Field           | Description                                                                                                                                                                                                                            |
| --------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Name            | Unique eval set name, for example, Booking Flow Regression Suite.                                                                                                                                                                      |
| Personas        | Select one or more personas to simulate users.                                                                                                                                                                                         |
| Scenarios       | Select one or more scenarios to test.                                                                                                                                                                                                  |
| Evaluators      | Select one or more evaluators to score conversations.                                                                                                                                                                                  |
| Variants        | Number of times to repeat each evaluation combination for statistical confidence.<br /><br />Higher variant counts help reduce:<ul><li>Random scoring fluctuations</li><li>LLM non-determinism</li><li>Statistical anomalies</li></ul> |
| Max Concurrency | Defines how many conversations run in parallel.<br /><br />Higher concurrency:<ul><li>Reduces overall execution time</li><li>Increases resource usage and cost</li></ul>                                                               |
| Persona Model   | LLM used to simulate the persona (optional override).                                                                                                                                                                                  |

### Enable CI/CD Integration

Use evaluation runs in CI/CD pipelines to automatically block deployments when regressions are detected.

To enable CI integration:

* Open the Eval Set.
* Enable CI/CD integration.
* Trigger evaluation runs using the Eval Run API from your CI pipeline.
* Check the run result for regressionDetected: true and fail the deployment pipeline accordingly.

```
# Trigger evaluation run
RUN_ID=$(curl -s -X POST .../eval-runs -d '...' | jq -r '.id')

# Check run result
RESULT=$(curl -s /api/projects/:projectId/eval-runs/$RUN_ID)
REGRESSION=$(echo $RESULT | jq '.regressionDetected')

if [ "$REGRESSION" = "true" ]; then
 echo "Regression detected -- blocking deployment"
 exit 1
fi
```

### Regression Detection

Eval sets support regression detection by comparing new runs against baseline runs.

To configure regression detection:

1. Open the Eval Set.
2. Under Regression Settings, select a baseline run.
   Typically, this is the last known-good evaluation run.
3. Specify the regression threshold.
   For example, 0.1 means that a 10% drop in score triggers a regression alert.

When a new run completes, the platform compares scores per evaluator and flags regressions with the evaluator name, persona/scenario combination, baseline score, current score, and score delta.

### Run a Subset of Scenarios

Instead of running the full evaluation set, use scenario tags to create smaller targeted evaluation batches.

1. Tag your scenarios (for example, `smoke-test`, `regression`, or `edge-case`).
2. Create separate eval sets for different test scopes - for example, a lightweight smoke-test set for every commit and a full regression set for release candidates.

## Runs

Runs represent executed evaluations and their results. Each run is executed and tracked independently.

Each run generates:

* Conversation transcripts
* Scores
* Evaluator outputs
* Execution metadata
* Analysis results

Runs help in tracking costs:

* Estimated execution cost
* Actual execution cost
* Model usage
* Token usage

This helps optimize the evaluation scale and model selection.

<img src="https://mintcdn.com/koreai/mOz_ZNDanBGKqV7f/agent-platform/images/eval-runs.png?fit=max&auto=format&n=mOz_ZNDanBGKqV7f&q=85&s=75ad4170fba73d79c4c35054710900e5" alt="Runs" width="916" height="589" data-path="agent-platform/images/eval-runs.png" />

### Run Evaluations

1. Select an Eval Set.
2. Click **Start Run**.
3. Monitor evaluation progress from the Runs page.

The system automatically:

* Executes conversations using the selected personas and scenarios.
* Applies evaluators to generated conversations.
* Stores scoring and transcript results.

### Run Statuses

| Status    | Description                             |
| --------- | --------------------------------------- |
| Pending   | Run is queued and waiting to start.     |
| Running   | Evaluations are in progress.            |
| Completed | All evaluations finished successfully.  |
| Failed    | Run encountered an unrecoverable error. |
| Cancelled | Run was manually stopped.               |

### Run via API

Trigger evaluation runs programmatically for CI/CD integration and automated testing workflows.

```bash theme={null}
curl -X POST /api/projects/:projectId/eval-runs \
 -H "Authorization: Bearer $TOKEN" \
 -H "Content-Type: application/json" \
 -d '{
   "evalSetId": "your-eval-set-id",
   "name": "CI Run #42",
   "triggerSource": "ci"
 }'
```

The `triggerSource` field tracks how the evaluation run was initiated:

* `manual` - Triggered from Studio.
* `ci` - Triggered through API or CI/CD pipelines.
* `scheduled` - Triggered by a scheduled job or automation.

### Troubleshoot Runs

| Issue                                   | Recommendation                                                                                                                         |
| --------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| Run is stuck in pending status          | Check that the project has valid LLM credentials configured. The evaluation pipeline requires both an agent model and a persona model. |
| High execution costs on large eval sets | Reduce variants to `1` during development, use smaller persona models, and limit max concurrency to control spend.                     |
| Inconsistent results between runs       | Increase the variant count to `3+` for better statistical significance and lower persona and judge temperatures.                       |
| Run fails immediately                   | Ensure all referenced personas, scenarios, and evaluators still exist. Deleted components can cause run failures.                      |

## Analyze Evaluation Results

After a run completes, you can:

* Review conversation transcripts.
* Analyze evaluator scores and reasoning.
* Identify success and failure patterns.
* Analyze agent behavior across scenarios
* Inspect execution traces and tool usage.

### View Run Summary

After an eval run completes, open it from **Evals** > **Runs** to view the summary dashboard.

| Metric              | Description                                                       |
| ------------------- | ----------------------------------------------------------------- |
| Avg Score           | Overall average across all evaluators and conversations.          |
| Scores by Evaluator | Breakdown of average score per evaluator.                         |
| Total Conversations | Total number of persona-scenario conversations executed.          |
| Total Evaluations   | Total evaluator judgments generated (conversations × evaluators). |
| Duration            | Total time taken for the run.                                     |
| Estimated Cost      | Projected LLM cost before execution.                              |
| Actual Cost         | Actual LLM cost tracked during execution.                         |

### Understand Statistical Metrics

| Metric              | Description                                                                                                              |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------ |
| Standard Deviation  | Measures how much individual scores vary from the average. Lower values indicate more consistent results.                |
| Confidence Interval | Reliability range of the average score. Narrow intervals indicate more reliable evaluation results.                      |
| Pass\@K             | The probability that at least one of K attempts passes the evaluation criteria. Useful for creative or open-ended tasks. |

### Understand Score Distributions

| Pattern                      | Meaning                                             | Recommended Action                            |
| ---------------------------- | --------------------------------------------------- | --------------------------------------------- |
| High average, low deviation  | The agent performs consistently well.               | Ready for deployment.                         |
| High average, high deviation | The agent performs well overall but inconsistently. | Investigate low-scoring outliers.             |
| Low average, low deviation   | The agent consistently underperforms.               | Review agent instructions and flow design.    |
| Low average, high deviation  | The agent's behavior is unstable or unpredictable.  | Investigate failing scenarios and edge cases. |

### Review Regression Details

If a run detects regressions, the regression panel shows:

* The evaluator that flagged the regression
* Persona/scenario combination
* Baseline score
* Current score
* Score delta

Focus first on regressions with the largest negative score delta.

Open individual conversations to inspect traces and understand the causes of failures.

### Conversation Analysis

Select a conversation to drill into:

* **Full Transcript** - Every message exchanged between the persona and the agent
* **Evaluator Scores** - Per-evaluator scores with judge reasoning (when Chain of Thought is enabled)
* **Trace Timeline** - Execution trace showing tool calls, handoffs, and decisions
* **Milestone Tracking** - Expected milestones that were completed or missed
* **Tool Usage and Failures** - Tool execution details and runtime failures

### Read Evaluator Reasoning

When chain-of-thought is enabled, evaluator scores include judge's reasoning explaining how the score was determined.

Use evaluator reasoning to identify improvements needed in:

* Agent instructions
* Flow design
* Tool configuration

#### Example Evaluator Reasoning

```yaml theme={null}
EVALUATION_RESULT:
  score: 3/5

  reasoning: >
    The agent correctly identified the customer's intent
    but failed to provide complete rebooking options
    before requesting additional information.
```

### Compare Runs Over Time

Use run history to:

* Compare evaluator score changes
* Track regressions
* Measure improvement trends
* Validate prompt or workflow updates

To compare runs:

1. Go to **Evals > Runs**.
2. Sort runs by date to view chronological progression.
3. Compare runs to analyze score deltas across evaluators and scenarios.

Run evaluations after significant changes to agents, prompts, tools, or workflows to identify regressions early.

### Acting on Results

| Issue                      | Recommended Actions                                                                                                                                                                                                                                         |
| -------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Low quality scores         | <ul><li>Refine the agent's **goal** and **persona** to provide clearer behavioral guidance.</li><li>Add or improve **limitations** to prevent off-topic responses.</li><li>For flow-based agents, review step transitions and conditional logic.</li></ul>  |
| Low safety scores          | <ul><li>Add or tighten guardrails rules for input and output filtering.</li><li>Create adversarial personas to stress-test edge cases.</li></ul>                                                                                                            |
| Low efficiency scores      | <ul><li>Reduce unnecessary tool calls by improving agent instructions.</li><li>Optimize flow step sequences to the number of minimize conversation turns.</li><li>Check whether the agent is requesting information already available in context.</li></ul> |
| Handoff correctness issues | <ul><li>Review the handoff conditions for the supervisor agent.</li><li>Verify that `when` clauses match the intended routing patterns.</li><li>Validate expected agent paths configured in scenarios.</li></ul>                                            |

### Troubleshooting

| Issue                                       | Recommendation                                                                                                                                                                     |
| ------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Scores seem random                          | Increase the number of variants in the eval set. Using `3-5` variants typically provides better statistical significance. Lower the judge temperature for more consistent scoring. |
| All scores are perfect (5/5)                | The scoring rubric may be too lenient. Add more specific failure conditions and use adversarial personas to test edge cases.                                                       |
| Regression detected, but the agent improved | Review the baseline run. It may contain an anomalous high score. Set a more recent and stable run as the new baseline.                                                             |
| Cost higher than expected                   | Review the selected persona and judge models. Using smaller persona models can significantly reduce evaluation cost.                                                               |
