Skip to content

Evaluation Datasets

Evaluation Datasets are collections of test data used to measure the performance of your AI workflows.

Overview

The Datasets page provides a centralized view of all your evaluation datasets:

Datasets List

Managing Datasets

In the Datasets list, you can:

  • View all available evaluation datasets
  • See when each dataset was created
  • Filter datasets by their active status
  • Access detailed information for any dataset

Dataset Details

When you select a dataset, you'll see its detailed information:

Dataset Details

The Dataset Details page includes:

  • Dataset name and description
  • Last updated timestamp
  • Dataset contents overview
  • List of dataset entries with their inputs
  • Reference outputs (expected results)
  • History of evaluation runs using this dataset

Dataset Entries

Each dataset contains entries that will be used to evaluate your workflows:

  • ID: A unique identifier for each entry
  • Inputs: The data that will be fed into your workflow (e.g., product descriptions, queries)
  • Reference Outputs: The expected outputs to compare against your workflow's results

Table Settings

You can customize how dataset entries are displayed using the Table Settings:

Table Settings

Available display options include:

  • Show/Hide Inputs: Toggle visibility of input data
  • Show/Hide State: Toggle visibility of intermediate state information
  • Show/Hide Reference Outputs: Toggle visibility of expected outputs

Latest Evaluation Runs

The Dataset Details page also shows a history of recent evaluation runs performed with this dataset:

  • Timestamp of each evaluation run
  • Workflow name that was evaluated
  • Brief description of the workflow
  • Options to open the run details or workflow

Running Evaluations

To run an evaluation against a dataset:

  1. Navigate to the Dataset Details page
  2. Select the workflow you want to evaluate
  3. Configure any evaluation parameters
  4. Click "Run Evaluation"

Best Practices

For optimal evaluation dataset management:

  • Include diverse test cases that cover edge cases
  • Maintain reference outputs that reflect desired behavior
  • Version datasets when significant changes are made
  • Use consistent formatting for inputs and reference outputs

Next Steps