Skip to content

Configure Evaluators

In Evaluation Studio, evaluators are tools used to assess how well a model is performing based on specific tasks. They function like custom prompts or instructions designed to check certain aspects of a model’s output.

For example, an evaluator can be set up to assess content completeness. It takes inputs and outputs from a model and is programmed to compare the results against predefined criteria to check if the content is complete.

Types of Evaluators

AI Evaluators are predefined sets of instructions provided to a large language model (LLM), whether open-source or commercial, to evaluate its outputs. These evaluators assess the model’s performance by comparing its outputs against predefined criteria or instructions, using a dataset of input and output data.

System AI Evaluators are pre-built evaluators offered by the platform to assess common aspects of model performance, such as quality, correctness, and safety. These evaluators are ready-to-use and cannot be modified, providing a quick and efficient way for users to evaluate models.

Note

Users can also access all the available system evaluators through the global Evaluators page located at the project level. Simply click the Evaluators tab, located next to the Projects tab. This page provides an overview of available evaluators that can be applied to datasets for evaluation.

System Evaluators are grouped into two categories: Quality Metrics and Safety Metrics.

Quality Metrics

Quality metrics assess the overall effectiveness and usefulness of the model's outputs. These metrics focus on whether the content generated by the model is clear, accurate, and complete.

Below are the key quality metrics and the components required in the dataset to use these evaluators:

Metric Description Required Dataset Components
Groundness Evaluates whether the output accurately reflects the information provided in the input without introducing additional details from the model’s knowledge base. Input, Output
Query Relevance Assesses the relevance of the output to a user query and ensures the output is related to the given input. Input, Output, User Query
Ground Truth Relevance Compares the output to a provided ground truth to assess the relevance between input and output. Input, Output, Ground Truth
Coherence Evaluates how logically consistent and well-structured the generated output is, assessing its natural flow and readability. Output
Fluency Assesses the quality of individual sentences in the output, checking if they are grammatically correct and well-written. Output
GPT Similarity Score Compares the model’s response with a superior model's response (e.g., GPT) for a given input. Input, your model’s response, Superior model’s response
Paraphrasing Assesses whether the output conveys the same meaning as the input using different phrasing and sentence structures. Input, Output
Completeness Evaluates whether the output conveys the full context from the input, checking if any information was lost or omitted. Input, Output

Safety Metrics

Safety metrics focus on evaluating whether the model’s outputs are free from harmful or unethical content. These metrics are crucial for ensuring that AI models do not produce dangerous or biased results.

Below are the key safety metrics and the components required in the dataset to use these evaluators:

Metric Description Required Dataset Components
Bias Detection Analyzes the output for potential biases related to specified topics, ensuring that the model doesn’t exhibit unfair or discriminatory tendencies. Output
Banned Topics Scans the output for prohibited content related to specific topics, like sensitive political issues or illegal activities. Output
Toxicity Screens the output for toxic content, such as violent, sexual, or otherwise inappropriate material. Output

Adding an Evaluator

When adding a system evaluator, users can map the dataset columns to the evaluator's input and output variables. However, the evaluator cannot be modified, and it must be used as-is. Users can add the system evaluators to an evaluation by mapping the prompt variables with the column names from the dataset.

Steps to add an evaluator:

  1. On the Evaluations page, click the + button, and select the Add evaluator option.

    Configure evaluator

  2. From the list of Quality and Safety evaluators, select the desired evaluator.

  3. In the Evaluators dialog, fill in these details:

    1. Model: Choose the model you want to use as an evaluator. This model will assess the input and/or output and generate a score. Only the models deployed in GALE will appear in the search dropdown. Both open-source and the external models are considered here.
    2. Model Configuration: Select the appropriate model hyperparameters such as Temperature, Output token limit, Top P etc.
    3. Prompt: Click to view the system prompt. The prompt associated with the system evaluator is view-only. While you can view the prompt, it cannot be edited.
    4. Map variables: Map the variables in the prompt to the corresponding columns in your imported dataset. This ensures the evaluator uses the right data for its analysis.
    5. Pass threshold: Set the minimum score required for an output to pass the evaluation. Choose either the ‘Greater than’ or ‘Less than’ option and then enter a threshold value (from 1 to 5).

      • For Positive Evaluators (or evaluators where a higher score is better, such as Completeness), the output is considered "good" if the score exceeds the threshold. For example, if the Completeness evaluator returns a score greater than 2.5, the result will be marked green, indicating that it meets the expected quality level.
      • For Negative Evaluators (such as Toxicity, where a lower score is better), a score above the threshold indicates a problem. For example, if the Toxicity evaluator returns a score greater than 2.5, it will be marked red, signaling that the output contains undesirable levels of toxic content.

      The ‘Greater than’ or ‘Less than’ options help distinguish between positive and negative evaluators, allowing you to adjust the evaluation based on the desired outcome

  4. Click Save to save the evaluator configuration.

    Configure evaluator

When setting up an AI evaluator, variable mapping is a crucial step. This is where the user connects the variables in the evaluator's prompt to the corresponding columns in the dataset.

  1. Variables in the Prompt: The evaluator’s prompt contains variables, indicated in double curly braces. For example, {{input}}, {{output}}, {{query}}. These variables are placeholders for your dataset columns and will appear on the left side of the Variable column. For example, in a Query Relevance evaluator, the prompt might include variables like {{query}} for the user query, {{input}} for the input text, and {{output}} for the model's response.
  2. Left Side - Prompt Variables: The left side of the mapping section shows the variables from the evaluator's prompt. This section is auto-populated by the system.
  3. Right Side - Dataset Columns: The right side displays the columns from your imported dataset. You must select the correct columns from the dataset to match each variable in the prompt. For example:
    • Map {{input}} to the corresponding input column in your dataset.
    • Map {{output}} to the output column.
  4. Safety Evaluators: For safety evaluators like Bias Detection or Toxicity, you may need to configure additional key-value pairs. These evaluators often provide binary (pass/fail) results, so map the relevant columns accordingly.

By correctly mapping the variables, you ensure the evaluator receives the right data and produces accurate results.

Key Points:

  • Evaluators are used to assess model performance by comparing its outputs against predefined criteria.

  • System evaluators are pre-built and cannot be modified, offering ready-to-use options for evaluating common aspects of model performance, such as quality and safety metrics.

  • Variable mapping is crucial when adding evaluators, as users must link the variables in the evaluator's prompt to the appropriate dataset columns to ensure accurate evaluation results.