Skip to main content

Overview

Scoring functions are every bit as consequential as the test under consideration. Runloop enables full control over scoring functions for a Scenario without needing code changes.

Why Use Custom Scorers?

Changing the scoring function changes the reward signal for the agent. By changing the scoring function, you can change what success looks like or teach the agent to avoid undesirable behaviors.
  1. Reusability: Take an existing Scenario and repurpose it to test a different behavior. This is particularly useful to detect model regressions along common dimensions, like security or privacy.
  2. Composability: Extend an existing scoring function to include additional criteria. For example, you can take a SWE-Bench scenario and add an additonal score component to reward the agent for keeping costs low.
  3. Flexibility: Incorporate powerful evaluation techniques like LLM-based scoring or grading using external tools.

Creating a Custom Scorer

Here’s an example of creating a custom scorer that evaluates the length of an agent’s response written to a file:
import os
from runloop_api_client import Runloop

client = Runloop(
    bearer_token=os.environ.get("RUNLOOP_API_KEY"),  # This is the default and can be omitted
)
scorer = client.scenarios.scorers.create(
    bash_script="""
    #!/bin/bash

    # Parse the test context to get expected length and file path
    expected_length=$(echo "$RL_SCORER_CONTEXT" | jq -r '.expected_length')
    file_path=$(echo "$RL_SCORER_CONTEXT" | jq -r '.file_path')

    # Read the file contents
    file_contents=$(cat "$file_path")

    # Get the actual length by counting characters in file contents
    actual_length=$(echo -n "$file_contents" | wc -m)

    # Compare lengths and exit with appropriate code
    # Calculate difference between actual and expected length
    diff=$(( actual_length > expected_length ? actual_length - expected_length : expected_length - actual_length ))
    
    # Calculate score based on difference (1.0 when equal, decreasing linearly as difference increases)
    # Use bc for floating point math
    score=$(echo "scale=2; 1.0 - ($diff / $expected_length)" | bc)
    
    # Ensure score doesn't go below 0
    if (( $(echo "$score < 0" | bc -l) )); then
        echo "0.0"
    else
        echo "$score"
    fi
    """,
    type="my_custom_scorer_type",
)
print(scorer.id)
Note the use of the RL_SCORER_CONTEXT environment variable to pass the test context to the scorer. This string is a JSON object that is used to pass the test context to the scorer. This is useful when writing a scorer that needs the input context. For example, using an LLM as judge to evaluate a model response will require sufficient input context to the LLM to provide a meaningful answer. The environment variable is intended to provide this context in an easy to use format and is available for any custom scorer.

Using Custom Scorers in Scenarios

You can reuse a custom scorer in multiple scenarios or inline the scorer into each Scenario for consistency. Here’s an example that uses the scorer to evaluate if an agent writes a file with exactly 10 characters:
import os
from runloop_api_client import Runloop

client = Runloop(
    bearer_token=os.environ.get("RUNLOOP_API_KEY"),  # This is the default and can be omitted
)
scenario_view = client.scenarios.create(
    input_context={
        "problem_statement": "How many characters are in the file provided in /home/user/file.txt?"
    },
    name="name",
    scoring_contract={
        "scoring_function_parameters": [{
            "name": "my scorer",
            "scorer": {
              "type": "custom_scorer",
              "custom_scorer_type": "my_custom_scorer_type",
              "scorer_params": {
                "expected_length": 10,
                "file_path": "/home/user/file.txt"
              }
            },
            "weight": 1.0,
        }]
    },
)
print(scenario_view.id)

Best Practices

When using custom scorers to train a model or agent, follow these best practices:
  1. Output Score: The scorer must output a score between 0.0 and 1.0 as the last line of execution.
  2. Start Simple: The more complex the scorer logic, the more likely it is that the agent will discover an unintended way to maximize the score (ie. the model will learn to reward hack the scorer). Start simple and add complexity to tune results.
  3. Clone Scenarios: Clone a Scenario and replace the scoring function to test a different behavior.
  4. Evaluate Early and Often: Evaluate the agent’s performance early and often to identify problems and improve the agent faster. Don’t start a training run until you’re happy with the scoring function.
  5. Establish a Baseline: It’s not where you start, it’s where you end. Establish a baseline score for the agent’s performance before training to track progress.