Overview
Scoring functions are every bit as consequential as the test under consideration. Runloop enables full control over scoring functions for a Scenario without needing code changes.Why Use Custom Scorers?
Changing the scoring function changes the reward signal for the agent. By changing the scoring function, you can change what success looks like or teach the agent to avoid undesirable behaviors.- Reusability: Take an existing Scenario and repurpose it to test a different behavior. This is particularly useful to detect model regressions along common dimensions, like security or privacy.
- Composability: Extend an existing scoring function to include additional criteria. For example, you can take a SWE-Bench scenario and add an additonal score component to reward the agent for keeping costs low.
- Flexibility: Incorporate powerful evaluation techniques like LLM-based scoring or grading using external tools.
Creating a Custom Scorer
Here’s an example of creating a custom scorer that evaluates the length of an agent’s response written to a file:RL_SCORER_CONTEXT environment variable to pass the test context to the scorer. This string is a JSON object that is used to pass the test context to the scorer. This is useful when writing a scorer that needs the input context. For example, using an LLM as judge to evaluate a model response will require sufficient input context to the LLM to provide a meaningful answer. The environment variable is intended to provide this context in an easy to use format and is available for any custom scorer.
Using Custom Scorers in Scenarios
You can reuse a custom scorer in multiple scenarios or inline the scorer into each Scenario for consistency. Here’s an example that uses the scorer to evaluate if an agent writes a file with exactly 10 characters:Best Practices
When using custom scorers to train a model or agent, follow these best practices:- Output Score: The scorer must output a score between 0.0 and 1.0 as the last line of execution.
- Start Simple: The more complex the scorer logic, the more likely it is that the agent will discover an unintended way to maximize the score (ie. the model will learn to reward hack the scorer). Start simple and add complexity to tune results.
- Clone Scenarios: Clone a Scenario and replace the scoring function to test a different behavior.
- Evaluate Early and Often: Evaluate the agent’s performance early and often to identify problems and improve the agent faster. Don’t start a training run until you’re happy with the scoring function.
- Establish a Baseline: It’s not where you start, it’s where you end. Establish a baseline score for the agent’s performance before training to track progress.
