A comprehensive framework for evaluating AI agents based on Anthropic's "Demystifying Evals for AI Agents" design principles.
This framework provides a complete evaluation infrastructure for AI agents with:
- Tasks: Individual test cases with defined inputs and success criteria
- Trials: Multiple attempts per task to handle model variability
- Graders: Deterministic, LLM-based, and human evaluation methods
- Transcripts: Complete records of agent interactions
- Metrics: pass@k and pass^k scoring
- Evaluation Harness: Infrastructure to run evals end-to-end
# Clone or copy to your project
cd anthropic_agent_eval
# Install in development mode
pip install -e .
# Install with dev dependencies
pip install -e ".[dev]"Create a YAML file (e.g., tasks/my_task.yaml):
task:
id: "greeting-test_1"
description: "Test that the agent responds to greetings appropriately"
input:
prompt: "Hello! How are you?"
graders:
- type: string_match
params:
pattern: "hello|hi|hey"
mode: regex
case_sensitive: false
- type: llm_assertion
params:
assertions:
- "The response is polite and friendly"
- "The response acknowledges the greeting"
environment:
timeout: 60import asyncio
from eval_framework import (
load_task_suite,
AgentHarness,
EvalHarness,
Tool,
)
# Define tools for your agent
tools = [
Tool(
name="read_file",
description="Read a file",
function=lambda path: open(path).read()
),
]
# Load tasks
tasks = load_task_suite("tasks/")
# Create agent harness
agent = AgentHarness(tools=tools)
# Run evaluation
harness = EvalHarness(tasks, agent)
results = asyncio.run(harness.run_evaluation(trials_per_task=3))
# Display results
print(results.summary())
results.save("results.json")# Key metrics
print(f"pass@1: {results.metrics['mean_pass@1']:.2%}")
print(f"pass@3: {results.metrics['mean_pass@3']:.2%}")
print(f"pass^3: {results.metrics['mean_pass^3']:.2%}")- pass@k: Probability of at least one success in k attempts. Higher k = higher scores.
- pass^k: Probability that all k attempts succeed. Higher k = lower scores (stricter).
Example with 75% per-trial success rate:
- pass@3 ≈ 98% (likely to get at least one success)
- pass^3 ≈ 42% (harder to get three consecutive successes)
| Grader | Purpose | Config Keys |
|---|---|---|
string_match |
Match patterns in output | pattern, mode, case_sensitive |
deterministic_tests |
Run test suites | test_files, test_command, working_dir |
static_analysis |
Run linters | commands, max_issues |
state_check |
Verify final state | expect, checks |
tool_calls |
Verify tool usage | required, forbidden, order_matters |
| Grader | Purpose | Config Keys |
|---|---|---|
llm_rubric |
Score against rubric | rubric, dimensions, pass_threshold |
llm_assertion |
Check assertions | assertions, require_all |
llm_pairwise |
Compare outputs | reference_transcript, criteria |
llm_reference |
Match reference | reference, similarity_threshold |
| Grader | Purpose | Config Keys |
|---|---|---|
human |
Human review | mode, rubric, dimensions |
task:
id: "unique-task-id"
description: "What this task tests"
input:
prompt: "Task instruction for the agent"
# Any other context...
graders:
- type: grader_type
weight: 1.0 # Relative weight in scoring
required: true # Must pass for task to pass
params:
# Grader-specific parameters
metadata:
category: "category"
difficulty: "easy|medium|hard"
tags: ["tag1", "tag2"]
environment:
isolation: true # Run in isolated environment
timeout: 300 # Max seconds
working_dir: "/path"
setup_commands: []
teardown_commands: []anthropic_agent_eval/
├── src/eval_framework/
│ ├── core/ # Data models
│ │ ├── task.py # Task definitions
│ │ ├── trial.py # Trial management
│ │ ├── transcript.py
│ │ ├── outcome.py
│ │ └── metrics.py # pass@k, pass^k
│ ├── graders/ # Evaluation graders
│ │ ├── base.py
│ │ ├── deterministic.py
│ │ ├── llm_based.py
│ │ └── human.py
│ ├── harness/ # Execution harnesses
│ │ ├── agent_harness.py
│ │ └── eval_harness.py
│ └── config/ # Configuration loading
│ └── loader.py
├── examples/
│ ├── tasks/ # Example task YAMLs
│ ├── rubrics/ # Example rubrics
│ └── run_eval.py # Example script
└── tests/
async def my_agent(input_data, execute_tool, transcript):
"""Custom agent implementation."""
# Read task
prompt = input_data["prompt"]
# Use tools
content = await execute_tool("read_file", path="src/main.py")
# Add reasoning to transcript
transcript.add_reasoning("Analyzing the code...")
# Return result
return {"result": "Task completed", "files_modified": ["src/main.py"]}
agent = AgentHarness(agent_fn=my_agent, tools=tools)from eval_framework.harness.eval_harness import IsolatedEvalHarness
harness = IsolatedEvalHarness(
tasks,
agent,
isolation_mode="tempdir" # or "docker"
)from eval_framework.graders.base import BaseGrader, GraderResult, register_grader
@register_grader("my_grader")
class MyGrader(BaseGrader):
def grade(self, transcript, outcome, task=None):
# Custom grading logic
score = 0.8
return GraderResult(
score=score,
passed=score >= 0.7,
feedback="Custom feedback"
)# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run with coverage
pytest tests/ --cov=eval_frameworkRun the example evaluation:
# Demo with mock agent
python examples/run_eval.py
# Full evaluation with Anthropic API (requires ANTHROPIC_API_KEY)
python examples/run_eval.py --full --trials 3- Python 3.10+
- pyyaml
- pydantic
- anthropic (for LLM graders and API agent)
- pytest (for testing)
MIT
- Demystifying Evals for AI Agents - The design principles this framework implements