AI Agent Evaluation Framework

A comprehensive framework for evaluating AI agents based on Anthropic's "Demystifying Evals for AI Agents" design principles.

Overview

This framework provides a complete evaluation infrastructure for AI agents with:

Tasks: Individual test cases with defined inputs and success criteria
Trials: Multiple attempts per task to handle model variability
Graders: Deterministic, LLM-based, and human evaluation methods
Transcripts: Complete records of agent interactions
Metrics: pass@k and pass^k scoring
Evaluation Harness: Infrastructure to run evals end-to-end

Installation

# Clone or copy to your project
cd anthropic_agent_eval

# Install in development mode
pip install -e .

# Install with dev dependencies
pip install -e ".[dev]"

Quick Start

1. Define a Task

Create a YAML file (e.g., tasks/my_task.yaml):

task:
  id: "greeting-test_1"
  description: "Test that the agent responds to greetings appropriately"

  input:
    prompt: "Hello! How are you?"

  graders:
    - type: string_match
      params:
        pattern: "hello|hi|hey"
        mode: regex
        case_sensitive: false

    - type: llm_assertion
      params:
        assertions:
          - "The response is polite and friendly"
          - "The response acknowledges the greeting"

  environment:
    timeout: 60

2. Run an Evaluation

import asyncio
from eval_framework import (
    load_task_suite,
    AgentHarness,
    EvalHarness,
    Tool,
)

# Define tools for your agent
tools = [
    Tool(
        name="read_file",
        description="Read a file",
        function=lambda path: open(path).read()
    ),
]

# Load tasks
tasks = load_task_suite("tasks/")

# Create agent harness
agent = AgentHarness(tools=tools)

# Run evaluation
harness = EvalHarness(tasks, agent)
results = asyncio.run(harness.run_evaluation(trials_per_task=3))

# Display results
print(results.summary())
results.save("results.json")

3. View Metrics

# Key metrics
print(f"pass@1: {results.metrics['mean_pass@1']:.2%}")
print(f"pass@3: {results.metrics['mean_pass@3']:.2%}")
print(f"pass^3: {results.metrics['mean_pass^3']:.2%}")

Key Concepts

Metrics

pass@k: Probability of at least one success in k attempts. Higher k = higher scores.
pass^k: Probability that all k attempts succeed. Higher k = lower scores (stricter).

Example with 75% per-trial success rate:

pass@3 ≈ 98% (likely to get at least one success)
pass^3 ≈ 42% (harder to get three consecutive successes)

Grader Types

Deterministic Graders (Code-based)

Grader	Purpose	Config Keys
`string_match`	Match patterns in output	`pattern`, `mode`, `case_sensitive`
`deterministic_tests`	Run test suites	`test_files`, `test_command`, `working_dir`
`static_analysis`	Run linters	`commands`, `max_issues`
`state_check`	Verify final state	`expect`, `checks`
`tool_calls`	Verify tool usage	`required`, `forbidden`, `order_matters`

LLM-based Graders

Grader	Purpose	Config Keys
`llm_rubric`	Score against rubric	`rubric`, `dimensions`, `pass_threshold`
`llm_assertion`	Check assertions	`assertions`, `require_all`
`llm_pairwise`	Compare outputs	`reference_transcript`, `criteria`
`llm_reference`	Match reference	`reference`, `similarity_threshold`

Human Graders

Grader	Purpose	Config Keys
`human`	Human review	`mode`, `rubric`, `dimensions`

Task Configuration

task:
  id: "unique-task-id"
  description: "What this task tests"

  input:
    prompt: "Task instruction for the agent"
    # Any other context...

  graders:
    - type: grader_type
      weight: 1.0        # Relative weight in scoring
      required: true     # Must pass for task to pass
      params:
        # Grader-specific parameters

  metadata:
    category: "category"
    difficulty: "easy|medium|hard"
    tags: ["tag1", "tag2"]

  environment:
    isolation: true      # Run in isolated environment
    timeout: 300         # Max seconds
    working_dir: "/path"
    setup_commands: []
    teardown_commands: []

Project Structure

anthropic_agent_eval/
├── src/eval_framework/
│   ├── core/           # Data models
│   │   ├── task.py     # Task definitions
│   │   ├── trial.py    # Trial management
│   │   ├── transcript.py
│   │   ├── outcome.py
│   │   └── metrics.py  # pass@k, pass^k
│   ├── graders/        # Evaluation graders
│   │   ├── base.py
│   │   ├── deterministic.py
│   │   ├── llm_based.py
│   │   └── human.py
│   ├── harness/        # Execution harnesses
│   │   ├── agent_harness.py
│   │   └── eval_harness.py
│   └── config/         # Configuration loading
│       └── loader.py
├── examples/
│   ├── tasks/          # Example task YAMLs
│   ├── rubrics/        # Example rubrics
│   └── run_eval.py     # Example script
└── tests/

Advanced Usage

Custom Agent Function

async def my_agent(input_data, execute_tool, transcript):
    """Custom agent implementation."""
    # Read task
    prompt = input_data["prompt"]

    # Use tools
    content = await execute_tool("read_file", path="src/main.py")

    # Add reasoning to transcript
    transcript.add_reasoning("Analyzing the code...")

    # Return result
    return {"result": "Task completed", "files_modified": ["src/main.py"]}

agent = AgentHarness(agent_fn=my_agent, tools=tools)

Isolated Evaluation

from eval_framework.harness.eval_harness import IsolatedEvalHarness

harness = IsolatedEvalHarness(
    tasks,
    agent,
    isolation_mode="tempdir"  # or "docker"
)

Custom Graders

from eval_framework.graders.base import BaseGrader, GraderResult, register_grader

@register_grader("my_grader")
class MyGrader(BaseGrader):
    def grade(self, transcript, outcome, task=None):
        # Custom grading logic
        score = 0.8
        return GraderResult(
            score=score,
            passed=score >= 0.7,
            feedback="Custom feedback"
        )

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=eval_framework

Demo

Run the example evaluation:

# Demo with mock agent
python examples/run_eval.py

# Full evaluation with Anthropic API (requires ANTHROPIC_API_KEY)
python examples/run_eval.py --full --trials 3

Requirements

Python 3.10+
pyyaml
pydantic
anthropic (for LLM graders and API agent)
pytest (for testing)

License

MIT

References

Demystifying Evals for AI Agents - The design principles this framework implements

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
examples		examples
src/eval_framework		src/eval_framework
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI Agent Evaluation Framework

Overview

Installation

Quick Start

1. Define a Task

2. Run an Evaluation

3. View Metrics

Key Concepts

Metrics

Grader Types

Deterministic Graders (Code-based)

LLM-based Graders

Human Graders

Task Configuration

Project Structure

Advanced Usage

Custom Agent Function

Isolated Evaluation

Custom Graders

Running Tests

Demo

Requirements

License

References

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

quantmew/anthropic-agent-eval

Folders and files

Latest commit

History

Repository files navigation

AI Agent Evaluation Framework

Overview

Installation

Quick Start

1. Define a Task

2. Run an Evaluation

3. View Metrics

Key Concepts

Metrics

Grader Types

Deterministic Graders (Code-based)

LLM-based Graders

Human Graders

Task Configuration

Project Structure

Advanced Usage

Custom Agent Function

Isolated Evaluation

Custom Graders

Running Tests

Demo

Requirements

License

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages