Skip to content

quantmew/anthropic-agent-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Agent Evaluation Framework

A comprehensive framework for evaluating AI agents based on Anthropic's "Demystifying Evals for AI Agents" design principles.

Overview

This framework provides a complete evaluation infrastructure for AI agents with:

  • Tasks: Individual test cases with defined inputs and success criteria
  • Trials: Multiple attempts per task to handle model variability
  • Graders: Deterministic, LLM-based, and human evaluation methods
  • Transcripts: Complete records of agent interactions
  • Metrics: pass@k and pass^k scoring
  • Evaluation Harness: Infrastructure to run evals end-to-end

Installation

# Clone or copy to your project
cd anthropic_agent_eval

# Install in development mode
pip install -e .

# Install with dev dependencies
pip install -e ".[dev]"

Quick Start

1. Define a Task

Create a YAML file (e.g., tasks/my_task.yaml):

task:
  id: "greeting-test_1"
  description: "Test that the agent responds to greetings appropriately"

  input:
    prompt: "Hello! How are you?"

  graders:
    - type: string_match
      params:
        pattern: "hello|hi|hey"
        mode: regex
        case_sensitive: false

    - type: llm_assertion
      params:
        assertions:
          - "The response is polite and friendly"
          - "The response acknowledges the greeting"

  environment:
    timeout: 60

2. Run an Evaluation

import asyncio
from eval_framework import (
    load_task_suite,
    AgentHarness,
    EvalHarness,
    Tool,
)

# Define tools for your agent
tools = [
    Tool(
        name="read_file",
        description="Read a file",
        function=lambda path: open(path).read()
    ),
]

# Load tasks
tasks = load_task_suite("tasks/")

# Create agent harness
agent = AgentHarness(tools=tools)

# Run evaluation
harness = EvalHarness(tasks, agent)
results = asyncio.run(harness.run_evaluation(trials_per_task=3))

# Display results
print(results.summary())
results.save("results.json")

3. View Metrics

# Key metrics
print(f"pass@1: {results.metrics['mean_pass@1']:.2%}")
print(f"pass@3: {results.metrics['mean_pass@3']:.2%}")
print(f"pass^3: {results.metrics['mean_pass^3']:.2%}")

Key Concepts

Metrics

  • pass@k: Probability of at least one success in k attempts. Higher k = higher scores.
  • pass^k: Probability that all k attempts succeed. Higher k = lower scores (stricter).

Example with 75% per-trial success rate:

  • pass@3 ≈ 98% (likely to get at least one success)
  • pass^3 ≈ 42% (harder to get three consecutive successes)

Grader Types

Deterministic Graders (Code-based)

Grader Purpose Config Keys
string_match Match patterns in output pattern, mode, case_sensitive
deterministic_tests Run test suites test_files, test_command, working_dir
static_analysis Run linters commands, max_issues
state_check Verify final state expect, checks
tool_calls Verify tool usage required, forbidden, order_matters

LLM-based Graders

Grader Purpose Config Keys
llm_rubric Score against rubric rubric, dimensions, pass_threshold
llm_assertion Check assertions assertions, require_all
llm_pairwise Compare outputs reference_transcript, criteria
llm_reference Match reference reference, similarity_threshold

Human Graders

Grader Purpose Config Keys
human Human review mode, rubric, dimensions

Task Configuration

task:
  id: "unique-task-id"
  description: "What this task tests"

  input:
    prompt: "Task instruction for the agent"
    # Any other context...

  graders:
    - type: grader_type
      weight: 1.0        # Relative weight in scoring
      required: true     # Must pass for task to pass
      params:
        # Grader-specific parameters

  metadata:
    category: "category"
    difficulty: "easy|medium|hard"
    tags: ["tag1", "tag2"]

  environment:
    isolation: true      # Run in isolated environment
    timeout: 300         # Max seconds
    working_dir: "/path"
    setup_commands: []
    teardown_commands: []

Project Structure

anthropic_agent_eval/
├── src/eval_framework/
│   ├── core/           # Data models
│   │   ├── task.py     # Task definitions
│   │   ├── trial.py    # Trial management
│   │   ├── transcript.py
│   │   ├── outcome.py
│   │   └── metrics.py  # pass@k, pass^k
│   ├── graders/        # Evaluation graders
│   │   ├── base.py
│   │   ├── deterministic.py
│   │   ├── llm_based.py
│   │   └── human.py
│   ├── harness/        # Execution harnesses
│   │   ├── agent_harness.py
│   │   └── eval_harness.py
│   └── config/         # Configuration loading
│       └── loader.py
├── examples/
│   ├── tasks/          # Example task YAMLs
│   ├── rubrics/        # Example rubrics
│   └── run_eval.py     # Example script
└── tests/

Advanced Usage

Custom Agent Function

async def my_agent(input_data, execute_tool, transcript):
    """Custom agent implementation."""
    # Read task
    prompt = input_data["prompt"]

    # Use tools
    content = await execute_tool("read_file", path="src/main.py")

    # Add reasoning to transcript
    transcript.add_reasoning("Analyzing the code...")

    # Return result
    return {"result": "Task completed", "files_modified": ["src/main.py"]}

agent = AgentHarness(agent_fn=my_agent, tools=tools)

Isolated Evaluation

from eval_framework.harness.eval_harness import IsolatedEvalHarness

harness = IsolatedEvalHarness(
    tasks,
    agent,
    isolation_mode="tempdir"  # or "docker"
)

Custom Graders

from eval_framework.graders.base import BaseGrader, GraderResult, register_grader

@register_grader("my_grader")
class MyGrader(BaseGrader):
    def grade(self, transcript, outcome, task=None):
        # Custom grading logic
        score = 0.8
        return GraderResult(
            score=score,
            passed=score >= 0.7,
            feedback="Custom feedback"
        )

Running Tests

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run with coverage
pytest tests/ --cov=eval_framework

Demo

Run the example evaluation:

# Demo with mock agent
python examples/run_eval.py

# Full evaluation with Anthropic API (requires ANTHROPIC_API_KEY)
python examples/run_eval.py --full --trials 3

Requirements

  • Python 3.10+
  • pyyaml
  • pydantic
  • anthropic (for LLM graders and API agent)
  • pytest (for testing)

License

MIT

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages