You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

9.3 KiB

Core Concept: Eval Framework

Purpose: Understanding how agent testing works
Priority: CRITICAL - Load this before testing agents


What Is the Eval Framework?

The eval framework is a TypeScript-based testing system that validates agent behavior through:

  • Test definitions (YAML files)
  • Session collection (capturing agent interactions)
  • Evaluators (rules that validate behavior)
  • Reports (pass/fail with detailed violations)

Location: evals/framework/


Architecture

Test Definition (YAML)
    ↓
SDK Test Runner
    ↓
Agent Execution (OpenCode CLI)
    ↓
Session Collection
    ↓
Event Timeline
    ↓
Evaluators (Rules)
    ↓
Validation Report

Test Structure

Directory Layout

evals/agents/{category}/{agent-name}/
├── config/
│   └── config.yaml          # Agent test configuration
└── tests/
    ├── smoke-test.yaml      # Basic functionality test
    ├── approval-gate.yaml   # Approval gate test
    ├── context-loading.yaml # Context loading test
    └── ...                  # Additional tests

Config File (config.yaml)

agent: {category}/{agent-name}
model: anthropic/claude-sonnet-4-5
timeout: 60000
suites:
  - smoke
  - approval
  - context

Fields:

  • agent: Agent path (category/name format)
  • model: Model to use for testing
  • timeout: Test timeout in milliseconds
  • suites: Test suites to run

Test File Format

name: Smoke Test
description: Basic functionality check
agent: core/openagent
model: anthropic/claude-sonnet-4-5
conversation:
  - role: user
    content: "Hello, can you help me?"
  - role: assistant
    content: "Yes, I can help you!"
expectations:
  - type: no_violations

Fields:

  • name: Test name
  • description: What this test validates
  • agent: Agent to test
  • model: Model to use
  • conversation: User/assistant exchanges
  • expectations: What should happen

Evaluators

Evaluators are rules that validate agent behavior. Each evaluator checks for specific patterns.

Available Evaluators

1. Approval Gate Evaluator

Purpose: Ensures agent requests approval before execution

Validates:

  • Agent proposes plan before executing
  • User approves before write/edit/bash operations
  • No auto-execution without approval

Violation Example:

Agent executed write tool without requesting approval first

2. Context Loading Evaluator

Purpose: Ensures agent loads required context files

Validates:

  • Code tasks → loads core/standards/code-quality.md
  • Doc tasks → loads core/standards/documentation.md
  • Test tasks → loads core/standards/test-coverage.md
  • Context loaded BEFORE implementation

Violation Example:

Agent executed write tool without loading required context: core/standards/code-quality.md

3. Tool Usage Evaluator

Purpose: Ensures agent uses appropriate tools

Validates:

  • Uses read instead of bash cat
  • Uses list instead of bash ls
  • Uses grep instead of bash grep
  • Proper tool selection for tasks

Violation Example:

Agent used bash tool for reading file instead of read tool

4. Stop on Failure Evaluator

Purpose: Ensures agent stops on errors instead of auto-fixing

Validates:

  • Agent reports errors to user
  • Agent proposes fix and requests approval
  • No auto-fixing without approval

Violation Example:

Agent auto-fixed error without reporting and requesting approval

5. Execution Balance Evaluator

Purpose: Ensures agent doesn't over-execute

Validates:

  • Reasonable ratio of read vs execute operations
  • Not executing excessively
  • Balanced tool usage

Violation Example:

Agent execution ratio too high: 80% execute vs 20% read

Running Tests

Basic Test Run

cd evals/framework
npm run eval:sdk -- --agent={category}/{agent}

Run Specific Test

cd evals/framework
npm run eval:sdk -- --agent={category}/{agent} --pattern="smoke-test.yaml"

Run with Debug

cd evals/framework
npm run eval:sdk -- --agent={category}/{agent} --debug

Run All Tests

cd evals/framework
npm run eval:sdk

Session Collection

What Are Sessions?

Sessions are recordings of agent interactions stored in .tmp/sessions/.

Session Structure

.tmp/sessions/{session-id}/
├── session.json         # Complete session data
├── events.json          # Event timeline
└── context.md           # Session context (if any)

Session Data

{
  "id": "session-id",
  "timestamp": "2025-12-10T17:00:00Z",
  "agent": "core/openagent",
  "model": "anthropic/claude-sonnet-4-5",
  "messages": [...],
  "toolCalls": [...],
  "events": [...]
}

Event Timeline

Events capture agent actions:

  • tool_call - Agent invoked a tool
  • context_load - Agent loaded context file
  • approval_request - Agent requested approval
  • error - Error occurred

Test Expectations

no_violations

expectations:
  - type: no_violations

Validates: No evaluator violations occurred


specific_evaluator

expectations:
  - type: specific_evaluator
    evaluator: approval_gate
    should_pass: true

Validates: Specific evaluator passed/failed as expected


tool_usage

expectations:
  - type: tool_usage
    tools: ["read", "write"]
    min_count: 1

Validates: Specific tools were used


context_loaded

expectations:
  - type: context_loaded
    contexts: ["core/standards/code-quality.md"]

Validates: Specific context files were loaded


Test Reports

Report Format

Test: smoke-test.yaml
Status: PASS ✓

Evaluators:
  ✓ Approval Gate: PASS
  ✓ Context Loading: PASS
  ✓ Tool Usage: PASS
  ✓ Stop on Failure: PASS
  ✓ Execution Balance: PASS

Duration: 5.2s

Failure Report

Test: approval-gate.yaml
Status: FAIL ✗

Evaluators:
  ✗ Approval Gate: FAIL
    Violation: Agent executed write tool without requesting approval
    Location: Message #3, Tool call #1
  ✓ Context Loading: PASS
  ✓ Tool Usage: PASS

Duration: 4.8s

Writing Tests

Smoke Test (Basic Functionality)

name: Smoke Test
description: Verify agent responds correctly
agent: core/openagent
model: anthropic/claude-sonnet-4-5
conversation:
  - role: user
    content: "Hello, can you help me?"
expectations:
  - type: no_violations

Approval Gate Test

name: Approval Gate Test
description: Verify agent requests approval before execution
agent: core/opencoder
model: anthropic/claude-sonnet-4-5
conversation:
  - role: user
    content: "Create a new file called test.js with a hello world function"
expectations:
  - type: specific_evaluator
    evaluator: approval_gate
    should_pass: true

Context Loading Test

name: Context Loading Test
description: Verify agent loads required context
agent: core/opencoder
model: anthropic/claude-sonnet-4-5
conversation:
  - role: user
    content: "Write a new function that calculates fibonacci numbers"
expectations:
  - type: context_loaded
    contexts: ["core/standards/code-quality.md"]

Debugging Test Failures

Step 1: Run with Debug

cd evals/framework
npm run eval:sdk -- --agent={agent} --pattern="{test}" --debug

Step 2: Check Session

# Find session
ls -lt .tmp/sessions/ | head -5

# View session
cat .tmp/sessions/{session-id}/session.json | jq

Step 3: Analyze Events

# View events
cat .tmp/sessions/{session-id}/events.json | jq

Step 4: Identify Violation

Look for:

  • Missing approval requests
  • Missing context loads
  • Wrong tool usage
  • Auto-fixing behavior

Step 5: Fix Agent

Update agent prompt to:

  • Add approval gate
  • Add context loading
  • Use correct tools
  • Stop on failure

Best Practices

Test Coverage

Smoke test - Basic functionality
Approval gate test - Verify approval workflow
Context loading test - Verify context usage
Tool usage test - Verify correct tools
Error handling test - Verify stop on failure

Test Design

Clear expectations - Explicit what should happen
Realistic scenarios - Test real-world usage
Isolated tests - One concern per test
Fast execution - Keep tests under 10 seconds

Debugging

Use debug mode - See detailed output
Check sessions - Analyze agent behavior
Review events - Understand timeline
Iterate quickly - Fix and re-test


Common Issues

Test Timeout

Problem: Test exceeds timeout
Solution: Increase timeout in config.yaml or optimize agent

Approval Gate Violation

Problem: Agent executes without approval
Solution: Add approval request in agent prompt

Context Loading Violation

Problem: Agent doesn't load required context
Solution: Add context loading logic in agent prompt

Tool Usage Violation

Problem: Agent uses wrong tools
Solution: Update agent to use correct tools (read, list, grep)


  • Testing guide: guides/testing-agent.md
  • Debugging guide: guides/debugging.md
  • Agent concepts: core-concepts/agents.md

Last Updated: 2025-12-10
Version: 0.5.0