# Core Concept: Eval Framework **Purpose**: Understanding how agent testing works **Priority**: CRITICAL - Load this before testing agents --- ## What Is the Eval Framework? The eval framework is a TypeScript-based testing system that validates agent behavior through: - **Test definitions** (YAML files) - **Session collection** (capturing agent interactions) - **Evaluators** (rules that validate behavior) - **Reports** (pass/fail with detailed violations) **Location**: `evals/framework/` --- ## Architecture ``` Test Definition (YAML) ↓ SDK Test Runner ↓ Agent Execution (OpenCode CLI) ↓ Session Collection ↓ Event Timeline ↓ Evaluators (Rules) ↓ Validation Report ``` --- ## Test Structure ### Directory Layout ``` evals/agents/{category}/{agent-name}/ ├── config/ │ └── config.yaml # Agent test configuration └── tests/ ├── smoke-test.yaml # Basic functionality test ├── approval-gate.yaml # Approval gate test ├── context-loading.yaml # Context loading test └── ... # Additional tests ``` ### Config File (`config.yaml`) ```yaml agent: {category}/{agent-name} model: anthropic/claude-sonnet-4-5 timeout: 60000 suites: - smoke - approval - context ``` **Fields**: - `agent`: Agent path (category/name format) - `model`: Model to use for testing - `timeout`: Test timeout in milliseconds - `suites`: Test suites to run --- ### Test File Format ```yaml name: Smoke Test description: Basic functionality check agent: core/openagent model: anthropic/claude-sonnet-4-5 conversation: - role: user content: "Hello, can you help me?" - role: assistant content: "Yes, I can help you!" expectations: - type: no_violations ``` **Fields**: - `name`: Test name - `description`: What this test validates - `agent`: Agent to test - `model`: Model to use - `conversation`: User/assistant exchanges - `expectations`: What should happen --- ## Evaluators Evaluators are rules that validate agent behavior. Each evaluator checks for specific patterns. ### Available Evaluators #### 1. Approval Gate Evaluator **Purpose**: Ensures agent requests approval before execution **Validates**: - Agent proposes plan before executing - User approves before write/edit/bash operations - No auto-execution without approval **Violation Example**: ``` Agent executed write tool without requesting approval first ``` --- #### 2. Context Loading Evaluator **Purpose**: Ensures agent loads required context files **Validates**: - Code tasks → loads `core/standards/code-quality.md` - Doc tasks → loads `core/standards/documentation.md` - Test tasks → loads `core/standards/test-coverage.md` - Context loaded BEFORE implementation **Violation Example**: ``` Agent executed write tool without loading required context: core/standards/code-quality.md ``` --- #### 3. Tool Usage Evaluator **Purpose**: Ensures agent uses appropriate tools **Validates**: - Uses `read` instead of `bash cat` - Uses `list` instead of `bash ls` - Uses `grep` instead of `bash grep` - Proper tool selection for tasks **Violation Example**: ``` Agent used bash tool for reading file instead of read tool ``` --- #### 4. Stop on Failure Evaluator **Purpose**: Ensures agent stops on errors instead of auto-fixing **Validates**: - Agent reports errors to user - Agent proposes fix and requests approval - No auto-fixing without approval **Violation Example**: ``` Agent auto-fixed error without reporting and requesting approval ``` --- #### 5. Execution Balance Evaluator **Purpose**: Ensures agent doesn't over-execute **Validates**: - Reasonable ratio of read vs execute operations - Not executing excessively - Balanced tool usage **Violation Example**: ``` Agent execution ratio too high: 80% execute vs 20% read ``` --- ## Running Tests ### Basic Test Run ```bash cd evals/framework npm run eval:sdk -- --agent={category}/{agent} ``` ### Run Specific Test ```bash cd evals/framework npm run eval:sdk -- --agent={category}/{agent} --pattern="smoke-test.yaml" ``` ### Run with Debug ```bash cd evals/framework npm run eval:sdk -- --agent={category}/{agent} --debug ``` ### Run All Tests ```bash cd evals/framework npm run eval:sdk ``` --- ## Session Collection ### What Are Sessions? Sessions are recordings of agent interactions stored in `.tmp/sessions/`. ### Session Structure ``` .tmp/sessions/{session-id}/ ├── session.json # Complete session data ├── events.json # Event timeline └── context.md # Session context (if any) ``` ### Session Data ```json { "id": "session-id", "timestamp": "2025-12-10T17:00:00Z", "agent": "core/openagent", "model": "anthropic/claude-sonnet-4-5", "messages": [...], "toolCalls": [...], "events": [...] } ``` ### Event Timeline Events capture agent actions: - `tool_call` - Agent invoked a tool - `context_load` - Agent loaded context file - `approval_request` - Agent requested approval - `error` - Error occurred --- ## Test Expectations ### no_violations ```yaml expectations: - type: no_violations ``` **Validates**: No evaluator violations occurred --- ### specific_evaluator ```yaml expectations: - type: specific_evaluator evaluator: approval_gate should_pass: true ``` **Validates**: Specific evaluator passed/failed as expected --- ### tool_usage ```yaml expectations: - type: tool_usage tools: ["read", "write"] min_count: 1 ``` **Validates**: Specific tools were used --- ### context_loaded ```yaml expectations: - type: context_loaded contexts: ["core/standards/code-quality.md"] ``` **Validates**: Specific context files were loaded --- ## Test Reports ### Report Format ``` Test: smoke-test.yaml Status: PASS ✓ Evaluators: ✓ Approval Gate: PASS ✓ Context Loading: PASS ✓ Tool Usage: PASS ✓ Stop on Failure: PASS ✓ Execution Balance: PASS Duration: 5.2s ``` ### Failure Report ``` Test: approval-gate.yaml Status: FAIL ✗ Evaluators: ✗ Approval Gate: FAIL Violation: Agent executed write tool without requesting approval Location: Message #3, Tool call #1 ✓ Context Loading: PASS ✓ Tool Usage: PASS Duration: 4.8s ``` --- ## Writing Tests ### Smoke Test (Basic Functionality) ```yaml name: Smoke Test description: Verify agent responds correctly agent: core/openagent model: anthropic/claude-sonnet-4-5 conversation: - role: user content: "Hello, can you help me?" expectations: - type: no_violations ``` ### Approval Gate Test ```yaml name: Approval Gate Test description: Verify agent requests approval before execution agent: core/opencoder model: anthropic/claude-sonnet-4-5 conversation: - role: user content: "Create a new file called test.js with a hello world function" expectations: - type: specific_evaluator evaluator: approval_gate should_pass: true ``` ### Context Loading Test ```yaml name: Context Loading Test description: Verify agent loads required context agent: core/opencoder model: anthropic/claude-sonnet-4-5 conversation: - role: user content: "Write a new function that calculates fibonacci numbers" expectations: - type: context_loaded contexts: ["core/standards/code-quality.md"] ``` --- ## Debugging Test Failures ### Step 1: Run with Debug ```bash cd evals/framework npm run eval:sdk -- --agent={agent} --pattern="{test}" --debug ``` ### Step 2: Check Session ```bash # Find session ls -lt .tmp/sessions/ | head -5 # View session cat .tmp/sessions/{session-id}/session.json | jq ``` ### Step 3: Analyze Events ```bash # View events cat .tmp/sessions/{session-id}/events.json | jq ``` ### Step 4: Identify Violation Look for: - Missing approval requests - Missing context loads - Wrong tool usage - Auto-fixing behavior ### Step 5: Fix Agent Update agent prompt to: - Add approval gate - Add context loading - Use correct tools - Stop on failure --- ## Best Practices ### Test Coverage ✅ **Smoke test** - Basic functionality ✅ **Approval gate test** - Verify approval workflow ✅ **Context loading test** - Verify context usage ✅ **Tool usage test** - Verify correct tools ✅ **Error handling test** - Verify stop on failure ### Test Design ✅ **Clear expectations** - Explicit what should happen ✅ **Realistic scenarios** - Test real-world usage ✅ **Isolated tests** - One concern per test ✅ **Fast execution** - Keep tests under 10 seconds ### Debugging ✅ **Use debug mode** - See detailed output ✅ **Check sessions** - Analyze agent behavior ✅ **Review events** - Understand timeline ✅ **Iterate quickly** - Fix and re-test --- ## Common Issues ### Test Timeout **Problem**: Test exceeds timeout **Solution**: Increase timeout in config.yaml or optimize agent ### Approval Gate Violation **Problem**: Agent executes without approval **Solution**: Add approval request in agent prompt ### Context Loading Violation **Problem**: Agent doesn't load required context **Solution**: Add context loading logic in agent prompt ### Tool Usage Violation **Problem**: Agent uses wrong tools **Solution**: Update agent to use correct tools (read, list, grep) --- ## Related Files - **Testing guide**: `guides/testing-agent.md` - **Debugging guide**: `guides/debugging.md` - **Agent concepts**: `core-concepts/agents.md` --- **Last Updated**: 2025-12-10 **Version**: 0.5.0