You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
5.3 KiB
5.3 KiB
Guide: Testing an Agent
Prerequisites: Load core-concepts/evals.md first
Purpose: Step-by-step workflow for testing agents
Quick Start
# Run smoke test
cd evals/framework
npm run eval:sdk -- --agent={category}/{agent} --pattern="smoke-test.yaml"
# Run all tests for agent
npm run eval:sdk -- --agent={category}/{agent}
# Run with debug
npm run eval:sdk -- --agent={category}/{agent} --debug
Test Types
1. Smoke Test
Purpose: Basic functionality check
name: Smoke Test
description: Verify agent responds correctly
agent: {category}/{agent}
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "Hello, can you help me?"
expectations:
- type: no_violations
Run:
npm run eval:sdk -- --agent={agent} --pattern="smoke-test.yaml"
2. Approval Gate Test
Purpose: Verify agent requests approval
name: Approval Gate Test
description: Verify agent requests approval before execution
agent: {category}/{agent}
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "Create a new file called test.js"
expectations:
- type: specific_evaluator
evaluator: approval_gate
should_pass: true
3. Context Loading Test
Purpose: Verify agent loads required context
name: Context Loading Test
description: Verify agent loads required context
agent: {category}/{agent}
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "Write a new function"
expectations:
- type: context_loaded
contexts: ["core/standards/code-quality.md"]
4. Tool Usage Test
Purpose: Verify agent uses correct tools
name: Tool Usage Test
description: Verify agent uses appropriate tools
agent: {category}/{agent}
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "Read the package.json file"
expectations:
- type: tool_usage
tools: ["read"]
min_count: 1
Running Tests
Single Test
cd evals/framework
npm run eval:sdk -- --agent={category}/{agent} --pattern="{test-name}.yaml"
All Tests for Agent
cd evals/framework
npm run eval:sdk -- --agent={category}/{agent}
All Tests (All Agents)
cd evals/framework
npm run eval:sdk
With Debug Output
cd evals/framework
npm run eval:sdk -- --agent={agent} --pattern="{test}" --debug
Interpreting Results
Pass Example
✓ Test: smoke-test.yaml
Status: PASS
Duration: 5.2s
Evaluators:
✓ Approval Gate: PASS
✓ Context Loading: PASS
✓ Tool Usage: PASS
✓ Stop on Failure: PASS
✓ Execution Balance: PASS
Fail Example
✗ Test: approval-gate.yaml
Status: FAIL
Duration: 4.8s
Evaluators:
✗ Approval Gate: FAIL
Violation: Agent executed write tool without requesting approval
Location: Message #3, Tool call #1
✓ Context Loading: PASS
✓ Tool Usage: PASS
Debugging Failures
Step 1: Run with Debug
npm run eval:sdk -- --agent={agent} --pattern="{test}" --debug
Step 2: Check Session
# Find recent session
ls -lt .tmp/sessions/ | head -5
# View session
cat .tmp/sessions/{session-id}/session.json | jq
Step 3: Analyze Events
# View event timeline
cat .tmp/sessions/{session-id}/events.json | jq
Step 4: Identify Issue
Common issues:
- Approval Gate Violation: Agent executed without approval
- Context Loading Violation: Agent didn't load required context
- Tool Usage Violation: Agent used wrong tool (bash instead of read)
- Stop on Failure Violation: Agent auto-fixed instead of stopping
Step 5: Fix Agent
Update agent prompt to address the issue, then re-test.
Writing New Tests
Test Template
name: Test Name
description: What this test validates
agent: {category}/{agent}
model: anthropic/claude-sonnet-4-5
conversation:
- role: user
content: "User message"
- role: assistant
content: "Expected response (optional)"
expectations:
- type: no_violations
Best Practices
✅ Clear name - Descriptive test name
✅ Good description - Explain what's being tested
✅ Realistic scenario - Test real-world usage
✅ Specific expectations - Clear pass/fail criteria
✅ Fast execution - Keep under 10 seconds
Common Test Patterns
Test Approval Workflow
conversation:
- role: user
content: "Create a new file"
expectations:
- type: specific_evaluator
evaluator: approval_gate
should_pass: true
Test Context Loading
conversation:
- role: user
content: "Write new code"
expectations:
- type: context_loaded
contexts: ["core/standards/code-quality.md"]
Test Tool Selection
conversation:
- role: user
content: "Read the README file"
expectations:
- type: tool_usage
tools: ["read"]
min_count: 1
Continuous Testing
Pre-Commit Hook
# Setup pre-commit hook
./scripts/validation/setup-pre-commit-hook.sh
CI/CD Integration
Tests run automatically on:
- Pull requests
- Merges to main
- Release tags
Related Files
- Eval concepts:
core-concepts/evals.md - Debugging guide:
guides/debugging.md - Adding agents:
guides/adding-agent.md
Last Updated: 2025-12-10
Version: 0.5.0