Skip to main content
Version: 0.2.9

Testing Overview

Agent Kernel provides a comprehensive testing framework for testing CLI-based agents with both interactive and automated test capabilities.

Testing Approaches

CLI Testing

Interactive testing of CLI agents using the Test class:

from agentkernel.test import Test

# Create a test instance
test = Test("demo.py")
await test.start()

# Send messages and verify responses
await test.send("Who won the 1996 cricket world cup?")
await test.expect("Sri Lanka won the 1996 cricket world cup.")

await test.stop()

Best for:

  • Development and debugging
  • Interactive exploration
  • Quick validation of agent responses

Learn more →

Automated Testing

pytest-based testing with async support:

import pytest
import pytest_asyncio
from agentkernel.test import Test

@pytest_asyncio.fixture(scope="session")
async def test_client():
test = Test("demo.py")
await test.start()
try:
yield test
finally:
await test.stop()

@pytest.mark.asyncio
async def test_basic_question(test_client):
await test_client.send("Hello!")
await test_client.expect("Hello! How can I help you?")

Best for:

  • Regression testing
  • CI/CD pipelines
  • Validation before deployment

Learn more →

Testing Framework Features

Test Comparison Modes

Agent Kernel supports three comparison modes for validating agent responses:

Fuzzy Mode

Uses fuzzy string matching (via RapidFuzz) with configurable thresholds:

from agentkernel.test import Test, Mode

# Use fuzzy mode only
test = Test("demo.py", match_threshold=80)
await test.send("Who won the 1996 cricket world cup?")

# expected is a list - test passes if ANY match exceeds threshold
Test.compare(
actual=test.last_agent_response,
expected=["Sri Lanka won", "Sri Lanka won the 1996 cricket world cup"],
threshold=80,
mode=Mode.FUZZY
)

Note: The expected parameter is a list. The test passes if the actual response matches any of the expected values above the threshold.

Judge Mode

Uses LLM-based evaluation (via Ragas) for semantic similarity:

# Use judge mode only - expected is a list
Test.compare(
actual=test.last_agent_response,
expected=[
"Sri Lanka won the 1996 cricket world cup",
"Sri Lanka was the winner of the 1996 world cup",
"The 1996 cricket world cup was won by Sri Lanka"
],
user_input="Who won the 1996 cricket world cup?",
threshold=50, # Threshold is percentage converted to 0.0-1.0 scale
mode=Mode.JUDGE
)

Note: The expected parameter is a list. The test passes if the actual response has semantic similarity above the threshold with any of the expected answers.

When expected answers are provided, uses answer_similarity metric. When no expected answers are provided, uses answer_relevancy metric to check if the answer is relevant to the question.

Fallback Mode (Default)

Tries fuzzy matching first, falls back to judge evaluation if fuzzy fails:

# Default fallback mode - multiple expected answers
Test.compare(
actual=test.last_agent_response,
expected=[
"Sri Lanka",
"Sri Lanka won the 1996 cricket world cup",
"The winner was Sri Lanka"
],
user_input="Who won the 1996 cricket world cup?",
threshold=50,
mode=Mode.FALLBACK # or None to use config default
)

Note: The expected parameter is a list of acceptable responses. The test passes if any expected value matches (fuzzy or judge evaluation).

Configuring Test Mode

Set the default test mode via configuration:

# config.yaml
test:
mode: fallback # Options: fuzzy, judge, fallback
judge:
model: gpt-4o-mini
provider: openai
embedding_model: text-embedding-3-small

Or via environment variables:

export AK_TEST__MODE=judge
export AK_TEST__JUDGE__MODEL=gpt-4o-mini
export AK_TEST__JUDGE__PROVIDER=openai
export AK_TEST__JUDGE__EMBEDDING_MODEL=text-embedding-3-small

Session Management

Tests maintain persistent CLI sessions with proper prompt handling and ANSI escape sequence cleanup.

Multi-Agent Support

Test different agent types within the same CLI application:

await test.send("!select general")  # Switch to general agent
await test.send("Who won the 1996 cricket world cup?")

Best Practices

  • Use pytest fixtures for test setup and teardown
  • Implement ordered tests for conversation flows
  • Configure appropriate fuzzy matching thresholds
  • Test agent selection commands when using multi-agent setups
  • Include both positive and negative test cases
  • Test session persistence and state management
💬 Ask AI Assistant

Get instant help with Agent Kernel documentation, examples, and more

AI Assistant