CLI Testing
Interactive testing of CLI agents using the Agent Kernel Test framework.
Test Class Overview
The Test class provides programmatic interaction with CLI agents:
from agentkernel.test import Test, Mode
# Initialize test with CLI script path
test = Test("demo.py", match_threshold=50, mode=Mode.FALLBACK)
Parameters
path: Path to the Python CLI script (relative to current working directory)match_threshold: Fuzzy matching threshold percentage (default: 50)mode: Test comparison mode -Mode.FUZZY,Mode.JUDGE, orMode.FALLBACK. If None, uses config value (default: None)
Basic Usage
Starting a Test Session
import asyncio
from agentkernel.test import Test
async def run_test():
test = Test("demo.py")
await test.start()
# Your test interactions here
await test.stop()
# Run the test
asyncio.run(run_test())
Sending Messages and Expecting Responses
# Send a message to the CLI
response = await test.send("Who won the 1996 cricket world cup?")
# Verify the response using fuzzy matching
await test.expect(["Sri Lanka won the 1996 cricket world cup."])
Test Comparison Modes
Agent Kernel supports three comparison modes for validating responses:
Fuzzy Mode
Uses fuzzy string matching with configurable thresholds:
from agentkernel.test import Test, Mode
# Initialize with fuzzy mode
test = Test("demo.py", match_threshold=80, mode=Mode.FUZZY)
# Or use static comparison with multiple expected answers
await test.send("Who won the 1996 cricket world cup?")
Test.compare(
actual=test.last_agent_response,
expected=[
"Sri Lanka won the 1996 cricket world cup",
"Sri Lanka won the 1996 world cup",
"The 1996 cricket world cup was won by Sri Lanka"
],
threshold=80,
mode=Mode.FUZZY
)
Note: The expected parameter is a list. The test passes if the actual response fuzzy-matches any of the expected values above the threshold.
Judge Mode
Uses LLM-based evaluation (Ragas) for semantic similarity:
# Initialize with judge mode
test = Test("demo.py", mode=Mode.JUDGE)
# Use judge evaluation with multiple expected answers
await test.send("Who won the 1996 cricket world cup?")
Test.compare(
actual=test.last_agent_response,
expected=[
"Sri Lanka won the 1996 cricket world cup",
"Sri Lanka was the winner of the 1996 world cup",
"The 1996 cricket world cup was won by Sri Lanka"
],
user_input="Who won the 1996 cricket world cup?",
threshold=50, # Converted to 0.5 on 0.0-1.0 scale
mode=Mode.JUDGE
)
Judge Mode Behavior:
- With expected answers: Uses
answer_similaritymetric to compare against each expected answer (ground truth). Test passes if any similarity score exceeds threshold. - Without expected answers: Uses
answer_relevancymetric to check if answer is relevant to the question
Note: When multiple expected answers are provided, the test evaluates semantic similarity against each one and passes if any meets the threshold.
Fallback Mode (Default)
Tries fuzzy matching first, falls back to judge evaluation if fuzzy fails:
# Default fallback mode with multiple expected answers
test = Test("demo.py", mode=Mode.FALLBACK)
await test.send("Who won the 1996 cricket world cup?")
Test.compare(
actual=test.last_agent_response,
expected=[
"Sri Lanka",
"Sri Lanka won the 1996 cricket world cup",
"The winner was Sri Lanka"
],
user_input="Who won the 1996 cricket world cup?",
threshold=50
)
Note: The expected parameter is a list of acceptable responses. Fuzzy matching is tried against each expected value first. If all fail, judge evaluation is attempted against each expected answer.
Configuration-Based Mode
Set default mode via configuration instead of constructor:
# config.yaml
test:
mode: judge # Options: fuzzy, judge, fallback
judge:
model: gpt-4o-mini
provider: openai
embedding_model: text-embedding-3-small
# Uses mode from config
test = Test("demo.py")
await test.send("Hello")
await test.expect(["Hello! How can I help?"]) # Uses configured mode
Advanced Features
Custom Matching Configuration
# Set threshold and mode during initialization
test = Test("demo.py", match_threshold=80, mode=Mode.FUZZY)
# Or use static comparison with custom parameters
Test.compare(
actual=response,
expected=["Expected response"],
user_input="User question",
threshold=70,
mode=Mode.JUDGE
)
Accessing Latest Response
await test.send("Hello!")
latest_response = test.latest # Contains the cleaned response without ANSI codes