Version: Next

CLI Testing

Interactive testing of CLI agents using the Agent Kernel Test framework.

Test Class Overview

The Test class provides programmatic interaction with CLI agents:

from agentkernel.test import Test, Mode

# Initialize test with CLI script path
test = Test("demo.py", match_threshold=50, mode=Mode.FALLBACK)

Parameters

path: Path to the Python CLI script (relative to current working directory)
match_threshold: Fuzzy matching threshold percentage (default: 50)
mode: Test comparison mode - Mode.FUZZY, Mode.JUDGE, or Mode.FALLBACK. If None, uses config value (default: None)

Basic Usage

Starting a Test Session

import asyncio
from agentkernel.test import Test

async def run_test():
    test = Test("demo.py")
    await test.start()
    
    # Your test interactions here
    
    await test.stop()

# Run the test
asyncio.run(run_test())

Sending Messages and Expecting Responses

# Send a message to the CLI
response = await test.send("Who won the 1996 cricket world cup?")

# Verify the response using fuzzy matching
await test.expect(["Sri Lanka won the 1996 cricket world cup."])

Test Comparison Modes

Agent Kernel supports three comparison modes for validating responses:

Fuzzy Mode

Uses fuzzy string matching with configurable thresholds:

from agentkernel.test import Test, Mode

# Initialize with fuzzy mode
test = Test("demo.py", match_threshold=80, mode=Mode.FUZZY)

# Or use static comparison with multiple expected answers
await test.send("Who won the 1996 cricket world cup?")
Test.compare(
    actual=test.last_agent_response,
    expected=[
        "Sri Lanka won the 1996 cricket world cup",
        "Sri Lanka won the 1996 world cup",
        "The 1996 cricket world cup was won by Sri Lanka"
    ],
    threshold=80,
    mode=Mode.FUZZY
)

Note: The expected parameter is a list. The test passes if the actual response fuzzy-matches any of the expected values above the threshold.

Judge Mode

Uses LLM-based evaluation (Ragas) for semantic similarity:

# Initialize with judge mode
test = Test("demo.py", mode=Mode.JUDGE)

# Use judge evaluation with multiple expected answers
await test.send("Who won the 1996 cricket world cup?")
Test.compare(
    actual=test.last_agent_response,
    expected=[
        "Sri Lanka won the 1996 cricket world cup",
        "Sri Lanka was the winner of the 1996 world cup",
        "The 1996 cricket world cup was won by Sri Lanka"
    ],
    user_input="Who won the 1996 cricket world cup?",
    threshold=50,  # Converted to 0.5 on 0.0-1.0 scale
    mode=Mode.JUDGE
)

Judge Mode Behavior:

With expected answers: Uses answer_similarity metric to compare against each expected answer (ground truth). Test passes if any similarity score exceeds threshold.
Without expected answers: Uses answer_relevancy metric to check if answer is relevant to the question

Note: When multiple expected answers are provided, the test evaluates semantic similarity against each one and passes if any meets the threshold.

Fallback Mode (Default)

Tries fuzzy matching first, falls back to judge evaluation if fuzzy fails:

# Default fallback mode with multiple expected answers
test = Test("demo.py", mode=Mode.FALLBACK)

await test.send("Who won the 1996 cricket world cup?")
Test.compare(
    actual=test.last_agent_response,
    expected=[
        "Sri Lanka",
        "Sri Lanka won the 1996 cricket world cup",
        "The winner was Sri Lanka"
    ],
    user_input="Who won the 1996 cricket world cup?",
    threshold=50
)

Note: The expected parameter is a list of acceptable responses. Fuzzy matching is tried against each expected value first. If all fail, judge evaluation is attempted against each expected answer.

Configuration-Based Mode

Set default mode via configuration instead of constructor:

# config.yaml
test:
  mode: judge  # Options: fuzzy, judge, fallback
  judge:
    model: gpt-4o-mini
    provider: openai
    embedding_model: text-embedding-3-small

# Uses mode from config
test = Test("demo.py")
await test.send("Hello")
await test.expect(["Hello! How can I help?"])  # Uses configured mode

Advanced Features

Custom Matching Configuration

# Set threshold and mode during initialization
test = Test("demo.py", match_threshold=80, mode=Mode.FUZZY)

# Or use static comparison with custom parameters
Test.compare(
    actual=response,
    expected=["Expected response"],
    user_input="User question",
    threshold=70,
    mode=Mode.JUDGE
)

Accessing Latest Response

await test.send("Hello!")
latest_response = test.latest  # Contains the cleaned response without ANSI codes

Prompt Detection

The Test class automatically detects CLI prompts using regex patterns:

Captures prompts in format: (agent_name) >>
Handles prompt changes during agent switching
Strips ANSI escape sequences from responses

Multi-Agent CLI Testing

For CLI applications with multiple agents:

# Switch to a specific agent
await test.send("!select general")
await test.send("Who won the 1996 cricket world cup?")
await test.expect("Sri Lanka won the 1996 Cricket World Cup.")

# Switch to another agent
await test.send("!select math")
await test.send("What is 2 + 2?")
await test.expect("4")

Error Handling

Assertion Errors

try:
    await test.expect("Expected response")
except AssertionError as e:
    print(f"Test failed: {e}")
    # The error includes both expected and actual responses

Process Management

# Ensure proper cleanup even if tests fail
test = Test("demo.py")
try:
    await test.start()
    # Your test code here
finally:
    await test.stop()  # Always stop the process

Best Practices

Development Testing

Use interactive mode during development for quick validation
Test edge cases and error conditions
Verify agent switching functionality

Test Mode Selection

Use Mode.FUZZY for deterministic, exact outputs
Use Mode.JUDGE for AI-generated content with paraphrasing
Use Mode.FALLBACK (default) for robust validation

Response Validation

Use appropriate fuzzy matching thresholds (50-80% typical)
Provide user_input when using judge mode for better evaluation
Test with variations in expected responses
Account for slight differences in AI model outputs

Session Management

Always call start() before sending messages
Always call stop() to clean up processes
Use try-finally blocks for proper cleanup

Judge Mode Configuration

Configure judge model/provider via config.yaml or environment variables
Ensure LLM API keys are set (e.g., OPENAI_API_KEY)
Note: Judge mode requires LLM calls which may slow down tests

Example Test Session

import asyncio
from agentkernel.test import Test

async def test_cricket_knowledge():
    test = Test("demo.py", match_threshold=60)
    
    try:
        await test.start()
        
        # Test basic question - expected is a list
        await test.send("Who won the 1996 cricket world cup?")
        await test.expect(["Sri Lanka won the 1996 cricket world cup."])
        
        # Test follow-up question with multiple acceptable answers
        await test.send("Which country hosted the tournament?")
        await test.expect([
            "Co-hosted by India, Pakistan and Sri Lanka.",
            "India, Pakistan and Sri Lanka co-hosted the tournament."
        ])
        
        print("All tests passed!")
        
    finally:
        await test.stop()

if __name__ == "__main__":
    asyncio.run(test_cricket_knowledge())

Session Persistence

Each CLI session maintains conversation history:

> My name is Alice
[general] Nice to meet you, Alice!

> What's my name?
[general] Your name is Alice.

Debug Mode

Enable verbose logging:

export AK_LOG_LEVEL=DEBUG
python my_agent.py

Multi-turn Conversations

Test complex interactions:

> I need help with a project
[general] I'd be happy to help! What's your project about?

> It's about machine learning
[general] Great! What specific aspect of machine learning?

> Image classification
[general] Image classification is a common ML task...

Commands

Available CLI commands:

!h, !help — Show help message
!ld, !load <module_name> — Load agent module
!ls, !list — List available agents
!n, !new — Start a new session
!c, !clear — Clear the current session memory
!s, !select <agent_name> — Select an agent to run the prompt
!q, !quit — Exit the program

Tips

Test edge cases interactively
Verify agent handoffs work correctly
Check conversation context is maintained
Test error scenarios
Validate tool integrations

Example Session

$ python my_agent.py

AgentKernel CLI (type !help for commands or !quit to exit):
Available agents:
  research
  write
  review

(research) >> !help
Available commands:
!h, !help - Show this help message
!ld, !load <module_name> - Load agent module
!ls, !list - List available agents
!n, !new - Start a new session
!c, !clear - Clear the current session memory
!s, !select <agent_name> - Select an agent to run the prompt
!q, !quit - Exit the program

(research) >> !ls
Available agents:
  research
  write
  review

(research) >> Find information about Python
Here's what I found about Python...

(research) >> !select write
(write) >> I'll help you create a summary...

(write) >> Great, can you review it?
I'll help you create a summary of the Python information...

(write) >> !select review
(review) >> Here's my review of the content...

(review) >> !new
(review) >> This is a new session now
How can I help you in this new session?

(review) >> !quit

Test Class Overview​

Parameters​

Basic Usage​

Starting a Test Session​

Sending Messages and Expecting Responses​

Test Comparison Modes​

Fuzzy Mode​

Judge Mode​

Fallback Mode (Default)​

Configuration-Based Mode​

Advanced Features​

Custom Matching Configuration​

Accessing Latest Response​

Prompt Detection​

Multi-Agent CLI Testing​

Error Handling​

Assertion Errors​

Process Management​

Best Practices​

Development Testing​

Test Mode Selection​

Response Validation​

Session Management​

Judge Mode Configuration​

Example Test Session​

Session Persistence​

Debug Mode​

Multi-turn Conversations​

Commands​

Tips​

Example Session​