Fault Tolerance
Agent Kernel provides comprehensive fault tolerance capabilities to ensure your AI agents remain available and resilient in production environments. The platform implements multiple layers of fault tolerance across different deployment modes.
Overview
Fault tolerance in Agent Kernel encompasses:
- Infrastructure-level resilience - Automatic recovery from hardware and system failures
- Application-level recovery - Graceful handling of application errors and crashes
- State persistence - Preservation of conversation state across failures
- Health monitoring - Continuous health checks and automatic remediation
- Retry mechanisms - Intelligent retry logic for transient failures (available soon)
Deployment-Specific Fault Tolerance
AWS Containerized (ECS/Fargate)
ECS deployments provide the most comprehensive fault tolerance features with extensive configurability:
Multi-AZ Deployment
- Tasks automatically distributed across multiple Availability Zones
- Survives entire AZ failures without service interruption
- Application Load Balancer routes traffic only to healthy tasks
Task-Level Fault Tolerance
- Automatic task replacement - Failed tasks automatically restarted
- Desired count enforcement - ECS maintains configured task count
- Rolling deployments - Zero-downtime updates with gradual task replacement
- Health checks - Container-level and application-level health monitoring
Configurable Redundancy ahnd health Settings
# Example Terraform configuration
ecs_desired_count = 3 # Number of tasks to maintain
ecs_health_check_path = "/health"
Service Auto-Scaling (Available soon)
- Scale out during high load, scale in during low load
- Target tracking based on CPU, memory, or request count
- Scheduled scaling for predictable traffic patterns
- Maintains service capacity during failures
Network Resilience
- Connection draining ensures graceful shutdown
- Load balancer health checks every 5-30 seconds
- Unhealthy targets automatically removed from rotation
- Circuit breaker patterns prevent cascade failures
Best Practices:
- Run minimum of 2 tasks across different AZs
- Configure appropriate health check intervals
- Set reasonable grace periods for startup
- Use target tracking auto-scaling for dynamic loads
AWS Serverless (Lambda)
Lambda deployments are inherently fault-tolerant with AWS-managed infrastructure:
Built-In Resilience
- Multi-AZ by default - Lambda functions automatically run across availability zones
- Automatic retry - Failed invocations retried automatically (available soon)
- No server management - AWS handles all infrastructure failures
- Infinite scaling - Automatically scales to handle any load
State Persistence
- Use DynamoDB for serverless-native state management
- DynamoDB provides:
- Multi-AZ replication
- Point-in-time recovery
- Automatic backups
- 99.999% availability SLA
Serverless Advantages:
- Zero infrastructure maintenance
- No cold start concerns with provisioned concurrency
- Pay only for actual usage
- Automatic failover and recovery
Local/Development
Limited fault tolerance suitable for development:
- Manual restart on failures
- In-memory state (lost on restart)
- Single-instance execution
Session State Resilience
Conversation state is critical for agent continuity. Agent Kernel supports multiple backends:
Redis (High Availability)
export AK_SESSION__TYPE=redis
export AK_SESSION__REDIS__URL=redis://cluster-endpoint:6379
Features:
- Redis Cluster for automatic failover
- Replication across multiple nodes
- Sentinel for monitoring and automatic failover
- Sub-millisecond recovery time
Configuration:
- Use Redis Cluster or Sentinel mode
- Enable persistence (AOF + RDB)
- Configure appropriate replica count
DynamoDB (Serverless)
export AK_SESSION__TYPE=dynamodb
export AK_SESSION__DYNAMODB__TABLE_NAME=sessions
Features:
- Multi-AZ replication by default
- Point-in-time recovery (PITR)
- Continuous backups
- 99.999% availability SLA
- No infrastructure management
Configuration:
- Enable point-in-time recovery
- Configure appropriate RCU/WCU or use on-demand
- Set TTL for automatic cleanup
In-Memory (Development Only)
Not recommended for production - state lost on restart.
Health Checks
Agent Kernel provides built-in health endpoints for monitoring:
Health Endpoint
GET /health
Response:
{
"status": "healthy",
"timestamp": "2024-01-15T10:30:00Z",
"runtime": "operational",
"memory_backend": "connected"
}
Load Balancer Health Checks
Configure your load balancer to use the health endpoint:
- Interval: 5-30 seconds
- Timeout: 2-5 seconds
- Healthy threshold: 2 consecutive successes
- Unhealthy threshold: 2 consecutive failures
Retry Logic (Available soon)
Agent Kernel implements intelligent retry mechanisms:
Framework-Level Retries
Each runner implementation includes:
- Exponential backoff for LLM API calls
- Configurable retry attempts
- Circuit breaker patterns
- Graceful degradation
Monitoring and Alerting
CloudWatch Metrics (AWS)
Available soon!
Testing Fault Tolerance
Chaos Engineering
Test your agent's resilience by simulating failures:
# Simulate ECS task failures
aws ecs stop-task --cluster my-cluster --task task-id
# Simulate Lambda failures
# Use error injection in code or AWS Fault Injection Simulator
Load Testing
Validate behavior under stress:
- Concurrent request handling
- Auto-scaling triggers
- Recovery time objectives (RTO)
- Recovery point objectives (RPO)
Disaster Recovery
ECS Recovery
- RTO: < 5 minutes (automatic task restart)
- RPO: Depends on state backend (Redis: seconds, DynamoDB: continuous)
Serverless Recovery
- RTO: < 1 minute (automatic Lambda retry)
- RPO: Continuous with DynamoDB
Backup Strategy
Not provided with the terraform module. User has to implement the backup feature.
Compliance and SLAs
ECS Deployments
- Availability: 99.9%+ with proper configuration
- Multi-AZ: 99.99%+ availability
- RTO: < 5 minutes
- RPO: < 1 minute (with persistent state)
Serverless Deployments
- Availability: 99.95%+ (AWS Lambda SLA)
- Multi-AZ: Built-in across all AZs
- RTO: < 1 minute
- RPO: Continuous (with DynamoDB)
Summary
Agent Kernel's fault tolerance capabilities ensure your AI agents remain available and resilient:
- ECS/Fargate: Highly configurable, multi-AZ, auto-scaling, minimal cold starts
- Lambda: Fully managed, auto-scaling, built-in retry, serverless
- State Management: Multiple backends with different resilience profiles
- Health Monitoring: Built-in endpoints and CloudWatch integration
- Error Recovery: Retry logic, circuit breakers, graceful degradation
Choose the deployment mode that best matches your availability requirements, budget, and operational preferences.
Learn More
- Deployment Overview - Compare deployment modes
- AWS Containerized - ECS fault tolerance details
- AWS Serverless - Lambda resilience features
- Session Management - State persistence options
- Architecture - System design patterns