# Provider Fallback & Dynamic Switching Guide **Production-grade LLM provider fallback for long-running agents** ## Overview The `ProviderManager` and `LongRunningAgent` classes provide enterprise-grade provider fallback, health monitoring, cost optimization, and automatic recovery for long-running AI agents. ### Key Features - ✅ **Automatic Fallback** - Seamless switching between providers on failure - ✅ **Circuit Breaker** - Prevents cascading failures with automatic recovery - ✅ **Health Monitoring** - Real-time provider health tracking - ✅ **Cost Optimization** - Intelligent provider selection based on cost/performance - ✅ **Retry Logic** - Exponential/linear backoff for transient errors - ✅ **Checkpointing** - Save/restore agent state for crash recovery - ✅ **Budget Control** - Hard limits on spending and runtime - ✅ **Performance Tracking** - Latency, success rate, and token usage metrics ## Quick Start ### Basic Provider Fallback ```typescript import { ProviderManager, ProviderConfig } from 'agentic-flow/core/provider-manager'; // Configure providers const providers: ProviderConfig[] = [ { name: 'gemini', apiKey: process.env.GOOGLE_GEMINI_API_KEY, priority: 1, // Try first maxRetries: 3, timeout: 30000, costPerToken: 0.00015, enabled: true }, { name: 'anthropic', apiKey: process.env.ANTHROPIC_API_KEY, priority: 2, // Fallback maxRetries: 3, timeout: 60000, costPerToken: 0.003, enabled: true }, { name: 'onnx', priority: 3, // Last resort (free, local) maxRetries: 2, timeout: 120000, costPerToken: 0, enabled: true } ]; // Initialize manager const manager = new ProviderManager(providers, { type: 'priority', // or 'cost-optimized', 'performance-optimized', 'round-robin' maxFailures: 3, recoveryTime: 60000, retryBackoff: 'exponential' }); // Execute with automatic fallback const { result, provider, attempts } = await manager.executeWithFallback( async (providerName) => { // Your LLM API call here return await callLLM(providerName, prompt); } ); console.log(`Success with ${provider} after ${attempts} attempts`); ``` ### Long-Running Agent ```typescript import { LongRunningAgent } from 'agentic-flow/core/long-running-agent'; // Create agent const agent = new LongRunningAgent({ agentName: 'research-agent', providers, fallbackStrategy: { type: 'cost-optimized', maxFailures: 3, recoveryTime: 60000, retryBackoff: 'exponential', costThreshold: 0.50, // Max $0.50 per request latencyThreshold: 30000 // Max 30s per request }, checkpointInterval: 30000, // Save state every 30s maxRuntime: 3600000, // Max 1 hour costBudget: 5.00 // Max $5 total }); await agent.start(); // Execute tasks with automatic provider selection const result = await agent.executeTask({ name: 'analyze-code', complexity: 'complex', // 'simple' | 'medium' | 'complex' estimatedTokens: 5000, execute: async (provider) => { return await analyzeCode(provider, code); } }); // Get status const status = agent.getStatus(); console.log(`Completed: ${status.completedTasks}, Cost: $${status.totalCost}`); await agent.stop(); ``` ## Fallback Strategies ### 1. Priority-Based (Default) Tries providers in priority order (1 = highest). ```typescript { type: 'priority', maxFailures: 3, recoveryTime: 60000, retryBackoff: 'exponential' } ``` **Use Case:** Prefer specific provider (e.g., Claude for quality) ### 2. Cost-Optimized Selects cheapest provider for estimated token count. ```typescript { type: 'cost-optimized', maxFailures: 3, recoveryTime: 60000, retryBackoff: 'exponential', costThreshold: 0.50 // Max $0.50 per request } ``` **Use Case:** High-volume applications, budget constraints ### 3. Performance-Optimized Selects provider with best latency and success rate. ```typescript { type: 'performance-optimized', maxFailures: 3, recoveryTime: 60000, retryBackoff: 'exponential', latencyThreshold: 30000 // Max 30s } ``` **Use Case:** Real-time applications, user-facing services ### 4. Round-Robin Distributes load evenly across providers. ```typescript { type: 'round-robin', maxFailures: 3, recoveryTime: 60000, retryBackoff: 'exponential' } ``` **Use Case:** Load balancing, testing multiple providers ## Task Complexity Heuristics The system applies intelligent heuristics based on task complexity: ### Simple Tasks → Prefer Gemini/ONNX ```typescript await agent.executeTask({ name: 'format-code', complexity: 'simple', // Fast, cheap providers preferred estimatedTokens: 200, execute: async (provider) => formatCode(code) }); ``` **Rationale:** Simple tasks don't need Claude's reasoning power ### Medium Tasks → Auto-Optimized ```typescript await agent.executeTask({ name: 'refactor-function', complexity: 'medium', // Balance cost/quality estimatedTokens: 1500, execute: async (provider) => refactorFunction(code) }); ``` **Rationale:** Uses fallback strategy (cost/performance) ### Complex Tasks → Prefer Claude ```typescript await agent.executeTask({ name: 'design-architecture', complexity: 'complex', // Quality matters most estimatedTokens: 5000, execute: async (provider) => designArchitecture(requirements) }); ``` **Rationale:** Complex reasoning benefits from Claude's capabilities ## Circuit Breaker Prevents cascading failures by temporarily disabling failing providers. ### How It Works 1. **Failure Tracking:** Count consecutive failures per provider 2. **Threshold:** Open circuit after N failures (configurable) 3. **Recovery:** Automatically recover after timeout 4. **Fallback:** Use next available provider ### Configuration ```typescript { maxFailures: 3, // Open circuit after 3 consecutive failures recoveryTime: 60000, // Try recovery after 60 seconds retryBackoff: 'exponential' // 1s, 2s, 4s, 8s, 16s... } ``` ### Monitoring ```typescript const health = manager.getHealth(); health.forEach(h => { console.log(`${h.provider}:`); console.log(` Circuit Breaker: ${h.circuitBreakerOpen ? 'OPEN' : 'CLOSED'}`); console.log(` Consecutive Failures: ${h.consecutiveFailures}`); console.log(` Success Rate: ${(h.successRate * 100).toFixed(1)}%`); }); ``` ## Cost Tracking & Optimization ### Real-Time Cost Monitoring ```typescript const costs = manager.getCostSummary(); console.log(`Total: $${costs.total.toFixed(4)}`); console.log(`Tokens: ${costs.totalTokens.toLocaleString()}`); for (const [provider, cost] of Object.entries(costs.byProvider)) { console.log(` ${provider}: $${cost.toFixed(4)}`); } ``` ### Budget Constraints ```typescript const agent = new LongRunningAgent({ agentName: 'budget-agent', providers, costBudget: 10.00, // Hard limit: $10 // ... other config }); // Agent automatically stops when budget exceeded ``` ### Cost-Per-Provider Configuration ```typescript const providers: ProviderConfig[] = [ { name: 'gemini', costPerToken: 0.00015, // $0.15 per 1M tokens // ... }, { name: 'anthropic', costPerToken: 0.003, // $3 per 1M tokens (Sonnet) // ... }, { name: 'onnx', costPerToken: 0, // FREE (local) // ... } ]; ``` ## Health Monitoring ### Automatic Health Checks ```typescript const providers: ProviderConfig[] = [ { name: 'gemini', healthCheckInterval: 60000, // Check every minute // ... } ]; ``` ### Manual Health Check ```typescript const health = manager.getHealth(); health.forEach(h => { console.log(`${h.provider}:`); console.log(` Healthy: ${h.isHealthy}`); console.log(` Success Rate: ${(h.successRate * 100).toFixed(1)}%`); console.log(` Avg Latency: ${h.averageLatency.toFixed(0)}ms`); console.log(` Error Rate: ${(h.errorRate * 100).toFixed(1)}%`); }); ``` ### Metrics Collection ```typescript const metrics = manager.getMetrics(); metrics.forEach(m => { console.log(`${m.provider}:`); console.log(` Total Requests: ${m.totalRequests}`); console.log(` Successful: ${m.successfulRequests}`); console.log(` Failed: ${m.failedRequests}`); console.log(` Avg Latency: ${m.averageLatency.toFixed(0)}ms`); console.log(` Total Cost: $${m.totalCost.toFixed(4)}`); }); ``` ## Checkpointing & Recovery ### Automatic Checkpoints ```typescript const agent = new LongRunningAgent({ agentName: 'checkpoint-agent', providers, checkpointInterval: 30000, // Save every 30 seconds // ... }); await agent.start(); // Agent automatically saves checkpoints every 30s // On crash, restore from last checkpoint ``` ### Manual Checkpoint Management ```typescript // Get all checkpoints const metrics = agent.getMetrics(); const checkpoints = metrics.checkpoints; // Restore from specific checkpoint const lastCheckpoint = checkpoints[checkpoints.length - 1]; agent.restoreFromCheckpoint(lastCheckpoint); ``` ### Checkpoint Data ```typescript interface AgentCheckpoint { timestamp: Date; taskProgress: number; // 0-1 currentProvider: string; totalCost: number; totalTokens: number; completedTasks: number; failedTasks: number; state: Record; // Custom state } ``` ## Retry Logic ### Exponential Backoff (Recommended) ```typescript { retryBackoff: 'exponential' } ``` **Delays:** 1s, 2s, 4s, 8s, 16s, 30s (max) **Use Case:** Rate limits, transient errors ### Linear Backoff ```typescript { retryBackoff: 'linear' } ``` **Delays:** 1s, 2s, 3s, 4s, 5s, 10s (max) **Use Case:** Predictable retry patterns ### Retryable Errors Automatically retried: - `rate limit` - `timeout` - `connection` - `network` - HTTP 503, 502, 429 Non-retryable errors fail immediately: - Authentication errors - Invalid requests - HTTP 4xx (except 429) ## Production Best Practices ### 1. Multi-Provider Strategy ```typescript const providers: ProviderConfig[] = [ // Primary: Fast & cheap for simple tasks { name: 'gemini', priority: 1, costPerToken: 0.00015 }, // Fallback: High quality for complex tasks { name: 'anthropic', priority: 2, costPerToken: 0.003 }, // Emergency: Free local inference { name: 'onnx', priority: 3, costPerToken: 0 } ]; ``` ### 2. Cost Optimization ```typescript // Use cost-optimized strategy for high-volume const agent = new LongRunningAgent({ agentName: 'production-agent', providers, fallbackStrategy: { type: 'cost-optimized', costThreshold: 0.50 }, costBudget: 100.00 // Daily budget }); ``` ### 3. Health Monitoring ```typescript // Monitor provider health every minute const providers: ProviderConfig[] = [ { name: 'gemini', healthCheckInterval: 60000, enabled: true } ]; // Check health before critical operations const health = manager.getHealth(); const unhealthy = health.filter(h => !h.isHealthy); if (unhealthy.length > 0) { console.warn('Unhealthy providers:', unhealthy.map(h => h.provider)); } ``` ### 4. Graceful Degradation ```typescript // Prefer quality, fallback to cost const providers: ProviderConfig[] = [ { name: 'anthropic', priority: 1 }, // Best quality { name: 'gemini', priority: 2 }, // Cheaper fallback { name: 'onnx', priority: 3 } // Always available ]; ``` ### 5. Circuit Breaker Tuning ```typescript { maxFailures: 5, // More tolerant in production recoveryTime: 300000, // 5 minutes before retry retryBackoff: 'exponential' } ``` ## Docker Validation ### Build Image ```bash docker build -f Dockerfile.provider-fallback -t agentic-flow-provider-fallback . ``` ### Run Tests ```bash # With Gemini API key docker run --rm \ -e GOOGLE_GEMINI_API_KEY=your_key_here \ agentic-flow-provider-fallback # ONNX only (no API key needed) docker run --rm agentic-flow-provider-fallback ``` ### Expected Output ``` ✅ Provider Fallback Validation Test ==================================== 📋 Testing Provider Manager... 1️⃣ Building TypeScript... ✅ Build complete 2️⃣ Running provider fallback example... Using Gemini API key: AIza... 🚀 Starting Long-Running Agent with Provider Fallback 📋 Task 1: Simple Code Generation (Gemini optimal) Using provider: gemini ✅ Result: { code: 'console.log("Hello World");', provider: 'gemini' } 📋 Task 2: Complex Architecture Design (Claude optimal) Using provider: anthropic ✅ Result: { architecture: 'Event-driven microservices', provider: 'anthropic' } 📈 Provider Health: gemini: Healthy: true Success Rate: 100.0% Circuit Breaker: CLOSED ✅ All provider fallback tests passed! ``` ## API Reference ### ProviderManager ```typescript class ProviderManager { constructor(providers: ProviderConfig[], strategy: FallbackStrategy); selectProvider( taskComplexity?: 'simple' | 'medium' | 'complex', estimatedTokens?: number ): Promise; executeWithFallback( requestFn: (provider: ProviderType) => Promise, taskComplexity?: 'simple' | 'medium' | 'complex', estimatedTokens?: number ): Promise<{ result: T; provider: ProviderType; attempts: number }>; getMetrics(): ProviderMetrics[]; getHealth(): ProviderHealth[]; getCostSummary(): { total: number; byProvider: Record; totalTokens: number }; destroy(): void; } ``` ### LongRunningAgent ```typescript class LongRunningAgent { constructor(config: LongRunningAgentConfig); start(): Promise; stop(): Promise; executeTask(task: { name: string; complexity: 'simple' | 'medium' | 'complex'; estimatedTokens?: number; execute: (provider: string) => Promise; }): Promise; getStatus(): AgentStatus; getMetrics(): AgentMetrics; restoreFromCheckpoint(checkpoint: AgentCheckpoint): void; } ``` ## Examples See `src/examples/use-provider-fallback.ts` for complete working examples. ## Support - **GitHub Issues:** https://github.com/ruvnet/agentic-flow/issues - **Documentation:** https://github.com/ruvnet/agentic-flow#readme - **Discord:** Coming soon ## License MIT - See LICENSE file for details