24 KiB
ONNX Runtime Integration Plan for Agentic-Flow
Executive Summary
Integrate ONNX Runtime to enable high-performance local model inference on both CPU and GPU, providing 2-100x speedup over standard inference and enabling privacy-focused, cost-free local execution of AI models including Microsoft Phi-3.
🎯 Objectives
- Performance: Achieve 2-100x inference speedup using ONNX Runtime optimizations
- Hardware Flexibility: Support both CPU and GPU execution with automatic provider selection
- Cost Reduction: Enable 100% cost-free inference for local model execution
- Privacy: Provide fully local inference option for sensitive data
- Model Support: Support Microsoft Phi-3 and other ONNX-compatible models
📊 Expected Performance Gains
CPU Optimization
- WebAssembly + SIMD: 3.4x performance improvement
- ONNX Runtime CPU: 2x average performance gain vs PyTorch/TensorFlow
- Graph Optimizations: 47% → 0.5% CPU usage (94% reduction)
- Inference Speed: ~20 tokens/second (Phi-3-medium on Intel i9-10920X)
GPU Acceleration
- CUDA Execution Provider: 10-100x speedup on NVIDIA GPUs
- TensorRT: Additional 2-5x optimization on top of CUDA
- DirectML (Windows): Native GPU acceleration on Windows
- WebGPU: Browser/Electron GPU acceleration
🏗️ Architecture
Component Overview
┌─────────────────────────────────────────────────────┐
│ Agentic-Flow Multi-Model Router │
│ │
│ ┌────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Anthropic │ │ OpenRouter │ │ ONNX Runtime│ │
│ │ Provider │ │ Provider │ │ Provider │ │
│ └────────────┘ └──────────────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────┐│
│ │ Execution Provider ││
│ │ Selector ││
│ └────────────────────────┘│
│ │ │
│ ┌─────────────────────────────┼──────┐ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌────────┐ ┌─────────┐ ┌───────┐ ┌──────┐ │
│ │ CPU │ │ CUDA │ │ WebGPU│ │DirectML
│ │ (WASM) │ │(NVIDIA) │ │ │ │(Windows)
│ └────────┘ └─────────┘ └───────┘ └──────┘ │
└─────────────────────────────────────────────────────┘
│
▼
┌───────────────────────┐
│ ONNX Model Store │
│ │
│ • Phi-3-mini (4K) │
│ • Phi-3-medium (128K) │
│ • Llama-3 ONNX │
│ • Custom models │
└───────────────────────┘
Data Flow
User Request
│
▼
Router (model selection)
│
▼
ONNX Provider
│
├─→ Load ONNX Model (if not cached)
│
├─→ Select Execution Provider
│ │
│ ├─→ Probe GPU availability
│ ├─→ Check CPU capabilities (SIMD, threads)
│ └─→ Prioritize: CUDA > WebGPU > DirectML > CPU
│
├─→ Create Inference Session
│ │
│ ├─→ Apply graph optimizations
│ ├─→ Configure threading
│ └─→ Enable SIMD if available
│
├─→ Run Inference
│ │
│ ├─→ Tokenize input
│ ├─→ Execute model
│ └─→ Decode output
│
└─→ Return Response
│
├─→ Update metrics (latency, tokens)
└─→ Cache model for reuse
📦 NPM Packages Required
Core Dependencies
{
"dependencies": {
"onnxruntime-node": "^1.22.0",
"@xenova/transformers": "^2.6.0",
"sharp": "^0.32.0"
},
"devDependencies": {
"@types/node": "^20.0.0"
},
"optionalDependencies": {
"onnxruntime-node-gpu": "^1.22.0"
}
}
Package Descriptions
-
onnxruntime-node (Required)
- Core ONNX Runtime for Node.js
- CPU execution provider
- Supports Node.js v16.x+ (recommend v20.x+)
- Size: ~20MB
-
onnxruntime-node-gpu (Optional)
- GPU acceleration via CUDA
- Requires NVIDIA GPU + CUDA 11.8 or 12.x
- Size: ~500MB (includes CUDA libraries)
-
@xenova/transformers (Helper)
- Transformers.js for tokenization
- Pre/post-processing utilities
- Model download management
-
sharp (Optional)
- Image processing for vision models
- Only needed for multimodal support
🔧 Implementation Phases
Phase 1: Core ONNX Provider (Week 1)
Objective: Basic ONNX Runtime integration with CPU inference
Tasks:
- Create ONNX provider class (
src/router/providers/onnx.ts) - Implement model loading and caching
- Add CPU execution provider support
- Integrate tokenization with Transformers.js
- Add basic error handling
Deliverables:
ONNXProviderclass implementingLLMProviderinterface- Model download and caching system
- CPU inference working with Phi-3-mini
Code Structure:
// src/router/providers/onnx.ts
import * as ort from 'onnxruntime-node';
import { AutoTokenizer } from '@xenova/transformers';
export class ONNXProvider implements LLMProvider {
name = 'onnx';
type = 'onnx' as const;
supportsStreaming = false; // Phase 2
supportsTools = false; // Phase 3
supportsMCP = false;
private session: ort.InferenceSession | null = null;
private tokenizer: any = null;
private modelPath: string;
constructor(config: ProviderConfig) {
this.modelPath = config.models?.default || 'microsoft/Phi-3-mini-4k-instruct-onnx-cpu';
}
async chat(params: ChatParams): Promise<ChatResponse> {
// Initialize session if needed
if (!this.session) {
await this.initializeSession();
}
// Tokenize input
const inputs = await this.tokenize(params.messages);
// Run inference
const outputs = await this.session.run(inputs);
// Decode output
const response = await this.decode(outputs);
return this.formatResponse(response, params.model);
}
private async initializeSession(): Promise<void> {
// Download model if needed
const modelPath = await this.downloadModel();
// Create inference session with optimizations
this.session = await ort.InferenceSession.create(modelPath, {
executionProviders: ['cpu'],
graphOptimizationLevel: 'all',
enableCpuMemArena: true,
enableMemPattern: true,
executionMode: 'parallel',
intraOpNumThreads: 4,
interOpNumThreads: 2
});
// Initialize tokenizer
this.tokenizer = await AutoTokenizer.from_pretrained(this.modelPath);
}
}
Testing:
- Load Phi-3-mini ONNX model
- Run simple inference test
- Measure baseline CPU performance
- Verify memory usage < 2GB
Phase 2: GPU Acceleration (Week 2)
Objective: Add GPU support with automatic provider selection
Tasks:
- Implement execution provider detection
- Add CUDA execution provider
- Add DirectML execution provider (Windows)
- Add WebGPU support (Electron/browser)
- Implement automatic provider fallback
Deliverables:
- GPU detection and capability probing
- Multi-provider support with prioritization
- Automatic fallback chain
Code Structure:
// src/router/providers/onnx.ts (additions)
export class ONNXProvider implements LLMProvider {
private async detectExecutionProviders(): Promise<string[]> {
const providers: string[] = [];
// Try CUDA first (NVIDIA GPU)
if (await this.isCUDAAvailable()) {
providers.push('cuda');
console.log('✅ CUDA execution provider available');
}
// Try DirectML (Windows GPU)
if (process.platform === 'win32' && await this.isDirectMLAvailable()) {
providers.push('dml');
console.log('✅ DirectML execution provider available');
}
// Try WebGPU (browser/Electron)
if (await this.isWebGPUAvailable()) {
providers.push('webgpu');
console.log('✅ WebGPU execution provider available');
}
// Always fallback to CPU
providers.push('cpu');
console.log('✅ CPU execution provider available');
return providers;
}
private async initializeSession(): Promise<void> {
const modelPath = await this.downloadModel();
const providers = await this.detectExecutionProviders();
this.session = await ort.InferenceSession.create(modelPath, {
executionProviders: providers,
graphOptimizationLevel: 'all',
enableCpuMemArena: true,
enableMemPattern: true
});
// Log which provider was selected
const selectedProvider = this.session.executionProvider;
console.log(`🚀 Using execution provider: ${selectedProvider}`);
}
private async isCUDAAvailable(): Promise<boolean> {
try {
// Check if CUDA libraries are available
const testSession = await ort.InferenceSession.create(
'path/to/test.onnx',
{ executionProviders: ['cuda'] }
);
return true;
} catch {
return false;
}
}
}
Testing:
- Test on CPU-only machine
- Test on NVIDIA GPU machine
- Test on Windows with DirectML
- Verify automatic fallback works
- Benchmark performance improvements
Expected Results:
- CUDA: 10-100x faster than CPU
- DirectML: 5-20x faster than CPU
- Automatic selection working
Phase 3: Optimization & Streaming (Week 3)
Objective: Performance optimizations and streaming support
Tasks:
- Implement streaming inference
- Add WebAssembly SIMD optimization
- Implement model quantization support
- Add KV cache for faster generation
- Optimize memory usage
Deliverables:
- Streaming token generation
- SIMD-optimized CPU inference
- INT8/INT4 quantized model support
- Reduced memory footprint
Code Structure:
// src/router/providers/onnx.ts (streaming)
export class ONNXProvider implements LLMProvider {
supportsStreaming = true;
async *stream(params: ChatParams): AsyncGenerator<StreamChunk> {
if (!this.session) {
await this.initializeSession();
}
const inputs = await this.tokenize(params.messages);
const maxTokens = params.maxTokens || 512;
let generatedTokens = [];
for (let i = 0; i < maxTokens; i++) {
// Run inference for next token
const outputs = await this.session.run({
...inputs,
past_key_values: this.kvCache // Use KV cache for speed
});
// Extract next token
const nextToken = this.sampleToken(outputs.logits);
generatedTokens.push(nextToken);
// Update KV cache
this.updateKVCache(outputs.present_key_values);
// Decode and yield
const text = await this.tokenizer.decode([nextToken]);
yield {
type: 'content_block_delta',
delta: {
type: 'text_delta',
text
}
};
// Stop on EOS token
if (nextToken === this.tokenizer.eos_token_id) {
yield { type: 'message_stop' };
break;
}
}
}
}
Optimizations:
// WASM + SIMD configuration
const sessionOptions: ort.InferenceSession.SessionOptions = {
executionProviders: [{
name: 'wasm',
options: {
simd: true,
threads: navigator.hardwareConcurrency || 4
}
}],
graphOptimizationLevel: 'all',
enableCpuMemArena: true,
enableMemPattern: true,
executionMode: 'parallel'
};
Testing:
- Measure streaming latency
- Verify SIMD activation
- Test quantized models (INT8, INT4)
- Benchmark KV cache improvements
- Memory profiling
Expected Results:
- Streaming: <100ms time to first token
- SIMD: 3.4x CPU performance improvement
- Quantization: 2-4x faster inference, 50% less memory
- KV cache: 2-3x faster multi-turn conversations
Phase 4: Model Management (Week 4)
Objective: Model download, caching, and selection
Tasks:
- Implement HuggingFace model downloader
- Add local model caching
- Create model registry
- Add model version management
- Implement automatic model selection based on hardware
Deliverables:
- Automatic model download from HuggingFace
- Local model cache (~/.agentic-flow/onnx-models/)
- Model registry with hardware requirements
- Smart model selection
Code Structure:
// src/router/onnx/model-manager.ts
export class ONNXModelManager {
private cacheDir = join(homedir(), '.agentic-flow', 'onnx-models');
private models = {
'phi-3-mini-4k-cpu': {
huggingface: 'microsoft/Phi-3-mini-4k-instruct-onnx-cpu',
files: ['cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4'],
size: '2.4GB',
requirements: { ram: '4GB', gpu: false }
},
'phi-3-mini-4k-gpu': {
huggingface: 'microsoft/Phi-3-mini-4k-instruct-onnx-cuda',
files: ['cuda/cuda-fp16'],
size: '4.8GB',
requirements: { ram: '8GB', gpu: 'CUDA' }
},
'phi-3-medium-128k-cpu': {
huggingface: 'microsoft/Phi-3-medium-128k-instruct-onnx-cpu',
files: ['cpu_and_mobile/cpu-int4-rtn-block-32'],
size: '8.2GB',
requirements: { ram: '16GB', gpu: false }
}
};
async downloadModel(modelId: string): Promise<string> {
const modelInfo = this.models[modelId];
if (!modelInfo) {
throw new Error(`Unknown model: ${modelId}`);
}
const modelPath = join(this.cacheDir, modelId);
// Check if already downloaded
if (existsSync(join(modelPath, 'model.onnx'))) {
console.log(`✅ Model ${modelId} already cached`);
return modelPath;
}
console.log(`📥 Downloading ${modelId} (${modelInfo.size})...`);
// Download from HuggingFace
await this.downloadFromHuggingFace(
modelInfo.huggingface,
modelInfo.files,
modelPath
);
console.log(`✅ Model ${modelId} downloaded to ${modelPath}`);
return modelPath;
}
selectModelForHardware(): string {
// Detect hardware capabilities
const hasGPU = this.detectGPU();
const ram = this.getAvailableRAM();
if (hasGPU && ram >= 8) {
return 'phi-3-mini-4k-gpu';
} else if (ram >= 16) {
return 'phi-3-medium-128k-cpu';
} else {
return 'phi-3-mini-4k-cpu';
}
}
}
Testing:
- Test model download from HuggingFace
- Verify caching works correctly
- Test automatic model selection
- Test with multiple models
- Verify disk space management
Phase 5: Integration & CLI (Week 5)
Objective: Integrate ONNX provider into router and add CLI commands
Tasks:
- Add ONNX provider to router initialization
- Add CLI commands for ONNX management
- Implement cost tracking (always $0 for local)
- Add performance benchmarking
- Update routing rules for ONNX
Deliverables:
- ONNX provider in multi-model router
- CLI commands for model management
- Benchmark utilities
- Updated documentation
CLI Commands:
# List available ONNX models
npx agentic-flow onnx models
# Download a model
npx agentic-flow onnx download phi-3-mini-4k-cpu
# List downloaded models
npx agentic-flow onnx list
# Test ONNX inference
npx agentic-flow onnx test --model phi-3-mini-4k-cpu
# Benchmark performance
npx agentic-flow onnx benchmark --model phi-3-mini-4k-cpu
# Check hardware capabilities
npx agentic-flow onnx info
# Use ONNX provider for inference
npx agentic-flow --provider onnx --model phi-3-mini-4k-cpu --task "Hello world"
# Use ONNX with GPU
npx agentic-flow --provider onnx --model phi-3-mini-4k-gpu --execution-provider cuda --task "Complex task"
Router Integration:
// src/router/router.ts
private initializeProviders(): void {
// ... existing providers ...
// Initialize ONNX
if (this.config.providers.onnx) {
try {
const provider = new ONNXProvider(this.config.providers.onnx);
this.providers.set('onnx', provider);
console.log('✅ ONNX provider initialized');
} catch (error) {
console.error('❌ Failed to initialize ONNX:', error);
}
}
}
Configuration:
{
"providers": {
"onnx": {
"models": {
"default": "phi-3-mini-4k-cpu",
"fast": "phi-3-mini-4k-cpu",
"advanced": "phi-3-medium-128k-cpu",
"gpu": "phi-3-mini-4k-gpu"
},
"executionProviders": ["cuda", "dml", "cpu"],
"graphOptimizationLevel": "all",
"enableMemoryOptimization": true,
"threads": 4,
"timeout": 60000
}
},
"routing": {
"rules": [
{
"condition": {
"privacy": "high",
"localOnly": true
},
"action": {
"provider": "onnx",
"model": "phi-3-mini-4k-cpu"
},
"reason": "Privacy-sensitive tasks use local ONNX models"
},
{
"condition": {
"agentType": ["researcher"],
"complexity": "low"
},
"action": {
"provider": "onnx",
"model": "phi-3-mini-4k-cpu"
},
"reason": "Simple tasks use free local inference"
}
]
}
}
Phase 6: Advanced Features (Week 6)
Objective: Vision support, tool calling, and production optimizations
Tasks:
- Add multimodal support (vision)
- Implement tool calling for ONNX models
- Add model fine-tuning support
- Implement distributed inference
- Production hardening
Features:
- Phi-3-vision for image understanding
- Custom tool calling layer
- Model adaptation for specific tasks
- Multi-GPU support
- Load balancing across models
📋 Configuration Examples
Basic CPU Configuration
{
"version": "1.0",
"defaultProvider": "onnx",
"providers": {
"onnx": {
"models": {
"default": "phi-3-mini-4k-cpu"
},
"executionProviders": ["cpu"],
"threads": 4
}
}
}
GPU Optimized Configuration
{
"version": "1.0",
"defaultProvider": "onnx",
"providers": {
"onnx": {
"models": {
"default": "phi-3-mini-4k-gpu"
},
"executionProviders": ["cuda", "cpu"],
"cudaOptions": {
"deviceId": 0,
"gpuMemLimit": 4294967296,
"arenExtendStrategy": "kSameAsRequested"
}
}
}
}
Hybrid Cloud + Local Configuration
{
"version": "1.0",
"defaultProvider": "anthropic",
"fallbackChain": ["anthropic", "openrouter", "onnx"],
"providers": {
"anthropic": { ... },
"openrouter": { ... },
"onnx": {
"models": {
"default": "phi-3-mini-4k-cpu"
}
}
},
"routing": {
"mode": "rule-based",
"rules": [
{
"condition": { "privacy": "high" },
"action": { "provider": "onnx" }
},
{
"condition": { "complexity": "low" },
"action": { "provider": "onnx" }
},
{
"condition": { "complexity": "high" },
"action": { "provider": "anthropic" }
}
]
}
}
🎯 Success Metrics
Performance Targets
| Metric | Target | Measurement |
|---|---|---|
| CPU Inference Speed | 15-20 tokens/sec | Phi-3-mini on i9-10920X |
| GPU Inference Speed | 100+ tokens/sec | Phi-3-mini on RTX 3090 |
| Time to First Token | <500ms | Streaming mode |
| Memory Usage (CPU) | <4GB | Phi-3-mini INT4 |
| Model Load Time | <10s | First request only |
Cost Savings
| Scenario | Cloud Cost | ONNX Cost | Savings |
|---|---|---|---|
| 1M tokens (research) | $3.00 | $0.00 | 100% |
| 1M tokens (coding) | $15.00 | $0.00 | 100% |
| Monthly development | $100-500 | $0.00 | 100% |
Quality Targets
| Metric | Target |
|---|---|
| Accuracy vs Cloud | >95% for simple tasks |
| Success Rate | >99% (no network failures) |
| Latency Variance | <10% (consistent) |
🔒 Security & Privacy
Benefits
- ✅ No data sent to external services
- ✅ No API keys required for local models
- ✅ Fully offline operation possible
- ✅ HIPAA/GDPR compliant by design
- ✅ No usage tracking or telemetry
Considerations
- Models downloaded from HuggingFace (verify checksums)
- Model license compliance (MIT for Phi-3)
- Disk space for model storage (2-10GB per model)
🐛 Known Limitations
- Model Size: ONNX models are 2-10GB, requiring significant disk space
- Initial Download: First-time model download takes 5-30 minutes
- Hardware Requirements: GPU models require NVIDIA GPU with CUDA
- Tool Calling: Limited compared to Claude/GPT (requires custom implementation)
- Streaming: Initial implementation may have higher latency than cloud
- Context Length: Phi-3-mini limited to 4K tokens (vs 200K for Claude)
📊 Benchmarking Plan
Test Suite
# CPU Benchmark
npx agentic-flow onnx benchmark \
--model phi-3-mini-4k-cpu \
--provider cpu \
--iterations 100
# GPU Benchmark
npx agentic-flow onnx benchmark \
--model phi-3-mini-4k-gpu \
--provider cuda \
--iterations 100
# Comparison Benchmark
npx agentic-flow router benchmark \
--providers "onnx,anthropic,openrouter" \
--task "Write a hello world function" \
--iterations 50
Benchmark Metrics
-
Latency
- Time to first token
- Tokens per second
- End-to-end request time
-
Throughput
- Concurrent requests
- Batch processing
-
Resource Usage
- CPU utilization
- Memory consumption
- GPU memory
- Disk I/O
-
Quality
- Response accuracy
- Instruction following
- Consistency
🚀 Deployment Strategy
Development Phase
- Use ONNX CPU for testing
- Small models (Phi-3-mini)
- Local development only
Staging Phase
- Test GPU acceleration
- Larger models (Phi-3-medium)
- Performance benchmarking
Production Phase
- Hybrid cloud + local routing
- GPU for high-throughput
- Fallback to cloud for complex tasks
📚 Resources
Documentation
- ONNX Runtime: https://onnxruntime.ai/
- Phi-3 Models: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx-cpu
- Transformers.js: https://huggingface.co/docs/transformers.js
Examples
- Phi-3 Chat: https://github.com/microsoft/onnxruntime-inference-examples/tree/main/js/chat
- WebGPU RAG: https://github.com/microsoft/Phi-3CookBook/tree/main/code/08.RAG/rag_webgpu_chat
Performance Guides
- CPU Optimization: https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html
- GPU Providers: https://onnxruntime.ai/docs/execution-providers/
✅ Next Steps
- Immediate: Review and approve this implementation plan
- Week 1: Begin Phase 1 (Core ONNX Provider)
- Week 2: Implement Phase 2 (GPU Acceleration)
- Week 3: Complete Phase 3 (Optimization)
- Week 4: Execute Phase 4 (Model Management)
- Week 5: Finish Phase 5 (Integration)
- Week 6: Deploy Phase 6 (Advanced Features)
Status: Ready for Implementation Estimated Timeline: 6 weeks Estimated Effort: 120-150 hours Risk Level: Low (proven technology, clear path) ROI: High (100% cost savings for local inference, 2-100x performance improvement)