Marc Rejohn Castillano 5cb6561924 added ruflo

2026-04-09 19:01:53 +08:00

24 KiB

Raw Blame History

ONNX Runtime Integration Plan for Agentic-Flow

Executive Summary

Integrate ONNX Runtime to enable high-performance local model inference on both CPU and GPU, providing 2-100x speedup over standard inference and enabling privacy-focused, cost-free local execution of AI models including Microsoft Phi-3.

🎯 Objectives

Performance: Achieve 2-100x inference speedup using ONNX Runtime optimizations
Hardware Flexibility: Support both CPU and GPU execution with automatic provider selection
Cost Reduction: Enable 100% cost-free inference for local model execution
Privacy: Provide fully local inference option for sensitive data
Model Support: Support Microsoft Phi-3 and other ONNX-compatible models

📊 Expected Performance Gains

CPU Optimization

WebAssembly + SIMD: 3.4x performance improvement
ONNX Runtime CPU: 2x average performance gain vs PyTorch/TensorFlow
Graph Optimizations: 47% → 0.5% CPU usage (94% reduction)
Inference Speed: ~20 tokens/second (Phi-3-medium on Intel i9-10920X)

GPU Acceleration

CUDA Execution Provider: 10-100x speedup on NVIDIA GPUs
TensorRT: Additional 2-5x optimization on top of CUDA
DirectML (Windows): Native GPU acceleration on Windows
WebGPU: Browser/Electron GPU acceleration

🏗️ Architecture

Component Overview

┌─────────────────────────────────────────────────────┐
│         Agentic-Flow Multi-Model Router             │
│                                                       │
│  ┌────────────┐  ┌──────────────┐  ┌─────────────┐ │
│  │ Anthropic  │  │  OpenRouter  │  │ ONNX Runtime│ │
│  │  Provider  │  │   Provider   │  │   Provider  │ │
│  └────────────┘  └──────────────┘  └─────────────┘ │
│                                             │        │
│                                             ▼        │
│                            ┌────────────────────────┐│
│                            │  Execution Provider    ││
│                            │     Selector           ││
│                            └────────────────────────┘│
│                                       │              │
│         ┌─────────────────────────────┼──────┐      │
│         │             │               │      │      │
│         ▼             ▼               ▼      ▼      │
│    ┌────────┐   ┌─────────┐    ┌───────┐ ┌──────┐ │
│    │  CPU   │   │  CUDA   │    │ WebGPU│ │DirectML
│    │ (WASM) │   │(NVIDIA) │    │       │ │(Windows)
│    └────────┘   └─────────┘    └───────┘ └──────┘ │
└─────────────────────────────────────────────────────┘
                        │
                        ▼
            ┌───────────────────────┐
            │   ONNX Model Store    │
            │                       │
            │ • Phi-3-mini (4K)     │
            │ • Phi-3-medium (128K) │
            │ • Llama-3 ONNX       │
            │ • Custom models       │
            └───────────────────────┘

Data Flow

User Request
    │
    ▼
Router (model selection)
    │
    ▼
ONNX Provider
    │
    ├─→ Load ONNX Model (if not cached)
    │
    ├─→ Select Execution Provider
    │   │
    │   ├─→ Probe GPU availability
    │   ├─→ Check CPU capabilities (SIMD, threads)
    │   └─→ Prioritize: CUDA > WebGPU > DirectML > CPU
    │
    ├─→ Create Inference Session
    │   │
    │   ├─→ Apply graph optimizations
    │   ├─→ Configure threading
    │   └─→ Enable SIMD if available
    │
    ├─→ Run Inference
    │   │
    │   ├─→ Tokenize input
    │   ├─→ Execute model
    │   └─→ Decode output
    │
    └─→ Return Response
        │
        ├─→ Update metrics (latency, tokens)
        └─→ Cache model for reuse

📦 NPM Packages Required

Core Dependencies

{
  "dependencies": {
    "onnxruntime-node": "^1.22.0",
    "@xenova/transformers": "^2.6.0",
    "sharp": "^0.32.0"
  },
  "devDependencies": {
    "@types/node": "^20.0.0"
  },
  "optionalDependencies": {
    "onnxruntime-node-gpu": "^1.22.0"
  }
}

Package Descriptions

onnxruntime-node (Required)
- Core ONNX Runtime for Node.js
- CPU execution provider
- Supports Node.js v16.x+ (recommend v20.x+)
- Size: ~20MB
onnxruntime-node-gpu (Optional)
- GPU acceleration via CUDA
- Requires NVIDIA GPU + CUDA 11.8 or 12.x
- Size: ~500MB (includes CUDA libraries)
@xenova/transformers (Helper)
- Transformers.js for tokenization
- Pre/post-processing utilities
- Model download management
sharp (Optional)
- Image processing for vision models
- Only needed for multimodal support

🔧 Implementation Phases

Phase 1: Core ONNX Provider (Week 1)

Objective: Basic ONNX Runtime integration with CPU inference

Tasks:

Create ONNX provider class (src/router/providers/onnx.ts)
Implement model loading and caching
Add CPU execution provider support
Integrate tokenization with Transformers.js
Add basic error handling

Deliverables:

ONNXProvider class implementing LLMProvider interface
Model download and caching system
CPU inference working with Phi-3-mini

Code Structure:

// src/router/providers/onnx.ts
import * as ort from 'onnxruntime-node';
import { AutoTokenizer } from '@xenova/transformers';

export class ONNXProvider implements LLMProvider {
  name = 'onnx';
  type = 'onnx' as const;
  supportsStreaming = false; // Phase 2
  supportsTools = false;     // Phase 3
  supportsMCP = false;

  private session: ort.InferenceSession | null = null;
  private tokenizer: any = null;
  private modelPath: string;

  constructor(config: ProviderConfig) {
    this.modelPath = config.models?.default || 'microsoft/Phi-3-mini-4k-instruct-onnx-cpu';
  }

  async chat(params: ChatParams): Promise<ChatResponse> {
    // Initialize session if needed
    if (!this.session) {
      await this.initializeSession();
    }

    // Tokenize input
    const inputs = await this.tokenize(params.messages);

    // Run inference
    const outputs = await this.session.run(inputs);

    // Decode output
    const response = await this.decode(outputs);

    return this.formatResponse(response, params.model);
  }

  private async initializeSession(): Promise<void> {
    // Download model if needed
    const modelPath = await this.downloadModel();

    // Create inference session with optimizations
    this.session = await ort.InferenceSession.create(modelPath, {
      executionProviders: ['cpu'],
      graphOptimizationLevel: 'all',
      enableCpuMemArena: true,
      enableMemPattern: true,
      executionMode: 'parallel',
      intraOpNumThreads: 4,
      interOpNumThreads: 2
    });

    // Initialize tokenizer
    this.tokenizer = await AutoTokenizer.from_pretrained(this.modelPath);
  }
}

Testing:

Load Phi-3-mini ONNX model
Run simple inference test
Measure baseline CPU performance
Verify memory usage < 2GB

Phase 2: GPU Acceleration (Week 2)

Objective: Add GPU support with automatic provider selection

Tasks:

Implement execution provider detection
Add CUDA execution provider
Add DirectML execution provider (Windows)
Add WebGPU support (Electron/browser)
Implement automatic provider fallback

Deliverables:

GPU detection and capability probing
Multi-provider support with prioritization
Automatic fallback chain

Code Structure:

// src/router/providers/onnx.ts (additions)
export class ONNXProvider implements LLMProvider {
  private async detectExecutionProviders(): Promise<string[]> {
    const providers: string[] = [];

    // Try CUDA first (NVIDIA GPU)
    if (await this.isCUDAAvailable()) {
      providers.push('cuda');
      console.log('✅ CUDA execution provider available');
    }

    // Try DirectML (Windows GPU)
    if (process.platform === 'win32' && await this.isDirectMLAvailable()) {
      providers.push('dml');
      console.log('✅ DirectML execution provider available');
    }

    // Try WebGPU (browser/Electron)
    if (await this.isWebGPUAvailable()) {
      providers.push('webgpu');
      console.log('✅ WebGPU execution provider available');
    }

    // Always fallback to CPU
    providers.push('cpu');
    console.log('✅ CPU execution provider available');

    return providers;
  }

  private async initializeSession(): Promise<void> {
    const modelPath = await this.downloadModel();
    const providers = await this.detectExecutionProviders();

    this.session = await ort.InferenceSession.create(modelPath, {
      executionProviders: providers,
      graphOptimizationLevel: 'all',
      enableCpuMemArena: true,
      enableMemPattern: true
    });

    // Log which provider was selected
    const selectedProvider = this.session.executionProvider;
    console.log(`🚀 Using execution provider: ${selectedProvider}`);
  }

  private async isCUDAAvailable(): Promise<boolean> {
    try {
      // Check if CUDA libraries are available
      const testSession = await ort.InferenceSession.create(
        'path/to/test.onnx',
        { executionProviders: ['cuda'] }
      );
      return true;
    } catch {
      return false;
    }
  }
}

Testing:

Test on CPU-only machine
Test on NVIDIA GPU machine
Test on Windows with DirectML
Verify automatic fallback works
Benchmark performance improvements

Expected Results:

CUDA: 10-100x faster than CPU
DirectML: 5-20x faster than CPU
Automatic selection working

Phase 3: Optimization & Streaming (Week 3)

Objective: Performance optimizations and streaming support

Tasks:

Implement streaming inference
Add WebAssembly SIMD optimization
Implement model quantization support
Add KV cache for faster generation
Optimize memory usage

Deliverables:

Streaming token generation
SIMD-optimized CPU inference
INT8/INT4 quantized model support
Reduced memory footprint

Code Structure:

// src/router/providers/onnx.ts (streaming)
export class ONNXProvider implements LLMProvider {
  supportsStreaming = true;

  async *stream(params: ChatParams): AsyncGenerator<StreamChunk> {
    if (!this.session) {
      await this.initializeSession();
    }

    const inputs = await this.tokenize(params.messages);
    const maxTokens = params.maxTokens || 512;

    let generatedTokens = [];

    for (let i = 0; i < maxTokens; i++) {
      // Run inference for next token
      const outputs = await this.session.run({
        ...inputs,
        past_key_values: this.kvCache // Use KV cache for speed
      });

      // Extract next token
      const nextToken = this.sampleToken(outputs.logits);
      generatedTokens.push(nextToken);

      // Update KV cache
      this.updateKVCache(outputs.present_key_values);

      // Decode and yield
      const text = await this.tokenizer.decode([nextToken]);

      yield {
        type: 'content_block_delta',
        delta: {
          type: 'text_delta',
          text
        }
      };

      // Stop on EOS token
      if (nextToken === this.tokenizer.eos_token_id) {
        yield { type: 'message_stop' };
        break;
      }
    }
  }
}

Optimizations:

// WASM + SIMD configuration
const sessionOptions: ort.InferenceSession.SessionOptions = {
  executionProviders: [{
    name: 'wasm',
    options: {
      simd: true,
      threads: navigator.hardwareConcurrency || 4
    }
  }],
  graphOptimizationLevel: 'all',
  enableCpuMemArena: true,
  enableMemPattern: true,
  executionMode: 'parallel'
};

Testing:

Measure streaming latency
Verify SIMD activation
Test quantized models (INT8, INT4)
Benchmark KV cache improvements
Memory profiling

Expected Results:

Streaming: <100ms time to first token
SIMD: 3.4x CPU performance improvement
Quantization: 2-4x faster inference, 50% less memory
KV cache: 2-3x faster multi-turn conversations

Phase 4: Model Management (Week 4)

Objective: Model download, caching, and selection

Tasks:

Implement HuggingFace model downloader
Add local model caching
Create model registry
Add model version management
Implement automatic model selection based on hardware

Deliverables:

Automatic model download from HuggingFace
Local model cache (~/.agentic-flow/onnx-models/)
Model registry with hardware requirements
Smart model selection

Code Structure:

// src/router/onnx/model-manager.ts
export class ONNXModelManager {
  private cacheDir = join(homedir(), '.agentic-flow', 'onnx-models');

  private models = {
    'phi-3-mini-4k-cpu': {
      huggingface: 'microsoft/Phi-3-mini-4k-instruct-onnx-cpu',
      files: ['cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4'],
      size: '2.4GB',
      requirements: { ram: '4GB', gpu: false }
    },
    'phi-3-mini-4k-gpu': {
      huggingface: 'microsoft/Phi-3-mini-4k-instruct-onnx-cuda',
      files: ['cuda/cuda-fp16'],
      size: '4.8GB',
      requirements: { ram: '8GB', gpu: 'CUDA' }
    },
    'phi-3-medium-128k-cpu': {
      huggingface: 'microsoft/Phi-3-medium-128k-instruct-onnx-cpu',
      files: ['cpu_and_mobile/cpu-int4-rtn-block-32'],
      size: '8.2GB',
      requirements: { ram: '16GB', gpu: false }
    }
  };

  async downloadModel(modelId: string): Promise<string> {
    const modelInfo = this.models[modelId];
    if (!modelInfo) {
      throw new Error(`Unknown model: ${modelId}`);
    }

    const modelPath = join(this.cacheDir, modelId);

    // Check if already downloaded
    if (existsSync(join(modelPath, 'model.onnx'))) {
      console.log(`✅ Model ${modelId} already cached`);
      return modelPath;
    }

    console.log(`📥 Downloading ${modelId} (${modelInfo.size})...`);

    // Download from HuggingFace
    await this.downloadFromHuggingFace(
      modelInfo.huggingface,
      modelInfo.files,
      modelPath
    );

    console.log(`✅ Model ${modelId} downloaded to ${modelPath}`);
    return modelPath;
  }

  selectModelForHardware(): string {
    // Detect hardware capabilities
    const hasGPU = this.detectGPU();
    const ram = this.getAvailableRAM();

    if (hasGPU && ram >= 8) {
      return 'phi-3-mini-4k-gpu';
    } else if (ram >= 16) {
      return 'phi-3-medium-128k-cpu';
    } else {
      return 'phi-3-mini-4k-cpu';
    }
  }
}

Testing:

Test model download from HuggingFace
Verify caching works correctly
Test automatic model selection
Test with multiple models
Verify disk space management

Phase 5: Integration & CLI (Week 5)

Objective: Integrate ONNX provider into router and add CLI commands

Tasks:

Add ONNX provider to router initialization
Add CLI commands for ONNX management
Implement cost tracking (always $0 for local)
Add performance benchmarking
Update routing rules for ONNX

Deliverables:

ONNX provider in multi-model router
CLI commands for model management
Benchmark utilities
Updated documentation

CLI Commands:

# List available ONNX models
npx agentic-flow onnx models

# Download a model
npx agentic-flow onnx download phi-3-mini-4k-cpu

# List downloaded models
npx agentic-flow onnx list

# Test ONNX inference
npx agentic-flow onnx test --model phi-3-mini-4k-cpu

# Benchmark performance
npx agentic-flow onnx benchmark --model phi-3-mini-4k-cpu

# Check hardware capabilities
npx agentic-flow onnx info

# Use ONNX provider for inference
npx agentic-flow --provider onnx --model phi-3-mini-4k-cpu --task "Hello world"

# Use ONNX with GPU
npx agentic-flow --provider onnx --model phi-3-mini-4k-gpu --execution-provider cuda --task "Complex task"

Router Integration:

// src/router/router.ts
private initializeProviders(): void {
  // ... existing providers ...

  // Initialize ONNX
  if (this.config.providers.onnx) {
    try {
      const provider = new ONNXProvider(this.config.providers.onnx);
      this.providers.set('onnx', provider);
      console.log('✅ ONNX provider initialized');
    } catch (error) {
      console.error('❌ Failed to initialize ONNX:', error);
    }
  }
}

Configuration:

{
  "providers": {
    "onnx": {
      "models": {
        "default": "phi-3-mini-4k-cpu",
        "fast": "phi-3-mini-4k-cpu",
        "advanced": "phi-3-medium-128k-cpu",
        "gpu": "phi-3-mini-4k-gpu"
      },
      "executionProviders": ["cuda", "dml", "cpu"],
      "graphOptimizationLevel": "all",
      "enableMemoryOptimization": true,
      "threads": 4,
      "timeout": 60000
    }
  },
  "routing": {
    "rules": [
      {
        "condition": {
          "privacy": "high",
          "localOnly": true
        },
        "action": {
          "provider": "onnx",
          "model": "phi-3-mini-4k-cpu"
        },
        "reason": "Privacy-sensitive tasks use local ONNX models"
      },
      {
        "condition": {
          "agentType": ["researcher"],
          "complexity": "low"
        },
        "action": {
          "provider": "onnx",
          "model": "phi-3-mini-4k-cpu"
        },
        "reason": "Simple tasks use free local inference"
      }
    ]
  }
}

Phase 6: Advanced Features (Week 6)

Objective: Vision support, tool calling, and production optimizations

Tasks:

Add multimodal support (vision)
Implement tool calling for ONNX models
Add model fine-tuning support
Implement distributed inference
Production hardening

Features:

Phi-3-vision for image understanding
Custom tool calling layer
Model adaptation for specific tasks
Multi-GPU support
Load balancing across models

📋 Configuration Examples

Basic CPU Configuration

{
  "version": "1.0",
  "defaultProvider": "onnx",
  "providers": {
    "onnx": {
      "models": {
        "default": "phi-3-mini-4k-cpu"
      },
      "executionProviders": ["cpu"],
      "threads": 4
    }
  }
}

GPU Optimized Configuration

{
  "version": "1.0",
  "defaultProvider": "onnx",
  "providers": {
    "onnx": {
      "models": {
        "default": "phi-3-mini-4k-gpu"
      },
      "executionProviders": ["cuda", "cpu"],
      "cudaOptions": {
        "deviceId": 0,
        "gpuMemLimit": 4294967296,
        "arenExtendStrategy": "kSameAsRequested"
      }
    }
  }
}

Hybrid Cloud + Local Configuration

{
  "version": "1.0",
  "defaultProvider": "anthropic",
  "fallbackChain": ["anthropic", "openrouter", "onnx"],
  "providers": {
    "anthropic": { ... },
    "openrouter": { ... },
    "onnx": {
      "models": {
        "default": "phi-3-mini-4k-cpu"
      }
    }
  },
  "routing": {
    "mode": "rule-based",
    "rules": [
      {
        "condition": { "privacy": "high" },
        "action": { "provider": "onnx" }
      },
      {
        "condition": { "complexity": "low" },
        "action": { "provider": "onnx" }
      },
      {
        "condition": { "complexity": "high" },
        "action": { "provider": "anthropic" }
      }
    ]
  }
}

🎯 Success Metrics

Performance Targets

Metric	Target	Measurement
CPU Inference Speed	15-20 tokens/sec	Phi-3-mini on i9-10920X
GPU Inference Speed	100+ tokens/sec	Phi-3-mini on RTX 3090
Time to First Token	<500ms	Streaming mode
Memory Usage (CPU)	<4GB	Phi-3-mini INT4
Model Load Time	<10s	First request only

Cost Savings

Scenario	Cloud Cost	ONNX Cost	Savings
1M tokens (research)	$3.00	$0.00	100%
1M tokens (coding)	$15.00	$0.00	100%
Monthly development	$100-500	$0.00	100%

Quality Targets

Metric	Target
Accuracy vs Cloud	>95% for simple tasks
Success Rate	>99% (no network failures)
Latency Variance	<10% (consistent)

🔒 Security & Privacy

Benefits

✅ No data sent to external services
✅ No API keys required for local models
✅ Fully offline operation possible
✅ HIPAA/GDPR compliant by design
✅ No usage tracking or telemetry

Considerations

Models downloaded from HuggingFace (verify checksums)
Model license compliance (MIT for Phi-3)
Disk space for model storage (2-10GB per model)

🐛 Known Limitations

Model Size: ONNX models are 2-10GB, requiring significant disk space
Initial Download: First-time model download takes 5-30 minutes
Hardware Requirements: GPU models require NVIDIA GPU with CUDA
Tool Calling: Limited compared to Claude/GPT (requires custom implementation)
Streaming: Initial implementation may have higher latency than cloud
Context Length: Phi-3-mini limited to 4K tokens (vs 200K for Claude)

📊 Benchmarking Plan

Test Suite

# CPU Benchmark
npx agentic-flow onnx benchmark \
  --model phi-3-mini-4k-cpu \
  --provider cpu \
  --iterations 100

# GPU Benchmark
npx agentic-flow onnx benchmark \
  --model phi-3-mini-4k-gpu \
  --provider cuda \
  --iterations 100

# Comparison Benchmark
npx agentic-flow router benchmark \
  --providers "onnx,anthropic,openrouter" \
  --task "Write a hello world function" \
  --iterations 50

Benchmark Metrics

Latency
- Time to first token
- Tokens per second
- End-to-end request time
Throughput
- Concurrent requests
- Batch processing
Resource Usage
- CPU utilization
- Memory consumption
- GPU memory
- Disk I/O
Quality
- Response accuracy
- Instruction following
- Consistency

🚀 Deployment Strategy

Development Phase

Use ONNX CPU for testing
Small models (Phi-3-mini)
Local development only

Staging Phase

Test GPU acceleration
Larger models (Phi-3-medium)
Performance benchmarking

Production Phase

Hybrid cloud + local routing
GPU for high-throughput
Fallback to cloud for complex tasks

📚 Resources

Documentation

ONNX Runtime: https://onnxruntime.ai/
Phi-3 Models: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx-cpu
Transformers.js: https://huggingface.co/docs/transformers.js

Examples

Performance Guides

CPU Optimization: https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html
GPU Providers: https://onnxruntime.ai/docs/execution-providers/

✅ Next Steps

Immediate: Review and approve this implementation plan
Week 1: Begin Phase 1 (Core ONNX Provider)
Week 2: Implement Phase 2 (GPU Acceleration)
Week 3: Complete Phase 3 (Optimization)
Week 4: Execute Phase 4 (Model Management)
Week 5: Finish Phase 5 (Integration)
Week 6: Deploy Phase 6 (Advanced Features)

Status: Ready for Implementation Estimated Timeline: 6 weeks Estimated Effort: 120-150 hours Risk Level: Low (proven technology, clear path) ROI: High (100% cost savings for local inference, 2-100x performance improvement)

24 KiB Raw Blame History

ONNX Runtime Integration Plan for Agentic-Flow

Executive Summary

🎯 Objectives

📊 Expected Performance Gains

CPU Optimization

GPU Acceleration

🏗️ Architecture

Component Overview

Data Flow

📦 NPM Packages Required

Core Dependencies

Package Descriptions

🔧 Implementation Phases

Phase 1: Core ONNX Provider (Week 1)

Phase 2: GPU Acceleration (Week 2)

Phase 3: Optimization & Streaming (Week 3)

Phase 4: Model Management (Week 4)

Phase 5: Integration & CLI (Week 5)

Phase 6: Advanced Features (Week 6)

📋 Configuration Examples

Basic CPU Configuration

GPU Optimized Configuration

Hybrid Cloud + Local Configuration

🎯 Success Metrics

Performance Targets

Cost Savings

Quality Targets

🔒 Security & Privacy

Benefits

Considerations

🐛 Known Limitations

📊 Benchmarking Plan

Test Suite

Benchmark Metrics

🚀 Deployment Strategy

Development Phase

Staging Phase

Production Phase

📚 Resources

Documentation

Examples

Performance Guides

✅ Next Steps

24 KiB

Raw Blame History