tasq/node_modules/agentic-flow/docs/guides/ONNX_CLI_USAGE.md

8.9 KiB

ONNX Local Inference - CLI Usage Guide

Quick Start

Run AI agents with 100% free local inference using Microsoft's Phi-4 model:

# Auto-downloads Phi-4 (~4.9GB one-time download)
npx agentic-flow --agent coder --task "Create hello world" --provider onnx

Installation & Setup

Prerequisites

  • Node.js 18+
  • ~5GB disk space for Phi-4 model
  • Internet connection for first-time download

Automatic Model Download

The Phi-4-mini ONNX model downloads automatically on first use:

npx agentic-flow --agent coder --task "test" --provider onnx

# Output:
# 🔍 Phi-4-mini ONNX model not found locally
# 📥 Starting automatic download...
#    This is a one-time download (~4.9GB total)
#    Model: microsoft/Phi-4-mini-instruct-onnx (INT4 quantized)
#    Files: model.onnx (~52MB) + model.onnx.data (~4.86GB)
#
# 📦 Downloading model.onnx...
# ✅ Model downloaded successfully
#
# 📦 Downloading model.onnx.data (this is the large 4.86GB file)...
#    📥 Downloading: 10.0% (463.16/4631.59 MB)
#    📥 Downloading: 20.0% (926.32/4631.59 MB)
#    ...
# ✅ Model downloaded successfully

CLI Usage

Basic Commands

# Use ONNX provider with --provider flag
npx agentic-flow --agent <agent> --task "<task>" --provider onnx

# Examples
npx agentic-flow --agent coder --task "Write Python hello world" --provider onnx
npx agentic-flow --agent researcher --task "Analyze AI trends" --provider onnx
npx agentic-flow --agent reviewer --task "Review code quality" --provider onnx

Environment Variables

# Force ONNX for all commands
export USE_ONNX=true
npx agentic-flow --agent coder --task "Build API"

# Or set provider explicitly
export PROVIDER=onnx
npx agentic-flow --agent coder --task "Build API"

# Enable optimizations for better quality and speed
export ONNX_OPTIMIZED=true

# Custom model path (if downloaded manually)
export ONNX_MODEL_PATH=./path/to/model.onnx

# GPU acceleration (10-50x faster!)
export ONNX_EXECUTION_PROVIDERS=cuda,cpu  # NVIDIA
# export ONNX_EXECUTION_PROVIDERS=dml,cpu   # Windows DirectML
# export ONNX_EXECUTION_PROVIDERS=coreml,cpu # macOS Metal

See full environment variable reference: ONNX_ENV_VARS.md

Available Agents

All 75+ agents work with ONNX provider:

Core Development:

  • coder - Code generation
  • reviewer - Code review
  • tester - Test creation
  • researcher - Research & analysis

Specialized:

  • backend-dev - Backend APIs
  • mobile-dev - Mobile apps
  • ml-developer - ML models
  • cicd-engineer - CI/CD pipelines
  • api-docs - API documentation

See full list: npx agentic-flow --list

Performance

CPU Performance (Intel i7)

  • Speed: ~6 tokens/second (base), ~12 tokens/sec (optimized)
  • Latency: ~3s for 20 tokens, ~16s for 100 tokens
  • Cost: $0.00 (free)

GPU Performance (with CUDA/DirectML/Metal)

  • Speed: 60-300 tokens/second
  • Latency: ~0.08s for 20 tokens, ~0.42s for 100 tokens
  • Cost: $0.00 (free)

Optimized Performance (ONNX_OPTIMIZED=true)

  • Quality: 6.5/10 → 8.5/10 (31% improvement)
  • Speed: 2-4x faster with context pruning
  • CPU: ~12 tokens/sec (2x faster than base)
  • GPU: ~180 tokens/sec (30x faster than base CPU)

See GPU setup in ONNX_INTEGRATION.md See optimization guide in ONNX_OPTIMIZATION_GUIDE.md

Use Cases

Perfect For

  1. Offline Development

    # Work without internet (after initial download)
    export PROVIDER=onnx
    export ONNX_OPTIMIZED=true
    npx agentic-flow --agent coder --task "Build feature"
    
  2. Privacy-Sensitive Data

    # Process PII/HIPAA data locally
    export PROVIDER=onnx
    export ONNX_OPTIMIZED=true
    npx agentic-flow --agent coder --task "Process medical records"
    
  3. Cost Optimization

    # Free inference for simple tasks with better quality
    export PROVIDER=onnx
    export ONNX_OPTIMIZED=true
    export ONNX_TEMPERATURE=0.3  # Lower for code tasks
    
    for task in task1 task2 task3; do
      npx agentic-flow --agent coder --task "$task"
    done
    
  4. High-Volume Simple Tasks

    # Thousands of generations daily at $0 cost
    export PROVIDER=onnx
    export ONNX_OPTIMIZED=true
    export ONNX_MAX_CONTEXT_TOKENS=1000  # Faster
    
    cat tasks.txt | while read task; do
      npx agentic-flow --agent coder --task "$task"
    done
    
  5. GPU-Accelerated Development

    # 30x faster with GPU (180 tokens/sec)
    export PROVIDER=onnx
    export ONNX_OPTIMIZED=true
    export ONNX_EXECUTION_PROVIDERS=cuda,cpu  # or dml, coreml
    
    npx agentic-flow --agent coder --task "Complex feature"
    

Not Ideal For

  • Complex Reasoning - Use Claude or DeepSeek via OpenRouter
  • Tool Calling - ONNX doesn't support MCP tools (use Anthropic/OpenRouter)
  • Long Context - Limited to 4K tokens (use cloud models for >4K)
  • Streaming - Not implemented yet (use OpenRouter/Anthropic)

Hybrid Deployments

Mix ONNX with OpenRouter/Anthropic based on task complexity:

Scenario 1: Simple Local, Complex Cloud

# Simple tasks - free ONNX
npx agentic-flow --agent coder --task "Hello world" --provider onnx

# Complex tasks - cheap OpenRouter
npx agentic-flow --agent coder --task "Design distributed system" \
  --model "deepseek/deepseek-chat-v3.1"

Scenario 2: Privacy-First with Fallback

# Privacy-sensitive - ONNX
export USE_ONNX=true
npx agentic-flow --agent coder --task "Process PII"

# Non-sensitive - OpenRouter (cheaper)
unset USE_ONNX
export OPENROUTER_API_KEY=sk-or-v1-...
npx agentic-flow --agent coder --task "Public API"

Troubleshooting

Model Download Failed

# Check internet connection
curl -I https://huggingface.co

# Retry download
rm -rf ./models/phi-4-mini
npx agentic-flow --agent coder --task "test" --provider onnx

Slow Inference (6 tokens/sec)

Enable GPU acceleration - see GPU Setup Guide

Out of Memory

# Reduce max tokens
export ONNX_MAX_TOKENS=50
npx agentic-flow --agent coder --task "small task" --provider onnx

Model Not Found Error

# Ensure model downloaded completely
ls -lh ./models/phi-4-mini/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/
# Should show:
# model.onnx (~52MB)
# model.onnx.data (~4.86GB)

Cost Comparison

1,000 Code Generation Tasks

Provider Model Cost
ONNX Local Phi-4-mini $0.00
OpenRouter Llama 3.1 8B $0.30
OpenRouter DeepSeek V3.1 $1.40
Anthropic Claude 3.5 Sonnet $81.00

Monthly Savings: $81/month vs Claude, $1.40/month vs DeepSeek

Electricity Cost

Assuming 100W CPU, 1hr/day, $0.12/kWh:

  • Daily: $0.012
  • Monthly: $0.36
  • Annual: $4.32

Still cheaper than 5 OpenRouter requests!

Model Details

Phi-4-mini-instruct-onnx

  • Source: microsoft/Phi-4-mini-instruct-onnx (HuggingFace)
  • Architecture: Phi-4 (14B parameters)
  • Quantization: INT4 (4-bit integers)
  • Size: 4.9GB (52MB model + 4.86GB weights)
  • Optimization: CPU and mobile optimized
  • Context: 4K tokens
  • License: Microsoft Research License

Advanced Configuration

Custom Model Path

export ONNX_MODEL_PATH=/custom/path/to/model.onnx
npx agentic-flow --agent coder --task "test" --provider onnx

Execution Providers

# CPU only (default)
export ONNX_EXECUTION_PROVIDERS=cpu

# GPU acceleration (NVIDIA)
export ONNX_EXECUTION_PROVIDERS=cuda,cpu

# GPU acceleration (Windows DirectML)
export ONNX_EXECUTION_PROVIDERS=dml,cpu

# GPU acceleration (macOS Metal)
export ONNX_EXECUTION_PROVIDERS=coreml,cpu

Generation Parameters

# Max output tokens
export ONNX_MAX_TOKENS=100

# Temperature (0.0 = deterministic, 1.0 = creative)
export ONNX_TEMPERATURE=0.7

Security & Privacy

Data Privacy

  • 100% Local Processing - No data leaves your machine
  • No API Calls - Zero external requests
  • No Telemetry - No usage tracking
  • GDPR Compliant - No data transmission
  • HIPAA Suitable - Process sensitive health data locally

Model Security

  • Official Source - Downloaded from Microsoft HuggingFace
  • SHA256 Verification - Optional integrity checks
  • Read-Only - Model not modified after download

Next Steps

  1. Enable GPU Acceleration: GPU Setup Guide
  2. Explore All Agents: npx agentic-flow --list
  3. Hybrid Deployments: Router Configuration
  4. Advanced Features: Full ONNX Guide

Support


Run AI agents for free. Zero API costs. Complete privacy. Works offline.