8.9 KiB
ONNX Local Inference - CLI Usage Guide
Quick Start
Run AI agents with 100% free local inference using Microsoft's Phi-4 model:
# Auto-downloads Phi-4 (~4.9GB one-time download)
npx agentic-flow --agent coder --task "Create hello world" --provider onnx
Installation & Setup
Prerequisites
- Node.js 18+
- ~5GB disk space for Phi-4 model
- Internet connection for first-time download
Automatic Model Download
The Phi-4-mini ONNX model downloads automatically on first use:
npx agentic-flow --agent coder --task "test" --provider onnx
# Output:
# 🔍 Phi-4-mini ONNX model not found locally
# 📥 Starting automatic download...
# This is a one-time download (~4.9GB total)
# Model: microsoft/Phi-4-mini-instruct-onnx (INT4 quantized)
# Files: model.onnx (~52MB) + model.onnx.data (~4.86GB)
#
# 📦 Downloading model.onnx...
# ✅ Model downloaded successfully
#
# 📦 Downloading model.onnx.data (this is the large 4.86GB file)...
# 📥 Downloading: 10.0% (463.16/4631.59 MB)
# 📥 Downloading: 20.0% (926.32/4631.59 MB)
# ...
# ✅ Model downloaded successfully
CLI Usage
Basic Commands
# Use ONNX provider with --provider flag
npx agentic-flow --agent <agent> --task "<task>" --provider onnx
# Examples
npx agentic-flow --agent coder --task "Write Python hello world" --provider onnx
npx agentic-flow --agent researcher --task "Analyze AI trends" --provider onnx
npx agentic-flow --agent reviewer --task "Review code quality" --provider onnx
Environment Variables
# Force ONNX for all commands
export USE_ONNX=true
npx agentic-flow --agent coder --task "Build API"
# Or set provider explicitly
export PROVIDER=onnx
npx agentic-flow --agent coder --task "Build API"
# Enable optimizations for better quality and speed
export ONNX_OPTIMIZED=true
# Custom model path (if downloaded manually)
export ONNX_MODEL_PATH=./path/to/model.onnx
# GPU acceleration (10-50x faster!)
export ONNX_EXECUTION_PROVIDERS=cuda,cpu # NVIDIA
# export ONNX_EXECUTION_PROVIDERS=dml,cpu # Windows DirectML
# export ONNX_EXECUTION_PROVIDERS=coreml,cpu # macOS Metal
See full environment variable reference: ONNX_ENV_VARS.md
Available Agents
All 75+ agents work with ONNX provider:
Core Development:
coder- Code generationreviewer- Code reviewtester- Test creationresearcher- Research & analysis
Specialized:
backend-dev- Backend APIsmobile-dev- Mobile appsml-developer- ML modelscicd-engineer- CI/CD pipelinesapi-docs- API documentation
See full list: npx agentic-flow --list
Performance
CPU Performance (Intel i7)
- Speed: ~6 tokens/second (base), ~12 tokens/sec (optimized)
- Latency: ~3s for 20 tokens, ~16s for 100 tokens
- Cost: $0.00 (free)
GPU Performance (with CUDA/DirectML/Metal)
- Speed: 60-300 tokens/second
- Latency: ~0.08s for 20 tokens, ~0.42s for 100 tokens
- Cost: $0.00 (free)
Optimized Performance (ONNX_OPTIMIZED=true)
- Quality: 6.5/10 → 8.5/10 (31% improvement)
- Speed: 2-4x faster with context pruning
- CPU: ~12 tokens/sec (2x faster than base)
- GPU: ~180 tokens/sec (30x faster than base CPU)
See GPU setup in ONNX_INTEGRATION.md See optimization guide in ONNX_OPTIMIZATION_GUIDE.md
Use Cases
✅ Perfect For
-
Offline Development
# Work without internet (after initial download) export PROVIDER=onnx export ONNX_OPTIMIZED=true npx agentic-flow --agent coder --task "Build feature" -
Privacy-Sensitive Data
# Process PII/HIPAA data locally export PROVIDER=onnx export ONNX_OPTIMIZED=true npx agentic-flow --agent coder --task "Process medical records" -
Cost Optimization
# Free inference for simple tasks with better quality export PROVIDER=onnx export ONNX_OPTIMIZED=true export ONNX_TEMPERATURE=0.3 # Lower for code tasks for task in task1 task2 task3; do npx agentic-flow --agent coder --task "$task" done -
High-Volume Simple Tasks
# Thousands of generations daily at $0 cost export PROVIDER=onnx export ONNX_OPTIMIZED=true export ONNX_MAX_CONTEXT_TOKENS=1000 # Faster cat tasks.txt | while read task; do npx agentic-flow --agent coder --task "$task" done -
GPU-Accelerated Development
# 30x faster with GPU (180 tokens/sec) export PROVIDER=onnx export ONNX_OPTIMIZED=true export ONNX_EXECUTION_PROVIDERS=cuda,cpu # or dml, coreml npx agentic-flow --agent coder --task "Complex feature"
❌ Not Ideal For
- Complex Reasoning - Use Claude or DeepSeek via OpenRouter
- Tool Calling - ONNX doesn't support MCP tools (use Anthropic/OpenRouter)
- Long Context - Limited to 4K tokens (use cloud models for >4K)
- Streaming - Not implemented yet (use OpenRouter/Anthropic)
Hybrid Deployments
Mix ONNX with OpenRouter/Anthropic based on task complexity:
Scenario 1: Simple Local, Complex Cloud
# Simple tasks - free ONNX
npx agentic-flow --agent coder --task "Hello world" --provider onnx
# Complex tasks - cheap OpenRouter
npx agentic-flow --agent coder --task "Design distributed system" \
--model "deepseek/deepseek-chat-v3.1"
Scenario 2: Privacy-First with Fallback
# Privacy-sensitive - ONNX
export USE_ONNX=true
npx agentic-flow --agent coder --task "Process PII"
# Non-sensitive - OpenRouter (cheaper)
unset USE_ONNX
export OPENROUTER_API_KEY=sk-or-v1-...
npx agentic-flow --agent coder --task "Public API"
Troubleshooting
Model Download Failed
# Check internet connection
curl -I https://huggingface.co
# Retry download
rm -rf ./models/phi-4-mini
npx agentic-flow --agent coder --task "test" --provider onnx
Slow Inference (6 tokens/sec)
Enable GPU acceleration - see GPU Setup Guide
Out of Memory
# Reduce max tokens
export ONNX_MAX_TOKENS=50
npx agentic-flow --agent coder --task "small task" --provider onnx
Model Not Found Error
# Ensure model downloaded completely
ls -lh ./models/phi-4-mini/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/
# Should show:
# model.onnx (~52MB)
# model.onnx.data (~4.86GB)
Cost Comparison
1,000 Code Generation Tasks
| Provider | Model | Cost |
|---|---|---|
| ONNX Local | Phi-4-mini | $0.00 |
| OpenRouter | Llama 3.1 8B | $0.30 |
| OpenRouter | DeepSeek V3.1 | $1.40 |
| Anthropic | Claude 3.5 Sonnet | $81.00 |
Monthly Savings: $81/month vs Claude, $1.40/month vs DeepSeek
Electricity Cost
Assuming 100W CPU, 1hr/day, $0.12/kWh:
- Daily: $0.012
- Monthly: $0.36
- Annual: $4.32
Still cheaper than 5 OpenRouter requests!
Model Details
Phi-4-mini-instruct-onnx
- Source: microsoft/Phi-4-mini-instruct-onnx (HuggingFace)
- Architecture: Phi-4 (14B parameters)
- Quantization: INT4 (4-bit integers)
- Size: 4.9GB (52MB model + 4.86GB weights)
- Optimization: CPU and mobile optimized
- Context: 4K tokens
- License: Microsoft Research License
Advanced Configuration
Custom Model Path
export ONNX_MODEL_PATH=/custom/path/to/model.onnx
npx agentic-flow --agent coder --task "test" --provider onnx
Execution Providers
# CPU only (default)
export ONNX_EXECUTION_PROVIDERS=cpu
# GPU acceleration (NVIDIA)
export ONNX_EXECUTION_PROVIDERS=cuda,cpu
# GPU acceleration (Windows DirectML)
export ONNX_EXECUTION_PROVIDERS=dml,cpu
# GPU acceleration (macOS Metal)
export ONNX_EXECUTION_PROVIDERS=coreml,cpu
Generation Parameters
# Max output tokens
export ONNX_MAX_TOKENS=100
# Temperature (0.0 = deterministic, 1.0 = creative)
export ONNX_TEMPERATURE=0.7
Security & Privacy
Data Privacy
- ✅ 100% Local Processing - No data leaves your machine
- ✅ No API Calls - Zero external requests
- ✅ No Telemetry - No usage tracking
- ✅ GDPR Compliant - No data transmission
- ✅ HIPAA Suitable - Process sensitive health data locally
Model Security
- ✅ Official Source - Downloaded from Microsoft HuggingFace
- ✅ SHA256 Verification - Optional integrity checks
- ✅ Read-Only - Model not modified after download
Next Steps
- Enable GPU Acceleration: GPU Setup Guide
- Explore All Agents:
npx agentic-flow --list - Hybrid Deployments: Router Configuration
- Advanced Features: Full ONNX Guide
Support
- Documentation: ONNX_INTEGRATION.md
- Issues: https://github.com/ruvnet/agentic-flow/issues
- Model: https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx
- ONNX Runtime: https://onnxruntime.ai
Run AI agents for free. Zero API costs. Complete privacy. Works offline.