12 KiB
ONNX Environment Variables Reference
Complete guide to configuring ONNX local inference via environment variables.
Quick Start
# Enable ONNX with all optimizations
export PROVIDER=onnx
export ONNX_OPTIMIZED=true
# Run your agent
npx agentic-flow --agent coder --task "Build feature"
Provider Selection
PROVIDER
Values: anthropic | openrouter | onnx
Default: anthropic
Description: Set the AI provider for all CLI commands
# Use ONNX for all commands
export PROVIDER=onnx
npx agentic-flow --agent coder --task "test"
# Use OpenRouter
export PROVIDER=openrouter
npx agentic-flow --agent coder --task "test"
USE_ONNX
Values: true | false
Default: false
Description: Force ONNX provider (legacy, use PROVIDER=onnx instead)
export USE_ONNX=true
npx agentic-flow --agent coder --task "test"
Model Configuration
ONNX_MODEL_PATH
Values: File path
Default: ./models/phi-4-mini/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/model.onnx
Description: Custom path to ONNX model file
# Use custom model location
export ONNX_MODEL_PATH=/mnt/models/custom-model.onnx
ONNX_EXECUTION_PROVIDERS
Values: Comma-separated list: cpu, cuda, dml, coreml
Default: cpu
Description: Execution providers for inference (affects speed dramatically)
# CPU only (default, slowest)
export ONNX_EXECUTION_PROVIDERS=cpu
# NVIDIA GPU acceleration (10-50x faster)
export ONNX_EXECUTION_PROVIDERS=cuda,cpu
# Windows DirectML GPU (5-15x faster)
export ONNX_EXECUTION_PROVIDERS=dml,cpu
# macOS Apple Silicon (7-20x faster)
export ONNX_EXECUTION_PROVIDERS=coreml,cpu
Performance Impact:
cpu: ~6 tokens/seccuda: ~60-300 tokens/sec (10-50x faster)dml: ~30-100 tokens/sec (5-15x faster)coreml: ~40-120 tokens/sec (7-20x faster)
Generation Parameters
ONNX_MAX_TOKENS
Values: Integer (1-4096)
Default: 200
Description: Maximum tokens to generate in response
# Short responses (faster)
export ONNX_MAX_TOKENS=100
# Long responses
export ONNX_MAX_TOKENS=500
Tip: Keep under 300 for best speed. Context + output must stay under 4K tokens total.
ONNX_TEMPERATURE
Values: Float (0.0-2.0)
Default: 0.7 (base), 0.3 (optimized)
Description: Controls output randomness/creativity
# Deterministic code (recommended for code generation)
export ONNX_TEMPERATURE=0.2
# Balanced
export ONNX_TEMPERATURE=0.7
# Creative writing
export ONNX_TEMPERATURE=0.9
Recommended Settings:
| Task Type | Temperature | Why |
|---|---|---|
| Code generation | 0.2-0.4 | Consistent syntax |
| Refactoring | 0.3-0.5 | Some creativity, but safe |
| Documentation | 0.5-0.7 | Clear but varied |
| Brainstorming | 0.7-0.9 | Diverse ideas |
| Math/Logic | 0.1-0.2 | Precise |
ONNX_TOP_K
Values: Integer (1-100)
Default: 50
Description: Consider top K tokens for sampling
# More focused (deterministic)
export ONNX_TOP_K=20
# More diverse
export ONNX_TOP_K=80
ONNX_TOP_P
Values: Float (0.0-1.0)
Default: 0.9
Description: Nucleus sampling threshold (probability mass)
# Very focused
export ONNX_TOP_P=0.7
# Balanced
export ONNX_TOP_P=0.9
# Diverse
export ONNX_TOP_P=0.95
ONNX_REPETITION_PENALTY
Values: Float (1.0-2.0)
Default: 1.1
Description: Penalty for token repetition
# No penalty (may repeat)
export ONNX_REPETITION_PENALTY=1.0
# Mild penalty (recommended)
export ONNX_REPETITION_PENALTY=1.1
# Strong penalty (more diverse but may lose coherence)
export ONNX_REPETITION_PENALTY=1.5
Optimization Features
ONNX_OPTIMIZED
Values: true | false
Default: false
Description: Enable optimized ONNX provider with context pruning and prompt enhancement
# Enable all optimizations (recommended)
export ONNX_OPTIMIZED=true
# Use base provider
export ONNX_OPTIMIZED=false
Benefits when enabled:
- 30-50% quality improvement via prompt optimization
- 2-4x speed improvement via context pruning
- Automatic sliding window context management
ONNX_MAX_CONTEXT_TOKENS
Values: Integer (500-4000)
Default: 2048
Description: Maximum context tokens (used when ONNX_OPTIMIZED=true)
# Smaller context (faster, less history)
export ONNX_MAX_CONTEXT_TOKENS=1000
# Larger context (slower, more history)
export ONNX_MAX_CONTEXT_TOKENS=3000
Warning: Total (context + output) must stay under 4096 tokens (Phi-4 limit)
ONNX_SLIDING_WINDOW
Values: true | false
Default: true (when ONNX_OPTIMIZED=true)
Description: Enable sliding window context pruning
# Enable context pruning (recommended for speed)
export ONNX_SLIDING_WINDOW=true
# Disable (keep all context)
export ONNX_SLIDING_WINDOW=false
Performance: 2-4x faster inference by keeping only recent messages
ONNX_PROMPT_OPTIMIZATION
Values: true | false
Default: true (when ONNX_OPTIMIZED=true)
Description: Auto-enhance prompts for better quality
# Enable prompt optimization (recommended for quality)
export ONNX_PROMPT_OPTIMIZATION=true
# Disable
export ONNX_PROMPT_OPTIMIZATION=false
Quality: 30-50% improvement by adding quality guidelines to code tasks
ONNX_CACHE_SYSTEM_PROMPTS
Values: true | false
Default: true (when ONNX_OPTIMIZED=true)
Description: Cache processed system prompts for reuse
# Enable caching (faster repeated tasks)
export ONNX_CACHE_SYSTEM_PROMPTS=true
# Disable
export ONNX_CACHE_SYSTEM_PROMPTS=false
Speed: 30-40% faster on repeated prompts
Preset Configurations
Maximum Speed
export PROVIDER=onnx
export ONNX_EXECUTION_PROVIDERS=cuda,cpu # or dml/coreml
export ONNX_OPTIMIZED=true
export ONNX_MAX_CONTEXT_TOKENS=1000
export ONNX_MAX_TOKENS=100
export ONNX_SLIDING_WINDOW=true
export ONNX_CACHE_SYSTEM_PROMPTS=true
Result: 180+ tokens/sec (with GPU), minimal latency
Maximum Quality
export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.3
export ONNX_TOP_P=0.9
export ONNX_TOP_K=50
export ONNX_REPETITION_PENALTY=1.1
export ONNX_PROMPT_OPTIMIZATION=true
export ONNX_MAX_TOKENS=300
Result: 8.5/10 quality for code tasks
Balanced
export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_EXECUTION_PROVIDERS=cpu # or gpu
export ONNX_TEMPERATURE=0.3
export ONNX_MAX_TOKENS=200
export ONNX_MAX_CONTEXT_TOKENS=1500
Result: Good quality + speed tradeoff
CPU Only (No GPU)
export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_EXECUTION_PROVIDERS=cpu
export ONNX_MAX_CONTEXT_TOKENS=1000
export ONNX_MAX_TOKENS=150
export ONNX_TEMPERATURE=0.3
export ONNX_SLIDING_WINDOW=true
Result: Best CPU performance (still ~12 tokens/sec)
Use Case Configurations
Code Generation
export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.3 # Deterministic
export ONNX_TOP_P=0.9
export ONNX_PROMPT_OPTIMIZATION=true
export ONNX_MAX_TOKENS=250
Code Review
export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.4
export ONNX_MAX_TOKENS=300
export ONNX_MAX_CONTEXT_TOKENS=2000 # Need more context
Documentation
export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.6 # More creative
export ONNX_TOP_P=0.95
export ONNX_MAX_TOKENS=400
Refactoring
export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.35
export ONNX_MAX_TOKENS=200
export ONNX_SLIDING_WINDOW=true
Performance Tuning
Scenario 1: Too Slow (6 tokens/sec)
Problem: CPU-only inference is slow Solutions:
- Enable GPU acceleration (biggest impact)
- Reduce context size
- Enable sliding window
- Reduce max tokens
# Quick wins (no hardware change)
export ONNX_MAX_CONTEXT_TOKENS=1000 # 2x faster
export ONNX_SLIDING_WINDOW=true
export ONNX_MAX_TOKENS=100
# Best solution (requires GPU)
export ONNX_EXECUTION_PROVIDERS=cuda,cpu # 30x faster!
Scenario 2: Low Quality Output
Problem: Generated code has bugs/missing features Solutions:
- Enable optimizations
- Lower temperature
- Use specific prompts
- Enable prompt optimization
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.3
export ONNX_PROMPT_OPTIMIZATION=true
export ONNX_TOP_K=50
export ONNX_TOP_P=0.9
Scenario 3: Out of Memory
Problem: System runs out of RAM Solutions:
- Reduce context size
- Reduce max tokens
- Close other applications
export ONNX_MAX_CONTEXT_TOKENS=800
export ONNX_MAX_TOKENS=100
export ONNX_SLIDING_WINDOW=true
Scenario 4: Repetitive Output
Problem: Model repeats same phrases Solutions:
- Increase repetition penalty
- Adjust temperature
- Change top_p/top_k
export ONNX_REPETITION_PENALTY=1.2
export ONNX_TEMPERATURE=0.4
export ONNX_TOP_P=0.85
Debug and Logging
DEBUG
Values: true | false
Default: false
Description: Enable detailed logging
export DEBUG=true
npx agentic-flow --agent coder --task "test"
ONNX_LOG_PERFORMANCE
Values: true | false
Default: false
Description: Log performance metrics
export ONNX_LOG_PERFORMANCE=true
# Outputs: tokens/sec, latency, context size, etc.
Example Workflows
Daily Development (Local, Free)
# .env file
PROVIDER=onnx
ONNX_OPTIMIZED=true
ONNX_TEMPERATURE=0.3
ONNX_MAX_TOKENS=200
ONNX_EXECUTION_PROVIDERS=cpu
# Usage
npx agentic-flow --agent coder --task "Build feature"
CI/CD Pipeline (Fast, Local)
# CI environment variables
PROVIDER=onnx
ONNX_OPTIMIZED=true
ONNX_MAX_CONTEXT_TOKENS=800
ONNX_MAX_TOKENS=100
ONNX_TEMPERATURE=0.2
Hybrid: ONNX + Cloud Fallback
# Try ONNX first (80% of tasks)
export PROVIDER=onnx
export ONNX_OPTIMIZED=true
# For complex tasks, switch to cloud
unset PROVIDER
export ANTHROPIC_API_KEY=sk-ant-...
# or
export OPENROUTER_API_KEY=sk-or-v1-...
Best Practices
-
Always enable optimizations
export ONNX_OPTIMIZED=true -
Lower temperature for code
export ONNX_TEMPERATURE=0.3 -
Enable GPU if available (30x faster!)
export ONNX_EXECUTION_PROVIDERS=cuda,cpu -
Keep context under 2K tokens (2-4x faster)
export ONNX_MAX_CONTEXT_TOKENS=1500 -
Use
.envfile for consistency# Create .env file in project root echo "PROVIDER=onnx" >> .env echo "ONNX_OPTIMIZED=true" >> .env echo "ONNX_TEMPERATURE=0.3" >> .env
Troubleshooting
Error: Model not found
# Check model path
ls -lh ./models/phi-4-mini/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/
# Re-download if missing
rm -rf ./models/phi-4-mini
npx agentic-flow --agent coder --task "test" --provider onnx
Error: CUDA not available
# Check CUDA installation
nvidia-smi
# Fall back to CPU
export ONNX_EXECUTION_PROVIDERS=cpu
Slow inference (< 10 tok/s)
# Enable optimizations
export ONNX_OPTIMIZED=true
export ONNX_MAX_CONTEXT_TOKENS=1000
export ONNX_SLIDING_WINDOW=true
# Best: Enable GPU
export ONNX_EXECUTION_PROVIDERS=cuda,cpu
See Also
Remember: ONNX is free and runs locally. Optimize first, then decide if you need cloud providers for complex tasks.