# ONNX Environment Variables Reference Complete guide to configuring ONNX local inference via environment variables. ## Quick Start ```bash # Enable ONNX with all optimizations export PROVIDER=onnx export ONNX_OPTIMIZED=true # Run your agent npx agentic-flow --agent coder --task "Build feature" ``` --- ## Provider Selection ### `PROVIDER` **Values:** `anthropic` | `openrouter` | `onnx` **Default:** `anthropic` **Description:** Set the AI provider for all CLI commands ```bash # Use ONNX for all commands export PROVIDER=onnx npx agentic-flow --agent coder --task "test" # Use OpenRouter export PROVIDER=openrouter npx agentic-flow --agent coder --task "test" ``` ### `USE_ONNX` **Values:** `true` | `false` **Default:** `false` **Description:** Force ONNX provider (legacy, use `PROVIDER=onnx` instead) ```bash export USE_ONNX=true npx agentic-flow --agent coder --task "test" ``` --- ## Model Configuration ### `ONNX_MODEL_PATH` **Values:** File path **Default:** `./models/phi-4-mini/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/model.onnx` **Description:** Custom path to ONNX model file ```bash # Use custom model location export ONNX_MODEL_PATH=/mnt/models/custom-model.onnx ``` ### `ONNX_EXECUTION_PROVIDERS` **Values:** Comma-separated list: `cpu`, `cuda`, `dml`, `coreml` **Default:** `cpu` **Description:** Execution providers for inference (affects speed dramatically) ```bash # CPU only (default, slowest) export ONNX_EXECUTION_PROVIDERS=cpu # NVIDIA GPU acceleration (10-50x faster) export ONNX_EXECUTION_PROVIDERS=cuda,cpu # Windows DirectML GPU (5-15x faster) export ONNX_EXECUTION_PROVIDERS=dml,cpu # macOS Apple Silicon (7-20x faster) export ONNX_EXECUTION_PROVIDERS=coreml,cpu ``` **Performance Impact:** - `cpu`: ~6 tokens/sec - `cuda`: ~60-300 tokens/sec (10-50x faster) - `dml`: ~30-100 tokens/sec (5-15x faster) - `coreml`: ~40-120 tokens/sec (7-20x faster) --- ## Generation Parameters ### `ONNX_MAX_TOKENS` **Values:** Integer (1-4096) **Default:** `200` **Description:** Maximum tokens to generate in response ```bash # Short responses (faster) export ONNX_MAX_TOKENS=100 # Long responses export ONNX_MAX_TOKENS=500 ``` **Tip:** Keep under 300 for best speed. Context + output must stay under 4K tokens total. ### `ONNX_TEMPERATURE` **Values:** Float (0.0-2.0) **Default:** `0.7` (base), `0.3` (optimized) **Description:** Controls output randomness/creativity ```bash # Deterministic code (recommended for code generation) export ONNX_TEMPERATURE=0.2 # Balanced export ONNX_TEMPERATURE=0.7 # Creative writing export ONNX_TEMPERATURE=0.9 ``` **Recommended Settings:** | Task Type | Temperature | Why | |-----------|-------------|-----| | Code generation | 0.2-0.4 | Consistent syntax | | Refactoring | 0.3-0.5 | Some creativity, but safe | | Documentation | 0.5-0.7 | Clear but varied | | Brainstorming | 0.7-0.9 | Diverse ideas | | Math/Logic | 0.1-0.2 | Precise | ### `ONNX_TOP_K` **Values:** Integer (1-100) **Default:** `50` **Description:** Consider top K tokens for sampling ```bash # More focused (deterministic) export ONNX_TOP_K=20 # More diverse export ONNX_TOP_K=80 ``` ### `ONNX_TOP_P` **Values:** Float (0.0-1.0) **Default:** `0.9` **Description:** Nucleus sampling threshold (probability mass) ```bash # Very focused export ONNX_TOP_P=0.7 # Balanced export ONNX_TOP_P=0.9 # Diverse export ONNX_TOP_P=0.95 ``` ### `ONNX_REPETITION_PENALTY` **Values:** Float (1.0-2.0) **Default:** `1.1` **Description:** Penalty for token repetition ```bash # No penalty (may repeat) export ONNX_REPETITION_PENALTY=1.0 # Mild penalty (recommended) export ONNX_REPETITION_PENALTY=1.1 # Strong penalty (more diverse but may lose coherence) export ONNX_REPETITION_PENALTY=1.5 ``` --- ## Optimization Features ### `ONNX_OPTIMIZED` **Values:** `true` | `false` **Default:** `false` **Description:** Enable optimized ONNX provider with context pruning and prompt enhancement ```bash # Enable all optimizations (recommended) export ONNX_OPTIMIZED=true # Use base provider export ONNX_OPTIMIZED=false ``` **Benefits when enabled:** - 30-50% quality improvement via prompt optimization - 2-4x speed improvement via context pruning - Automatic sliding window context management ### `ONNX_MAX_CONTEXT_TOKENS` **Values:** Integer (500-4000) **Default:** `2048` **Description:** Maximum context tokens (used when `ONNX_OPTIMIZED=true`) ```bash # Smaller context (faster, less history) export ONNX_MAX_CONTEXT_TOKENS=1000 # Larger context (slower, more history) export ONNX_MAX_CONTEXT_TOKENS=3000 ``` **Warning:** Total (context + output) must stay under 4096 tokens (Phi-4 limit) ### `ONNX_SLIDING_WINDOW` **Values:** `true` | `false` **Default:** `true` (when `ONNX_OPTIMIZED=true`) **Description:** Enable sliding window context pruning ```bash # Enable context pruning (recommended for speed) export ONNX_SLIDING_WINDOW=true # Disable (keep all context) export ONNX_SLIDING_WINDOW=false ``` **Performance:** 2-4x faster inference by keeping only recent messages ### `ONNX_PROMPT_OPTIMIZATION` **Values:** `true` | `false` **Default:** `true` (when `ONNX_OPTIMIZED=true`) **Description:** Auto-enhance prompts for better quality ```bash # Enable prompt optimization (recommended for quality) export ONNX_PROMPT_OPTIMIZATION=true # Disable export ONNX_PROMPT_OPTIMIZATION=false ``` **Quality:** 30-50% improvement by adding quality guidelines to code tasks ### `ONNX_CACHE_SYSTEM_PROMPTS` **Values:** `true` | `false` **Default:** `true` (when `ONNX_OPTIMIZED=true`) **Description:** Cache processed system prompts for reuse ```bash # Enable caching (faster repeated tasks) export ONNX_CACHE_SYSTEM_PROMPTS=true # Disable export ONNX_CACHE_SYSTEM_PROMPTS=false ``` **Speed:** 30-40% faster on repeated prompts --- ## Preset Configurations ### Maximum Speed ```bash export PROVIDER=onnx export ONNX_EXECUTION_PROVIDERS=cuda,cpu # or dml/coreml export ONNX_OPTIMIZED=true export ONNX_MAX_CONTEXT_TOKENS=1000 export ONNX_MAX_TOKENS=100 export ONNX_SLIDING_WINDOW=true export ONNX_CACHE_SYSTEM_PROMPTS=true ``` **Result:** 180+ tokens/sec (with GPU), minimal latency ### Maximum Quality ```bash export PROVIDER=onnx export ONNX_OPTIMIZED=true export ONNX_TEMPERATURE=0.3 export ONNX_TOP_P=0.9 export ONNX_TOP_K=50 export ONNX_REPETITION_PENALTY=1.1 export ONNX_PROMPT_OPTIMIZATION=true export ONNX_MAX_TOKENS=300 ``` **Result:** 8.5/10 quality for code tasks ### Balanced ```bash export PROVIDER=onnx export ONNX_OPTIMIZED=true export ONNX_EXECUTION_PROVIDERS=cpu # or gpu export ONNX_TEMPERATURE=0.3 export ONNX_MAX_TOKENS=200 export ONNX_MAX_CONTEXT_TOKENS=1500 ``` **Result:** Good quality + speed tradeoff ### CPU Only (No GPU) ```bash export PROVIDER=onnx export ONNX_OPTIMIZED=true export ONNX_EXECUTION_PROVIDERS=cpu export ONNX_MAX_CONTEXT_TOKENS=1000 export ONNX_MAX_TOKENS=150 export ONNX_TEMPERATURE=0.3 export ONNX_SLIDING_WINDOW=true ``` **Result:** Best CPU performance (still ~12 tokens/sec) --- ## Use Case Configurations ### Code Generation ```bash export PROVIDER=onnx export ONNX_OPTIMIZED=true export ONNX_TEMPERATURE=0.3 # Deterministic export ONNX_TOP_P=0.9 export ONNX_PROMPT_OPTIMIZATION=true export ONNX_MAX_TOKENS=250 ``` ### Code Review ```bash export PROVIDER=onnx export ONNX_OPTIMIZED=true export ONNX_TEMPERATURE=0.4 export ONNX_MAX_TOKENS=300 export ONNX_MAX_CONTEXT_TOKENS=2000 # Need more context ``` ### Documentation ```bash export PROVIDER=onnx export ONNX_OPTIMIZED=true export ONNX_TEMPERATURE=0.6 # More creative export ONNX_TOP_P=0.95 export ONNX_MAX_TOKENS=400 ``` ### Refactoring ```bash export PROVIDER=onnx export ONNX_OPTIMIZED=true export ONNX_TEMPERATURE=0.35 export ONNX_MAX_TOKENS=200 export ONNX_SLIDING_WINDOW=true ``` --- ## Performance Tuning ### Scenario 1: Too Slow (6 tokens/sec) **Problem:** CPU-only inference is slow **Solutions:** 1. Enable GPU acceleration (biggest impact) 2. Reduce context size 3. Enable sliding window 4. Reduce max tokens ```bash # Quick wins (no hardware change) export ONNX_MAX_CONTEXT_TOKENS=1000 # 2x faster export ONNX_SLIDING_WINDOW=true export ONNX_MAX_TOKENS=100 # Best solution (requires GPU) export ONNX_EXECUTION_PROVIDERS=cuda,cpu # 30x faster! ``` ### Scenario 2: Low Quality Output **Problem:** Generated code has bugs/missing features **Solutions:** 1. Enable optimizations 2. Lower temperature 3. Use specific prompts 4. Enable prompt optimization ```bash export ONNX_OPTIMIZED=true export ONNX_TEMPERATURE=0.3 export ONNX_PROMPT_OPTIMIZATION=true export ONNX_TOP_K=50 export ONNX_TOP_P=0.9 ``` ### Scenario 3: Out of Memory **Problem:** System runs out of RAM **Solutions:** 1. Reduce context size 2. Reduce max tokens 3. Close other applications ```bash export ONNX_MAX_CONTEXT_TOKENS=800 export ONNX_MAX_TOKENS=100 export ONNX_SLIDING_WINDOW=true ``` ### Scenario 4: Repetitive Output **Problem:** Model repeats same phrases **Solutions:** 1. Increase repetition penalty 2. Adjust temperature 3. Change top_p/top_k ```bash export ONNX_REPETITION_PENALTY=1.2 export ONNX_TEMPERATURE=0.4 export ONNX_TOP_P=0.85 ``` --- ## Debug and Logging ### `DEBUG` **Values:** `true` | `false` **Default:** `false` **Description:** Enable detailed logging ```bash export DEBUG=true npx agentic-flow --agent coder --task "test" ``` ### `ONNX_LOG_PERFORMANCE` **Values:** `true` | `false` **Default:** `false` **Description:** Log performance metrics ```bash export ONNX_LOG_PERFORMANCE=true # Outputs: tokens/sec, latency, context size, etc. ``` --- ## Example Workflows ### Daily Development (Local, Free) ```bash # .env file PROVIDER=onnx ONNX_OPTIMIZED=true ONNX_TEMPERATURE=0.3 ONNX_MAX_TOKENS=200 ONNX_EXECUTION_PROVIDERS=cpu # Usage npx agentic-flow --agent coder --task "Build feature" ``` ### CI/CD Pipeline (Fast, Local) ```bash # CI environment variables PROVIDER=onnx ONNX_OPTIMIZED=true ONNX_MAX_CONTEXT_TOKENS=800 ONNX_MAX_TOKENS=100 ONNX_TEMPERATURE=0.2 ``` ### Hybrid: ONNX + Cloud Fallback ```bash # Try ONNX first (80% of tasks) export PROVIDER=onnx export ONNX_OPTIMIZED=true # For complex tasks, switch to cloud unset PROVIDER export ANTHROPIC_API_KEY=sk-ant-... # or export OPENROUTER_API_KEY=sk-or-v1-... ``` --- ## Best Practices 1. **Always enable optimizations** ```bash export ONNX_OPTIMIZED=true ``` 2. **Lower temperature for code** ```bash export ONNX_TEMPERATURE=0.3 ``` 3. **Enable GPU if available** (30x faster!) ```bash export ONNX_EXECUTION_PROVIDERS=cuda,cpu ``` 4. **Keep context under 2K tokens** (2-4x faster) ```bash export ONNX_MAX_CONTEXT_TOKENS=1500 ``` 5. **Use `.env` file** for consistency ```bash # Create .env file in project root echo "PROVIDER=onnx" >> .env echo "ONNX_OPTIMIZED=true" >> .env echo "ONNX_TEMPERATURE=0.3" >> .env ``` --- ## Troubleshooting ### Error: Model not found ```bash # Check model path ls -lh ./models/phi-4-mini/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/ # Re-download if missing rm -rf ./models/phi-4-mini npx agentic-flow --agent coder --task "test" --provider onnx ``` ### Error: CUDA not available ```bash # Check CUDA installation nvidia-smi # Fall back to CPU export ONNX_EXECUTION_PROVIDERS=cpu ``` ### Slow inference (< 10 tok/s) ```bash # Enable optimizations export ONNX_OPTIMIZED=true export ONNX_MAX_CONTEXT_TOKENS=1000 export ONNX_SLIDING_WINDOW=true # Best: Enable GPU export ONNX_EXECUTION_PROVIDERS=cuda,cpu ``` --- ## See Also - [ONNX CLI Usage Guide](./ONNX_CLI_USAGE.md) - [ONNX Optimization Guide](./ONNX_OPTIMIZATION_GUIDE.md) - [ONNX vs Claude Quality](./ONNX_VS_CLAUDE_QUALITY.md) - [Full ONNX Integration](./ONNX_INTEGRATION.md) --- **Remember:** ONNX is free and runs locally. Optimize first, then decide if you need cloud providers for complex tasks.