Marc Rejohn Castillano 5cb6561924 added ruflo

2026-04-09 19:01:53 +08:00

12 KiB

Raw Permalink Blame History

ONNX Environment Variables Reference

Complete guide to configuring ONNX local inference via environment variables.

Quick Start

# Enable ONNX with all optimizations
export PROVIDER=onnx
export ONNX_OPTIMIZED=true

# Run your agent
npx agentic-flow --agent coder --task "Build feature"

Provider Selection

`PROVIDER`

Values: anthropic | openrouter | onnx Default: anthropic Description: Set the AI provider for all CLI commands

# Use ONNX for all commands
export PROVIDER=onnx
npx agentic-flow --agent coder --task "test"

# Use OpenRouter
export PROVIDER=openrouter
npx agentic-flow --agent coder --task "test"

`USE_ONNX`

Values: true | false Default: false Description: Force ONNX provider (legacy, use PROVIDER=onnx instead)

export USE_ONNX=true
npx agentic-flow --agent coder --task "test"

Model Configuration

`ONNX_MODEL_PATH`

Values: File path Default: ./models/phi-4-mini/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/model.onnx Description: Custom path to ONNX model file

# Use custom model location
export ONNX_MODEL_PATH=/mnt/models/custom-model.onnx

`ONNX_EXECUTION_PROVIDERS`

Values: Comma-separated list: cpu, cuda, dml, coreml Default: cpu Description: Execution providers for inference (affects speed dramatically)

# CPU only (default, slowest)
export ONNX_EXECUTION_PROVIDERS=cpu

# NVIDIA GPU acceleration (10-50x faster)
export ONNX_EXECUTION_PROVIDERS=cuda,cpu

# Windows DirectML GPU (5-15x faster)
export ONNX_EXECUTION_PROVIDERS=dml,cpu

# macOS Apple Silicon (7-20x faster)
export ONNX_EXECUTION_PROVIDERS=coreml,cpu

Performance Impact:

cpu: ~6 tokens/sec
cuda: ~60-300 tokens/sec (10-50x faster)
dml: ~30-100 tokens/sec (5-15x faster)
coreml: ~40-120 tokens/sec (7-20x faster)

Generation Parameters

`ONNX_MAX_TOKENS`

Values: Integer (1-4096) Default: 200 Description: Maximum tokens to generate in response

# Short responses (faster)
export ONNX_MAX_TOKENS=100

# Long responses
export ONNX_MAX_TOKENS=500

Tip: Keep under 300 for best speed. Context + output must stay under 4K tokens total.

`ONNX_TEMPERATURE`

Values: Float (0.0-2.0) Default: 0.7 (base), 0.3 (optimized) Description: Controls output randomness/creativity

# Deterministic code (recommended for code generation)
export ONNX_TEMPERATURE=0.2

# Balanced
export ONNX_TEMPERATURE=0.7

# Creative writing
export ONNX_TEMPERATURE=0.9

Recommended Settings:

Task Type	Temperature	Why
Code generation	0.2-0.4	Consistent syntax
Refactoring	0.3-0.5	Some creativity, but safe
Documentation	0.5-0.7	Clear but varied
Brainstorming	0.7-0.9	Diverse ideas
Math/Logic	0.1-0.2	Precise

`ONNX_TOP_K`

Values: Integer (1-100) Default: 50 Description: Consider top K tokens for sampling

# More focused (deterministic)
export ONNX_TOP_K=20

# More diverse
export ONNX_TOP_K=80

`ONNX_TOP_P`

Values: Float (0.0-1.0) Default: 0.9 Description: Nucleus sampling threshold (probability mass)

# Very focused
export ONNX_TOP_P=0.7

# Balanced
export ONNX_TOP_P=0.9

# Diverse
export ONNX_TOP_P=0.95

`ONNX_REPETITION_PENALTY`

Values: Float (1.0-2.0) Default: 1.1 Description: Penalty for token repetition

# No penalty (may repeat)
export ONNX_REPETITION_PENALTY=1.0

# Mild penalty (recommended)
export ONNX_REPETITION_PENALTY=1.1

# Strong penalty (more diverse but may lose coherence)
export ONNX_REPETITION_PENALTY=1.5

Optimization Features

`ONNX_OPTIMIZED`

Values: true | false Default: false Description: Enable optimized ONNX provider with context pruning and prompt enhancement

# Enable all optimizations (recommended)
export ONNX_OPTIMIZED=true

# Use base provider
export ONNX_OPTIMIZED=false

Benefits when enabled:

30-50% quality improvement via prompt optimization
2-4x speed improvement via context pruning
Automatic sliding window context management

`ONNX_MAX_CONTEXT_TOKENS`

Values: Integer (500-4000) Default: 2048 Description: Maximum context tokens (used when ONNX_OPTIMIZED=true)

# Smaller context (faster, less history)
export ONNX_MAX_CONTEXT_TOKENS=1000

# Larger context (slower, more history)
export ONNX_MAX_CONTEXT_TOKENS=3000

Warning: Total (context + output) must stay under 4096 tokens (Phi-4 limit)

`ONNX_SLIDING_WINDOW`

Values: true | false Default: true (when ONNX_OPTIMIZED=true) Description: Enable sliding window context pruning

# Enable context pruning (recommended for speed)
export ONNX_SLIDING_WINDOW=true

# Disable (keep all context)
export ONNX_SLIDING_WINDOW=false

Performance: 2-4x faster inference by keeping only recent messages

`ONNX_PROMPT_OPTIMIZATION`

Values: true | false Default: true (when ONNX_OPTIMIZED=true) Description: Auto-enhance prompts for better quality

# Enable prompt optimization (recommended for quality)
export ONNX_PROMPT_OPTIMIZATION=true

# Disable
export ONNX_PROMPT_OPTIMIZATION=false

Quality: 30-50% improvement by adding quality guidelines to code tasks

`ONNX_CACHE_SYSTEM_PROMPTS`

Values: true | false Default: true (when ONNX_OPTIMIZED=true) Description: Cache processed system prompts for reuse

# Enable caching (faster repeated tasks)
export ONNX_CACHE_SYSTEM_PROMPTS=true

# Disable
export ONNX_CACHE_SYSTEM_PROMPTS=false

Speed: 30-40% faster on repeated prompts

Preset Configurations

Maximum Speed

export PROVIDER=onnx
export ONNX_EXECUTION_PROVIDERS=cuda,cpu  # or dml/coreml
export ONNX_OPTIMIZED=true
export ONNX_MAX_CONTEXT_TOKENS=1000
export ONNX_MAX_TOKENS=100
export ONNX_SLIDING_WINDOW=true
export ONNX_CACHE_SYSTEM_PROMPTS=true

Result: 180+ tokens/sec (with GPU), minimal latency

Maximum Quality

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.3
export ONNX_TOP_P=0.9
export ONNX_TOP_K=50
export ONNX_REPETITION_PENALTY=1.1
export ONNX_PROMPT_OPTIMIZATION=true
export ONNX_MAX_TOKENS=300

Result: 8.5/10 quality for code tasks

Balanced

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_EXECUTION_PROVIDERS=cpu  # or gpu
export ONNX_TEMPERATURE=0.3
export ONNX_MAX_TOKENS=200
export ONNX_MAX_CONTEXT_TOKENS=1500

Result: Good quality + speed tradeoff

CPU Only (No GPU)

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_EXECUTION_PROVIDERS=cpu
export ONNX_MAX_CONTEXT_TOKENS=1000
export ONNX_MAX_TOKENS=150
export ONNX_TEMPERATURE=0.3
export ONNX_SLIDING_WINDOW=true

Result: Best CPU performance (still ~12 tokens/sec)

Use Case Configurations

Code Generation

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.3  # Deterministic
export ONNX_TOP_P=0.9
export ONNX_PROMPT_OPTIMIZATION=true
export ONNX_MAX_TOKENS=250

Code Review

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.4
export ONNX_MAX_TOKENS=300
export ONNX_MAX_CONTEXT_TOKENS=2000  # Need more context

Documentation

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.6  # More creative
export ONNX_TOP_P=0.95
export ONNX_MAX_TOKENS=400

Refactoring

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.35
export ONNX_MAX_TOKENS=200
export ONNX_SLIDING_WINDOW=true

Performance Tuning

Scenario 1: Too Slow (6 tokens/sec)

Problem: CPU-only inference is slow Solutions:

Enable GPU acceleration (biggest impact)
Reduce context size
Enable sliding window
Reduce max tokens

# Quick wins (no hardware change)
export ONNX_MAX_CONTEXT_TOKENS=1000  # 2x faster
export ONNX_SLIDING_WINDOW=true
export ONNX_MAX_TOKENS=100

# Best solution (requires GPU)
export ONNX_EXECUTION_PROVIDERS=cuda,cpu  # 30x faster!

Scenario 2: Low Quality Output

Problem: Generated code has bugs/missing features Solutions:

Enable optimizations
Lower temperature
Use specific prompts
Enable prompt optimization

export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.3
export ONNX_PROMPT_OPTIMIZATION=true
export ONNX_TOP_K=50
export ONNX_TOP_P=0.9

Scenario 3: Out of Memory

Problem: System runs out of RAM Solutions:

Reduce context size
Reduce max tokens
Close other applications

export ONNX_MAX_CONTEXT_TOKENS=800
export ONNX_MAX_TOKENS=100
export ONNX_SLIDING_WINDOW=true

Scenario 4: Repetitive Output

Problem: Model repeats same phrases Solutions:

Increase repetition penalty
Adjust temperature
Change top_p/top_k

export ONNX_REPETITION_PENALTY=1.2
export ONNX_TEMPERATURE=0.4
export ONNX_TOP_P=0.85

Debug and Logging

`DEBUG`

Values: true | false Default: false Description: Enable detailed logging

export DEBUG=true
npx agentic-flow --agent coder --task "test"

`ONNX_LOG_PERFORMANCE`

Values: true | false Default: false Description: Log performance metrics

export ONNX_LOG_PERFORMANCE=true
# Outputs: tokens/sec, latency, context size, etc.

Example Workflows

Daily Development (Local, Free)

# .env file
PROVIDER=onnx
ONNX_OPTIMIZED=true
ONNX_TEMPERATURE=0.3
ONNX_MAX_TOKENS=200
ONNX_EXECUTION_PROVIDERS=cpu

# Usage
npx agentic-flow --agent coder --task "Build feature"

CI/CD Pipeline (Fast, Local)

# CI environment variables
PROVIDER=onnx
ONNX_OPTIMIZED=true
ONNX_MAX_CONTEXT_TOKENS=800
ONNX_MAX_TOKENS=100
ONNX_TEMPERATURE=0.2

Hybrid: ONNX + Cloud Fallback

# Try ONNX first (80% of tasks)
export PROVIDER=onnx
export ONNX_OPTIMIZED=true

# For complex tasks, switch to cloud
unset PROVIDER
export ANTHROPIC_API_KEY=sk-ant-...
# or
export OPENROUTER_API_KEY=sk-or-v1-...

Best Practices

Always enable optimizations
```
export ONNX_OPTIMIZED=true
```
Lower temperature for code
```
export ONNX_TEMPERATURE=0.3
```

Enable GPU if available (30x faster!)

export ONNX_EXECUTION_PROVIDERS=cuda,cpu

Keep context under 2K tokens (2-4x faster)
```
export ONNX_MAX_CONTEXT_TOKENS=1500
```

Use .env file for consistency

# Create .env file in project root
echo "PROVIDER=onnx" >> .env
echo "ONNX_OPTIMIZED=true" >> .env
echo "ONNX_TEMPERATURE=0.3" >> .env

Troubleshooting

Error: Model not found

# Check model path
ls -lh ./models/phi-4-mini/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/

# Re-download if missing
rm -rf ./models/phi-4-mini
npx agentic-flow --agent coder --task "test" --provider onnx

Error: CUDA not available

# Check CUDA installation
nvidia-smi

# Fall back to CPU
export ONNX_EXECUTION_PROVIDERS=cpu

Slow inference (< 10 tok/s)

# Enable optimizations
export ONNX_OPTIMIZED=true
export ONNX_MAX_CONTEXT_TOKENS=1000
export ONNX_SLIDING_WINDOW=true

# Best: Enable GPU
export ONNX_EXECUTION_PROVIDERS=cuda,cpu

12 KiB Raw Permalink Blame History