tasq/node_modules/agentic-flow/docs/archived/ONNX_ENV_VARS.md

12 KiB

ONNX Environment Variables Reference

Complete guide to configuring ONNX local inference via environment variables.

Quick Start

# Enable ONNX with all optimizations
export PROVIDER=onnx
export ONNX_OPTIMIZED=true

# Run your agent
npx agentic-flow --agent coder --task "Build feature"

Provider Selection

PROVIDER

Values: anthropic | openrouter | onnx Default: anthropic Description: Set the AI provider for all CLI commands

# Use ONNX for all commands
export PROVIDER=onnx
npx agentic-flow --agent coder --task "test"

# Use OpenRouter
export PROVIDER=openrouter
npx agentic-flow --agent coder --task "test"

USE_ONNX

Values: true | false Default: false Description: Force ONNX provider (legacy, use PROVIDER=onnx instead)

export USE_ONNX=true
npx agentic-flow --agent coder --task "test"

Model Configuration

ONNX_MODEL_PATH

Values: File path Default: ./models/phi-4-mini/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/model.onnx Description: Custom path to ONNX model file

# Use custom model location
export ONNX_MODEL_PATH=/mnt/models/custom-model.onnx

ONNX_EXECUTION_PROVIDERS

Values: Comma-separated list: cpu, cuda, dml, coreml Default: cpu Description: Execution providers for inference (affects speed dramatically)

# CPU only (default, slowest)
export ONNX_EXECUTION_PROVIDERS=cpu

# NVIDIA GPU acceleration (10-50x faster)
export ONNX_EXECUTION_PROVIDERS=cuda,cpu

# Windows DirectML GPU (5-15x faster)
export ONNX_EXECUTION_PROVIDERS=dml,cpu

# macOS Apple Silicon (7-20x faster)
export ONNX_EXECUTION_PROVIDERS=coreml,cpu

Performance Impact:

  • cpu: ~6 tokens/sec
  • cuda: ~60-300 tokens/sec (10-50x faster)
  • dml: ~30-100 tokens/sec (5-15x faster)
  • coreml: ~40-120 tokens/sec (7-20x faster)

Generation Parameters

ONNX_MAX_TOKENS

Values: Integer (1-4096) Default: 200 Description: Maximum tokens to generate in response

# Short responses (faster)
export ONNX_MAX_TOKENS=100

# Long responses
export ONNX_MAX_TOKENS=500

Tip: Keep under 300 for best speed. Context + output must stay under 4K tokens total.

ONNX_TEMPERATURE

Values: Float (0.0-2.0) Default: 0.7 (base), 0.3 (optimized) Description: Controls output randomness/creativity

# Deterministic code (recommended for code generation)
export ONNX_TEMPERATURE=0.2

# Balanced
export ONNX_TEMPERATURE=0.7

# Creative writing
export ONNX_TEMPERATURE=0.9

Recommended Settings:

Task Type Temperature Why
Code generation 0.2-0.4 Consistent syntax
Refactoring 0.3-0.5 Some creativity, but safe
Documentation 0.5-0.7 Clear but varied
Brainstorming 0.7-0.9 Diverse ideas
Math/Logic 0.1-0.2 Precise

ONNX_TOP_K

Values: Integer (1-100) Default: 50 Description: Consider top K tokens for sampling

# More focused (deterministic)
export ONNX_TOP_K=20

# More diverse
export ONNX_TOP_K=80

ONNX_TOP_P

Values: Float (0.0-1.0) Default: 0.9 Description: Nucleus sampling threshold (probability mass)

# Very focused
export ONNX_TOP_P=0.7

# Balanced
export ONNX_TOP_P=0.9

# Diverse
export ONNX_TOP_P=0.95

ONNX_REPETITION_PENALTY

Values: Float (1.0-2.0) Default: 1.1 Description: Penalty for token repetition

# No penalty (may repeat)
export ONNX_REPETITION_PENALTY=1.0

# Mild penalty (recommended)
export ONNX_REPETITION_PENALTY=1.1

# Strong penalty (more diverse but may lose coherence)
export ONNX_REPETITION_PENALTY=1.5

Optimization Features

ONNX_OPTIMIZED

Values: true | false Default: false Description: Enable optimized ONNX provider with context pruning and prompt enhancement

# Enable all optimizations (recommended)
export ONNX_OPTIMIZED=true

# Use base provider
export ONNX_OPTIMIZED=false

Benefits when enabled:

  • 30-50% quality improvement via prompt optimization
  • 2-4x speed improvement via context pruning
  • Automatic sliding window context management

ONNX_MAX_CONTEXT_TOKENS

Values: Integer (500-4000) Default: 2048 Description: Maximum context tokens (used when ONNX_OPTIMIZED=true)

# Smaller context (faster, less history)
export ONNX_MAX_CONTEXT_TOKENS=1000

# Larger context (slower, more history)
export ONNX_MAX_CONTEXT_TOKENS=3000

Warning: Total (context + output) must stay under 4096 tokens (Phi-4 limit)

ONNX_SLIDING_WINDOW

Values: true | false Default: true (when ONNX_OPTIMIZED=true) Description: Enable sliding window context pruning

# Enable context pruning (recommended for speed)
export ONNX_SLIDING_WINDOW=true

# Disable (keep all context)
export ONNX_SLIDING_WINDOW=false

Performance: 2-4x faster inference by keeping only recent messages

ONNX_PROMPT_OPTIMIZATION

Values: true | false Default: true (when ONNX_OPTIMIZED=true) Description: Auto-enhance prompts for better quality

# Enable prompt optimization (recommended for quality)
export ONNX_PROMPT_OPTIMIZATION=true

# Disable
export ONNX_PROMPT_OPTIMIZATION=false

Quality: 30-50% improvement by adding quality guidelines to code tasks

ONNX_CACHE_SYSTEM_PROMPTS

Values: true | false Default: true (when ONNX_OPTIMIZED=true) Description: Cache processed system prompts for reuse

# Enable caching (faster repeated tasks)
export ONNX_CACHE_SYSTEM_PROMPTS=true

# Disable
export ONNX_CACHE_SYSTEM_PROMPTS=false

Speed: 30-40% faster on repeated prompts


Preset Configurations

Maximum Speed

export PROVIDER=onnx
export ONNX_EXECUTION_PROVIDERS=cuda,cpu  # or dml/coreml
export ONNX_OPTIMIZED=true
export ONNX_MAX_CONTEXT_TOKENS=1000
export ONNX_MAX_TOKENS=100
export ONNX_SLIDING_WINDOW=true
export ONNX_CACHE_SYSTEM_PROMPTS=true

Result: 180+ tokens/sec (with GPU), minimal latency

Maximum Quality

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.3
export ONNX_TOP_P=0.9
export ONNX_TOP_K=50
export ONNX_REPETITION_PENALTY=1.1
export ONNX_PROMPT_OPTIMIZATION=true
export ONNX_MAX_TOKENS=300

Result: 8.5/10 quality for code tasks

Balanced

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_EXECUTION_PROVIDERS=cpu  # or gpu
export ONNX_TEMPERATURE=0.3
export ONNX_MAX_TOKENS=200
export ONNX_MAX_CONTEXT_TOKENS=1500

Result: Good quality + speed tradeoff

CPU Only (No GPU)

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_EXECUTION_PROVIDERS=cpu
export ONNX_MAX_CONTEXT_TOKENS=1000
export ONNX_MAX_TOKENS=150
export ONNX_TEMPERATURE=0.3
export ONNX_SLIDING_WINDOW=true

Result: Best CPU performance (still ~12 tokens/sec)


Use Case Configurations

Code Generation

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.3  # Deterministic
export ONNX_TOP_P=0.9
export ONNX_PROMPT_OPTIMIZATION=true
export ONNX_MAX_TOKENS=250

Code Review

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.4
export ONNX_MAX_TOKENS=300
export ONNX_MAX_CONTEXT_TOKENS=2000  # Need more context

Documentation

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.6  # More creative
export ONNX_TOP_P=0.95
export ONNX_MAX_TOKENS=400

Refactoring

export PROVIDER=onnx
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.35
export ONNX_MAX_TOKENS=200
export ONNX_SLIDING_WINDOW=true

Performance Tuning

Scenario 1: Too Slow (6 tokens/sec)

Problem: CPU-only inference is slow Solutions:

  1. Enable GPU acceleration (biggest impact)
  2. Reduce context size
  3. Enable sliding window
  4. Reduce max tokens
# Quick wins (no hardware change)
export ONNX_MAX_CONTEXT_TOKENS=1000  # 2x faster
export ONNX_SLIDING_WINDOW=true
export ONNX_MAX_TOKENS=100

# Best solution (requires GPU)
export ONNX_EXECUTION_PROVIDERS=cuda,cpu  # 30x faster!

Scenario 2: Low Quality Output

Problem: Generated code has bugs/missing features Solutions:

  1. Enable optimizations
  2. Lower temperature
  3. Use specific prompts
  4. Enable prompt optimization
export ONNX_OPTIMIZED=true
export ONNX_TEMPERATURE=0.3
export ONNX_PROMPT_OPTIMIZATION=true
export ONNX_TOP_K=50
export ONNX_TOP_P=0.9

Scenario 3: Out of Memory

Problem: System runs out of RAM Solutions:

  1. Reduce context size
  2. Reduce max tokens
  3. Close other applications
export ONNX_MAX_CONTEXT_TOKENS=800
export ONNX_MAX_TOKENS=100
export ONNX_SLIDING_WINDOW=true

Scenario 4: Repetitive Output

Problem: Model repeats same phrases Solutions:

  1. Increase repetition penalty
  2. Adjust temperature
  3. Change top_p/top_k
export ONNX_REPETITION_PENALTY=1.2
export ONNX_TEMPERATURE=0.4
export ONNX_TOP_P=0.85

Debug and Logging

DEBUG

Values: true | false Default: false Description: Enable detailed logging

export DEBUG=true
npx agentic-flow --agent coder --task "test"

ONNX_LOG_PERFORMANCE

Values: true | false Default: false Description: Log performance metrics

export ONNX_LOG_PERFORMANCE=true
# Outputs: tokens/sec, latency, context size, etc.

Example Workflows

Daily Development (Local, Free)

# .env file
PROVIDER=onnx
ONNX_OPTIMIZED=true
ONNX_TEMPERATURE=0.3
ONNX_MAX_TOKENS=200
ONNX_EXECUTION_PROVIDERS=cpu

# Usage
npx agentic-flow --agent coder --task "Build feature"

CI/CD Pipeline (Fast, Local)

# CI environment variables
PROVIDER=onnx
ONNX_OPTIMIZED=true
ONNX_MAX_CONTEXT_TOKENS=800
ONNX_MAX_TOKENS=100
ONNX_TEMPERATURE=0.2

Hybrid: ONNX + Cloud Fallback

# Try ONNX first (80% of tasks)
export PROVIDER=onnx
export ONNX_OPTIMIZED=true

# For complex tasks, switch to cloud
unset PROVIDER
export ANTHROPIC_API_KEY=sk-ant-...
# or
export OPENROUTER_API_KEY=sk-or-v1-...

Best Practices

  1. Always enable optimizations

    export ONNX_OPTIMIZED=true
    
  2. Lower temperature for code

    export ONNX_TEMPERATURE=0.3
    
  3. Enable GPU if available (30x faster!)

    export ONNX_EXECUTION_PROVIDERS=cuda,cpu
    
  4. Keep context under 2K tokens (2-4x faster)

    export ONNX_MAX_CONTEXT_TOKENS=1500
    
  5. Use .env file for consistency

    # Create .env file in project root
    echo "PROVIDER=onnx" >> .env
    echo "ONNX_OPTIMIZED=true" >> .env
    echo "ONNX_TEMPERATURE=0.3" >> .env
    

Troubleshooting

Error: Model not found

# Check model path
ls -lh ./models/phi-4-mini/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/

# Re-download if missing
rm -rf ./models/phi-4-mini
npx agentic-flow --agent coder --task "test" --provider onnx

Error: CUDA not available

# Check CUDA installation
nvidia-smi

# Fall back to CPU
export ONNX_EXECUTION_PROVIDERS=cpu

Slow inference (< 10 tok/s)

# Enable optimizations
export ONNX_OPTIMIZED=true
export ONNX_MAX_CONTEXT_TOKENS=1000
export ONNX_SLIDING_WINDOW=true

# Best: Enable GPU
export ONNX_EXECUTION_PROVIDERS=cuda,cpu

See Also


Remember: ONNX is free and runs locally. Optimize first, then decide if you need cloud providers for complex tasks.