tasq/node_modules/agentic-flow/docs/archived/ONNX_PHI4_RESEARCH.md

7.9 KiB

ONNX Runtime Integration Research - Phi-4 Implementation

Date: 2025-10-03 Status: Research Complete - Implementation Ready

Executive Summary

Research findings for integrating Microsoft's Phi-4-mini-instruct-onnx model with agentic-flow using onnxruntime-node for CPU inference.

Key Findings

1. Library Comparison

Library Use Case Performance Node.js Support Status
onnxruntime-node Server-side inference Fastest Native Recommended
onnxruntime-web Browser/frontend Good WebAssembly Not suitable for CLI
@xenova/transformers Simplified API Moderate Yes Limited model support
onnxruntime-genai Generative AI Excellent Python only Not available for Node.js

Conclusion: Use onnxruntime-node v1.22.0 - it's the official Microsoft library with best performance for server-side Node.js applications.

2. Phi-4-mini-instruct-onnx Model Details

HuggingFace: https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx

Model Specifications

  • Context Length: 128K tokens
  • License: MIT
  • Quantization: INT4 (RTN - Round To Nearest)
  • Variants:
    • cpu-int4-rtn-block-32-acc-level-4 - CPU optimized
    • gpu-int4-rtn-block-32 - CUDA optimized

Performance Characteristics

  • Speedup: 12.4x faster than PyTorch on CPU
  • Memory: Reduced via INT4 quantization
  • Platform: Cross-platform (Windows, Linux, macOS)

3. Key Challenges Identified

Challenge 1: onnxruntime-genai Not Available for Node.js

  • Issue: The onnxruntime-genai library (used in Python examples) has no npm package
  • Impact: Cannot use simplified GenAI API in Node.js
  • Solution: Use onnxruntime-node directly with manual tokenization

Challenge 2: Transformers.js Incompatibility

  • Issue: @xenova/transformers doesn't support Phi-4 models (only GPT-2, Llama, etc.)
  • Error: "Unsupported model type: phi3"
  • Solution: Bypass transformers.js, use onnxruntime-node + custom tokenizer

Challenge 3: Manual Tokenization Required

  • Issue: Need to implement Phi-4 chat template and tokenization
  • Required:
    • Tokenizer model (tokenizer.json)
    • Chat template formatting
    • Pre/post processing
  • Solution: Use HuggingFace tokenizers library or implement manually
┌─────────────────────────────────────────────┐
│         ONNXProvider (Updated)              │
├─────────────────────────────────────────────┤
│                                             │
│  ┌───────────────────────────────────┐     │
│  │   onnxruntime-node v1.22.0       │     │
│  │   (InferenceSession)              │     │
│  └───────────────────────────────────┘     │
│              ↓                              │
│  ┌───────────────────────────────────┐     │
│  │   Phi-4 ONNX Model                │     │
│  │   cpu-int4-rtn-block-32           │     │
│  └───────────────────────────────────┘     │
│              ↓                              │
│  ┌───────────────────────────────────┐     │
│  │   Custom Tokenizer                │     │
│  │   (Phi-4 chat template)           │     │
│  └───────────────────────────────────┘     │
│                                             │
└─────────────────────────────────────────────┘

5. Implementation Plan

Phase 1: Download Phi-4 Model

huggingface-cli download microsoft/Phi-4-mini-instruct-onnx \
  --include cpu-int4-rtn-block-32-acc-level-4/* \
  --local-dir ./models/phi-4

Phase 2: Install Dependencies

npm install onnxruntime-node@^1.22.0
npm install @huggingface/tokenizers  # For tokenization

Phase 3: Implement ONNXProvider

  • Use onnxruntime-node InferenceSession API
  • Load Phi-4 ONNX model from disk
  • Implement Phi-4 chat template
  • Handle tokenization/detokenization
  • Support CPU execution provider (upgradeable to CUDA)

Phase 4: Chat Template Format

// Phi-4 uses the following chat format:
// <|system|>
// {system_message}<|end|>
// <|user|>
// {user_message}<|end|>
// <|assistant|>
// {assistant_response}<|end|>

6. API Comparison

Python (onnxruntime-genai) - NOT AVAILABLE FOR NODE

import onnxruntime_genai as og
model = og.Model("cpu-int4-rtn-block-32-acc-level-4")
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
generator = og.Generator(model, params)
import * as ort from 'onnxruntime-node';

// Load model
const session = await ort.InferenceSession.create(
  './models/phi-4/model.onnx',
  { executionProviders: ['cpu'] }
);

// Manual tokenization
const inputIds = tokenize(prompt);
const feeds = { input_ids: new ort.Tensor('int64', inputIds, [1, inputIds.length]) };

// Run inference
const results = await session.run(feeds);
const outputIds = results.output.data;
const text = detokenize(outputIds);

7. Performance Targets

Metric Target Expected
First Token Latency <2000ms ~1500ms
Tokens/Second >15 15-25
Memory Usage <4GB ~2-3GB
Cost $0 FREE

8. Execution Providers

Provider Platform Support Acceleration
CPU All Default AVX2, AVX512
CUDA Linux + NVIDIA Available 10-100x
DirectML Windows Available 5-20x
CoreML macOS ⚠️ Experimental 5-10x

9. Model Download Strategy

Option 1: Manual Download (Recommended)

  • Use huggingface-cli to pre-download model
  • Store in ./models/phi-4/ directory
  • Faster initialization, no runtime downloads

Option 2: Automatic Download

  • Use @huggingface/hub library
  • Download on first run
  • Slower first initialization

Recommendation: Pre-download for Docker deployments, auto-download for development.

10. Docker Considerations

# In Dockerfile
RUN pip install huggingface-hub
RUN huggingface-cli download microsoft/Phi-4-mini-instruct-onnx \
    --include cpu-int4-rtn-block-32-acc-level-4/* \
    --local-dir /app/models/phi-4

# Or mount as volume
volumes:
  - ./models:/app/models

Next Steps

  1. Research complete - onnxruntime-node confirmed as best option
  2. 🔄 Download Phi-4 model files
  3. Implement ONNXProvider with onnxruntime-node
  4. Create tokenizer integration
  5. Test in Docker CPU environment
  6. Benchmark performance
  7. Add GPU support (CUDA/DirectML)

Resources

Conclusion

onnxruntime-node is the correct choice for implementing Phi-4 inference in agentic-flow:

  • Official Microsoft library
  • Best performance for Node.js
  • CPU and GPU support
  • Production-ready

Note: We'll need to implement manual tokenization since onnxruntime-genai (with built-in tokenization) is Python-only.