7.9 KiB
ONNX Runtime Integration Research - Phi-4 Implementation
Date: 2025-10-03 Status: Research Complete - Implementation Ready
Executive Summary
Research findings for integrating Microsoft's Phi-4-mini-instruct-onnx model with agentic-flow using onnxruntime-node for CPU inference.
Key Findings
1. Library Comparison
| Library | Use Case | Performance | Node.js Support | Status |
|---|---|---|---|---|
| onnxruntime-node | Server-side inference | Fastest | ✅ Native | Recommended |
| onnxruntime-web | Browser/frontend | Good | ✅ WebAssembly | Not suitable for CLI |
| @xenova/transformers | Simplified API | Moderate | ✅ Yes | Limited model support |
| onnxruntime-genai | Generative AI | Excellent | ❌ Python only | Not available for Node.js |
Conclusion: Use onnxruntime-node v1.22.0 - it's the official Microsoft library with best performance for server-side Node.js applications.
2. Phi-4-mini-instruct-onnx Model Details
HuggingFace: https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx
Model Specifications
- Context Length: 128K tokens
- License: MIT
- Quantization: INT4 (RTN - Round To Nearest)
- Variants:
cpu-int4-rtn-block-32-acc-level-4- CPU optimizedgpu-int4-rtn-block-32- CUDA optimized
Performance Characteristics
- Speedup: 12.4x faster than PyTorch on CPU
- Memory: Reduced via INT4 quantization
- Platform: Cross-platform (Windows, Linux, macOS)
3. Key Challenges Identified
Challenge 1: onnxruntime-genai Not Available for Node.js
- Issue: The
onnxruntime-genailibrary (used in Python examples) has no npm package - Impact: Cannot use simplified GenAI API in Node.js
- Solution: Use onnxruntime-node directly with manual tokenization
Challenge 2: Transformers.js Incompatibility
- Issue: @xenova/transformers doesn't support Phi-4 models (only GPT-2, Llama, etc.)
- Error: "Unsupported model type: phi3"
- Solution: Bypass transformers.js, use onnxruntime-node + custom tokenizer
Challenge 3: Manual Tokenization Required
- Issue: Need to implement Phi-4 chat template and tokenization
- Required:
- Tokenizer model (tokenizer.json)
- Chat template formatting
- Pre/post processing
- Solution: Use HuggingFace tokenizers library or implement manually
4. Recommended Architecture
┌─────────────────────────────────────────────┐
│ ONNXProvider (Updated) │
├─────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────┐ │
│ │ onnxruntime-node v1.22.0 │ │
│ │ (InferenceSession) │ │
│ └───────────────────────────────────┘ │
│ ↓ │
│ ┌───────────────────────────────────┐ │
│ │ Phi-4 ONNX Model │ │
│ │ cpu-int4-rtn-block-32 │ │
│ └───────────────────────────────────┘ │
│ ↓ │
│ ┌───────────────────────────────────┐ │
│ │ Custom Tokenizer │ │
│ │ (Phi-4 chat template) │ │
│ └───────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────┘
5. Implementation Plan
Phase 1: Download Phi-4 Model ✅
huggingface-cli download microsoft/Phi-4-mini-instruct-onnx \
--include cpu-int4-rtn-block-32-acc-level-4/* \
--local-dir ./models/phi-4
Phase 2: Install Dependencies
npm install onnxruntime-node@^1.22.0
npm install @huggingface/tokenizers # For tokenization
Phase 3: Implement ONNXProvider
- Use onnxruntime-node InferenceSession API
- Load Phi-4 ONNX model from disk
- Implement Phi-4 chat template
- Handle tokenization/detokenization
- Support CPU execution provider (upgradeable to CUDA)
Phase 4: Chat Template Format
// Phi-4 uses the following chat format:
// <|system|>
// {system_message}<|end|>
// <|user|>
// {user_message}<|end|>
// <|assistant|>
// {assistant_response}<|end|>
6. API Comparison
Python (onnxruntime-genai) - NOT AVAILABLE FOR NODE
import onnxruntime_genai as og
model = og.Model("cpu-int4-rtn-block-32-acc-level-4")
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
generator = og.Generator(model, params)
Node.js (onnxruntime-node) - RECOMMENDED
import * as ort from 'onnxruntime-node';
// Load model
const session = await ort.InferenceSession.create(
'./models/phi-4/model.onnx',
{ executionProviders: ['cpu'] }
);
// Manual tokenization
const inputIds = tokenize(prompt);
const feeds = { input_ids: new ort.Tensor('int64', inputIds, [1, inputIds.length]) };
// Run inference
const results = await session.run(feeds);
const outputIds = results.output.data;
const text = detokenize(outputIds);
7. Performance Targets
| Metric | Target | Expected |
|---|---|---|
| First Token Latency | <2000ms | ~1500ms |
| Tokens/Second | >15 | 15-25 |
| Memory Usage | <4GB | ~2-3GB |
| Cost | $0 | FREE |
8. Execution Providers
| Provider | Platform | Support | Acceleration |
|---|---|---|---|
| CPU | All | ✅ Default | AVX2, AVX512 |
| CUDA | Linux + NVIDIA | ✅ Available | 10-100x |
| DirectML | Windows | ✅ Available | 5-20x |
| CoreML | macOS | ⚠️ Experimental | 5-10x |
9. Model Download Strategy
Option 1: Manual Download (Recommended)
- Use huggingface-cli to pre-download model
- Store in
./models/phi-4/directory - Faster initialization, no runtime downloads
Option 2: Automatic Download
- Use @huggingface/hub library
- Download on first run
- Slower first initialization
Recommendation: Pre-download for Docker deployments, auto-download for development.
10. Docker Considerations
# In Dockerfile
RUN pip install huggingface-hub
RUN huggingface-cli download microsoft/Phi-4-mini-instruct-onnx \
--include cpu-int4-rtn-block-32-acc-level-4/* \
--local-dir /app/models/phi-4
# Or mount as volume
volumes:
- ./models:/app/models
Next Steps
- ✅ Research complete - onnxruntime-node confirmed as best option
- 🔄 Download Phi-4 model files
- ⏳ Implement ONNXProvider with onnxruntime-node
- ⏳ Create tokenizer integration
- ⏳ Test in Docker CPU environment
- ⏳ Benchmark performance
- ⏳ Add GPU support (CUDA/DirectML)
Resources
- ONNX Runtime Node.js: https://onnxruntime.ai/docs/api/js/
- Phi-4 Model: https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx
- ONNX Runtime GenAI: https://github.com/microsoft/onnxruntime-genai (Python reference)
- HuggingFace Tokenizers: https://www.npmjs.com/package/@huggingface/tokenizers
Conclusion
onnxruntime-node is the correct choice for implementing Phi-4 inference in agentic-flow:
- Official Microsoft library
- Best performance for Node.js
- CPU and GPU support
- Production-ready
Note: We'll need to implement manual tokenization since onnxruntime-genai (with built-in tokenization) is Python-only.