Marc Rejohn Castillano 5cb6561924 added ruflo

2026-04-09 19:01:53 +08:00

7.9 KiB

Raw Blame History

ONNX Runtime Integration Research - Phi-4 Implementation

Date: 2025-10-03 Status: Research Complete - Implementation Ready

Executive Summary

Research findings for integrating Microsoft's Phi-4-mini-instruct-onnx model with agentic-flow using onnxruntime-node for CPU inference.

Key Findings

1. Library Comparison

Library	Use Case	Performance	Node.js Support	Status
onnxruntime-node	Server-side inference	Fastest	✅ Native	Recommended
onnxruntime-web	Browser/frontend	Good	✅ WebAssembly	Not suitable for CLI
@xenova/transformers	Simplified API	Moderate	✅ Yes	Limited model support
onnxruntime-genai	Generative AI	Excellent	❌ Python only	Not available for Node.js

Conclusion: Use onnxruntime-node v1.22.0 - it's the official Microsoft library with best performance for server-side Node.js applications.

2. Phi-4-mini-instruct-onnx Model Details

HuggingFace: https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx

Model Specifications

Context Length: 128K tokens
License: MIT
Quantization: INT4 (RTN - Round To Nearest)
Variants:
- cpu-int4-rtn-block-32-acc-level-4 - CPU optimized
- gpu-int4-rtn-block-32 - CUDA optimized

Performance Characteristics

Speedup: 12.4x faster than PyTorch on CPU
Memory: Reduced via INT4 quantization
Platform: Cross-platform (Windows, Linux, macOS)

3. Key Challenges Identified

Challenge 1: onnxruntime-genai Not Available for Node.js

Issue: The onnxruntime-genai library (used in Python examples) has no npm package
Impact: Cannot use simplified GenAI API in Node.js
Solution: Use onnxruntime-node directly with manual tokenization

Challenge 2: Transformers.js Incompatibility

Issue: @xenova/transformers doesn't support Phi-4 models (only GPT-2, Llama, etc.)
Error: "Unsupported model type: phi3"
Solution: Bypass transformers.js, use onnxruntime-node + custom tokenizer

Challenge 3: Manual Tokenization Required

Issue: Need to implement Phi-4 chat template and tokenization
Required:
- Tokenizer model (tokenizer.json)
- Chat template formatting
- Pre/post processing
Solution: Use HuggingFace tokenizers library or implement manually

4. Recommended Architecture

┌─────────────────────────────────────────────┐
│         ONNXProvider (Updated)              │
├─────────────────────────────────────────────┤
│                                             │
│  ┌───────────────────────────────────┐     │
│  │   onnxruntime-node v1.22.0       │     │
│  │   (InferenceSession)              │     │
│  └───────────────────────────────────┘     │
│              ↓                              │
│  ┌───────────────────────────────────┐     │
│  │   Phi-4 ONNX Model                │     │
│  │   cpu-int4-rtn-block-32           │     │
│  └───────────────────────────────────┘     │
│              ↓                              │
│  ┌───────────────────────────────────┐     │
│  │   Custom Tokenizer                │     │
│  │   (Phi-4 chat template)           │     │
│  └───────────────────────────────────┘     │
│                                             │
└─────────────────────────────────────────────┘

5. Implementation Plan

Phase 1: Download Phi-4 Model ✅

huggingface-cli download microsoft/Phi-4-mini-instruct-onnx \
  --include cpu-int4-rtn-block-32-acc-level-4/* \
  --local-dir ./models/phi-4

Phase 2: Install Dependencies

npm install onnxruntime-node@^1.22.0
npm install @huggingface/tokenizers  # For tokenization

Phase 3: Implement ONNXProvider

Use onnxruntime-node InferenceSession API
Load Phi-4 ONNX model from disk
Implement Phi-4 chat template
Handle tokenization/detokenization
Support CPU execution provider (upgradeable to CUDA)

Phase 4: Chat Template Format

// Phi-4 uses the following chat format:
// <|system|>
// {system_message}<|end|>
// <|user|>
// {user_message}<|end|>
// <|assistant|>
// {assistant_response}<|end|>

6. API Comparison

Python (onnxruntime-genai) - NOT AVAILABLE FOR NODE

import onnxruntime_genai as og
model = og.Model("cpu-int4-rtn-block-32-acc-level-4")
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
generator = og.Generator(model, params)

Node.js (onnxruntime-node) - RECOMMENDED

import * as ort from 'onnxruntime-node';

// Load model
const session = await ort.InferenceSession.create(
  './models/phi-4/model.onnx',
  { executionProviders: ['cpu'] }
);

// Manual tokenization
const inputIds = tokenize(prompt);
const feeds = { input_ids: new ort.Tensor('int64', inputIds, [1, inputIds.length]) };

// Run inference
const results = await session.run(feeds);
const outputIds = results.output.data;
const text = detokenize(outputIds);

7. Performance Targets

Metric	Target	Expected
First Token Latency	<2000ms	~1500ms
Tokens/Second	>15	15-25
Memory Usage	<4GB	~2-3GB
Cost	$0	FREE

8. Execution Providers

Provider	Platform	Support	Acceleration
CPU	All	✅ Default	AVX2, AVX512
CUDA	Linux + NVIDIA	✅ Available	10-100x
DirectML	Windows	✅ Available	5-20x
CoreML	macOS	⚠️ Experimental	5-10x

9. Model Download Strategy

Option 1: Manual Download (Recommended)

Use huggingface-cli to pre-download model
Store in ./models/phi-4/ directory
Faster initialization, no runtime downloads

Option 2: Automatic Download

Use @huggingface/hub library
Download on first run
Slower first initialization

Recommendation: Pre-download for Docker deployments, auto-download for development.

10. Docker Considerations

# In Dockerfile
RUN pip install huggingface-hub
RUN huggingface-cli download microsoft/Phi-4-mini-instruct-onnx \
    --include cpu-int4-rtn-block-32-acc-level-4/* \
    --local-dir /app/models/phi-4

# Or mount as volume
volumes:
  - ./models:/app/models

Next Steps

✅ Research complete - onnxruntime-node confirmed as best option
🔄 Download Phi-4 model files
⏳ Implement ONNXProvider with onnxruntime-node
⏳ Create tokenizer integration
⏳ Test in Docker CPU environment
⏳ Benchmark performance
⏳ Add GPU support (CUDA/DirectML)

Resources

ONNX Runtime Node.js: https://onnxruntime.ai/docs/api/js/
Phi-4 Model: https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx
ONNX Runtime GenAI: https://github.com/microsoft/onnxruntime-genai (Python reference)
HuggingFace Tokenizers: https://www.npmjs.com/package/@huggingface/tokenizers

Conclusion

onnxruntime-node is the correct choice for implementing Phi-4 inference in agentic-flow:

Official Microsoft library
Best performance for Node.js
CPU and GPU support
Production-ready

Note: We'll need to implement manual tokenization since onnxruntime-genai (with built-in tokenization) is Python-only.

7.9 KiB Raw Blame History