# ONNX Runtime Integration Research - Phi-4 Implementation

**Date**: 2025-10-03
**Status**: Research Complete - Implementation Ready

## Executive Summary

Research findings for integrating Microsoft's Phi-4-mini-instruct-onnx model with agentic-flow using onnxruntime-node for CPU inference.

## Key Findings

### 1. Library Comparison

| Library | Use Case | Performance | Node.js Support | Status |
|---------|----------|-------------|-----------------|--------|
| **onnxruntime-node** | Server-side inference | **Fastest** | ✅ Native | **Recommended** |
| onnxruntime-web | Browser/frontend | Good | ✅ WebAssembly | Not suitable for CLI |
| @xenova/transformers | Simplified API | Moderate | ✅ Yes | Limited model support |
| onnxruntime-genai | Generative AI | Excellent | ❌ Python only | Not available for Node.js |

**Conclusion**: Use **onnxruntime-node** v1.22.0 - it's the official Microsoft library with best performance for server-side Node.js applications.

### 2. Phi-4-mini-instruct-onnx Model Details

**HuggingFace**: https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx

#### Model Specifications
- **Context Length**: 128K tokens
- **License**: MIT
- **Quantization**: INT4 (RTN - Round To Nearest)
- **Variants**:
  - `cpu-int4-rtn-block-32-acc-level-4` - CPU optimized
  - `gpu-int4-rtn-block-32` - CUDA optimized

#### Performance Characteristics
- **Speedup**: 12.4x faster than PyTorch on CPU
- **Memory**: Reduced via INT4 quantization
- **Platform**: Cross-platform (Windows, Linux, macOS)

### 3. Key Challenges Identified

#### Challenge 1: onnxruntime-genai Not Available for Node.js
- **Issue**: The `onnxruntime-genai` library (used in Python examples) has no npm package
- **Impact**: Cannot use simplified GenAI API in Node.js
- **Solution**: Use onnxruntime-node directly with manual tokenization

#### Challenge 2: Transformers.js Incompatibility
- **Issue**: @xenova/transformers doesn't support Phi-4 models (only GPT-2, Llama, etc.)
- **Error**: "Unsupported model type: phi3"
- **Solution**: Bypass transformers.js, use onnxruntime-node + custom tokenizer

#### Challenge 3: Manual Tokenization Required
- **Issue**: Need to implement Phi-4 chat template and tokenization
- **Required**:
  - Tokenizer model (tokenizer.json)
  - Chat template formatting
  - Pre/post processing
- **Solution**: Use HuggingFace tokenizers library or implement manually

### 4. Recommended Architecture

```
┌─────────────────────────────────────────────┐
│         ONNXProvider (Updated)              │
├─────────────────────────────────────────────┤
│                                             │
│  ┌───────────────────────────────────┐     │
│  │   onnxruntime-node v1.22.0       │     │
│  │   (InferenceSession)              │     │
│  └───────────────────────────────────┘     │
│              ↓                              │
│  ┌───────────────────────────────────┐     │
│  │   Phi-4 ONNX Model                │     │
│  │   cpu-int4-rtn-block-32           │     │
│  └───────────────────────────────────┘     │
│              ↓                              │
│  ┌───────────────────────────────────┐     │
│  │   Custom Tokenizer                │     │
│  │   (Phi-4 chat template)           │     │
│  └───────────────────────────────────┘     │
│                                             │
└─────────────────────────────────────────────┘
```

### 5. Implementation Plan

#### Phase 1: Download Phi-4 Model ✅
```bash
huggingface-cli download microsoft/Phi-4-mini-instruct-onnx \
  --include cpu-int4-rtn-block-32-acc-level-4/* \
  --local-dir ./models/phi-4
```

#### Phase 2: Install Dependencies
```bash
npm install onnxruntime-node@^1.22.0
npm install @huggingface/tokenizers  # For tokenization
```

#### Phase 3: Implement ONNXProvider
- Use onnxruntime-node InferenceSession API
- Load Phi-4 ONNX model from disk
- Implement Phi-4 chat template
- Handle tokenization/detokenization
- Support CPU execution provider (upgradeable to CUDA)

#### Phase 4: Chat Template Format
```typescript
// Phi-4 uses the following chat format:
// <|system|>
// {system_message}<|end|>
// <|user|>
// {user_message}<|end|>
// <|assistant|>
// {assistant_response}<|end|>
```

### 6. API Comparison

#### Python (onnxruntime-genai) - NOT AVAILABLE FOR NODE
```python
import onnxruntime_genai as og
model = og.Model("cpu-int4-rtn-block-32-acc-level-4")
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
generator = og.Generator(model, params)
```

#### Node.js (onnxruntime-node) - RECOMMENDED
```typescript
import * as ort from 'onnxruntime-node';

// Load model
const session = await ort.InferenceSession.create(
  './models/phi-4/model.onnx',
  { executionProviders: ['cpu'] }
);

// Manual tokenization
const inputIds = tokenize(prompt);
const feeds = { input_ids: new ort.Tensor('int64', inputIds, [1, inputIds.length]) };

// Run inference
const results = await session.run(feeds);
const outputIds = results.output.data;
const text = detokenize(outputIds);
```

### 7. Performance Targets

| Metric | Target | Expected |
|--------|--------|----------|
| **First Token Latency** | <2000ms | ~1500ms |
| **Tokens/Second** | >15 | 15-25 |
| **Memory Usage** | <4GB | ~2-3GB |
| **Cost** | $0 | FREE |

### 8. Execution Providers

| Provider | Platform | Support | Acceleration |
|----------|----------|---------|--------------|
| CPU | All | ✅ Default | AVX2, AVX512 |
| CUDA | Linux + NVIDIA | ✅ Available | 10-100x |
| DirectML | Windows | ✅ Available | 5-20x |
| CoreML | macOS | ⚠️ Experimental | 5-10x |

### 9. Model Download Strategy

**Option 1: Manual Download (Recommended)**
- Use huggingface-cli to pre-download model
- Store in `./models/phi-4/` directory
- Faster initialization, no runtime downloads

**Option 2: Automatic Download**
- Use @huggingface/hub library
- Download on first run
- Slower first initialization

**Recommendation**: Pre-download for Docker deployments, auto-download for development.

### 10. Docker Considerations

```dockerfile
# In Dockerfile
RUN pip install huggingface-hub
RUN huggingface-cli download microsoft/Phi-4-mini-instruct-onnx \
    --include cpu-int4-rtn-block-32-acc-level-4/* \
    --local-dir /app/models/phi-4

# Or mount as volume
volumes:
  - ./models:/app/models
```

## Next Steps

1. ✅ Research complete - onnxruntime-node confirmed as best option
2. 🔄 Download Phi-4 model files
3. ⏳ Implement ONNXProvider with onnxruntime-node
4. ⏳ Create tokenizer integration
5. ⏳ Test in Docker CPU environment
6. ⏳ Benchmark performance
7. ⏳ Add GPU support (CUDA/DirectML)

## Resources

- ONNX Runtime Node.js: https://onnxruntime.ai/docs/api/js/
- Phi-4 Model: https://huggingface.co/microsoft/Phi-4-mini-instruct-onnx
- ONNX Runtime GenAI: https://github.com/microsoft/onnxruntime-genai (Python reference)
- HuggingFace Tokenizers: https://www.npmjs.com/package/@huggingface/tokenizers

## Conclusion

**onnxruntime-node is the correct choice** for implementing Phi-4 inference in agentic-flow:
- Official Microsoft library
- Best performance for Node.js
- CPU and GPU support
- Production-ready

**Note**: We'll need to implement manual tokenization since onnxruntime-genai (with built-in tokenization) is Python-only.