tasq/node_modules/agentic-flow/docs/archive/TOOL-EMULATION-INTEGRATION-ISSUE.md

19 KiB
Raw Blame History

🔧 Tool Emulation for Non-Tool Models - Phase 2 Integration

Issue Type: Feature Enhancement Priority: Medium Effort: ~8-12 hours Version: 1.3.0 (proposed) Status: Ready for Implementation


📋 Summary

Enable Claude Code and agentic-flow to work with ANY model (even those without native function calling support) by implementing automatic tool emulation. This will achieve 99%+ cost savings while maintaining 70-85% functionality.

Current Status: Phase 1 Complete

  • Architecture designed and validated
  • Tool emulation code implemented (src/proxy/tool-emulation.ts, src/utils/modelCapabilities.ts)
  • All regression tests pass (15/15)
  • Zero breaking changes confirmed

Next Step: Phase 2 Integration

  • Connect emulation layer to OpenRouter proxy
  • Add capability detection to CLI
  • Test with real non-tool models
  • Deploy to production

🎯 Problem Statement

Current Limitation

Claude Code and agentic-flow currently require models with native tool/function calling support:

Works: DeepSeek Chat, Claude 3.5 Sonnet, GPT-4o, Llama 3.3 70B Fails: Mistral 7B, Llama 2 13B, GLM-4-9B (free), older models

When using non-tool models:

  • Tools are ignored
  • Model responds with plain text
  • No file operations, bash commands, or MCP tool usage possible

Impact

Users are forced to use expensive models:

  • Claude 3.5 Sonnet: $3-15/M tokens
  • GPT-4o: $2.50/M tokens

Even though cheaper/free alternatives exist:

  • Mistral 7B: $0.07/M tokens (97.7% cheaper)
  • GLM-4-9B: FREE (100% savings)

Solution: Automatic Tool Emulation

Implement transparent tool emulation that:

  1. Detects when a model lacks native tool support
  2. Converts tool definitions into structured prompts
  3. Parses model responses for tool calls
  4. Executes tools and continues conversation
  5. Returns results in standard Anthropic format

Two Strategies

ReAct Pattern (70-85% reliability):

  • Best for: Complex tasks, 32k+ context
  • Structured reasoning: Thought → Action → Observation → Final Answer
  • Used by: Mistral 7B, GLM-4-9B, newer models

Prompt-Based (50-70% reliability):

  • Best for: Simple tasks, <8k context
  • Direct JSON tool invocation
  • Used by: Llama 2 13B, older models

📦 Phase 1 Complete (Validation)

Files Implemented

Core Implementation (~22KB):

  • src/utils/modelCapabilities.ts - Capability detection for 15+ models
  • src/proxy/tool-emulation.ts - ReAct and Prompt emulation logic

Testing & Documentation (~51KB):

  • examples/tool-emulation-demo.ts - Offline demonstration
  • examples/tool-emulation-test.ts - Real API testing script
  • examples/regression-test.ts - 15-test regression suite
  • examples/test-claude-code-emulation.ts - Claude Code simulation
  • examples/TOOL-EMULATION-ARCHITECTURE.md - Technical documentation
  • examples/REGRESSION-TEST-RESULTS.md - Test results
  • examples/VALIDATION-SUMMARY.md - High-level overview
  • examples/PHASE-2-INTEGRATION-GUIDE.md - Integration instructions

Validation Results

Regression Tests: 15/15 passed (100%)

Category Status
Code Isolation Not imported in main codebase
TypeScript Compilation Clean build with zero errors
Model Detection Correctly identifies native vs emulation
Proxy Integrity Tool names/schemas unchanged
Backward Compatibility All 67 agents work

Key Validation: Confirmed that proxy does NOT rewrite tool names or schemas - they pass through unchanged. Tool emulation is completely isolated.


🚀 Phase 2 Tasks (Integration)

Task 1: Add Capability Detection to CLI (1-2 hours)

File: src/cli-proxy.ts

Changes:

  1. Import capability detection at top of file
  2. Detect capabilities when initializing OpenRouter proxy
  3. Log emulation status to console
  4. Pass capabilities to proxy constructor

Code Location: Around line 307-347 (OpenRouter proxy initialization)

Implementation:

import { detectModelCapabilities } from './utils/modelCapabilities.js';

// In startOpenRouterProxy function:
const model = options.model || process.env.COMPLETION_MODEL || 'mistralai/mistral-small-3.1-24b-instruct';
const capabilities = detectModelCapabilities(model);

if (capabilities.requiresEmulation) {
  console.log(`\n⚙  Detected: Model lacks native tool support`);
  console.log(`🔧 Using ${capabilities.emulationStrategy.toUpperCase()} emulation pattern`);
  console.log(`📊 Expected reliability: ${capabilities.emulationStrategy === 'react' ? '70-85%' : '50-70%'}\n`);
}

// Pass to proxy constructor
const proxy = new AnthropicToOpenRouterProxy({
  apiKey: openRouterKey,
  defaultModel: model,
  capabilities: capabilities  // NEW
});

Test After:

# Should show native tools message
npx agentic-flow --agent coder --task "test" --provider openrouter --model "deepseek/deepseek-chat"

# Should show emulation message
npx agentic-flow --agent coder --task "test" --provider openrouter --model "mistralai/mistral-7b-instruct"

Task 2: Update OpenRouter Proxy Constructor (1 hour)

File: src/proxy/anthropic-to-openrouter.ts

Changes:

  1. Add imports for tool emulation
  2. Add capabilities field to class
  3. Update constructor to accept capabilities parameter
  4. Initialize (but don't use yet) emulation flag

Code Location: Around line 58-120 (class definition and constructor)

Implementation:

import { ModelCapabilities } from '../utils/modelCapabilities.js';

export class AnthropicToOpenRouterProxy {
  private capabilities?: ModelCapabilities;

  constructor(config: {
    apiKey: string;
    defaultModel?: string;
    baseURL?: string;
    siteName?: string;
    siteURL?: string;
    capabilities?: ModelCapabilities;  // NEW
  }) {
    // ... existing code ...
    this.capabilities = config.capabilities;
  }
}

Test After:

npm run build
# Should compile with no errors

# Test existing functionality
npx agentic-flow --agent coder --task "What is 2+2?" --provider openrouter --model "deepseek/deepseek-chat"
# Should work exactly as before

Task 3: Regression Test After Constructor Change (30 min)

Run:

npm run build
npx tsx examples/regression-test.ts

Expected: All 15 tests pass

If any test fails: Revert changes and debug before continuing


Task 4: Add Emulation Request Handler (3-4 hours)

File: src/proxy/anthropic-to-openrouter.ts

Changes:

  1. Import tool emulation utilities
  2. Split existing request handler into two methods
  3. Add emulation-specific request handler
  4. Add tool execution stub (returns error for now)

Code Location: Request handling logic (around line 200-400)

Implementation:

import { ToolEmulator, executeEmulation, ToolCall } from './tool-emulation.js';
import { detectModelCapabilities } from '../utils/modelCapabilities.js';

// In request handler (around line 250):
private async handleAnthropicRequest(anthropicReq: AnthropicRequest): Promise<any> {
  const model = anthropicReq.model || this.defaultModel;
  const capabilities = this.capabilities || detectModelCapabilities(model);

  // Check if emulation is needed
  if (capabilities.requiresEmulation && anthropicReq.tools && anthropicReq.tools.length > 0) {
    logger.info(`Using tool emulation for model: ${model}`);
    return this.handleEmulatedRequest(anthropicReq, capabilities);
  }

  // Existing path (native tool support)
  return this.handleNativeRequest(anthropicReq);
}

private async handleNativeRequest(anthropicReq: AnthropicRequest): Promise<any> {
  // Move existing request handling code here
  // This is the current logic - no changes needed
}

private async handleEmulatedRequest(
  anthropicReq: AnthropicRequest,
  capabilities: ModelCapabilities
): Promise<any> {
  const emulator = new ToolEmulator(
    anthropicReq.tools || [],
    capabilities.emulationStrategy as 'react' | 'prompt'
  );

  // Extract user message
  const lastMessage = anthropicReq.messages[anthropicReq.messages.length - 1];
  const userMessage = this.extractMessageText(lastMessage);

  // Execute emulation
  const result = await executeEmulation(
    emulator,
    userMessage,
    async (prompt) => {
      // Call model with prompt
      const openaiReq = this.buildOpenAIRequest(anthropicReq, prompt);
      const response = await this.callOpenRouterAPI(openaiReq);
      return response.choices[0].message.content;
    },
    async (toolCall) => {
      // Tool execution - stub for now
      logger.warn(`Tool execution not yet implemented: ${toolCall.name}`);
      return { error: 'Tool execution not implemented' };
    },
    {
      maxIterations: 5,
      verbose: process.env.VERBOSE === 'true'
    }
  );

  // Convert to Anthropic format
  return this.formatEmulationResult(result, anthropicReq);
}

private extractMessageText(message: AnthropicMessage): string {
  if (typeof message.content === 'string') {
    return message.content;
  }
  return message.content.find(c => c.type === 'text')?.text || '';
}

private formatEmulationResult(result: any, originalReq: AnthropicRequest): any {
  return {
    id: `emulated_${Date.now()}`,
    type: 'message',
    role: 'assistant',
    content: [{
      type: 'text',
      text: result.finalAnswer || 'No response generated'
    }],
    model: originalReq.model || this.defaultModel,
    stop_reason: 'end_turn',
    usage: {
      input_tokens: 0,
      output_tokens: 0
    }
  };
}

Test After:

npm run build

# Test native tools still work
npx agentic-flow --agent coder --task "What is 2+2?" \
  --provider openrouter --model "deepseek/deepseek-chat"

# Test emulation path (will have limited functionality)
npx agentic-flow --agent coder --task "What is 5*5?" \
  --provider openrouter --model "mistralai/mistral-7b-instruct"

Task 5: Test Non-Tool Model Emulation (1-2 hours)

Requirements:

  • OpenRouter API key set: export OPENROUTER_API_KEY="sk-or-..."

Test Cases:

# Test 1: Simple math (should work even without tools)
npx agentic-flow --agent coder \
  --task "Calculate 15 * 23" \
  --provider openrouter \
  --model "mistralai/mistral-7b-instruct"

# Expected: Emulation message shown, model responds with answer

# Test 2: Verify native tools unaffected
npx agentic-flow --agent coder \
  --task "Calculate 100 / 4" \
  --provider openrouter \
  --model "deepseek/deepseek-chat"

# Expected: No emulation message, standard tool use

# Test 3: Free model (GLM-4-9B)
npx agentic-flow --agent researcher \
  --task "What is machine learning?" \
  --provider openrouter \
  --model "thudm/glm-4-9b:free"

# Expected: Emulation message, response generated

Validation Checklist:

  • Emulation message appears for non-tool models
  • Native tool models work unchanged
  • No errors during request processing
  • Responses are coherent
  • Build succeeds with no warnings

Task 6: Run Full Regression Suite (30 min)

npm run build
npx tsx examples/regression-test.ts

Expected: All 15 tests still pass

If tests fail:

  1. Check TypeScript compilation errors
  2. Verify imports are correct
  3. Ensure backward compatibility maintained
  4. Review changes and revert if needed

Task 7: Update Documentation (1 hour)

Files to Update:

  1. README.md: Add section on tool emulation
  2. CHANGELOG.md: Document v1.3.0 changes
  3. examples/TOOL-EMULATION-ARCHITECTURE.md: Update status from "Phase 1" to "Phase 2 Complete"

Changelog Entry:

## [1.3.0] - 2025-10-07

### Added
- 🔧 **Tool Emulation for Non-Tool Models**: Automatically enables tool use for models without native function calling
  - ReAct pattern for complex tasks (70-85% reliability)
  - Prompt-based pattern for simple tasks (50-70% reliability)
  - Automatic capability detection for 15+ models
  - Supports Mistral 7B, Llama 2, GLM-4-9B (FREE), and more
  - Achieves 99%+ cost savings vs Claude 3.5 Sonnet

### Technical
- Added `src/utils/modelCapabilities.ts` - Model capability detection
- Added `src/proxy/tool-emulation.ts` - ReAct and Prompt emulation
- Modified `src/cli-proxy.ts` - Capability detection integration
- Modified `src/proxy/anthropic-to-openrouter.ts` - Emulation request handler
- Added comprehensive test suite (15 regression tests)

### Backward Compatibility
- ✅ Zero breaking changes
- ✅ Native tool models work unchanged
- ✅ All 67 agents functional
- ✅ Claude Code integration unaffected

🧪 Testing Strategy

Automated Tests

  1. Regression Tests (15 tests):

    npx tsx examples/regression-test.ts
    
    • Must pass 15/15 before and after each change
  2. Emulation Demo (offline):

    npx tsx examples/tool-emulation-demo.ts
    
    • Validates architecture without API calls
  3. Build Verification:

    npm run build
    
    • Must succeed with zero errors

Manual Tests

  1. Native Tool Model (baseline):

    npx agentic-flow --agent coder --task "What is 2+2?" \
      --provider openrouter --model "deepseek/deepseek-chat"
    
  2. Non-Tool Model (emulation):

    npx agentic-flow --agent coder --task "Calculate 5*5" \
      --provider openrouter --model "mistralai/mistral-7b-instruct"
    
  3. Free Model:

    npx agentic-flow --agent researcher --task "Explain AI" \
      --provider openrouter --model "thudm/glm-4-9b:free"
    
  4. Claude Code Integration:

    npx agentic-flow claude-code --provider openrouter \
      --model "mistralai/mistral-7b-instruct" \
      "Write a hello world function"
    

Validation Criteria

Must Pass:

  • All 15 regression tests pass
  • TypeScript builds without errors
  • Native tool models work unchanged
  • Emulation message appears for non-tool models
  • No runtime errors or crashes

⚠️ Expected Limitations:

  • Tool execution not yet implemented (Phase 3)
  • Emulation reliability 70-85% (lower than native 95%+)
  • No streaming support for emulated requests

📊 Success Metrics

Technical Metrics

  • Zero regressions (15/15 tests pass)
  • Clean TypeScript build
  • Emulation detection working
  • Tool execution integrated (Phase 3)

User Metrics

  • Users can select Mistral 7B and see emulation message
  • Cost savings: 97-99% vs Claude 3.5 Sonnet
  • Model options increase from ~10 to 100+

Performance Metrics

  • Native tools: 95-99% reliability (unchanged)
  • ReAct emulation: 70-85% reliability
  • Prompt emulation: 50-70% reliability

🚧 Known Limitations (Phase 2)

  1. No Tool Execution Yet: Emulation detects tool calls but can't execute them

    • Impact: Models will attempt to use tools but get error responses
    • Fix: Phase 3 - Integrate with MCP tool execution system
  2. No Streaming: Emulation uses multi-iteration loop, can't stream

    • Impact: Responses come all at once, no progressive updates
    • Fix: Phase 3 - Implement partial streaming
  3. Context Window Constraints: Small models can't handle 218 tools

    • Impact: Models with <32k context may fail with full tool catalog
    • Fix: Phase 3 - Tool filtering based on task relevance
  4. Lower Reliability: 70-85% vs 95%+ for native tools

    • Impact: Some tool calls may be missed or malformed
    • Fix: Inherent limitation - use native tool models for critical tasks

🔮 Future Enhancements (Phase 3+)

Phase 3: Tool Execution Integration (4-6 hours)

  • Connect emulation loop to MCP tool execution
  • Implement tool result handling
  • Add error recovery mechanisms

Phase 4: Optimization (3-4 hours)

  • Tool filtering based on task relevance (embeddings)
  • Prompt caching to reduce token usage
  • Parallel tool execution where possible

Phase 5: Advanced Features (6-8 hours)

  • Streaming support for emulated requests
  • Hybrid routing (tool model for decisions, cheap model for text)
  • Fine-tuning adapters for specific emulation patterns
  • Auto-switching strategies based on failure detection

📁 Files Modified/Created

Created (Phase 1 - Complete)

  • src/utils/modelCapabilities.ts (~8KB)
  • src/proxy/tool-emulation.ts (~14KB)
  • examples/tool-emulation-demo.ts (~6KB)
  • examples/tool-emulation-test.ts (~8KB)
  • examples/regression-test.ts (~7KB)
  • examples/test-claude-code-emulation.ts (~8KB)
  • examples/TOOL-EMULATION-ARCHITECTURE.md (~18KB)
  • examples/REGRESSION-TEST-RESULTS.md (~12KB)
  • examples/VALIDATION-SUMMARY.md (~10KB)
  • examples/PHASE-2-INTEGRATION-GUIDE.md (~12KB)

To Modify (Phase 2)

  • src/cli-proxy.ts - Add capability detection
  • src/proxy/anthropic-to-openrouter.ts - Add emulation handler
  • README.md - Document tool emulation
  • CHANGELOG.md - Add v1.3.0 entry
  • package.json - Bump version to 1.3.0

  • Related to: Cost optimization efforts
  • Related to: OpenRouter integration
  • Addresses: User requests for cheaper model options
  • Enables: Free tier usage (GLM-4-9B, Gemini Flash)

👥 Assignee Notes

Prerequisites

  • Phase 1 complete and validated
  • All regression tests passing
  • Architecture documented
  • OpenRouter API key for testing

Implementation Order

  1. Task 1: CLI capability detection (safest, easy to test)
  2. Task 2: Proxy constructor update (no behavior change yet)
  3. Test checkpoint: Run regression tests
  4. Task 4: Emulation handler (main integration)
  5. Test checkpoint: Verify native tools still work
  6. Task 5: Manual testing with non-tool models
  7. Task 6: Full regression suite
  8. Task 7: Documentation updates

Testing Strategy

  • Test after EVERY change
  • Run regression suite at checkpoints
  • Keep changes small and incremental
  • Commit working state before risky changes

Rollback Plan

If issues arise:

  1. Revert last commit
  2. Run regression tests to confirm stability
  3. Debug in isolation before re-attempting
  4. All changes are non-breaking by design

📝 Acceptance Criteria

Phase 2 Complete When:

  • Capability detection integrated into CLI
  • OpenRouter proxy accepts capabilities parameter
  • Emulation request handler implemented
  • All 15 regression tests pass
  • Native tool models work unchanged
  • Emulation message appears for non-tool models
  • TypeScript builds with zero errors
  • Documentation updated (README, CHANGELOG)
  • Manual testing completed successfully
  • Code reviewed and approved
  • Merged to main branch
  • Version bumped to 1.3.0

Success Indicators:

# This should work and show emulation
$ npx agentic-flow --agent coder --task "Calculate 15*23" \
    --provider openrouter --model "mistralai/mistral-7b-instruct"

⚙️  Detected: Model lacks native tool support
🔧 Using REACT emulation pattern
📊 Expected reliability: 70-85%
⏳ Running...

[Response generated using emulation]

🏁 Summary

Phase 1: Complete (Architecture + Validation) Phase 2: Ready to Implement (Integration) Phase 3: 📋 Planned (Tool Execution)

Estimated Total Effort: 8-12 hours for Phase 2 Risk Level: Low (all changes are non-breaking and incrementally testable) Benefits: 99%+ cost savings, access to 100+ models, FREE tier support

Ready to Start: All prerequisites met, architecture validated, regression suite in place.


Created: 2025-10-07 Last Updated: 2025-10-07 Status: Ready for Implementation Assignee: TBD Reviewer: TBD