LLM Router: High-Performance Semantic Orchestration
The Challenge
Most LLM routing systems are just if-else wrappers around OpenAI calls. They work for demos but fail under production constraints where milliseconds matter and cost scales with every query. The problem: how do you intelligently route queries to appropriate models while keeping overhead under 50ms?
The Solution
I built LLM-Router, a production-grade microservice implementing a Hybrid Routing Architecture that balances latency and semantic accuracy.
Core Components
-
L1: Keyword Router (Fast Path): Deterministic regex matching in
KeywordRoutingStrategywith <1ms latency. Handles 90% of queries using pattern dictionaries loaded from Hydra configs. -
L2: Semantic Router (Fallback): When L1 fails,
SemanticRoutingStrategyusesall-MiniLM-L6-v2embeddings with cosine similarity (~30ms latency). Provides intelligent routing for complex queries that don’t match keyword patterns. -
Protocol-Oriented Design:
RoutingStrategyandLLMProviderprotocols insrc/core/protocols.pyenable swappable implementations. Currently supports Ollama with asymmetric model deployment (Llama-3.2-3B for reasoning, TinyLlama for speed). -
Forensic Benchmarking:
src/forensics/benchmark_suite.pysystematically tests routing decisions across model configs and query types, visualizing latency-accuracy trade-offs as heatmaps. -
DPO Flywheel:
src/forensics/dpo_pipeline.pylogs routing decisions as{prompt, chosen, rejected}triplets, enabling Direct Preference Optimization from production traffic without a separate reward model.
The Impact
This project demonstrates that routing doesn’t need complex ML. Start with rules (keywords), add semantic fallback for edge cases, and collect data for continuous improvement. The forensic approach (measure everything, optimize bottlenecks) applies beyond LLM routing to any latency-constrained system.
Key Insight: Hybrid routing keeps P50 latency under 50ms while maintaining routing accuracy. The DPO flywheel creates a continuous improvement loop where better routing generates better training data.