Integrates with Google's Gemini models including Gemini Pro and Gemini Flash for efficient document processing and task handling.
Provides access to OpenAI's models such as GPT-3.5, GPT-4o, and GPT-4o mini for cost-effective task delegation and processing.
Leverages Pydantic for data validation across the service's operations and request handling.
Utilizes Rich for beautiful console output with emoji indicators and detailed context information in logging.
LLM Gateway MCP Server
A Model Context Protocol (MCP) server enabling intelligent delegation from high-capability AI agents to cost-effective LLMs
Getting Started • Key Features • Usage Examples • Architecture •
What is LLM Gateway?
LLM Gateway is an MCP-native server that enables intelligent task delegation from advanced AI agents like Claude 3.7 Sonnet to more cost-effective models like Gemini Flash 2.0 Lite. It provides a unified interface to multiple Large Language Model (LLM) providers while optimizing for cost, performance, and quality.
The Vision: AI-Driven Resource Optimization
At its core, LLM Gateway represents a fundamental shift in how we interact with AI systems. Rather than using a single expensive model for all tasks, it enables an intelligent hierarchy where:
Advanced models like Claude 3.7 focus on high-level reasoning, orchestration, and complex tasks
Cost-effective models handle routine processing, extraction, and mechanical tasks
The overall system achieves near-top-tier performance at a fraction of the cost
This approach mirrors how human organizations work — specialists handle complex decisions while delegating routine tasks to others with the right skills for those specific tasks.
MCP-Native Architecture
The server is built on the Model Context Protocol (MCP), making it specifically designed to work with AI agents like Claude. All functionality is exposed through MCP tools that can be directly called by these agents, creating a seamless workflow for AI-to-AI delegation.
Primary Use Case: AI Agent Task Delegation
The primary design goal of LLM Gateway is to allow sophisticated AI agents like Claude 3.7 Sonnet to intelligently delegate tasks to less expensive models:
Example workflow:
Claude identifies that a document needs to be summarized (an expensive operation with Claude)
Claude delegates this task to LLM Gateway via MCP tools
LLM Gateway routes the summarization task to Gemini Flash (10-20x cheaper than Claude)
The summary is returned to Claude for higher-level reasoning and decision-making
Claude can then focus its capabilities on tasks that truly require its intelligence
This delegation pattern can save 70-90% on API costs while maintaining output quality.
Why Use LLM Gateway?
🔄 AI-to-AI Task Delegation
The most powerful use case is enabling advanced AI agents to delegate routine tasks to cheaper models:
Have Claude 3.7 use GPT-4o-mini for initial document summarization
Let Claude use Gemini 2.0 Flash light for data extraction and transformation
Allow Claude to orchestrate a multi-stage workflow across different providers
Enable Claude to choose the right model for each specific sub-task
💰 Cost Optimization
API costs for advanced models can be substantial. LLM Gateway helps reduce costs by:
Routing appropriate tasks to cheaper models (e.g., $0.01/1K tokens vs $0.15/1K tokens)
Implementing advanced caching to avoid redundant API calls
Tracking and optimizing costs across providers
Enabling cost-aware task routing decisions
🔄 Provider Abstraction
Avoid provider lock-in with a unified interface:
Standard API for OpenAI, Anthropic (Claude), Google (Gemini), and DeepSeek
Consistent parameter handling and response formatting
Ability to swap providers without changing application code
Protection against provider-specific outages and limitations
📄 Document Processing at Scale
Process large documents efficiently:
Break documents into semantically meaningful chunks
Process chunks in parallel across multiple models
Extract structured data from unstructured text
Generate summaries and insights from large texts
Key Features
MCP Protocol Integration
Native MCP Server: Built on the Model Context Protocol for AI agent integration
MCP Tool Framework: All functionality exposed through standardized MCP tools
Tool Composition: Tools can be combined for complex workflows
Tool Discovery: Support for tool listing and capability discovery
Intelligent Task Delegation
Task Routing: Analyze tasks and route to appropriate models
Provider Selection: Choose provider based on task requirements
Cost-Performance Balancing: Optimize for cost, quality, or speed
Delegation Tracking: Monitor delegation patterns and outcomes
Advanced Caching
Multi-level Caching: Multiple caching strategies:
Exact match caching
Semantic similarity caching
Task-aware caching
Persistent Cache: Disk-based persistence with fast in-memory access
Cache Analytics: Track savings and hit rates
Document Tools
Smart Chunking: Multiple chunking strategies:
Token-based chunking
Semantic boundary detection
Structural analysis
Document Operations:
Summarization
Entity extraction
Question generation
Batch processing
Structured Data Extraction
JSON Extraction: Extract structured JSON with schema validation
Table Extraction: Extract tables in multiple formats
Key-Value Extraction: Extract key-value pairs from text
Semantic Schema Inference: Generate schemas from text
Tournament Mode
Code and Text Competitions: Support for running tournament-style competitions
Multiple Models: Compare outputs from different models simultaneously
Performance Metrics: Evaluate and track model performance
Results Storage: Persist tournament results for further analysis
Advanced Vector Operations
Semantic Search: Find semantically similar content across documents
Vector Storage: Efficient storage and retrieval of vector embeddings
Hybrid Search: Combine keyword and semantic search capabilities
Batched Processing: Efficiently process large datasets
Usage Examples
Claude Using LLM Gateway for Document Analysis
This example shows how Claude can use the LLM Gateway to process a document by delegating tasks to cheaper models:
Multi-Provider Comparison for Decision Making
Cost-Optimized Workflow
Document Chunking
To break a large document into smaller, manageable chunks:
Multi-Provider Completion
To get completions for the same prompt from multiple providers/models simultaneously for comparison:
Structured Data Extraction (JSON)
To extract information from text into a specific JSON schema:
Retrieval-Augmented Generation (RAG) Query
To ask a question using RAG, where the system retrieves relevant context before generating an answer (assuming relevant documents have been indexed):
Fused Search (Keyword + Semantic)
To perform a hybrid search combining keyword relevance and semantic similarity using Marqo:
Local Text Processing
To perform local, offline text operations without calling an LLM API:
Running a Model Tournament
To compare the outputs of multiple models on a specific task (e.g., code generation):
(More tool examples can be added here...)
Getting Started
Installation
Environment Setup
Create a .env
file with your API keys:
Running the Server
Once running, the server will be available at http://localhost:8013
.
Advanced Configuration
While the .env
file is convenient for basic setup, the LLM Gateway offers more detailed configuration options primarily managed through environment variables.
Server Configuration
SERVER_HOST
: (Default:127.0.0.1
) The network interface the server listens on. Use0.0.0.0
to listen on all interfaces (necessary for Docker or external access).SERVER_PORT
: (Default:8013
) The port the server listens on.API_PREFIX
: (Default:/
) The URL prefix for the API endpoints.
Logging Configuration
LOG_LEVEL
: (Default:INFO
) Controls the verbosity of logs. Options:DEBUG
,INFO
,WARNING
,ERROR
,CRITICAL
.USE_RICH_LOGGING
: (Default:true
) Use Rich library for colorful, formatted console logs. Set tofalse
for plain text logs (better for file redirection or some log aggregation systems).LOG_FORMAT
: (Optional) Specify a custom log format string.LOG_TO_FILE
: (Optional, e.g.,gateway.log
) Path to a file where logs should also be written.
Cache Configuration
CACHE_ENABLED
: (Default:true
) Enable or disable caching globally.CACHE_TTL
: (Default:86400
seconds, i.e., 24 hours) Default Time-To-Live for cached items. Specific tools might override this.CACHE_TYPE
: (Default:memory
) The type of cache backend. Options might includememory
,redis
,diskcache
. (Note: Check current implementation for supported types).CACHE_MAX_SIZE
: (Optional) Maximum number of items or memory size for the cache.REDIS_URL
: (Required ifCACHE_TYPE=redis
) Connection URL for the Redis cache server (e.g.,redis://localhost:6379/0
).
Provider Timeouts & Retries
PROVIDER_TIMEOUT
: (Default:120
seconds) Default timeout for requests to LLM provider APIs.PROVIDER_MAX_RETRIES
: (Default:3
) Default number of retries for failed provider requests (e.g., due to temporary network issues or rate limits).Specific provider timeouts/retries might be configurable via dedicated variables like
OPENAI_TIMEOUT
,ANTHROPIC_MAX_RETRIES
, etc. (Note: Check current implementation).
Tool-Specific Configuration
Some tools might have their own specific environment variables for configuration (e.g.,
MARQO_URL
for fused search, default chunking parameters). Refer to the documentation or source code of individual tools.
Always ensure your environment variables are set correctly before starting the server. Changes often require a server restart.
Deployment Considerations
While running the server directly with python
or docker compose up
is suitable for development and testing, consider the following for more robust or production deployments:
1. Running as a Background Service
To ensure the gateway runs continuously and restarts automatically on failure or server reboot, use a process manager:
systemd
Create a service unit file (e.g.,/etc/systemd/system/llm-gateway.service
) to manage the process. This allows commands likesudo systemctl start|stop|restart|status llm-gateway
.supervisor
A popular process control system written in Python. Configuresupervisord
to monitor and control the gateway process.Docker Restart Policies: If using Docker (standalone or Compose), configure appropriate restart policies (e.g.,
unless-stopped
oralways
) in yourdocker run
command ordocker-compose.yml
file.
2. Using a Reverse Proxy (Nginx/Caddy/Apache)
Placing a reverse proxy in front of the LLM Gateway is highly recommended:
HTTPS/SSL Termination: The proxy can handle SSL certificates (e.g., using Let's Encrypt with Caddy or Certbot with Nginx/Apache), encrypting traffic between clients and the proxy.
Load Balancing: If you need to run multiple instances of the gateway for high availability or performance, the proxy can distribute traffic among them.
Path Routing: Map external paths (e.g.,
https://api.yourdomain.com/llm-gateway/
) to the internal gateway server (http://localhost:8013
).Security Headers: Add important security headers (like CSP, HSTS).
Buffering/Caching: Some proxies offer additional request/response buffering or caching capabilities.
Example Nginx
3. Container Orchestration (Kubernetes/Swarm)
If deploying in a containerized environment:
Health Checks: Implement and configure health check endpoints (e.g., the
/healthz
mentioned earlier) in your deployment manifests so the orchestrator can monitor the service's health.Configuration: Use ConfigMaps and Secrets (Kubernetes) or equivalent mechanisms to manage environment variables and API keys securely, rather than hardcoding them in images or relying solely on
.env
files.Resource Limits: Define appropriate CPU and memory requests/limits for the gateway container to ensure stable performance and prevent resource starvation.
Service Discovery: Utilize the orchestrator's service discovery mechanisms instead of hardcoding IP addresses or hostnames.
4. Resource Allocation
Ensure the host machine or container has sufficient RAM, especially if using in-memory caching or processing large documents/requests.
Monitor CPU usage, particularly under heavy load or when multiple complex operations run concurrently.
Cost Savings With Delegation
Using LLM Gateway for delegation can yield significant cost savings:
Task | Claude 3.7 Direct | Delegated to Cheaper LLM | Savings |
Summarizing 100-page document | $4.50 | $0.45 (Gemini Flash) | 90% |
Extracting data from 50 records | $2.25 | $0.35 (GPT-4o-mini) | 84% |
Generating 20 content ideas | $0.90 | $0.12 (DeepSeek) | 87% |
Processing 1,000 customer queries | $45.00 | $7.50 (Mixed delegation) | 83% |
These savings are achieved while maintaining high-quality outputs by letting Claude focus on high-level reasoning and orchestration while delegating mechanical tasks to cost-effective models.
Why AI-to-AI Delegation Matters
The strategic importance of AI-to-AI delegation extends beyond simple cost savings:
Democratizing Advanced AI Capabilities
By enabling powerful models like Claude 3.7, GPT-4o, and others to delegate effectively, we:
Make advanced AI capabilities accessible at a fraction of the cost
Allow organizations with budget constraints to leverage top-tier AI capabilities
Enable more efficient use of AI resources across the industry
Economic Resource Optimization
AI-to-AI delegation represents a fundamental economic optimization:
Complex reasoning, creativity, and understanding are reserved for top-tier models
Routine data processing, extraction, and simpler tasks go to cost-effective models
The overall system achieves near-top-tier performance at a fraction of the cost
API costs become a controlled expenditure rather than an unpredictable liability
Sustainable AI Architecture
This approach promotes more sustainable AI usage:
Reduces unnecessary consumption of high-end computational resources
Creates a tiered approach to AI that matches capabilities to requirements
Allows experimental work that would be cost-prohibitive with top-tier models only
Creates a scalable approach to AI integration that can grow with business needs
Technical Evolution Path
LLM Gateway represents an important evolution in AI application architecture:
Moving from monolithic AI calls to distributed, multi-model workflows
Enabling AI-driven orchestration of complex processing pipelines
Creating a foundation for AI systems that can reason about their own resource usage
Building toward self-optimizing AI systems that make intelligent delegation decisions
The Future of AI Efficiency
LLM Gateway points toward a future where:
AI systems actively manage and optimize their own resource usage
Higher-capability models serve as intelligent orchestrators for entire AI ecosystems
AI workflows become increasingly sophisticated and self-organizing
Organizations can leverage the full spectrum of AI capabilities in cost-effective ways
This vision of efficient, self-organizing AI systems represents the next frontier in practical AI deployment, moving beyond the current pattern of using single models for every task.
Architecture
How MCP Integration Works
The LLM Gateway is built natively on the Model Context Protocol:
MCP Server Core: The gateway implements a full MCP server
Tool Registration: All capabilities are exposed as MCP tools
Tool Invocation: Claude and other AI agents can directly invoke these tools
Context Passing: Results are returned in MCP's standard format
This ensures seamless integration with Claude and other MCP-compatible agents.
Component Diagram
Request Flow for Delegation
When Claude delegates a task to LLM Gateway:
Claude sends an MCP tool invocation request
The Gateway receives the request via MCP protocol
The appropriate tool processes the request
The caching service checks if the result is already cached
If not cached, the optimization service selects the appropriate provider/model
The provider layer sends the request to the selected LLM API
The response is standardized, cached, and metrics are recorded
The MCP server returns the result to Claude
Detailed Feature Documentation
Provider Integration
Multi-Provider Support: First-class support for:
OpenAI (GPT-4o-mini, GPT-4o, GPT-4o mini)
Anthropic (Claude 3.7 series)
Google (Gemini Pro, Gemini Flash, Gemini Flash Light)
DeepSeek (DeepSeek-Chat, DeepSeek-Reasoner)
Extensible architecture for adding new providers
Model Management:
Automatic model selection based on task requirements
Model performance tracking
Fallback mechanisms for provider outages
Cost Optimization
Intelligent Routing: Automatically selects models based on:
Task complexity requirements
Budget constraints
Performance priorities
Historical performance data
Advanced Caching System:
Multiple caching strategies (exact, semantic, task-based)
Configurable TTL per task type
Persistent cache with fast in-memory lookup
Cache statistics and cost savings tracking
Document Processing
Smart Document Chunking:
Multiple chunking strategies (token-based, semantic, structural)
Overlap configuration for context preservation
Handles very large documents efficiently
Document Operations:
Summarization (with configurable formats)
Entity extraction
Question-answer pair generation
Batch processing with concurrency control
Data Extraction
Structured Data Extraction:
JSON extraction with schema validation
Table extraction (JSON, CSV, Markdown formats)
Key-value pair extraction
Semantic schema inference
Tournament and Benchmarking
Model Competitions:
Run competitions between different models and configurations
Compare code generation capabilities across providers
Generate statistical performance reports
Store competition results for historical analysis
Code Extraction:
Extract clean code from model responses
Analyze and validate extracted code
Support for multiple programming languages
Vector Operations
Embedding Service:
Efficient text embedding generation
Embedding caching to reduce API costs
Batched processing for performance
Semantic Search:
Find semantically similar content
Configurable similarity thresholds
Fast vector operations
Advanced Fused Search (Marqo):
Leverages Marqo for combined keyword and semantic search
Tunable weighting between keyword and vector relevance
Supports complex filtering and faceting
Retrieval-Augmented Generation (RAG)
Contextual Generation:
Augments LLM prompts with relevant retrieved information
Improves factual accuracy and reduces hallucinations
Integrates with vector search and document stores
Workflow Integration:
Seamlessly combine document retrieval with generation tasks
Customizable retrieval and generation strategies
Local Text Processing
Offline Operations:
Provides tools for text manipulation that run locally, without API calls
Includes functions for cleaning, formatting, and basic analysis
Useful for pre-processing text before sending to LLMs or post-processing results
Meta Operations
Introspection and Management:
Tools for querying server capabilities and status
May include functions for managing configurations or tool settings dynamically
Facilitates more complex agent interactions and self-management
System Features
Rich Logging:
Beautiful console output with Rich
Emoji indicators for different operations
Detailed context information
Performance metrics in log entries
Streaming Support:
Consistent streaming interface across all providers
Token-by-token delivery
Cost tracking during stream
Health Monitoring:
Endpoint health checks (/healthz)
Resource usage monitoring
Provider availability tracking
Error rate statistics
Command-Line Interface:
Rich interactive CLI for server management
Direct tool invocation from command line
Configuration management
Cache and server status inspection
Tool Usage Examples
This section provides examples of how an MCP client (like Claude 3.7) would invoke specific tools provided by the LLM Gateway. These examples assume you have an initialized mcp.client.Client
instance named client
connected to the gateway.
Basic Completion
To get a simple text completion from a chosen provider:
Document Summarization
To summarize a piece of text, potentially delegating to a cost-effective model:
Entity Extraction
To extract specific types of entities from text:
Executing an Optimized Workflow
To run a multi-step workflow where the gateway optimizes model selection for each step:
Listing Available Tools (Meta Tool)
To dynamically discover the tools currently registered and available on the gateway:
Real-World Use Cases
AI Agent Orchestration
Claude or other advanced AI agents can use LLM Gateway to:
Delegate routine tasks to cheaper models
Process large documents in parallel
Extract structured data from unstructured text
Generate drafts for review and enhancement
Enterprise Document Processing
Process large document collections efficiently:
Break documents into meaningful chunks
Distribute processing across optimal models
Extract structured data at scale
Implement semantic search across documents
Research and Analysis
Research teams can use LLM Gateway to:
Compare outputs from different models
Process research papers efficiently
Extract structured information from studies
Track token usage and optimize research budgets
Model Benchmarking and Selection
Organizations can use the tournament features to:
Run controlled competitions between different models
Generate quantitative performance metrics
Make data-driven decisions on model selection
Build custom model evaluation frameworks
Security Considerations
When deploying and operating the LLM Gateway, consider the following security aspects:
API Key Management:
Never hardcode API keys in your source code.
Use environment variables (
.env
file for local development, system environment variables, or secrets management tools like HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager for production).Ensure the
.env
file (if used) has strict file permissions (readable only by the user running the gateway).Rotate keys periodically and revoke any suspected compromised keys immediately.
Network Exposure & Access Control:
By default, the server binds to
127.0.0.1
, only allowing local connections. Only changeSERVER_HOST
to0.0.0.0
if you intend to expose it externally, and ensure proper controls are in place.Use a reverse proxy (Nginx, Caddy, etc.) to handle incoming connections. This allows you to manage TLS/SSL encryption, apply access controls (e.g., IP allow-listing), and potentially add gateway-level authentication.
Employ firewall rules on the host machine or network to restrict access to the
SERVER_PORT
only from trusted sources (like the reverse proxy or specific internal clients).
Authentication & Authorization:
The gateway itself may not have built-in user authentication. Access control typically relies on network security (firewalls, VPNs) and potentially authentication handled by a reverse proxy (e.g., Basic Auth, OAuth2 proxy).
Ensure that only authorized clients (like your trusted AI agents or applications) can reach the gateway endpoint.
Rate Limiting & Abuse Prevention:
Implement rate limiting at the reverse proxy level or using dedicated middleware to prevent denial-of-service attacks or excessive API usage (which can incur high costs).
Input Validation:
While LLM inputs are generally text, be mindful if any tools interpret inputs in ways that could lead to vulnerabilities (e.g., if a tool were to execute code based on input). Sanitize or validate inputs where appropriate for the specific tool's function.
Dependency Security:
Regularly update dependencies (
uv pip install --upgrade ...
or similar) to patch known vulnerabilities in third-party libraries.Consider using security scanning tools (like
pip-audit
or GitHub Dependabot alerts) to identify vulnerable dependencies.
Logging:
Be aware that
DEBUG
level logging might log full prompts and responses, potentially including sensitive information. ConfigureLOG_LEVEL
appropriately for your environment and ensure log files have proper permissions.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgements
Model Context Protocol for the foundation of the API
Rich for beautiful terminal output
Pydantic for data validation
uv for fast and reliable Python package management
All the LLM providers making their models available via API
This server cannot be installed
hybrid server
The server is able to function both locally and remotely, depending on the configuration or use case.
An MCP-native server that enables intelligent task delegation from advanced AI agents like Claude to more cost-effective LLMs, optimizing for cost while maintaining output quality.
- What is LLM Gateway?
- Why Use LLM Gateway?
- Key Features
- Usage Examples
- Claude Using LLM Gateway for Document Analysis
- Multi-Provider Comparison for Decision Making
- Cost-Optimized Workflow
- Document Chunking
- Multi-Provider Completion
- Structured Data Extraction (JSON)
- Retrieval-Augmented Generation (RAG) Query
- Fused Search (Keyword + Semantic)
- Local Text Processing
- Running a Model Tournament
- Getting Started
- Advanced Configuration
- Deployment Considerations
- Cost Savings With Delegation
- Why AI-to-AI Delegation Matters
- Architecture
- Detailed Feature Documentation
- Tool Usage Examples
- Real-World Use Cases
- Security Considerations
- License
- Acknowledgements
Related MCP Servers
- -securityFlicense-qualityAn MCP server that allows AI assistants like Claude to execute terminal commands on the user's computer and return the output, functioning like a terminal through AI.Last updated -57
- -securityAlicense-qualityAn MCP server that bridges AI agents with GUI automation capabilities, allowing them to control mouse, keyboard, windows, and take screenshots to interact with desktop applications.Last updated -9MIT License
- AsecurityFlicenseAqualityAn MCP server that supercharges AI assistants with powerful tools for software development, enabling research, planning, code generation, and project scaffolding through natural language interaction.Last updated -113978
- AsecurityAlicenseAqualityAn MCP server that allows AI assistants to utilize human capabilities by sending requests to humans and receiving their responses through a Streamlit UI.Last updated -743MIT License