Transform multiple AI perspectives into superior answers through intelligent synthesis
MoM Service is an OpenAI-compatible API that revolutionizes LLM usage by orchestrating multiple AI models simultaneously. Instead of relying on a single model's perspective, it queries several LLMs in parallel and synthesizes their responses into a single, superior answer using a dedicated "concluding" model.
Think of it as assembling an expert panel: you get the creativity of GPT-5, the reasoning of Claude Sonnet 4.5, and the versatility of Gemini 2.5 Pro—all combined into one comprehensive response that's more reliable and nuanced than any individual model could produce.
In today's AI landscape with hundreds of specialized LLMs, relying on a single model is limiting. A Mixture of Models (MoM) approach delivers compelling advantages:
Each AI model brings its own unique perspective and reasoning style. MoM synthesizes these diverse viewpoints into a more comprehensive answer.
| 🎯 Superior Quality | Synthesize multiple perspectives to mitigate individual model weaknesses (hallucinations, biases, knowledge gaps) |
| 🛡️ Enhanced Reliability | If one LLM fails or underperforms, others compensate to maintain high-quality output |
| 💰 Cost Optimization | Route queries strategically—use cost-effective models where appropriate, premium ones when needed |
| 🔄 Maximum Flexibility | Hot-swap models via configuration without code changes. Create specialized "meta-models" for different tasks |
- 📝 Content Creation: Combine creative and factual models for balanced, engaging content
- 💻 Code Generation: Merge multiple coding assistants for more robust solutions
- 🔍 Research & Analysis: Get comprehensive answers by consulting multiple AI "experts"
- 🎓 Educational Applications: Provide students with well-rounded explanations from diverse perspectives
MoM Service uses an elegant fan-out, fan-in architecture for parallel processing and intelligent synthesis:
- 📥 Request In: Client makes request to OpenAI-compatible endpoint (/v1/chat/completions)
- 🎯 Fan-Out: Service identifies the MoM configuration and forwards request to all configured LLMs
- ⚡ Concurrent Processing: All LLMs process the request simultaneously (non-blocking)
- 🧠 Synthesize: Responses collected and passed to the "Concluding LLM"
- 📤 Stream Response: Final synthesized answer streamed back to client in real-time
- 🔌 OpenAI-Compatible API: Drop-in replacement with /v1/chat/completions and /v1/models endpoints
- 🎭 Multi-Model Orchestration: Query multiple LLMs in parallel with intelligent synthesis
- 🖼️ Multimodal Vision Support: Send images alongside text using OpenAI Vision API format
- ⚡ Real-Time Streaming: Stream synthesized responses back to clients with low latency
- ⚙️ Configuration-Driven: Define everything in a single config.yaml file—no code changes needed
- 💰 Advanced Pricing & Cost Tracking:
- Custom pricing configurations for reasoning tokens
- Automatic model filtering based on multimodal capabilities
- Detailed cost breakdowns with normalized token reporting
- Per-request cost calculation and logging
- 📊 Advanced Observability:
- Built-in Langfuse integration for distributed tracing
- Comprehensive metrics API with cost tracking and usage analytics
- Detailed health check endpoints for monitoring system components
- 🔒 Enterprise Security:
- Centralized Bearer token authentication with structured error responses
- Clear distinction between service misconfiguration (503) and auth failures (401)
- Flexible CORS policies for cross-origin requests
- 🐳 Production Ready:
- Multi-stage Docker builds with non-root users
- Docker Compose for local development
- Advanced health checks for orchestration
- 💾 Response Caching: Automatic LLM response caching to reduce costs and latency
- 🧪 Comprehensive Testing: Full test suite with pytest for reliability
- Python 3.9 or higher
- Docker (optional, for containerized deployment)
- API keys for your chosen LLM providers (OpenAI, Google Gemini, Anthropic, etc.)
-
Clone the repository
git clone https://github.com/arashbehmand/mom-llm.git cd mom-llm -
Set up environment variables
Create a .env file in the project root:
# Service Configuration API_TOKEN="your-secret-bearer-token" ALLOWED_CORS_ORIGINS="" # Comma-separated origins, or empty for no CORS LITELLM_VERBOSE="false" # LLM API Keys (add the ones you need) OPENAI_API_KEY="sk-..." GOOGLE_API_KEY="..." ANTHROPIC_API_KEY="..." # Optional: Langfuse for observability LANGFUSE_PUBLIC_KEY="" LANGFUSE_SECRET_KEY="" LANGFUSE_HOST="https://cloud.langfuse.com" -
Configure your models
Copy the template and customize:
- macOS/Linux:
cp config.yaml_template config.yaml # Edit config.yaml to define your LLMs and MoM configurations
- Windows (PowerShell):
Copy-Item config.yaml_template config.yaml # Then edit config.yaml to define your LLMs and MoM configurations
- macOS/Linux:
-
Install dependencies
pip install -r requirements.txt -
Run the service
uvicorn mom_service.main:app --reload --host 0.0.0.0 --port 8000
Using Docker Compose (Recommended):
Using Docker directly:
Test the service:
Make a chat completion request:
Note: Set "stream": false to get a single JSON response instead of an SSE stream.
Send an image (multimodal vision request):
Note: Vision requests automatically filter to multimodal-capable models. Non-capable models are skipped, and messages are sanitized for each provider to ensure compatibility.
The service is configured through config.yaml and environment variables (.env file).
1. Environment Variables - API keys and service settings:
2. Configuration File - Define your LLMs and MoM models:
For detailed configuration options, custom pricing, advanced features, and complete examples, see the Configuration Guide.
The MoM Service provides OpenAI-compatible endpoints plus additional metrics and health check endpoints.
Core Endpoints:
- GET /v1/models - List available MoM models
- POST /v1/chat/completions - Chat completions (streaming and non-streaming)
- GET /v1/metrics/usage - Usage metrics and cost tracking
- GET /health - Health check
Example Request:
For complete API documentation including all endpoints, parameters, response formats, and code examples in multiple languages, see the API Reference.
The service is fully compatible with the OpenAI Python SDK:
See the API Reference for more examples including non-streaming and multimodal requests.
Set include_thinking_context: true in your model configuration to see intermediate responses from all LLMs before synthesis:
Useful for understanding synthesis logic, debugging, and transparency.
The service automatically sanitizes messages for provider compatibility, removing empty fields and preserving multimodal content appropriately. This ensures reliable operation across all LLM providers without manual adjustments.
- Automatic cost calculation for every request with detailed breakdowns
- Langfuse integration for distributed tracing: Add credentials to .env and view detailed traces at Langfuse
- Metrics API at /v1/metrics/usage for usage analytics
The --reload-include flag watches config.yaml for changes and automatically reloads the service.
The test suite includes unit tests, integration tests, API tests, and health check validation.
For more detailed information, check out these guides:
- Configuration Guide - Comprehensive guide to configuring LLMs, MoM models, and service settings
- API Reference - Complete API documentation with examples in multiple languages
- Contributing Guide - Guidelines for contributors
Contributions are welcome! Whether you're fixing bugs, improving documentation, or proposing new features, your help is appreciated.
Please see CONTRIBUTING.md for detailed guidelines on:
- Setting up your development environment
- Code style and standards
- Running tests and quality checks
- Submitting pull requests
- Reporting issues
Quick start for contributors:
- Fork the repository
- Create a feature branch (git checkout -b feature/amazing-feature)
- Make your changes with tests
- Run the test suite (pytest)
- Commit your changes
- Push to your branch
- Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- This project was developed with the assistance of multiple AI tools, including Anthropic's Claude, GitHub Copilot, and Kilo Code.
- Built with FastAPI and LiteLLM
- Inspired by ensemble learning and multi-agent AI systems
- Observability powered by Langfuse
Arash Behmand
- GitHub: @arashbehmand
- LinkedIn: linkedin.com/in/arashbehmand
⭐ If you find this project useful, please consider giving it a star on GitHub!
.png)

