Lightweight CLI, API and ChatGPT-like alternative to Open WebUI for accessing multiple LLMs, entirely offline, with all data kept private in browser storage.
Configure additional providers and models in llms.json
- Mix and match local models with models from different API providers
- Requests automatically routed to available providers that supports the requested model (in defined order)
- Define free/cheapest/local providers first to save on costs
- Any failures are automatically retried on the next available provider
- Lightweight: Single llms.py Python file with single aiohttp dependency (Pillow optional)
- Multi-Provider Support: OpenRouter, Ollama, Anthropic, Google, OpenAI, Grok, Groq, Qwen, Z.ai, Mistral
- OpenAI-Compatible API: Works with any client that supports OpenAI's chat completion API
- Built-in Analytics: Built-in analytics UI to visualize costs, requests, and token usage
- Configuration Management: Easy provider enable/disable and configuration management
- CLI Interface: Simple command-line interface for quick interactions
- Server Mode: Run an OpenAI-compatible HTTP server at http://localhost:{PORT}/v1/chat/completions
- Image Support: Process images through vision-capable models
- Auto resizes and converts to webp if exceeds configured limits
- Audio Support: Process audio through audio-capable models
- Custom Chat Templates: Configurable chat completion request templates for different modalities
- Auto-Discovery: Automatically discover available Ollama models
- Unified Models: Define custom model names that map to different provider-specific names
- Multi-Model Support: Support for over 160+ different LLMs
Access all your local all remote LLMs with a single ChatGPT-like UI:
More Features and Screenshots.
Check the status of configured providers to test if they're configured correctly, reachable and what their response times is for the simplest 1+1= request:
As they're a good indicator for the reliability and speed you can expect from different providers we've created a test-providers.yml GitHub Action to test the response times for all configured providers and models, the results of which will be frequently published to /checks/latest.txt
- Improved Responsive Layout with collapsible Sidebar
- Return focus to textarea after request completes
- Support VERBOSE=1 for enabling --verbose mode (useful in Docker)
- Autoreload providers and UI config when change to config files is detected
- Add cancel button to cancel pending request
- Clicking outside model or system prompt selector will collapse it
- Clicking on selected item no longer deselects it
- Dark Mode
- Drag n' Drop files in Message prompt
- Copy & Paste files in Message prompt
- Support for GitHub OAuth and optional restrict access to specified Users
- Support for Docker and Docker Compose
Set environment variables for the providers you want to use:
| openrouter_free | OPENROUTER_API_KEY | OpenRouter FREE models API key | sk-or-... |
| groq | GROQ_API_KEY | Groq API key | gsk_... |
| google_free | GOOGLE_FREE_API_KEY | Google FREE API key | AIza... |
| codestral | CODESTRAL_API_KEY | Codestral API key | ... |
| ollama | N/A | No API key required | |
| openrouter | OPENROUTER_API_KEY | OpenRouter API key | sk-or-... |
| GOOGLE_API_KEY | Google API key | AIza... | |
| anthropic | ANTHROPIC_API_KEY | Anthropic API key | sk-ant-... |
| openai | OPENAI_API_KEY | OpenAI API key | sk-... |
| grok | GROK_API_KEY | Grok (X.AI) API key | xai-... |
| qwen | DASHSCOPE_API_KEY | Qwen (Alibaba) API key | sk-... |
| z.ai | ZAI_API_KEY | Z.ai API key | sk-... |
| mistral | MISTRAL_API_KEY | Mistral API key | ... |
Start the UI and an OpenAI compatible API on port 8000:
Launches UI at http://localhost:8000 and OpenAI Endpoint at http://localhost:8000/v1/chat/completions.
To see detailed request/response logging, add --verbose:
Any providers that have their API Keys set and enabled in llms.json are automatically made available.
Providers can be enabled or disabled in the UI at runtime next to the model selector, or on the command line:
Run the server on port 8000:
Get the latest version:
Use custom llms.json and ui.json config files outside of the container (auto created if they don't exist):
Download and use docker-compose.yml:
Update API Keys in docker-compose.yml then start the server:
After the container starts, you can access the UI and API at http://localhost:8000.
See DOCKER.md for detailed instructions on customizing configuration files.
llms.py supports optional GitHub OAuth authentication to secure your web UI and API endpoints. When enabled, users must sign in with their GitHub account before accessing the application.
GITHUB_USERS is optional but if set will only allow access to the specified users.
See GITHUB_OAUTH_SETUP.md for detailed setup instructions.
The configuration file llms.json is saved to ~/.llms/llms.json and defines available providers, models, and default settings. Key sections:
- headers: Common HTTP headers for all requests
- text: Default chat completion request template for text prompts
- image: Default chat completion request template for image prompts
- audio: Default chat completion request template for audio prompts
- file: Default chat completion request template for file prompts
- check: Check request template for testing provider connectivity
- limits: Override Request size limits
- convert: Max image size and length limits and auto conversion settings
Each provider configuration includes:
- enabled: Whether the provider is active
- type: Provider class (OpenAiProvider, GoogleProvider, etc.)
- api_key: API key (supports environment variables with $VAR_NAME)
- base_url: API endpoint URL
- models: Model name mappings (local name → provider name)
- pricing: Pricing per token (input/output) for each model
- default_pricing: Default pricing if not specified in pricing
- check: Check request template for testing provider connectivity
By default llms uses the defaults/text chat completion request defined in llms.json.
You can instead use a custom chat completion request with --chat, e.g:
Example request.json:
Send images to vision-capable models using the --image option:
Example of image-request.json:
Supported image formats: PNG, WEBP, JPG, JPEG, GIF, BMP, TIFF, ICO
Image sources:
- Local files: Absolute paths (/path/to/image.jpg) or relative paths (./image.png, ../image.jpg)
- Remote URLs: HTTP/HTTPS URLs are automatically downloaded
- Data URIs: Base64-encoded images (data:image/png;base64,...)
Images are automatically processed and converted to base64 data URIs before being sent to the model.
Popular models that support image analysis:
- OpenAI: GPT-4o, GPT-4o-mini, GPT-4.1
- Anthropic: Claude Sonnet 4.0, Claude Opus 4.1
- Google: Gemini 2.5 Pro, Gemini Flash
- Qwen: Qwen2.5-VL, Qwen3-VL, QVQ-max
- Ollama: qwen2.5vl, llava
Images are automatically downloaded and converted to base64 data URIs.
Send audio files to audio-capable models using the --audio option:
Example of audio-request.json:
Supported audio formats: MP3, WAV
Audio sources:
- Local files: Absolute paths (/path/to/audio.mp3) or relative paths (./audio.wav, ../recording.m4a)
- Remote URLs: HTTP/HTTPS URLs are automatically downloaded
- Base64 Data: Base64-encoded audio
Audio files are automatically processed and converted to base64 data before being sent to the model.
Popular models that support audio processing:
- OpenAI: gpt-4o-audio-preview
- Google: gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-flash-lite
Audio files are automatically downloaded and converted to base64 data URIs with appropriate format detection.
Send documents (e.g. PDFs) to file-capable models using the --file option:
Example of file-request.json:
Supported file formats: PDF
Other document types may work depending on the model/provider.
File sources:
- Local files: Absolute paths (/path/to/file.pdf) or relative paths (./file.pdf, ../file.pdf)
- Remote URLs: HTTP/HTTPS URLs are automatically downloaded
- Base64/Data URIs: Inline data:application/pdf;base64,... is supported
Files are automatically downloaded (for URLs) and converted to base64 data URIs before being sent to the model.
Popular multi-modal models that support file (PDF) inputs:
- OpenAI: gpt-5, gpt-5-mini, gpt-4o, gpt-4o-mini
- Google: gemini-flash-latest, gemini-2.5-flash-lite
- Grok: grok-4-fast (OpenRouter)
- Qwen: qwen2.5vl, qwen3-max, qwen3-vl:235b, qwen3-coder, qwen3-coder-flash (OpenRouter)
- Others: kimi-k2, glm-4.5-air, deepseek-v3.1:671b, llama4:400b, llama3.3:70b, mai-ds-r1, nemotron-nano:9b
Run as an OpenAI-compatible HTTP server:
The server exposes a single endpoint:
- POST /v1/chat/completions - OpenAI-compatible chat completions
Example client usage:
The --args option allows you to pass URL-encoded parameters to customize the chat request sent to LLM providers:
Parameter Types:
- Floats: temperature=0.7, frequency_penalty=0.2
- Integers: max_completion_tokens=100
- Booleans: store=true, verbose=false, logprobs=true
- Strings: stop=one
- Lists: stop=two,words
Common Parameters:
- temperature: Controls randomness (0.0 to 2.0)
- max_completion_tokens: Maximum tokens in response
- seed: For reproducible outputs
- top_p: Nucleus sampling parameter
- stop: Stop sequences (URL-encode special chars)
- store: Whether or not to store the output
- frequency_penalty: Penalize new tokens based on frequency
- presence_penalty: Penalize new tokens based on presence
- logprobs: Include log probabilities in response
- parallel_tool_calls: Enable parallel tool calls
- prompt_cache_key: Cache key for prompt
- reasoning_effort: Reasoning effort (low, medium, high, *minimal, *none, *default)
- safety_identifier: A string that uniquely identifies each user
- seed: For reproducible outputs
- service_tier: Service tier (free, standard, premium, *default)
- top_logprobs: Number of top logprobs to return
- top_p: Nucleus sampling parameter
- verbosity: Verbosity level (0, 1, 2, 3, *default)
- enable_thinking: Enable thinking mode (Qwen)
- stream: Enable streaming responses
The --default MODEL option allows you to set the default model used for all chat completions. This updates the defaults.text.model field in your configuration file:
When you set a default model:
- The configuration file (~/.llms/llms.json) is automatically updated
- The specified model becomes the default for all future chat requests
- The model must exist in your currently enabled providers
- You can still override the default using -m MODEL for individual requests
Pipe Markdown output to glow to beautifully render it in the terminal:
Any OpenAI-compatible providers and their models can be added by configuring them in llms.json. By default only AI Providers with free tiers are enabled which will only be "available" if their API Key is set.
You can list the available providers, their models and which are enabled or disabled with:
They can be enabled/disabled in your llms.json file or with:
For a provider to be available, they also require their API Key configured in either your Environment Variables or directly in your llms.json.
| openrouter_free | OPENROUTER_API_KEY | OpenRouter FREE models API key | sk-or-... |
| groq | GROQ_API_KEY | Groq API key | gsk_... |
| google_free | GOOGLE_FREE_API_KEY | Google FREE API key | AIza... |
| codestral | CODESTRAL_API_KEY | Codestral API key | ... |
| ollama | N/A | No API key required | |
| openrouter | OPENROUTER_API_KEY | OpenRouter API key | sk-or-... |
| GOOGLE_API_KEY | Google API key | AIza... | |
| anthropic | ANTHROPIC_API_KEY | Anthropic API key | sk-ant-... |
| openai | OPENAI_API_KEY | OpenAI API key | sk-... |
| grok | GROK_API_KEY | Grok (X.AI) API key | xai-... |
| qwen | DASHSCOPE_API_KEY | Qwen (Alibaba) API key | sk-... |
| z.ai | ZAI_API_KEY | Z.ai API key | sk-... |
| mistral | MISTRAL_API_KEY | Mistral API key | ... |
- Type: OpenAiProvider
- Models: GPT-5, GPT-5 Codex, GPT-4o, GPT-4o-mini, o3, etc.
- Features: Text, images, function calling
- Type: OpenAiProvider
- Models: Claude Opus 4.1, Sonnet 4.0, Haiku 3.5, etc.
- Features: Text, images, large context windows
- Type: GoogleProvider
- Models: Gemini 2.5 Pro, Flash, Flash-Lite
- Features: Text, images, safety settings
- Type: OpenAiProvider
- Models: 100+ models from various providers
- Features: Access to latest models, free tier available
- Type: OpenAiProvider
- Models: Grok-4, Grok-3, Grok-3-mini, Grok-code-fast-1, etc.
- Features: Real-time information, humor, uncensored responses
- Type: OpenAiProvider
- Models: Llama 3.3, Gemma 2, Kimi K2, etc.
- Features: Fast inference, competitive pricing
- Type: OllamaProvider
- Models: Auto-discovered from local Ollama installation
- Features: Local inference, privacy, no API costs
- Type: OpenAiProvider
- Models: Qwen3-max, Qwen-max, Qwen-plus, Qwen2.5-VL, QwQ-plus, etc.
- Features: Multilingual, vision models, coding, reasoning, audio processing
- Type: OpenAiProvider
- Models: GLM-4.6, GLM-4.5, GLM-4.5-air, GLM-4.5-x, GLM-4.5-airx, GLM-4.5-flash, GLM-4:32b
- Features: Advanced language models with strong reasoning capabilities
- Type: OpenAiProvider
- Models: Mistral Large, Codestral, Pixtral, etc.
- Features: Code generation, multilingual
- Type: OpenAiProvider
- Models: Codestral
- Features: Code generation
The tool automatically routes requests to the first available provider that supports the requested model. If a provider fails, it tries the next available provider with that model.
Example: If both OpenAI and OpenRouter support kimi-k2, the request will first try OpenRouter (free), then fall back to Groq than OpenRouter (Paid) if requests fails.
The easiest way to run llms-py is using Docker:
Pre-built Docker images are automatically published to GitHub Container Registry:
- Latest stable: ghcr.io/servicestack/llms:latest
- Specific version: ghcr.io/servicestack/llms:v2.0.24
- Main branch: ghcr.io/servicestack/llms:main
Pass API keys as environment variables:
Create a docker-compose.yml file (or use the one in the repository):
Create a .env file with your API keys:
Start the service:
Build the Docker image from source:
To persist configuration and analytics data between container restarts:
Customize llms-py behavior by providing your own llms.json and ui.json files:
Option 1: Mount a directory with custom configs
Option 2: Mount individual config files
With docker-compose:
The container will auto-create default config files on first run if they don't exist. You can customize these to:
- Enable/disable specific providers
- Add or remove models
- Configure API endpoints
- Set custom pricing
- Customize chat templates
- Configure UI settings
See DOCKER.md for detailed configuration examples.
Change the port mapping to run on a different port:
You can also use the Docker container for CLI commands:
The Docker image includes a health check that verifies the server is responding:
The Docker images support multiple architectures:
- linux/amd64 (x86_64)
- linux/arm64 (ARM64/Apple Silicon)
Docker will automatically pull the correct image for your platform.
Config file not found
No providers enabled
API key issues
Model not found
Enable verbose logging to see detailed request/response information:
This shows:
- Enabled providers
- Model routing decisions
- HTTP request details
- Error messages with stack traces
- llms/main.py - Main script with CLI and server functionality
- llms/llms.json - Default configuration file
- llms/ui.json - UI configuration file
- requirements.txt - Python dependencies, required: aiohttp, optional: Pillow
- OpenAiProvider - Generic OpenAI-compatible provider
- OllamaProvider - Ollama-specific provider with model auto-discovery
- GoogleProvider - Google Gemini with native API format
- GoogleOpenAiProvider - Google Gemini via OpenAI-compatible endpoint
- Create a provider class inheriting from OpenAiProvider
- Implement provider-specific authentication and formatting
- Add provider configuration to llms.json
- Update initialization logic in init_llms()
Contributions are welcome! Please submit a PR to add support for any missing OpenAI-compatible providers.
.png)







