This VSCode extension fully automates the deployment and management of the FastAPI-BitNet backend. It runs the powerful REST API inside a Docker container and exposes its capabilities directly to GitHub Copilot Chat as a set of tools.
With this extension, you can initialize, manage, and communicate with multiple BitNet model instances (both llama-server and persistent llama-cli sessions) directly from your editor's chat panel.
Core Features
This extension empowers GitHub Copilot with a rich set of tools to manage the entire lifecycle of BitNet instances.
Automated Backend Management:
- Automatically pulls the required Docker image from Docker Hub.
- Starts the Docker container with the FastAPI-BitNet service.
- Gracefully shuts down the container when you close VSCode.
Seamless Copilot Integration:
- Exposes all API endpoints as tools for GitHub Copilot via the Model Context Protocol (MCP).
Resource & Server Management (llama-server):
- Estimate Capacity: Check how many servers can be run based on available RAM and CPU.
- Initialize/Shutdown: Start and stop single or multiple llama-server instances in batches.
- Check Status: Query the status, PID, and configuration of any running server.
Persistent CLI Session Management (llama-cli):
- Start/Stop Sessions: Create and terminate persistent, conversational llama-cli processes, each identified by a unique alias.
- Batch Operations: Manage multiple CLI sessions with a single command.
- Check Status: Get the status and configuration of any active CLI session.
Advanced Interaction:
- Chat: Send prompts to specific llama-server instances.
- Multi-Chat: Broadcast a prompt to multiple servers concurrently and get all responses.
- Conversational Chat: Maintain an ongoing conversation with a persistent llama-cli session.
Model Utilities:
- Benchmark: Programmatically run llama-bench to evaluate model performance.
- Perplexity: Calculate model perplexity for a given text using llama-perplexity.
- Model Info: Get the file sizes of all available BitNet models.
Requirements
- Docker Desktop: You must have Docker installed and running on your computer.
- Disk Space: At least 8GB of free disk space is required for the uncompressed Docker image.
- RAM: Each BitNet instance consumes ~1.5GB of RAM. While this is manageable for a few instances, it can add up quickly. Ensure your system has enough memory for the number of instances you plan to run.
Setup Instructions
- Install this extension from the VSCode Marketplace.
- Ensure Docker Desktop is running.
- Open the VSCode Command Palette using Ctrl+Shift+P.
- Type BitNet: Initialize MCP Server and press Enter. This will download the large Docker image and may take some time.
- Once initialization is complete, run BitNet: Start Server from the Command Palette.
- Open the GitHub Copilot Chat panel (it must be in "Chat" view, not "Inline Chat").
- Click the wrench icon (Configure Tools...) at the top of the chat panel.
- Scroll to the bottom and select + Add MCP Server, then choose HTTP.
- Enter the URL: http://127.0.0.1:8080/mcp
- Copilot will now detect the available tools. You may need to click the refresh icon next to the wrench if they don't appear immediately.
AI Chat Instructions & Examples
Once set up, you can instruct GitHub Copilot to use the new tools. You can always ask "What tools can you use?" to get a fresh list of available actions.
1. Estimate Server Capacity
Before starting, see what your system can handle.
Your Prompt: "Estimate how many BitNet servers I can run."
2. Manage llama-server Instances
You're able to launch one/many llama-server processes, however cnv/instruction mode is currently unsupported.
Start a single server: "Initialize a BitNet server, with 1 thread each, and a system prompt of you're a helpful assistant."
Start multiple servers: "Initialize a batch of two BitNet servers. The first on port 9001 with 2 threads and the second on port 9002 with 2 threads, with system prompts of you're a helpful assistant."
Check status: "What is the status of the server on port 9001?"
Shut down a server: "Shut down the server on port 9002."
3. Manage Persistent llama-cli Sessions
You're able to launch one/many llama-cli processes which fully support cnv/instruction mode by default!
Start a session: "Start a persistent llama-cli session with the alias 'research-chat' using the default model, with 1 thread each, and a system prompt of you're a helpful assistant."
Start many sessions: "Start 5 llama-cli sessions, with 1 thread each, and a system prompt of you're a helpful assistant."
Check status: "What is the status of the llama-cli session 'research-chat'?"
Shut down a session: "Stop the llama-cli session named 'research-chat'."
4. Chat with Models
You can talk to servers or conversational CLI sessions.
Chat with a server: "Using the server on port 9000, tell me about the BitNet architecture."
Multi-chat with servers: "Ask both the server on port 9000 and the server on port 9001 to explain the concept of perplexity. Compare their answers."
Chat with a CLI session: "Using the llama-cli session 'research-chat', what is the capial of France?"
5. Run Utilities
Analyze model performance directly from the chat.
Run a benchmark: "Run a benchmark on the default BitNet model with 128 tokens and 4 threads."
Calculate perplexity: "Calculate the perplexity of the text 'The quick brown fox jumps over the lazy dog' using the default model and a context size of 10."