Set Up a Private OpenAI-Compatible LLM on Google Cloud Run

11 hours ago 4

For those passionate about privacy and control, the trajectory of improvement in open-weight LLMs and the ecosystem around them has been extremely encouraging:

  • The gap in raw reasoning capability between the best OSS models (DeepSeek, Qwen) and the best models from frontier labs has continued to shrink.

  • OSS models now support key usability features such as function calling and structured output.

  • Thanks to tools such as Ollama and the proliferation of GPUs in consumer laptops, more people than ever can discover and run models on their own hardware.

This continued trend means that even distilled versions of the best OSS models will become “good enough” for an increasing percentage of tasks. However, when it comes time to do something like serve an app to actual users, there’s a missing piece in this AI-hacker fairytale - how do I easily deploy a model on infrastructure I control?

When I saw that Google recently brought serverless GPUs into GA, it piqued my interest, and I was able to adapt one of their examples to support API key auth and make it fully OpenAI and LangChain-compatible. I wrote up a small guide here + a GitHub repo you can use to deploy your own!

Once live, your endpoint can be used as a drop-in substitute for clients and code that use these interfaces. It also requires no infrastructure management and scales down to zero instances when not in use.

You can serve any open source model from Ollama's registry in theory, including DeepSeek, Gemma, and Qwen, though in practice caps on Cloud Run resources will limit effective model size. For more on this, see the below section on model customization.

Let’s dive in!

Quickstart

Setting up Google Cloud resources

The initial setup for this project is the same as the official Cloud Run guide here.

If you don't already have a Google Cloud account, you will first need to sign up.

Navigate to the Google Cloud project selector and select or create a Google Cloud project. You will need to enabled billing for the project, since GPUs are currently not part of Google Cloud's free tier.

Next, you must enable access to Artifact Registry, Cloud Build, Cloud Run, and Cloud Storage APIs for your project. Click here, select your newly created project, then follow the instructions to do so.

GPUs are not part of the default project quota, so you will need to submit a quota increase request. From this page, select your project, then filter by Total Nvidia L4 GPU allocation without zonal redundancy, per project per region in the search bar. Find your desired region (Google currently recommends europe-west1, note that pricing may vary depending on region), then click the side menu and press Edit quota:

Enter a value (e.g. 5), and submit a request. Google claims that increase requests may take a few days to process, but you may receive an approval email almost immediately in practice.

Finally, you will need to set up proper IAM permissions for your project. Navigate to this page and select your project, then press Grant Access. In the resulting modal, paste the following permissions into the filter window and add them one by one to a principal on your project:

  • roles/artifactregistry.admin

  • roles/cloudbuild.builds.editor

  • roles/run.admin

  • roles/resourcemanager.projectIamAdmin

  • roles/iam.serviceAccountUser

  • roles/serviceusage.serviceUsageConsumer

  • roles/storage.admin

By the end, your screen should look something like this:

Deploying your endpoint

Now, clone this repo and switch your working directory to be the cloned folder:

git clone https://github.com/jacoblee93/personallm.git cd personallm

The repo extends Google’s official guide a lightweight proxy server, which runs in the Cloud Run instance. This proxy handles auth and forwards requests to a concurrently running Ollama instance.

Rename the .env.example file to .env. Run something similar to the following command to randomly generate an API key:

openssl rand -base64 32

Paste this value into the API_KEYS field. You can provide multiple API keys by comma separating them here, so make sure that none of your key values contain commas.

Install and initialize the gcloud CLI if you haven't already by following these instructions. If you already have the CLI installed, you may need to run gcloud components update to make sure you are on the latest CLI version.

Next, set your gcloud CLI project to be your project name:

gcloud config set project YOUR_PROJECT_NAME

And set the region to be the same one as where you requested GPU quota:

gcloud config set run/region YOUR_REGION

Finally, run the following command to deploy your new inference endpoint!

gcloud run deploy personallm \ --source . \ --concurrency 4 \ --cpu 8 \ --set-env-vars OLLAMA_NUM_PARALLEL=4 \ --gpu 1 \ --gpu-type nvidia-l4 \ --max-instances 1 \ --memory 32Gi \ --no-cpu-throttling \ --no-gpu-zonal-redundancy \ --timeout=600

When prompted with something like Allow unauthenticated invocations to [personallm] (y/N)?, you should respond with y. The internal proxy will handle authentication, and we want our endpoint to be reachable from anywhere for ease of use.

Note that deployments are quite slow since model weights are bundled directly into the Dockerfile - expect this step to take upwards of 20 minutes. Once it finishes, your terminal should print a Service URL, and that's it! You now have a personal, private LLM inference endpoint!

Trying it out

You can call your endpoint in a similar way to how you'd call an OpenAI model, only using your generated API key and your provisioned endpoint. Here are some examples:

OpenAI Python SDK

uv add openai from openai import OpenAI client = OpenAI( base_url="https://YOUR_SERVICE_URL/v1", api_key="YOUR_API_KEY", ) response = client.chat.completions.create( model="qwen3:14b", messages=[ {"role": "user", "content": "What is 2 + 2?"} ] )

See OpenAI's SDK docs for examples of advanced features such as function/tool calling.

LangChain

uv add langchain-ollama from langchain_ollama import ChatOllama model = ChatOllama( model="qwen3:14b", base_url="https://YOUR_SERVICE_URL", client_kwargs={ "headers": { "Authorization": "Bearer YOUR_API_KEY" } } ) response = model.invoke("What is 2 + 2?")

See LangChain's docs for examples of advanced features such as function/tool calling.

OpenAI JS SDK

npm install openai import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://YOUR_SERVICE_URL/v1", apiKey: "YOUR_API_KEY", }); const result = await client.chat.completions.create({ model: "qwen3:14b", messages: [{ role: "user", content: "What is 2 + 2?" }], });

See OpenAI's SDK docs for examples of advanced features such as function/tool calling.

LangChain.js

npm install @langchain/ollama @langchain/core import { ChatOllama } from "@langchain/ollama"; const model = new ChatOllama({ model: "qwen3:14b", baseUrl: "https://YOUR_SERVICE_URL", headers: { Authorization: "Bearer YOUR_API_KEY", }, }); const result = await model.invoke("What is 2 + 2?");

See LangChain's docs for examples of advanced features such as function/tool calling.

Latency

Keep in mind that there will be additional cold start latency if the endpoint has not been used in some time.

Model customization

The base configuration in this repo serves a 14 billion parameter model (Qwen 3) clocked at ~20-25 output tokens per second. This model is quite capable and also supports function/tool calling, which makes it more useful when building agentic flows, but if speed becomes a concern you might try smaller models such as Google's 4 billion parameter Gemma 3. You can also run the popular DeepSeek-R1 if you do not need tool calling.

To customize the served model, open your Dockerfile and modify the ENV MODEL qwen3:14b line to be a different model from Ollama's registry:

ENV MODEL qwen3:14b

Note that you will also have to change your clientside code to specify the new model as a parameter.

🙏 Thank you!

This GitHub repo contains the source code for this guide.

If you have any questions or comments, please open an issue there. You can also follow me @Hacubu on X (formerly Twitter).

Read Entire Article