Injecting Spotify API Data into the Gemini AI Context Window

53 minutes ago 1

I recently added a voice assistant to my website that lets visitors have a conversation with an AI that knows a bit about me. What makes it interesting is that the AI has access to my current Spotify listening data, so it can tell people what I’m listening to right now, my top artists, and recent tracks. Here’s how I built it.

You can see a video of this in action right here.

The Architecture

The system has three main parts working together.

First, there’s a WebSocket relay server running on Node.js. This server sits between the browser and Google’s Gemini API. When someone clicks the voice button on my site, their browser opens a WebSocket connection to my server, which then opens another WebSocket connection to Gemini. Audio flows in both directions: speech from the visitor goes to Gemini, and Gemini’s voice responses come back to the visitor.

Second, there’s the Spotify integration. Every time someone starts a new conversation, my server makes API calls to Spotify to fetch my currently playing track, recently played songs, top artists from the last month, and top tracks. This data gets formatted into a natural language description.

Third, there’s the context injection. Before the conversation starts, all that Spotify data gets injected into Gemini’s system instruction. This is the part that tells Gemini who it is and what it knows. So instead of just saying “You are Jesse’s AI assistant,” the prompt now includes “Jesse is currently listening to ‘Song Name’ by Artist Name. His top artists lately are X, Y, Z.”

The key insight is that AI models have a context window, which is essentially their working memory for a conversation. By injecting fresh data into that context at the start of each session, the AI has access to information it wouldn’t normally know. It’s not magic, it’s just data placement.

When someone asks “What music does Jesse like?” the AI can respond with actual current information rather than generic statements. The data is right there in its context, so it answers naturally.

The Technical Flow

Here’s what happens when someone starts a voice conversation:

The server receives the WebSocket connection and generates a session ID. Before connecting to Gemini, it calls Spotify’s API using a refresh token to get an access token. With that access token, it makes parallel requests for currently playing track, recently played tracks, top artists, and top tracks. All of this data gets combined into a text description.

That description gets inserted into the system instruction along with other information about me like my location, expertise, and interests. The complete system instruction gets sent to Gemini as the setup message over WebSocket.

Now when the visitor speaks, their audio gets converted to PCM format in the browser, base64 encoded, wrapped in JSON, and sent over WebSocket to my server. My server relays it directly to Gemini. Gemini processes the audio, understands the speech, generates a response, converts it to audio, and sends it back. My server relays that audio to the browser where it gets decoded and played through the speakers.

The whole conversation happens in real-time with Gemini’s multimodal live API, which can handle both audio input and audio output natively without needing separate speech-to-text and text-to-speech steps.

Privacy and Security Considerations

There are a few important details here about security. The Gemini API key never touches the browser. It lives only on my server in environment variables. The server acts as a proxy, so the frontend never needs credentials.

For Spotify, I’m using OAuth refresh tokens. The initial setup requires a one-time authorization where I approve my own app to access my Spotify data. That exchange gives me a refresh token that I store on the server. The refresh token gets exchanged for short-lived access tokens as needed.

Scaling the Idea

The Spotify integration is just one example. The same pattern works for any API. I could inject my current location from a GPS tracker, upcoming calendar events from Google Calendar, , or the current weather in Boston. The only limitation is the context window size but you need to be selective about what information is actually useful for conversations.

Another consideration is API rate limits and costs. Every conversation makes Spotify API calls. Currently that’s fine because I have a low-traffic personal site, but if thousands of people were starting conversations simultaneously, I’d need to implement caching or rate limiting.

Real-Time Conversations

One of the more interesting aspects of this implementation is that it’s truly real-time. There’s no “press to talk” button. Once you start the conversation, the AI is listening continuously and can interrupt itself if you start talking. It feels more like a phone call than a chatbot.

This is possible because Gemini 2.0’s multimodal live API supports bidirectional streaming. Both sides can send and receive at the same time. The browser is constantly capturing audio from the microphone and streaming it to Gemini through my relay server.

I added a five-minute timeout to prevent runaway API costs. If someone leaves the connection open, it automatically disconnects after five minutes. This protects against cases where someone walks away from their computer with the microphone still active.

What I Learned

Building this taught me a few things about working with voice AI.

First, audio format matters a lot. Gemini expects 16kHz PCM audio for input and outputs 24kHz PCM audio. Getting the Web Audio API to capture, convert, and play these formats correctly took some trial and error. The ScriptProcessorNode API is deprecated but still necessary for this kind of real-time audio processing in browsers.

Second, WebSocket reliability is crucial. If the connection drops mid-conversation, everything breaks. I added error handling to gracefully close both connections when one side disconnects, and to clean up all audio resources properly.

Third, context injection needs to be fast. If fetching Spotify data takes too long, the user sits there waiting for the conversation to start. I made all the Spotify API calls in parallel to minimize latency. The whole context fetch and injection happens in under a second.

The Result

What I ended up with is a voice assistant on my website that actually knows things about me in real-time. Visitors can ask what I’m listening to, what my music taste is like, or just have a conversation about my work. The AI responds with current, accurate information because that data is sitting right there in its context window.

It’s a small feature, but it demonstrates how enriching AI context with live API data can create more interesting and useful interactions. The pattern is reusable for all sorts of applications where you want an AI to know current state rather than just static facts.

Read Entire Article