Goose Perception agents watching, listening, learning and acting

4 months ago 4

Note

This is experimental and very much a work in progress and showcase of async goose and recipes, use at own risk

Ideally agents wouldn't have an interface that you have to learn or use, and shouldn't be chat or make you work any specific way. This project aims to address that.

Goose perception runs on your desktop and learns from how you work, without any interaction on your part, you just go about your day. This is the most personal way to build up a background assistant that can take actions on your behalf, with out your using another tool. Just keep going, and goose will help you out.

Run this via just run and let it observe how you work (watch and listen), in the background it will then learn, and start doing work for you (carefully!) and suggesting things, reminding you etc. It can check your messages and emails and schedule while you are busy, it can pick up problems and try to solve them before you notice the messages. You can leave notes for things to work on while you are solving other things. Always watching, and listening and perceiving and then acting. If this is all to much just simple will run just the recipesd (no voice or other enhancements)

Goose perception has a fun or creepy avatar that will lurk on your screen - See AVATAR_SYSTEM.md for details about the floating avatar system that provides visual feedback and suggestions.

The avatar provides different types of suggestions based on your activity:

🔍 Work Pattern Alerts: "I've been watching your workflow... Want me to automate it?"
📅 Meeting Notices: "Your next meeting starts in 10 minutes. Should I prepare the agenda?"
🎯 Focus Suggestions: "I notice you're switching between tasks frequently. Want help prioritizing?"
⚡ Optimization Tips: "I could automate that repetitive task you just did..."

and may offer to do things for you.

🎭 Avatar Personality System

The avatar brings life to Goose Perception with 8 distinct personalities - not just different work styles, but genuinely unique characters with their own traits, quirks, and ways of seeing the world. Each personality fundamentally changes how the avatar interacts with you.

🌧️ Melancholic: Brooding and introspective, finds beauty in struggle and poetic meaning in mundane tasks
🃏 Joker: Chaotic and unpredictable, loves pranks and unconventional approaches
😂 Comedian (Default): Always cracking jokes and finding humor in everything
👁️ Creepy: Unsettling and mysterious, makes eerily accurate observations about your patterns
🧘 Zen Master: Wise and peaceful, speaks in riddles and philosophical insights
💬 Gossip: Chatty and dramatic, loves to "spill tea" about your code and projects
🙄 Sarcastic: Dry wit and cutting sarcasm delivered with perfect timing
🐕 Excited Puppy: Boundlessly enthusiastic about absolutely everything

How to Change Personalities

Right-click the avatar to open the personality selection menu
Select a personality from the dropdown (current selection marked with ✓)
Watch the costume change - entertaining transition messages like "🎭 Switching to Sarcastic mode..."
Experience complete consistency - all suggestions, responses, and interactions match the chosen personality

Each personality provides dramatically different experiences:

🌧️ Melancholic: "Ah, another click in this endless digital void... how beautifully tragic your code becomes in these moments of despair."
🃏 Joker: "CHAOS CLICK! What havoc shall we wreak today? Plot twist: your function might actually work!"
🙄 Sarcastic: "Oh wow, another click. How revolutionary. Let me guess, you want me to do something 'helpful'?"
🐕 Excited Puppy: "OH MY GOSH YOU CLICKED ME! THIS IS SO EXCITING! WOW WOW WOW!"

🔄 Message Queue System: Prevents competing messages, ensures smooth sequential display
💾 Persistent Settings: Your chosen personality is saved and restored across restarts
🧵 Thread-Safe Operations: Background personality changes don't freeze the UI
🎯 Complete Consistency: Every interaction maintains the same personality voice
⚡ Priority Handling: Important actionable messages get proper queue priority

Pro Tip: The personality system uses Goose's recipe parameter system, so each personality gets uniquely generated content rather than just different templates!

🎯 Interactive Action Menu

The avatar now features a smart action menu that appears when you click it! Instead of just showing messages, the avatar presents you with immediate access to its most useful functions through a beautiful, personality-aware interface.

🚀 Quick Actions at Your Fingertips

When you click the avatar, you'll see a personality-specific greeting followed by 6 action buttons:

📊 Run Report (Blue): Generate optimization analysis (same as Cmd+Shift+R hotkey)
🎤 Listen to Me (Red): Activate voice listening mode with helpful instructions
💬 Enter Prompt (Green): Open text input dialog for direct requests
📋 Show Status (Purple): Display real-time system information and diagnostics
🎭 Change Personality (Orange): Quick access to personality switching (redirects to right-click)
📝 Recent Work (Gray): View your latest work activity and summaries

The Show Status action provides comprehensive system information at a glance:

💻 Real-time Performance: CPU usage, memory consumption
🔄 Observer Status: Whether background recipes are running
📝 Work Logs: Status of work activity tracking files
🎭 Current Personality: Active personality mode
📊 Message Queue: Number of queued messages
🏠 System Health: Perception directory status and timestamp

🎭 Personality-Aware Greetings: Each personality has unique, character-appropriate messages
🎨 Color-Coded Actions: Visual organization with intuitive color scheme
⏱️ Auto-Dismiss: Menu automatically closes after 60 seconds if unused
🔄 Thread-Safe Execution: All actions run in background without blocking the UI
📱 Responsive Design: Fixed-width layout that grows upward from the avatar
🛡️ Error Handling: Graceful fallbacks if system information isn't available

The greeting changes based on your avatar's personality:

🙄 Sarcastic: "Let me guess, you want me to do something helpful?"
🃏 Joker: "PLOT TWIST! You want me to actually DO something?!"
🌧️ Melancholic: "In this digital void, what task calls to you?"
🐕 Excited Puppy: "OH MY GOSH YES! HOW CAN I HELP YOU TODAY?!"

Single-click the avatar to open the action menu
Choose an action by clicking any of the colorful buttons
Follow the prompts for text input or voice activation
View results through notifications and status updates
Double-click the avatar if a message gets stuck (emergency dismiss)

Note: The action menu replaces the old avatar state cycling behavior, providing much more useful functionality while maintaining the avatar's personality and charm!

This makes use of many goose powered agents (which use "recipes") to observe you as you go about your day, learn about your work, projects and more so it can suggest improvements and take actions. It works by watching how you work (screens, interactions, collaboration) and literally listening, uses many MCPs and recipes asynchronously.

It also includes an audio agent activation tool using local transcription models, with custom wake word detection model and conversation capture. No audio leaves your machine (but you can dictate or tell it intructions, not for conversation, but to give it instructions)

Some of the recipes for example will prompt you when appropriate with suggestions on how to order your tasks or day. It will communicate with you via notifications (if it must), voice, or presenting ideas and reports from time to time:

Once running - just go about your day, you don't need to do anything, it will do something.

Click the avatar to open an interactive action menu with 6 helpful functions like running reports, voice activation, text prompts, and system status.

You can ask goose to do something by talking - just mention "goose", it will detect that you are talking to it (vs about it).

Cmd+Shift+G: Capture your screen and give Goose an instruction about what to do with it
Cmd+Shift+R: Run optimization analysis and generate HTML5 reports

All methods work asynchronously and Goose will process your requests in the background.

Note

This is a real screenshot (blurred with AI) of a meeting where it was early morning, and goose was prepping me for the day, noting any important actions in slack and email I needed to take (knowing that I am terrible at reading email until much later in the day).

Let it run for a while (ie run it all the time in the background), over time goose will start either showing things to you (with windows showing up with little web apps or reports), it will put things quietly on your calendar to remind you, prepare documents for you for up coming things, it may even patch code and open pull requests if something important and urgent happens and your attention is elsewhere.

For example it will do a weekly "hype doc" summary celebrating all your hard work! There are many other things it may do, they all combine together as it learns what you do. it knows when you are in a meeting, and can use the dialog to help you not miss follow up items and take action on them.

You will see things appear in your calendar (if you have the gcal integration enabled), notes in your notes app etc - it will work out where you manage things and interact there.

It may even tell you (congratulate?) it is time to take a break based on recent activity (like this!)

Tip

you can leave notes for goose to work on in your notes app, and it will act on them, and report back to them etc.

You can also use verbal commands to ask it to do something (and it has the benefit of that learned context). A normal goose session can also benefit from this context as well.

Prerequisites:

just command runner
macOS (for hotkey functionality)

The application will:

Create a virtual environment with Python 3.12
Install all required dependencies
Download models when first needed
Train the classifiers on first run

For the hotkey functionality (Cmd+Shift+G) to work, you may need to grant permissions:

Accessibility: System Preferences → Security & Privacy → Privacy → Accessibility
- Add Terminal or your Python executable to allow hotkey detection
Screen Recording: System Preferences → Security & Privacy → Privacy → Screen Recording
- Add Terminal or your Python executable to allow screenshot capture

Note: The application sets TOKENIZERS_PARALLELISM=false to avoid warnings from the Hugging Face tokenizers library. If you run into any issues with tokenizers, you can manually set this environment variable: export TOKENIZERS_PARALLELISM=false

just run # Run the application just train-classifier # run the classifier (won't usually need to) just status # to check what is running just logs # to follow along with background things going on just kill # kill switch

The voice models (currently whisper) and locally trained models (distilbert based) all run locally when listening in to the environment and for commands, no audio leaves your machine. Screen shots are taken periodically but cleaned out, goose is used to summarised those into a rolling log

Local models and personal data

Goose will use whatever default models are configured for it, so they can be local models, but if they are remote ones then screenshots will be sent as will information that other recipes gather. For running all locally, this is possibly by via goose config, but also possibly with screen analysis with ollama models which are fine tuned for screen analysis (I wasn't able to run any multimodal models on my local machine, meaning that the goose agent loop won't work with the local models if images are needed on my machine, but with a more powerful one a totally local flow would be possible). There are models such as llava:3b (ideally 13b or up) which can read screen content for summarization, but need to be used in concert with other models (think of it as mode like fancy OCR!) This is an emerging space so keep an eye out

The observers directory has many recipes for learning about you - these will run from time to time, some continuously.

The work recipe looks at your screen content and will summarise, learn and note what you are working on, who you interact with, and what has changed. It will even work out when you are asleep (well not at your computer). This keeps a rolling log of activity and up to date latest one.

Interactions and contributions

other recipes will examine your communications history and contributions, and form a picture and some metrics of what you have been doing recently and with whom. They will then proactively go to work (checkout observers dir).

Voice input is useful as it is always listening and knows the context of which you ask for something to be done. Best for async tasks, not conversation.

The system now includes a hotkey-activated screen capture feature that allows you to quickly capture your screen and have Goose analyze it.

Hotkey: Cmd+Shift+G (press and hold all three keys)
Process: Captures screen → Shows input dialog → Processes with Goose
Use cases: Code review, document analysis, UI interaction, task automation

See HOTKEY_FEATURE.md for detailed documentation.

The application uses a sequential processing approach with continuous audio capture:

Audio Capture (continuous background thread)
- Captures audio from the microphone in real-time
- Buffers audio in a queue for processing
- Runs in a separate thread that never blocks
Audio Processing (main thread)
- Collects 5-second chunks of audio from the queue
- Saves each chunk to a temporary file
- Submits the chunk for transcription in a background thread
Transcription (background thread)
- Transcribes the audio chunk using Whisper
- Runs in a background thread to avoid blocking audio capture
- Returns the transcribed text to the main thread
Wake Word Detection (main thread)
- Checks if the transcribed text contains "goose"
- If found, uses the classifier to determine if it's addressed to Goose
- The classifier check is fast and doesn't block audio capture
Mode Switching
- If addressed to Goose: switches to active listening mode
- If not: stays in passive listening mode

The system maintains continuous audio capture throughout all these steps, ensuring no audio is missed during processing or classification.

The system uses an enhanced ML-based classifier to determine if speech is addressed to Goose:

Two-Model Approach: Uses a lightweight model (tiny) for wake word detection and a higher-quality model for full transcription
Fuzzy Text Matching: Can detect variations of "goose" using fuzzy string matching
Confidence Thresholds: Configurable confidence threshold for wake word classification
ML-Based Classification: Uses a fine-tuned DistilBERT model to determine if speech is addressed to Goose
More accurate and context-aware than simple text matching
Can distinguish between mentions of "goose" and actual commands to Goose

┌────────────────────┐ ┌────────────────────┐ ┌────────────────────┐ │ │ │ │ │ │ │ Audio Capture │────▶│ 5-second Chunks │────▶│ Quick Transcription│ │ (Background) │ │ (Main Thread) │ │ (Lightweight Model)│ │ │ │ │ │ │ └────────────────────┘ └────────────────────┘ └──────────┬─────────┘ │ ▼ ┌────────────────────┐ ┌─────────────────────────┐ │ │ │ │ │ Passive Listening │◀────────── No ─────────────┤ Contains "goose"? │ │ │ │ (Fuzzy Match) │ └────────────────────┘ └─────────────┬───────────┘ │ Yes │ ▼ ┌────────────────────┐ ┌─────────────────────────┐ │ │ │ │ │ Passive Listening │◀────────── No ─────────────┤ Addressed to Goose? │ │ │ │ (Classifier Check) │ └────────────────────┘ └─────────────┬───────────┘ │ Yes │ ▼ ┌────────────────────┐ ┌─────────────────────────┐ │ │ │ │ │ Switch to Active │─────────────────────────────▶ Active Listening │ │ Mode │ │ (Main Model) │ └────────────────────┘ └─────────────┬───────────┘ │ ▼ ┌─────────────────────────┐ │ │ │ Monitor Until Silence │ │ │ └─────────────┬───────────┘ │ ▼ ┌─────────────────────────┐ │ │ │ Full Transcription │ │ (Using Main Model) │ └─────────────┬───────────┘ │ ▼ ┌─────────────────────────┐ │ │ │ Save Conversation │ │ │ └─────────────┬───────────┘ │ ▼ ┌─────────────────────────┐ │ │ │ Invoke Goose Agent │ │ (via agent.py) │ └─────────────┬───────────┘ │ ▼ ┌─────────────────────────┐ │ │ │ Goose Process │ │ (Background Thread) │ └─────────────────────────┘

Multi-Scale Transcription System

Short Chunks (5 seconds)
- Used for real-time monitoring and wake word detection
- Provides immediate feedback on what's being heard
- Serves as building blocks for longer transcriptions
Context Buffer (30 seconds)
- Maintains a rolling window of recent speech
- Preserves what was said before the wake word
- Composed of multiple 5-second chunks
Active Listening (Variable Length)
- Triggered when wake word is detected
- Continues until silence is detected (default: 3 seconds of silence)
- Captures the complete interaction after the wake word
Full Conversations
- Combines context buffer + active listening period
- Captures speech before, during, and after wake word
- Re-transcribes the entire audio using the main model
- Saved as complete audio and transcript files

Captures audio from your microphone in real-time (continuous)
Processes the audio in 5-second chunks (sequential)
Maintains a rolling buffer of recent speech (default: 30 seconds)
Continuously monitors for the wake word "goose"
Shows minimal output to indicate it's working

Triggered when the wake word is detected and verified by the classifier
Preserves the context from before the wake word was spoken
Continues actively transcribing all speech
Monitors for a period of silence (default: 3 seconds)
When silence is detected, saves the entire conversation (context + active speech)
Returns to passive listening mode

During active listening, the system prioritizes capturing the complete conversation. It continues to buffer audio in the background, ensuring no speech is missed even during transcription.

Complete conversations are saved as both audio (.wav) and text (.txt) files
Files are stored in the recordings directory with timestamps
Each conversation includes speech from before the wake word was detected

Activation Logging for Training

The system logs all wake word activations for analysis and model improvement
Successful activations are saved as activation_triggered_[timestamp].txt
Bypassed activations (when "goose" is detected but not addressed to Goose) are saved as activation_bypassed_[timestamp].txt
Each log includes the transcript, confidence score, and timestamp
These logs can be used to retrain the wake word classifier to improve accuracy

The system directly integrates with Goose through the agent.py module:

When a conversation is complete, perception.py directly calls agent.process_conversation()
The agent reads the transcript and prepares it for Goose with appropriate instructions
Goose is invoked with the command: goose run --name voice -t "The user has spoken the following..."
The Goose process runs in a separate thread to avoid blocking the main application
All Goose interactions happen in the ~/Documents/voice directory

The system uses a multi-threaded approach to handle Goose interactions:

Main Thread (perception.py)
- Detects wake words, processes conversations
- Calls agent.process_conversation() when a conversation is complete
- Continues listening for new wake words immediately
Agent Thread (agent.py)
- Created by agent.process_conversation()
- Runs run_goose_in_background() in a daemon thread
- Daemon threads don't block program exit
Goose Process
- Started by the agent thread using subprocess.call()
- Runs the Goose CLI with the transcript
- Operates independently from the main application

This design ensures that:

The voice recognition system continues to function while Goose processes requests
Multiple conversations can be handled sequentially
The application remains responsive during Goose processing

Continuous Conversation Support

The system supports continuous conversations without requiring silence between commands:

During active listening, it continues to monitor for additional wake words
If a wake word is detected during active listening, the silence counter is reset
This allows for chained commands without waiting for silence
Example: "Hey Goose, what's the weather? Hey Goose, set a timer for 5 minutes."

Parameter Description Default

--context-seconds	Seconds of speech to keep before wake word	30
--silence-seconds	Seconds of silence to end active listening	3
--recordings-dir	Directory to save audio and transcripts	"recordings"
--model	Whisper model size	"base"
--language	Language code (optional)	None (auto-detect)
--device	Audio input device number	None (default)
--use-lightweight-model	Use lightweight model for wake word detection	True
--no-lightweight-model	Don't use lightweight model for wake word detection	False
--fuzzy-threshold	Fuzzy matching threshold for wake word (0-100)	80
--classifier-threshold	Confidence threshold for classifier (0-1)	0.6

The system processes audio in 5-second chunks, which represents a balance between:

Responsiveness: Short enough to detect wake words quickly
Transcription quality: Long enough for Whisper to have sufficient context
Natural speech: Aligns with typical spoken phrase length
Processing efficiency: Optimizes CPU and memory usage

Shorter chunks would improve responsiveness but reduce transcription quality, while longer chunks would improve transcription but increase latency.

The system uses a background thread for transcription:

Non-blocking Design: Audio capture continues even during transcription
Sequential Processing: Each audio chunk is processed in order
Reliable Wake Word Detection: The system processes each chunk fully before moving to the next
Focused Attention: Once activated, the system captures the entire conversation without interruption

This design ensures that the system properly captures complete conversations while maintaining a simple and reliable architecture.

Press Ctrl+C to stop the application.

Read Entire Article