Seeing with Your Ear: A Humble Experiment in AI, Depth, and Spatial Sound

3 months ago 1

What if your phone camera could help you “hear” your surroundings? That’s the idea behind a little experiment I hacked together: a11y-deepsee — a live depth-to-audio system built with a regular webcam, an AI model for depth perception, and spatial 3D sound via OpenAL. No LiDAR, no special gear. Just standard hardware and some open tools.

The concept of translating visual information into sound to help blind or visually impaired people isn’t new. Projects like vOICe, Sound of Vision, and EyeMusic have explored converting depth or color to audio. Others like EchoSee (2024) used iPhone LiDAR plus AirPods Pro to create a spatial soundscape. So why hasn’t this taken off?

A few reasons:

Hardware was bulky or expensive (depth cameras, vests, goggles, etc.).
Training was intense — users had to learn to decipher pitch/frequency encodings or dense soundscapes.
Mental load was high — a wall of beeps and buzzes isn’t always helpful.
And most importantly, nothing could beat the trusty white cane for simplicity and reliability.

Also, most of these systems relied on classic depth sensors and stereo audio, not the latest AI vision models. That matters.

This tiny proof-of-concept does a couple things differently:

It uses Depth Anything V2, a cutting-edge AI model that estimates depth from a single RGB frame. It runs locally on Apple Silicon GPUs.
It renders true 3D spatial audio using OpenAL with real-time object positioning, giving you the illusion of direction and distance.
It runs on standard hardware: your MacBook or iPhone camera, your AirPods or speakers.
And it’s built in Python, open-source, and designed to be hackable.

It’s not a product, not a startup, and definitely not a replacement for a cane. Just a fun experiment to explore the potential.

The system captures live video from your built-in camera. The Depth Anything model processes each frame and spits out a depth map. The app then picks points across the image (in a grid), maps their position and distance, and turns each into an audio source in 3D space. You hear the closest points as louder, and the direction is panned spatially using OpenAL’s engine.

You get a minimalist interface that shows:

Live camera feed
Depth map
Green dots indicating audio sources

It all runs locally — no cloud needed. Just plug and play.

Because now, with modern AI and hardware, this kind of assistive tech can be:

Cheaper — no fancy sensors
More portable — runs on a laptop or phone
More intuitive — spatial audio is easier to learn than beep-codes or pitch scales

The hope is to show that we’re closer than ever to useful, affordable sensory tools for people with visual impairments — not as replacements, but as complements to what already works.

This is still experimental. The limitations are real:

There’s some latency — camera capture, depth inference, and audio synthesis all add up.
Distance accuracy isn’t perfect — Depth Anything is impressive, but not metric-precise.
The audio landscape can get busy — especially in cluttered scenes.
It’s not always obvious what you’re hearing.

A real usable system would need:

Calmer, more orchestrated sound design (think gentle soundscapes, not sonar pings)
Smart filtering to avoid overwhelming the listener
Dynamic audio cues that evolve over time

Still, even in its current form, it offers something unique: the sense of directional awareness through nothing but a webcam and headphones.

The code is open source. The architecture is modular. You can plug in other audio renderers, play with sound design, try other depth models, or do whatever you want.

This is not a solution. It’s a provocation: what if we could build smart, ambient, low-cost spatial aids with tools we already have?

If you’re curious, here’s the repo:

GitHub: a11y-deepsee

PRs, feedback, and fun experiments welcome.

Built on a MacBook, vibe-coded with AI, tested in a living room, and debugged with headphones on.

Read Entire Article