We're (Still) Not Giving Data Enough Credit

1 month ago 8

by Luke Wroblewski October 2, 2025

In his AI Speaker Series presentation at Sutter Hill Ventures, UC Berkeley's Alexei Efros argued that data, not algorithms, drives AI progress in visual computing. Here's my notes from his talk: We're (Still) Not Giving Data Enough Credit.

Large data is necessary but not sufficient. We need to learn to be humble and to give the data the credit that it deserves. The visual computing field's algorithmic bias has obscured data's fundamental role. recognizing this reality becomes crucial for evaluating where AI breakthroughs will emerge.

The Role of Data

Data got little respect in academia until recently as researchers spent years on algorithms, then scrambled for datasets at the last minute
This mentality hurt us and stifled progress for a long time.
Scientific Narcissism in AI: we prefer giving credit to human cleverness over data's role
Human understanding relies heavily on stored experience, not just incoming sensory data.
People see detailed steam engines in Monet's blurry brushstrokes, but the steam engine is in your head. Each person sees different versions based on childhood experiences
People easily interpret heavily pixelated footage with brains filling in all the missing pieces
"Mind is largely an emergent property of data" -Lance Williams
Three landmark face detection papers achieved similar performance with completely different algorithms: neural networks. naive Bayes, and boosted cascades
The real breakthrough wasn't algorithmic sophistication. It was realizing we needed negative data (images without faces). But 25 years later, we still credit the fancy algorithm.
Efros's team demonstrated hole-filling in images using 2 million Flickr images with basic nearest-neighbor lookup. "The stupidest thing and it works."
Comparing approaches with identical datasets revealed that fancy neural networks performed similarly to simple nearest neighbors.
All the solution was in the data. Sophisticated algorithms often just perform fast lookup because the lookup contains the problem's solution.

Interpolation vs. Intelligence

MIT's Aude Oliva's experiments reveal extraordinary human capacity for remembering natural images.
But memory works selectively: high recognition rates for natural, meaningful images vs. near-chance performance on random textures.
We don't have photographic memory. We remember things that are somehow on the manifold of natural experience.
This suggests human intelligence is profoundly data-driven, but focused on meaningful experiences.
Psychologist Alison Gopnik reframes AI as cultural technologies. Like printing presses, they collect human knowledge and make it easier to interface with it
They're not creating truly new things, they're sophisticated interpolation systems.
"Interpolation in sufficiently high dimensional space is indistinguishable from magic" but the magic sits in the data, not the algorithms
Perhaps visual and textual spaces are smaller than we imagine, explaining data's effectiveness.
200 faces in PCA could model the whole of humanity's face. Can expand this to linear subspaces of not just pixels, but model weights themselves.
Startup algorithm: "Is there enough data for this problem?" Text: lots of data, excellent performance. Images: less data, getting there. Video/Robotics: harder data, slower progress
Current systems are "distillation machines" compressing human data into models.
True intelligence may require starting from scratch: remove human civilization artifacts and bootstrap from primitive urges: hunger, jealousy, happiness
"AI is not when a computer can write poetry. AI is when the computer will want to write poetry"