Data Scale is not all you need

11 hours ago 2

The dogma for AI companies is that more data leads to better performance, but actually data scale is not all you need. High quality data yields better performance compared to a larger low quality dataset. Producing high quality data requires filtering through noise, understanding unlabeled data, and understanding what to label. Massive data labeling by annotation platforms is also problematic as their incentives are often misaligned and their platform is a bottleneck that’s time-consuming, error prone & costly. The best way to improve AI systems is to understand the data feeding models by intelligently representing datasets in a way that’s interactable using self-supervised representation learning, foundation modeling, and filtering. These practices prevent the risk of poor performance in AI systems and the risk of generating harmful outputs.

Data scale is not all you need. Blindly increasing the size of a dataset while pretraining a model puts AI-first companies at risk of making serious errors. Training models on large datasets with an unknown distribution leads to unexpected behaviors: in robotics this could lead to erroneous & dangerous trajectories, for a healthcare company inaccurate risk assessments, and for LLMs harmful speech generation {9}. On X, Grok made this mistake, generating harmful speech in the now deleted post shown in Figure 0a. Even the xAI CEO admitted they need to be more “selective about training data, rather than just training on the entire internet”. But how do you properly select data to properly train and evaluate these models? What tools are out there?

The solution is to intelligently represent data in a form that’s interactable and sufficiently diverse semantically. This approach helps: 1. create training and evaluation datasets for both pretraining and post-training, 2. identify holes in the data and 3. make recommendations on how to fill those gaps (either by buying or collecting).


 Examples of an LLM generating harmful speech likely due to existence of similar text in the training data the xAI team used to train Grok.

Figure 0a: Examples of an LLM generating harmful speech likely due to existence of similar text in the training data the xAI team used to train Grok.

 Reaction from the xAI CEO after Grok generated harmful speech. The interesting piece is the teams focus on being selective of the training data. Original post from the Grok CEO

Figure 0b: Reaction from the xAI CEO after Grok generated harmful speech. The interesting piece is the teams focus on being selective of the training data. Original post from the Grok CEO https://x.com/elonmusk/status/1944132781745090819

In industry, most CEOs of AI companies, AI researchers, and engineers are unsatisfied with modern annotation companies that integrate themselves into their data flywheels.

The current go-to solution for AI companies is to amass a large unlabeled dataset for pretraining (or use an opensource pretrained model), then label another large dataset specific to the intended task, and finally hand curate a training set and eval set. The labeling is typically outsourced to annotation companies (ScaleAI, SuperAnnotate, Labelbox, etc) who integrate themselves into the data engine. But labeling everything in a large dataset doesn’t work well because scaling data labeling to millions or billions of examples is error-prone, unsustainably costly and time-consuming leaving AI companies unhappy. More importantly though, the labeling loop is a never ending process since data flywheels continuously adapt to evolving models and more collected data making labeling requirements fluid and change over time; annotation companies can’t keep up with the speed of changes as model updates can happen in weeks while labeling can take months.

The modern labeling loop in a data engine is:

  1. Collect some data.
  2. Design or update some labeling specification.
  3. Send the data and the spec to some labeling company (Scale, SuperAnnotate, etc.). Pay for the labeling.
  4. Iterate with the labeling company and train the model.
  5. Observe the results and then repeat steps 2-5 indefinitely.

For instance, an autonomous driving company might want to label stop signs but then after labeling 1 million stop signs and seeing the results they realize they want to label the “visibility” of the stop sign, then they realize that they also want to label trees that might be surrounding stop signs adding an “obfuscated” label. Now all the data (which has also grown in the meantime as data collects are continuous) needs to be relabeled! The cycle will never end while a company is improving their model!

Meta spending 14.3B for a 49% stake to hire the CEO of Scale.AI [11] might be one of the riskiest moves the company has ever made because of these difficulties with labeling companies.

So, if blindly training on enormous datasets is problematic, and labeling everything is difficult what else should we do? After working on this issue for the past four years, we found the best solution is to represent data well enough so it’s easier to select and understand what’s in our data and how that data impacts our models. We should be able to chat with our data in a way that lets us quickly search for examples and quickly build evaluation sets to test models.

That’s what we are building over at Interpret AI. We’re building a data introspection platform, data curation platform, and intelligent data marketplace that allows companies building AI systems to interact and understand their datasets. We envision a world where you can chat with your data using natural language, audio, image, and video to search for similar instances so that companies can trust and know their data (or the gaps in their data) that’s powering their models. (If any of this resonates with you please feel free to reach out to [email protected] )

Traditional data flywheels

 The traditional data engine powering AI solutions in companies.

Figure 1a: The traditional data engine powering AI solutions in companies.
  1. A company has some infrastructure that’s constantly collecting data into a dataset (1b). A team then creates heuristic data subsets that hopefully once labeled will improve their model (1a).
  2. The data is sent to the labeling (annotation) company. The labeling company produces labels (annotations) that is then reviewed by the team, which can take months of back and forth to converge.
  3. The pretrained AI Model is then pretrained.
  4. The pretrained model is then fine-tuned using the labels from the labeling company
  5. The final model is evaluated using the company’s evaluation system, generating metrics.
  6. The company then uses this feedback to possibly select other data subsets, update the labeling requirements, and or make model changes. Note that by this point the dataset subset is already growing stale.

Note: Metrics may be skewed by poor annotations requiring constant iteration from the team that’s both costly and time inefficient (6).


 A breakdown of the time requirements for different processes in a traditional company’s approach to solutions. Notice that the major bottleneck is getting labels from a labeling company.

Figure 1b: A breakdown of the time requirements for different processes in a traditional company’s approach to solutions. Notice that the major bottleneck is getting labels from a labeling company.

Figure 1b: A traditional company’s AI system time constraints and setup with approximate timelines for iterating each of these pieces independently. Notice that with a labeling company in the loop, it’ll take months of iteration to generate labels that properly improve an AI Model. See Figure 1a for how each of these pieces interact with a traditional company.


Interpret AI’s data flywheel:

Start Knowing with deep data insights


 Interpret’s AI data flywheel & how we provide immediate data insights.

Figure 2a: Interpret’s AI data flywheel & how we provide immediate data insights.

Figure 2a: Interpret AI’s data flywheel.

  1. Immediate data subset recommendations and enhanced data suggestions for pretraining & training (1a & 1b respectively).
  2. The team now reviews significantly smaller subsets of data suggested by Interpret before sending to a labeling company. These data subsets are fluid and are continuously updated as the data changes (Optionally, if a company integrates their baseline model, Interpret AI can provide more insights on how the data impacts model performance).
  3. The back and forth between a labeling company is accelerated from months to weeks and is significantly cheaper as the annotation specs and dataset selection is clear.
  • Feedback is focused on the model (6).
  • Lastly, Interpret AI analyzes your data space to provide insights on what data to collect or buy to accelerate model improvement.

 A breakdown of the time requirements for different processes in using Interpret’s platform. On the left hand side feedback iteration speed in green is accelerated. Notice there is no more bottleneck.

Figure 2b: A breakdown of the time requirements for different processes in using Interpret’s platform. On the left hand side feedback iteration speed in green is accelerated. Notice there is no more bottleneck.

Figure 2b: The figure demonstrates how Interpret AI directly integrates with our customers to accelerate model training, data triaging & understanding, and evaluation. Interpret AI provides solutions for

  • Understanding the existing data distribution.
  • Identifying model gaps that are correlated with data gaps.
  • Buying and curating data to fill data gaps.

We collaborate with several businesses across robotics, healthcare, and agentic LLM industries. If any of these resonate with you please feel free to reach out to [email protected]


Healthcare

HealthCo is trying to predict the risk of cardiovascular diseases for their patients.

For training

  • Interpret AI analyzes cardiovascular data using our interpret foundation models, processing EHRs, images, potentially ECG data [12] if available.
  • Interpret AI notices anomalies or “holes” in HealthCo and describes the demographic of these people (ie female, middle aged, no children, historically prescribed trimetazidine).
  • These detected records are further analyzed by experts. The selected data can then be updated, ignored, used to help purchase more data of people historically prescribed trimetazidine, or sent to a labeling company to annotate this specific group.
  • The selected data is then used to train AI cardiovascular disease model. If HealthCo integrates their cardiovascular model into the Interpret platform then we further analyze where the model is performing poorly in real time, allowing for immediate introspection.
  • This process reduces model training timeline from an order of months to weeks rapidly improving AI systems and saving costs!

For safety

Suppose that HealthCo has examples of people who’ve suffered heart attacks and they want to analyze other EHRs of people who are similar to this person who also might be at risk

  • Using Interpret AI, HealthCo can select examples of this person and search for a related pool of people, sorting by confidence.
  • These people can be flagged as at risk, quickly identifying a few 100 people at risk from millions of records!

Robotics

DriveCo is building autonomous racecars as a toy for kids to play with outside.

For training

  • Interpret AI analyzes the collected runs of racecar video data. Interpret AI gives a data report.
  • Interpret AI notices that the majority of replays from the videos are not geographically diverse and that there are few examples of racecars driving over outdoors in backyards.
  • Interpret AI recommends the DriveCo team collect more examples of outdoor videos. We also try to balance the dataset in a learned way using our Interpret AI foundation model to alleviate this imbalance.
    • Without Interpret AI DriveCo may have sent over 1000 hours of racecar data for labeling objects that wasn’t needed! Now they only need to label 10 hours!

For safety

Suppose that these autonomous racecars face scrutiny for infant safety.

  • DriveCo can search their database for videos containing “baby” to see if they have this data.
  • If DriveCo doesn’t have the data this informs the team to collect it (using perhaps fake babies I hope) or this allows DriveCo to show consumers and investors that the product is in fact safe around babies!

A brief history on labels and pretraining

In 2015, pre-Transformers, most models were trained to solve a very particular subset of problems: classification, segmentation, object detection (ie foundation problems) and others [1]. Benchmarks were “largish” labeled datasets on the order of 10k to 1M. {1}

Modern pretraining entered the chat around 2017 and changed the game. Borrowing from representation learning, pretraining came as a fundamental paradigm shift where suddenly unlabeled datasets unlocked huge gains in model performance. The unlabeled datasets used for pretraining compared to their labeled brethren were massive [5]. This combined with other techniques & advancements {2} lead to modern foundational models like CLIP [13], DALL-E [14], DINOv2 [15], & BERT [16] to name a few.

Then OpenAI, built on a foundation of transformers, pretraining, and reinforcement learning progress, changed the game when they released GPT (generative pre-trained transformer) [6]. Sora [7], DeepSeek [8], Anthropic [9] all use pretraining on large datasets as the backbone for their performant models. But hidden in there is an acute observation that most people aren’t talking about.

While pretraining is a good first step, most of these models need further training on top of a pretrained base. Whether this is RL or supervised finetuning the most performant models are aligned {3} somehow to the original problem. But even finetuning scales up to a point, meaning improving pretraining is essential to future model performance {4}.

One of the most compelling examples of how to properly integrate pretraining and build a data flywheel in the literature is the labeled data flywheel built by Meta in Segment Anything Model (SAM) and SAM v2 [10]. But even in this example, data labeling is incredibly difficult to scale.

TL;DR: What SAM shows us is quality assurance and understanding what’s in our data is hard but an important problem to address. Adding more data is not necessarily the answer.

SAM built a data flywheel that curated a large labeled dataset using a partially-trained SAM at various stages of training with human label feedback. Their approach illustrates the proper way to integrate labeling into a pipeline but also highlights that even the right data labeling flywheel is costly & challenging to scale. At some point, the dataset grows sufficiently large where humans cannot annotate everything and thus requires some other method of introspection (ie what Interpret is building).

Roughly, SAM’s approach was [10]

  1. Start with an MAE pretrained hierarchical ViT.
  2. Train SAM on publicly available segmentation datasets.
  3. Use the partially trained SAM to generate segmentation masks on a data subset.
  4. Have humans refine the segmentation predictions. Then also use the masks to train an object detector to find more objects and have humans label that.
  5. Repeat steps 3-4 gradually increasing the the size of the dataset
  6. Finish by running on 1 billion images to get SA-1B. Use a QA team to flag potentially bad examples. Notice that providing human labels to all 1 billion images is incredibly difficult.

The idea is the same for SAM 2 which is a video segmentation model, which generated SA-V dataset with 35.5M masks across 50.9K videos, 53x more masks than any video segmentation dataset [10].

Notice, the best segmentation model was trained with data directly relating to its task where the label feedback was all nicely coupled in a speedy, efficient data flywheel. Pretraining and then **training with a collection of open source segmentation datasets were only the first and second step.

Also notice that that human labeling eventually hit a ceiling; when the data flywheel started labeling 1B images Meta still needed to run a QA filter to flag bad examples. Based on the paper, annotating all 1.1B masks would’ve taken 51k days of annotation time! {5}

This is Meta we’re talking about but hiring that for most companies would be egregiously expensive & infeasible! {6} Labeling at this scale is just hard!

Reiterating the TL;DR, what SAM shows us is quality assurance and understanding what’s in our data is hard but an important problem to address. This is fundamentally the gap we see in industry today: more data used for pretraining or finetuning is not necessarily the answer. The right approach identifies where a model suffers, understands why it suffers there, and then highlights data (or data gaps) relevant to the problem, which is what we’re doing over at Interpret AI.

We have industry experience in MAANG and our team has experience working with annotation companies like Scale, SuperAnnotate, etc. For most labeling (annotation) companies, the business model is:

  • Let companies generate their own labeling (annotation) spec with perhaps some back & forth depending on the complexity of the labels.
  • Most annotation companies have different tiers of annotators, the largest pool being non-experts who label everything and the smallest being experts in the field (i.e. Doctors). An annotation company then marshals a pool of human labelers, typically starting with the cheapest ones to do a low quality first pass.
  • The annotators then label according to the company’s complex annotation spec as best they can, charging per annotation.
  • Provide feedback and updates to the annotations, possibly updating the annotation spec.

There are four main problems with this process:

  1. annotations are not consistent and are usually not assigned to the right labelers,
  2. the labeling is time-consuming & expensive,
  3. the feedback loop for correcting annotations is erroneous, and
  4. annotation specs change over time as model performance changes.

Addressing 1., labelers are not guaranteed to be suited to their assigned labeling task and often label differently than their peers. For instance, for a healthcare company if the task is “Pick the clinical response that bests diagnoses the patient” these labelers may not even be doctors suited to the task! Additionally, for an autonomous driving company if the task is to “Draw bounding boxes for stop signs” does this include the pole or not? What if it’s the back side of a stop sign? Different annotators will label differently without consulting each other.

Addressing 2., charging per annotation sounds great in theory as the conventional dogma is that more labels help but if and only if the company can afford the cost of a sufficient number of labels to boost model performance; a number that is typically unknown. These annotations will also typically have errors that require AI companies to build internal systems that review the annotations which takes both time (order of months) and more money.

Addressing 3., The feedback loop is not consistent either. Typically the responsibility of annotation verification is pushed to the AI company, which needs to set up their own internal monitoring system (already time-consuming and costly). When an AI company notices an annotation issue, corrections are not guaranteed to be from the same annotator who created the problematic label and sometimes annotation companies will relabel the entire problematic example instead of correcting it which costs more. For instance, for an autonomous driving company might want to label instance masks of traffic lights and people. In this dummy example, the first annotator makes a mistake and forgets to label traffic lights not facing the camera. The AI company flags it and sends it off to be re-reviewed but the way the annotation company fixes this is by sending the image to a new annotator who relabels everything from scratch! The second annotator fixes the original issue but doesn’t label policeman as “people” and now a new issue emerges! See Figure 3a and Figure 3b. This loop has an incredibly low probability of correctly annotating objects correctly ~61% for 50 labels {7}.


 First pass by the first annotator who missed the traffic lights that are not facing the camera. (Image from Waymo Open Dataset [17])

Figure 3a: First pass by the first annotator who missed the traffic lights that are not facing the camera. (Image from Waymo Open Dataset [17])

 Second pass from the second annotator who got all the traffic lights but didn’t realize that the “people” class included police officers! (Image from Waymo Open Dataset [17])

Figure 3b: Second pass from the second annotator who got all the traffic lights but didn’t realize that the “people” class included police officers! (Image from Waymo Open Dataset [17])

Essentially, with this feedback system the labels an annotation company creates are not guaranteed to converge to the right labels!

The incentives of AI companies are not well aligned with those of labeling companies. AI companies want to improve their AI model and their product while annotation companies want to label as much company data as possible so that they can charge for it. You want to make your model performant and so should annotation companies.

Addressing 4., In industry (and research), when trying to solve a problem, there are many possible solutions. Perhaps pretraining on the entire internet will improve your LLM, or perhaps grounding an LLM by training on labeled text-images pairs will help with LLM reasoning, or perhaps adding chain of thought will help. In other words, when designing AI systems we need to try a lot of different things in parallel since sometimes it’s unclear what the best approach will be. Labeling is one solution, which means that as we better understand our problem the label definition is subject to change.

For instance, take labeling stop signs in autonomous driving; suppose that we first label stop signs. We notice that performance improves when we know if a stop sign is partially obstructed, so we update the annotation spec to add a metadata tag called “obstructed” later on when the sign is partially or not visible. We then go back to an annotation company and ask them to relabel all our stop signs with this! This “annotation-platform in the loop” means that every model experiment that updates the labeled dataset is super expensive!

So, one may wonder, why are labeling providers used at all? For two reasons: First, high quality labels on data do help as discussed earlier. In fact, less data with higher quality labels can outperform some of these large pretrained models; SAM being an excellent example. Second, the alternative to not using an annotation company is to create an internal annotation platform which is even more expensive and time consuming, since producing the same volume of labels as the other players can take years!

The optimal data flywheel represents data in a form that’s inherently insightful and interactable: we should be able to detect anomalies and also chat with our data to garner interesting patterns and insights. This flywheel should enhance annotation platforms by focusing on what should be labeled instead of labeling everything {8}. And finally, this data flywheel should align with model performance, tying directly to whatever problem your AI company is solving.

The traditional dogma is that more data “just works” and sometimes deep learning feels like alchemy. Perhaps more data will work for you in the short run but when things “just don’t work” the proper way is to assess failure both in the data & the model and work from there.

Over at Interpret we hope to change the paradigm. If you are interested, reach out to us at [email protected]

  1. Back when AlexNet was still a thing circa 2015ish most models for computer vision were trained on a subset of very particular problem types: classification, segmentation, object detection (ie foundation problems) and others like image captioning, scene recognition, pose estimation (see appendix for more details)[1]. Note this was pre “Attention is all you need” when bigrams were a-la-mode. The focus then was model development while benchmarks remained fixed. These benchmarks were “largish” labeled datasets (order of 10k to 1M) that were used to evaluate model performance. Some of the popular CV benchmarks you’re probably familiar with are MNIST, ImageNet, MS COCO, KITTI, Caltech-101 [2]. If you look the largest labeled datasets around this time they were around 1M labels, and that was considered large at the time.
  2. Modern pretraining entered the chat around 2017 and changed the game. Borrowing from representation learning, pretraining came as a fundamental paradigm shift from learning features for only a specific labeled dataset to learning general features on unlabeled data that correlated well with other problems like classification, segmentation, object detection. These datasets compared to their labeled brethern were massive [5]. At the same time, advancements in model training (CUDA optimization which is why NVIDIA hit a 4T market cap), deep learning libraries (tensorflow, pytroch), and new / improved model architectures like Transformers from “Attention Is All You Need” opened up a brand new world. Researchers also noticed that increasing the size of models typically correlated with improved performance on unseen data (from the same data distribution). All of this combined interfaced with modern pretraining algorithms like pretext tasks, contrastive learning, masked label modeling, masked autoencoding (MAE) multimodal modeling [4] unlocking the era of training big models on even massive unlabeled datasets. Ergo, models like CLIP [13], DALL-E [14], DINOv2 [15], BERT [16].
  3. ”Alignment” is an overused term I mean alignment in both the “we want our LLM to be helpful not harmful” sense and the “data distribution alignment” sense.
  4. When training / fine-tuning a model, scaling model size correlates with improvement in performance roughly following a power law. In industry, we’re already hitting the peak for model size scaling laws and fine-tuning is giving less and less of an advantage. The next frontier is improving pretraining method to better utilize existing unlabeled datasets.
  5. In the SAM paper, annotations could take 30 seconds (but suppose it took 4 seconds based on the improvements from SAM v2 [10]); reviewing 1.1B masks would’ve required 1,100,000,000 * 4 seconds = ~51,000 days of annotation time!
  6. This is also assuming that the data distribution is stationary (unchanging). If we wanted to increase the labels to a different data distribution (say deep sea diving videos where the semantics & dynamics of objects is different) then finetuning SAM would still require the same data flywheel training process which is also more time and more money.
  7. Suppose that each object has a probability of being mislabeled p=0.01 (ie an annotator labels incorrectly or misses a label once every 100 labels). Assuming 50 objects in a video the probability of succeeding assuming independence is (1 - p)^50 = 61% chance of success! And that’s conservative.
  8. Fundamentally, when AI companies have better clarity on what to label their incentives align with annotation companies.
  9. More and more it is clear very few samples (e.g. thousands) of very high quality data is way better than million of low quality data - this is particularly true in post-traning of LLMs in industry but it is starting to be the focus also of pre-training.
  10. A data flywheel is the loop used to collect data, improve the model, which makes a better product, which then modifies what data to collect and the cycle repeats (for example this image from dataloop.ai https://dataloop.ai/book/the-data-flywheel-effect/). A data engine is the infra for collecting/labeling/evaluating data (for example Scale’s product https://scale.com/data-engine).
  • Cameron Tukerman-Lee (also credit for the title)
  • Gabriele Sorrento
  • Francesco Pongetti
  • Lotfi Herzi
Read Entire Article