Model Collapse and the Need for Human-Generated Training Data

3 months ago 5

(All opinions herein are solely our own and do not express the views or opinions of our employer.)

Generative AI is poisoning its own well: online content is increasingly generated by AI; this data is used to train new models; those models then generate more online content, which in turn becomes training data. This creates a cycle that risks contaminating the sources AI relies on, potentially leading to diminished originality, amplified biases, and a disconnect from real-world information.

Last year, researchers already raised a critical warning about this trend: “The development of LLMs is very involved and requires large quantities of training data. Yet, although current LLMs . . . were trained on predominantly human-generated text, this may change. If the training data of most future models are also scraped from the web, then they will inevitably train on data produced by their predecessors.” (Shumailov, I., Shumaylov, Z., Zhao, Y. et al., AI models collapse when trained on recursively generated data. Nature 631, 755–759 (2024), p. 755). The authors describe this as leading to a model collapse: “Model collapse is a degenerative process affecting generations of learned generative models, in which the data they generate end up polluting the training set of the next generation. Being trained on polluted data, they then mis-perceive reality.” (Ibid.)

I believe that we may soon need “certified” human-generated data to train models. In other words, model creators, to achieve the best model performances, may benefit from a service that provides training data that is guaranteed to come from human minds. This is because human-generated data possesses qualities currently difficult for AI to replicate: nuance, creativity, common sense reasoning, and robust factual accuracy. Ensuring the source of training data will become increasingly important for them.

This idea raises multiple questions.

How can we ensure that data is generated by humans? I see one robust approach: a “human-in-a-room” experiment. To guarantee human origin, participants should have no access to AI during data generation. The ideal setup would involve having people work in a controlled environment, such as a library. While establishing such an environment presents logistical challenges (cost, participant comfort), positive incentives (financial compensation or academic credit) could encourage participation.

Who should generate the training data? Experts in their respective fields. PhDs, post-docs, doctors, mathematicians, historians, and others with deep subject matter knowledge. They would be asked to produce specific types of content (factual statements, creative writing prompts designed to test reasoning, or complex problem sets) with their knowledge, expertise, and available offline resources. This project resembles the work of encyclopedists, but instead of creating a first-order knowledge repository (an encyclopedia), they would create a higher-order one: high-quality training data for AI models.

How should this data be made available? Two approaches are possible: open access, driven by academic contributions, or a commercial market where data is sold to the highest bidder and not publicly released. Given its potential value (significantly higher quality than readily available online datasets), I anticipate the emergence of a robust market for certified human-generated training data. However, a market-driven approach raises concerns about equitable access and could favor larger companies with greater financial resources. To ensure consistent quality, a rigorous peer review process involving multiple experts would be essential.

Could AI models be trained purely on human-generated data? While humans cannot generate the sheer volume required for initial model training, a two-step approach is feasible. Models could first be pre-trained on existing large datasets and then fine-tuned using certified human data. During this second stage, we could instruct the model to prioritize learning from the human dataset by weighing its loss function more heavily. This high-quality data would also serve as an invaluable test set for evaluating model performance and identifying biases.

Model collapse is not a distant threat; it is a challenge we must confront now. By prioritizing certified human-generated data, we can unlock new levels of creativity and accuracy when creating new models. The future of AI depends not just on computational power, but on the quality of the knowledge that fuels it, a future where human ingenuity remains at the heart of artificial intelligence.

Read Entire Article