In recent work, Silver and Sutton (2025) wrote about what they call “The Era of Experience”. This is a pivot away from a reliance on mimicking human data and towards “experience”, defined as trajectories of actions, observations, and rewards. In a sense, what we’ve seen with DeepSeek R1, the OpenAI o-series, Claude 3.7 Sonnet, and the latest Gemini models are the first step in this direction by adopting reinforcement learning with verifiable rewards (RLVR) to develop reasoning abilities in constrained environments like math and code (Deepseek-AI, 2025; El-Kishky et al., 2024; Anthropic, 2025; Shao et al., 2024).
Similarly, there have been preliminary forays into applying similar concepts in less obviously verifiable domains, but that instead have gold standard completions. Using the completion of the next chapter of a story as a task, it’s been demonstrated that you can use the improvement in per-token perplexity from including reasoning traces, known as verifiable rewards - completion likelihood improvement (VR-CLI), as reward signal to improve reasoning abilities as well (Gurung & Lapata, 2025). Likely over the coming months we will see this extended to other domains with gold standard completions like multi-turn chat, and pretraining on certain document types.
The results with both RLVR and VR-CLI are excellent. RLVR in particular changed the model frontier overnight, and both demonstrate the wisdom in what Silver & Sutton say about the “Era of Experience”. However, as we see things playing out today, even these approaches present a handful of problems. The first is that the reward formulation and measurement are hopelessly intertwined with the construction of the environments themselves and do not generalize. While VR-CLI creates what feels like a general reward construction, its requirement of gold standard completions rules out many of the real-world use cases that would make the “Era of Experience” so powerful. All things considered, the overwhelming majority of valuable tasks lack easily verifiable completions or gold standard completions. The second is that overdoing RLVR can create very odd and jagged intelligences, as has been seen with ChatGPT o3 hallucinating running code on a MacBook, Sonnet 3.7 attempting to hardcode test response values, and Qwen3 having very weak knowledge with strong reasoning and a distaste for following system prompts over user messages. These issues all likely stem from the same root causes: RLVR is hyperfocused optimization on whatever the verifiable domain is and doesn’t tell you how your attempt was wrong and so there is an incentive to hallucinate reasoning for problems you cannot figure out. The third is that everything is reduced to scalar values, which destroys lots of valuable information. In Silver & Sutton’s (2025) work, they discuss ideas like a series of agents optimized for varying, directly measurable rewards, but acknowledge that this defeats the purpose of a unified general intelligence. To alleviate that, they then suggest a separate neural network that takes the agent’s actions, the environment information, and information from the user to output a scalar reward to guide the behavior of the agent. This seems similarly antithetical to the concept of a unified general intelligence. Why must there be a separation between the general intelligence and the evaluation of the quality of actions? Why must all signal be reduced to a scalar? To demonstrate why this reduction to a scalar is antithetical to high-quality signal extraction from the environment, think about writing code with a mistake in it and attempting to run it. You don’t just see a binary success/failure. Instead, the compiler or the runtime provides a detailed error and traceback that provides an extremely clear and high-fidelity signal about what went wrong in the original attempt. This is clearly superior.
What we want, then, is a generalized framework to distill this sort of high-fidelity, high-dimensional signal (that lives in token-space) into weight updates that make directed and relevant changes to how the model thinks.
In recent work, Silver and Sutton (2025) wrote about what they call “The Era of Experience”. This is a pivot away from a reliance on mimicking human data and towards “experience”, defined as trajectories of actions, observations, and rewards. In a sense, what we’ve seen with DeepSeek R1, the OpenAI o-series, Claude 3.7 Sonnet, and the latest Gemini models are the first step in this direction by adopting reinforcement learning with verifiable rewards (RLVR) to develop reasoning abilities in constrained environments like math and code (Deepseek-AI, 2025; El-Kishky et al., 2024; Anthropic, 2025; Shao et al., 2024).
Similarly, there have been preliminary forays into applying similar concepts in less obviously verifiable domains, but that instead have gold standard completions. Using the completion of the next chapter of a story as a task, it’s been demonstrated that you can use the improvement in per-token perplexity from including reasoning traces, known as verifiable rewards - completion likelihood improvement (VR-CLI), as reward signal to improve reasoning abilities as well (Gurung & Lapata, 2025). Likely over the coming months we will see this extended to other domains with gold standard completions like multi-turn chat, and pretraining on certain document types.
The results with both RLVR and VR-CLI are excellent. RLVR in particular changed the model frontier overnight, and both demonstrate the wisdom in what Silver & Sutton say about the “Era of Experience”. However, as we see things playing out today, even these approaches present a handful of problems. The first is that the reward formulation and measurement are hopelessly intertwined with the construction of the environments themselves and do not generalize. While VR-CLI creates what feels like a general reward construction, its requirement of gold standard completions rules out many of the real-world use cases that would make the “Era of Experience” so powerful. All things considered, the overwhelming majority of valuable tasks lack easily verifiable completions or gold standard completions. The second is that overdoing RLVR can create very odd and jagged intelligences, as has been seen with ChatGPT o3 hallucinating running code on a MacBook, Sonnet 3.7 attempting to hardcode test response values, and Qwen3 having very weak knowledge with strong reasoning and a distaste for following system prompts over user messages. These issues all likely stem from the same root causes: RLVR is hyperfocused optimization on whatever the verifiable domain is and doesn’t tell you how your attempt was wrong and so there is an incentive to hallucinate reasoning for problems you cannot figure out. The third is that everything is reduced to scalar values, which destroys lots of valuable information. In Silver & Sutton’s (2025) work, they discuss ideas like a series of agents optimized for varying, directly measurable rewards, but acknowledge that this defeats the purpose of a unified general intelligence. To alleviate that, they then suggest a separate neural network that takes the agent’s actions, the environment information, and information from the user to output a scalar reward to guide the behavior of the agent. This seems similarly antithetical to the concept of a unified general intelligence. Why must there be a separation between the general intelligence and the evaluation of the quality of actions? Why must all signal be reduced to a scalar? To demonstrate why this reduction to a scalar is antithetical to high-quality signal extraction from the environment, think about writing code with a mistake in it and attempting to run it. You don’t just see a binary success/failure. Instead, the compiler or the runtime provides a detailed error and traceback that provides an extremely clear and high-fidelity signal about what went wrong in the original attempt. This is clearly superior.
What we want, then, is a generalized framework to distill this sort of high-fidelity, high-dimensional signal (that lives in token-space) into weight updates that make directed and relevant changes to how the model thinks.