The Case of Random Rewards: What Qwen Teaches Us About AI's Learning

1 day ago 5

In early 2025, researchers ran a strange experiment with Alibaba’s flagship model Qwen, one of the leading open-source LLMs. You can read the full research summary here.

The experiment was simple. What happens if you train a model with random rewards? Not carefully crafted feedback. Not human-labeled data. Just random coin flips.

In theory, this should fail. If you reward good and bad outputs equally, the model should learn nothing or get worse.

Instead, the opposite happened.
Qwen improved by 15 to 20 percentage points on math problem solving.

The research team explored why randomness actually helped:

1. Hidden capabilities already existed.
Qwen’s math model had already learned strong internal reasoning patterns. It often used code-like problem solving during pretraining. The random rewards accidentally reinforced behaviors that were already effective.

2. The RL algorithm created strange effects.
The model was trained using reinforcement learning with an algorithm that used clipping, a method that prevents updates from being too extreme. The random rewards interacted with this clipping in unexpected ways. Instead of locking into bad answers, the randomness kept the model exploring, which sometimes pushed it toward better behaviors.

This was not a clever new training trick. It was mostly a fluke. It worked because of Qwen’s unique pretraining, the structure of its problem solving, and quirks in the reinforcement learning process.

When researchers tried this approach on other models, it did not work.
When they removed the clipping mechanism, it also failed.

AI models today are complex systems that researchers are still trying to fully understand. Sometimes, models show surprising behavior not because of intentional design, but because so much latent knowledge exists inside them that even random nudges can activate useful patterns.

The AI world often presents a clean narrative. Models scale with compute. More data makes things better. Benchmarks rise predictably.

Experiments like this are a reminder that:

We do not fully understand how these models generalize.
Some gains may happen because of side effects, not design.
Trial-and-error still plays a huge role in frontier AI research.

As systems get closer to AGI-level reasoning, alignment and interpretability will matter just as much as raw performance. If a model can improve from random rewards, we need to ask: what exactly is it learning?