Grok 4

4 months ago 6

Scaling Up Reinforcement Learning

With Grok 3, we scaled next-token prediction pretraining to unprecedented levels, resulting in a model with unparalleled world knowledge and performance. We also introduced Grok 3 Reasoning, which was trained using reinforcement learning to think longer about problems and solve them with increased accuracy. During our work on Grok 3 Reasoning, we noticed scaling trends that suggested it would be possible to scale up our reinforcement learning training significantly.

For Grok 4, we utilized Colossus, our 200k GPU cluster, to run reinforcement learning to refine Grok’s reasoning abilities at a pretraining scale. This was made possible with innovations throughout the stack, from new infrastructure and algorithmic work, which increased the compute efficiency of our training by 6x, to a massive data collection effort, where we significantly expanded our verifiable training data from primarily math and coding data to many more domains. The resulting training run saw smooth performance gains while training on over an order of magnitude more compute than previously.

Humanity's Last Exam

Deep expert-level benchmark at the frontier of human knowledge

State of the art

Full set (April 3, 2025) with Python and Internet tools

Performance over training

Text-only subset with Python and Internet tools

Compute

Test time computeTTC

Grok 4 was trained with reinforcement learning to use tools. This allows Grok to augment its thinking with tools like a code interpreter and web browsing in situations which are usually challenging for large language models. When searching for real time information or asking difficult research questions, Grok 4 chooses its own search queries, finding knowledge from across the web and diving as deeply as it needs to craft a high quality response.

We also trained Grok to use powerful tools to find information from deep within X. Grok can use advanced keyword and semantic search tools and even view media to improve the quality of its answers.

I remember this popular post from a few days ago about this crazy word puzzle which had something to do with legs. Can you help me find it?

Grok 4 Heavy

We have made further progress on parallel test-time compute, which allows Grok to consider multiple hypotheses at once. We call this model Grok 4 Heavy, and it sets a new standard for performance and reliability. Grok 4 Heavy saturates most academic benchmarks and is the first model to score 50% on Humanity’s Last Exam, a benchmark “designed to be the final closed-ended academic benchmark of its kind”.

Frontier Intelligence

Grok 4 represents a leap in frontier intelligence, setting new state-of-the-art for closed models on ARC-AGI V2 with 15.9% (nearly double Opus's ~8.6%, +8pp over previous high). On the agentic Vending-Bench, it dominates with $4694.15 net worth and 4569 units sold (averages across 5 runs), vastly outpacing Claude Opus 4 ($2077.41, 1412 units), humans ($844.05, 344 units), and others. Grok 4 Heavy leads USAMO'25 with 61.9%, and is the first to score 50.7% on Humanity’s Last Exam (text-only subset), demonstrating unparalleled capabilities in complex reasoning through scaled reinforcement learning and native tool use.

LiveCodeBench (Jan - May)

Competitive Coding

USAMO 2025

Olympiad Math Proofs

HMMT 2025

Competitive Math

ARC-AGI v2 Semi Private

Pattern Recognition

Grok 4 API

Grok 4 API empowers developers with frontier multimodal understanding, a 256k context window, and advanced reasoning capabilities to tackle complex tasks across text and vision. It integrates realtime data search across X, the web, and various news sources via our newly launched live search API, enabling up-to-date, accurate responses powered by native tool use. With enterprise-grade security and compliance—including SOC 2 Type 2, GDPR, and CCPA certifications—the API ensures robust protection for sensitive applications. Grok 4 is coming soon to our hyperscaler partners, making it easier for enterprises to deploy at scale for innovative AI solutions.

Grok 4 Voice Mode

Speak with Grok in our upgraded Voice Mode, featuring enhanced realism, responsiveness, and intelligence. We introduce a serene and brand-new voice, redesigning the conversations to make them even more natural.

And now, Grok can see what you see! Point your camera, speak right away, and Grok pulls live insights, analyzing your scene and responding to you in realtime from within the voice chat experience. We are proud to present this model trained in-house, with our state-of-the-art reinforcement learning framework and speech compression techniques.

Voice mode in the Grok app explaining what is seen in the camera

Enable video during your voice chat and Grok will look at what it sees when talking to you.

What’s Next

xAI will continue scaling reinforcement learning to unprecedented levels, building on Grok 4's advancements to push the boundaries of AI intelligence. We plan to expand the scope from verifiable rewards in controlled domains to tackling complex real-world problems, where models can learn and adapt in dynamic environments. Multimodal capabilities will see ongoing improvements, integrating vision, audio, and beyond for more intuitive interactions. Overall, our focus remains on making models smarter, faster, and more efficient, driving toward systems that truly understand and assist humanity in profound ways.

Read Entire Article