The AI Productivity Index (Apex)

1 month ago 7

[Submitted on 30 Sep 2025 (v1), last revised 2 Oct 2025 (this version, v2)]

View PDF HTML (experimental)

Abstract:We introduce the first version of the AI Productivity Index (APEX), a benchmark for assessing whether frontier AI models can perform knowledge work with high economic value. APEX addresses one of the largest inefficiencies in AI research: outside of coding, benchmarks often fail to test economically relevant capabilities. APEX-v1.0 contains 200 test cases and covers four domains: investment banking, management consulting, law, and primary medical care. It was built in three steps. First, we sourced experts with top-tier experience e.g., investment bankers from Goldman Sachs. Second, experts created prompts that reflect high-value tasks in their day-to-day work. Third, experts created rubrics for evaluating model responses. We evaluate 23 frontier models on APEX-v1.0 using an LM judge. GPT 5 (Thinking = High) achieves the highest mean score (64.2%), followed by Grok 4 (61.3%) and Gemini 2.5 Flash (Thinking = On) (60.4%). Qwen 3 235B is the best performing open-source model and seventh best overall. There is a large gap between the performance of even the best models and human experts, highlighting the need for better measurement of models' ability to produce economically valuable work.

Submission history

From: Bertie Vidgen Dr [view email]
[v1] Tue, 30 Sep 2025 03:26:17 UTC (11,775 KB)
[v2] Thu, 2 Oct 2025 05:47:47 UTC (11,775 KB)

Read Entire Article