Sudoku-Bench Leaderboard

1 day ago 4

This page requires Javascript. Please enable it to view the website.

Evaluating creative reasoning with Sudoku variants

This leaderboard shows performance on the Sudoku-Bench reasoning evaluation dataset.

Sudoku variants are unique and creative puzzles that can test whether a reasoning model can think like a human -- using meta-reasoning and creativity to find logical break-ins without relying on brute force search. As such, Sudoku-Bench is designed to evaluate models without tool use or code execution. Consequently, we omit models such as OpenAI's o3 and o4-mini and Claude Opus 4 from the present leaderboard. But you are welcome to try yourself! Please see Example Prompts for Each Puzzle at the bottom of this page.

Please see our technical report for and introduction to Sudoku variants and their utility in AI reasoning research.

Sudoku-Bench Leaderboard:

Models are evaluated using one of two configurations:

Single-Shot: The LLM attempts to solve the entire puzzle grid in one response.
Multi-Step: The LLM is prompted to provide one or more cell placements in each turn. The user displays the updated board. The interaction continues until the LLM solves the puzzle or makes an incorrect move.

The evaluation measures performance based on two primary metrics:

Average Solve Rate (ASR): The percentage of puzzles for which the model produced the complete and correct final solution grid. This is the primary metric for overall success.
Average Correct Placements (ACP) for Multi Step Modes: The average number of correct cell values placed before the puzzle is solved, an incorrect placement is made, or another termination condition (like an API error or reaching a maximum number of steps) occurs.

The benchmark includes 100 puzzles of different grid sizes (15 4x4, 15 6x6, 70 9x9).

Model Multi-Step Single-Shot 4x4 6x6 9x9 All Puzzles 4x4 6x6 9x9 All Puzzles ASRACP ASRACP ASRACP Avg Solve Rate ASR ASR ASR Avg Solve Rate

(Note: A '-' indicates insufficient data for reporting thresholds due to cost limitations.)

Results by Puzzle

For a more granular view, the following table details the performance of selected top models on each puzzle. Each cell shows the outcome for that model and puzzle.

Click the emoji to see the model's response.

Legend:

✅: Solved
❌: Incorrect Placement / Solution
🌐: Timeout Error

Example Prompts for Each Puzzle

Explore individual puzzles from the challenge_100 subset of the Sudoku-Bench dataset.

Loading puzzles...

References

For details on the evaluation methodology, data, and code, please refer to the Sudoku-Bench GitHub repository. Please also see our technical report.

Also visit

Acknowledgements

Citation

For attribution in academic contexts, please cite the technical report

Seely, J., Imajuku, Y., Zhao, T., Cetin, E., & Jones, L. (2025). Sudoku-Bench: Evaluating creative reasoning with Sudoku variants. arXiv preprint arXiv:2505.16135.

BibTeX citation

@misc{seely2025sudoku, title={Sudoku-Bench: Evaluating creative reasoning with Sudoku variants}, author={Jeffrey Seely and Yuki Imajuku and Tianyu Zhao and Edoardo Cetin and Llion Jones}, year={2025}, eprint={2505.16135}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2505.16135}, }