Greetings, celestial scribes! When we last left off, we were looking at how TiānshūBench (天书Bench) could test reasoning, fluid intelligence, and coding ability in LLMs by dynamically generating programming languages and challenging the LLM system to solve simple problems using those new programming languages.
Since that time, I've added:
- New tests
- The ability to test against LLM providers Chutes and Sambanova
- The ability to run the tests in parallel
- Retries in case of network errors
- Enhanced reporting and logging
- Multi-shot testing
Much of the code for the latest version of TiānshūBench was created with the help of aider, backed by Google Gemini and Anthropic Claude.
Trials and Tribulations
One big shift in this benchmark release is the change to using Chutes as the model inference provider instead of a local ollama instance as before. The big challenge in using ollama, as I found out, was that when you demand a model's full context length from ollama, it becomes EXTREMELY SLOW, to the point of complete unusability. Long context is critical for the multi-shot tests, because the context has to include previous failures, in addition to the somewhat lengthy language description.
In the meantime, I've switched to Chutes as an inference provider as
- It provides a nice selection of models, including some of the top open-weight models.
- It's free to use once you sign up.
Free to use is extremely important, because, as we'll see, a lot of the tests didn't work right the first time through (my fault). I would hate to spend hundreds or thousands of dollars in inference costs, only to find, for example, that I had mislabeled one of the tests, causing it to fail every time because the poor LLM had no chance to get it right from the beginning.
For a free service, it's shockingly reliable, and they only gripe about my usage when I'm hitting their API really hard. Their servers will respond with a 429 (Too Many Requests). TiānshūBench will now back off for a random length of time and try again in case of this or other network errors.
Like all stuff that's good and free, I don't expect it to last forever.
If you've got some free LLM usage credits to throw at an independent researcher, perhaps to prove that YOUR model is the world leader in TiānshūBench, hit me up.
I'm also contemplating switching back to local models with VLLM or a similar system, which I understand handles long context better.
Model Madness
In addition to the DeepSeek-R1 and -V3 models, Chutes also offers free API access to Qwen, GLM, Llama, and the hybrid DeepSeek-R1-0528-Qwen3-8B models. Unfortunately, these models are not included in this report, because they either:
- Came back with blank responses once the instructions got too long.
- Started taking way too long to respond on multi-shot tests.
I've got some idea what needs to happen in some of these cases. We need to separate scoring out by number of shots for one thing. Also, I need to mess with the temperature and other request parameters to find a combination that each model likes.
Problematic Parameters
While testing, I noticed that certain test cases were failing nearly 99+% of the time. This would normally be good, because it means that we've discovered an endeavor that the LLMs aren't really good at yet.
However, upon further inspection, I discovered that my test data just wasn't set up correctly. For example:
- One set of input and output values was marked with the wrong problem definition ID.
- Another set of expected output values had a trailing space, causing all output string comparisons to fail.
The latter case led to 99% failures for that test. There was a bit of cleverness from DeepSeek-R1-0528 on one run, though. A couple shots into the run, it managed to pass the test, by figuring out that the test was wrong.
It then went on to produce a program that passed the test. This is definitely one case where the LLM was smarter than the experimenter.
Precious Parallelism
Rather than coming up with my own testing framework, I decided to build TiānshūBench around pytest, a unit testing package for Python code. This has paid off in spades, as pytest is quite mature and features a number of great plugins for working with tests.
For instance: for this benchmark report, we're running 3 LLMs X 10 synthetic test languages X 16 test cases X 4 multishot levels, for a total of 1920 tests. If I had to run these tests one at a time, it would take over two and a half days to complete the whole suite. This is where the pytest-xdist plugin comes to the rescue, as it allows you to easily run any number of tests in parallel. The command I used to run the test suite is
Run this way, the tests run in 200 simultaneous processes, which means that the suite completes in about 2 hours.
Mystery Tests
Some tests still don't complete when being run this way. I suspect that it has something to do with pytest-xdist. Anyway, I was able to whip up a script that figures out which tests didn't run, and re-run those to output the missing tests' identifiers.
The missing tests can then be re-run with the command above, but specifying the individual tests.
Alluring Activity
Another great plugin for pytest is allure-pytest, for allowing pytest to work with the Allure reporting system. This allows a TiānshūBench user to search, sort, and filter individual tests, and see their results.
Allure allows you to attach files to an individual test, and I use this functionality to attach a log of the complete conversation with the LLM during a test. This was how I was able to quickly and mostly painlessly diagnose problems with the test data, including the issues mentioned above.
Rolling It Up
To get the benchmark stats, we run the analysis script thusly:
Statistics by Number of Shots
Unsurprisingly, the more chances an LLM gets to come up with an answer, plus some guidance as to what went wrong, the more likely the LLM is to come up with a correct solution. The trick here in the future will be to find the LLM that gives the most correct answers for the amount of money or time that you're spending on inference.
Statistics by Language Seed
These represent the different programming languages generated by TiānshūBench for testing the LLMs. Every test gets every test case in each of these languages. As expected, there's not a big variation here in the results, as the languages only differ by their nonsense keywords.
Statistics by Test Case
Test_case_10, the most-failed test case, is a fairly simple task that requires a loop and a conditional. It's a test involving problem ID 005:
And here's a solution from DeepSeek-R1-0528:
But most LLMs seem to get tripped up on the syntax for this problem. There need to be many more of these test cases and they need to be much harder if we are to avoid benchmark saturation. Perhaps following the example of test_case_10 can give us some guidance.
The Big Reveal: The Current TiānshūBench Champion!
Of the 3 tested models, DeepSeek-R1 comes out on top, with 47.34% of test cases passed. The newer DeepSeek-R1-0528 is very close behind with 46.56% of test cases passed. Finally DeepSeek-V3-0324 falls behind these two at 30.16%.
Wrapping Up
Of course, this research raises as many questions as it answers:
- How do other models compare with DeepSeek on TiānshūBench?
- Which models are the most efficient with regard to time and cost in generating an answer?
- What do AI systems most commonly get confused about?
- How can we improve artificial intelligence planning and reasoning?
Solving some of these problems will require new additions to the TiānshūBench code in the short term:
- Error code tags in Allure to distinguish different kinds of errors, e.g. syntax errors and runtime errors.
- Shot and token count logging
- Calculating the number of correct answers per dollar spent. I especially like this as a benchmark, because it can continue to grow as inference gets cheaper, rather than being capped like percentage of questions correct.
Questions and suggestions are welcome!