Large Language Models Are Automatic Code Benchmark Generators

2 hours ago 1

AutoCodeGen. We propose an automated workflow based on LLM-Sandbox interaction, where LLMs generate test inputs and obtain test outputs through the sandbox to create high-quality code generation datasets.

AutoCodeBench. We introduce AutoCodeBench, a large-scale code generation benchmark with 3,920 problems, evenly distributed across 20 programming languages. It features high difficulty, practicality, and diversity, and is designed to measure the absolute multilingual performance of models.

AutoCodeBench-Lite. Based on the evaluation results of over 30 open-source and closed-source models on AutoCodeBench, we select 1,586 problems that were successfully solved by at least two models. This subset, AutoCodeBench-Lite, is used to measure performance differences between models.

AutoCodeBench-Complete. We select 1,000 problems from AutoCodeBench-Lite and use 3-shot prompting to construct AutoCodeBench-Complete, a completion-style code generation benchmark designed to assess the performance of base models.

MultiLanguageSandbox. A robust, secure, and high-performance multi-language code execution sandbox service that provides comprehensive support for compilation and execution across more than 30 programming languages.

Read Entire Article