BEST
Benchmarking Efficiency in Space and Time for LLM-Generated Code
Aocheng Shen*1,2,3 Boyu Zhang*1 Jiaze Li*1 Ruixuan Ma*1 Qiankun Zhang1,2,3 Jing Wang4 Bin Yuan1,3,5,6,7 Shenghao Liu1 Xianjun Deng1,3
- 1 School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
- 2 Key Laboratory of Cyberspace Security, Ministry of Education, Zhengzhou, China
- 3 Hubei Key Laboratory of Distributed System Security, Wuhan, China
- 4 School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China
- 5 Songshan Laboratory, Zhengzhou, China
- 6 Jinyinhu Laboratory, Wuhan, China
- 7 Visiting researcher with the Lion Rock Labs of Cyberspace Security, CTlHE, Hong Kong, China
*Equal contribution. Correspondence to: Qiankun Zhang <qiankun@hust.edu.cn>.
TL;DR: Correct ≠ Efficient: LLMs struggle with time-space efficient code.
Abstract
Large language models (LLMs) have revolutionized research in software engineering, and among various tasks, LLM-based code synthesis is promising. A recent line of benchmarks aims to evaluate LLM-generated codes in time efficiency, beyond their correctness. However, space, another vital aspect of code efficiency, is rarely evaluated in prior benchmarks. To fill in the gap, this paper introduces BEST, the first benchmark for evaluating the efficiency of LLM-generated codes in both time and space. It comprises 440 coding tasks that are rigorously constructed by experts. In addition, we propose a fine-grained subtask-based evaluation scheme by dividing each task into multiple subtasks, with different input scales and difficulties. Each subtask is then accompanied by an expert-crafted standard implementation as the efficiency baseline, which achieves the Pareto optimum. Building on BEST, we introduce a unified and novel dual-indicator (time and space) metric, named dual@k, generalizing the notion of the standard pass@k metric and building on a careful and novel construction of a weight matrix of subtasks. Through extensive experiments with dual@k across 50 LLMs on BEST, our evaluation demonstrates that while LLMs exhibit weak capabilities in generating time-efficient code, their capabilities in space-efficient code generation are even worse. The benchmark is provided at https://github.com/kmsgk0/BEST.
Why BEST?
Existing code evaluation is incomplete
- Most code generation benchmarks focus on functional correctness.
- Recent efficiency benchmarks mainly evaluate running time.
- However, real-world efficient code must optimize both: Time + Space.
Time-space trade-off matters
- Different algorithms may solve the same task with different resource profiles, i.e., time and space efficiency.
No single implementation is universally optimal.
Key challenges
Challenge 1
Not every coding task is suitable for efficiency evaluation. A useful task should admit multiple algorithms with different time and space complexities.
Challenge 2
A single baseline is insufficient. Due to time-space trade-offs, there may be multiple Pareto-optimal reference implementations for the same task.
Challenge 3
Directly comparing runtime and memory usage can be hardware-dependent and may fail to capture asymptotic efficiency.
How BEST works?
BEST benchmark
A benchmark for time-space dual efficiency evaluation.
- 440 expert-curated coding tasks collected from multiple sources and newly constructed by experts
- Labeled by difficulty: easy, medium, and hard
- Labeled by algorithm category: e.g., sorting, greedy, graph, data structure
- Expert-crafted subtasks, test cases, and Pareto-optimal baselines
Subtask-based evaluation
Each task is divided into a matrix of subtasks.
- Rows: increasingly strict time requirements (larger inputs + tighter time limits)
- Columns: increasingly strict space requirements (smaller memory limits)
- Each cell is a subtask with specific time and memory constraints.
- Pass = correct + within time + within memory
Pareto-optimal baselines
A single baseline is insufficient under time-space trade-offs.
- BEST uses expert-crafted Pareto-optimal implementations.
- A solution is Pareto-optimal if: improving time efficiency worsens space efficiency, or improving space efficiency worsens time efficiency.
- This enables fair evaluation under different resource constraints.
dual@k metric
dual@k generalizes pass@k to time-space dual efficiency.
- Form a score matrix S by computing subtask-level pass@k
- Apply a weight matrix W that weights harder subtasks more heavily, controlled by two parameters τ and σ
- The final dual@k score is the Frobenius inner product of S and W, normalized by the score of Pareto-optimal baselines; it represents the weighted performance of LLM-generated code
- By tuning time weight τ and space weight σ, dual@k supports: time-only evaluation, space-only evaluation, joint time-space evaluation
What we find?
Correct code is often inefficient. Across evaluated LLMs: dual@10 < pass@10.
Space efficiency is harder. LLMs are generally weaker at generating space-efficient code than time-efficient code.
Reasoning helps efficiency. Large reasoning models achieve stronger time-space efficiency.
Algorithm category matters. Models perform better on more structured categories, while models struggle more on reasoning-heavy categories.
Qiu et al., 2024: Ruizhong Qiu, Weiliang Will Zeng, James Ezick, et al. How efficient is llm-generated code? a rigorous & high standard benchmark.
Chen et al., 2021: Mark Chen, Jerry Tworek, Heewoo Jun, et al. Evaluating large language models trained on code.
Takeaway
Efficient code generation requires evaluating correctness, time, and space together. BEST provides a fine-grained benchmark for diagnosing whether LLMs generate code that is not only correct, but also resource-efficient.
Cite this work
@inproceedings{shen2026best,
title = {BEST: Benchmarking Efficiency in Space and Time for LLM-Generated Code},
author = {Shen, Aocheng and Zhang, Boyu and Li, Jiaze and Ma, Ruixuan and Zhang, Qiankun and Wang, Jing and Yuan, Bin and Liu, Shenghao and Deng, Xianjun},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
year = {2026}
}