Huazhong University of Science and Technology ICML — International Conference on Machine Learning

BEST

Benchmarking Efficiency in Space and Time for LLM-Generated Code

Aocheng Shen*1,2,3 Boyu Zhang*1 Jiaze Li*1 Ruixuan Ma*1 Qiankun Zhang1,2,3 Jing Wang4 Bin Yuan1,3,5,6,7 Shenghao Liu1 Xianjun Deng1,3

  1. 1 School of Cyber Science and Engineering, Huazhong University of Science and Technology, Wuhan, China
  2. 2 Key Laboratory of Cyberspace Security, Ministry of Education, Zhengzhou, China
  3. 3 Hubei Key Laboratory of Distributed System Security, Wuhan, China
  4. 4 School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China
  5. 5 Songshan Laboratory, Zhengzhou, China
  6. 6 Jinyinhu Laboratory, Wuhan, China
  7. 7 Visiting researcher with the Lion Rock Labs of Cyberspace Security, CTlHE, Hong Kong, China

*Equal contribution. Correspondence to: Qiankun Zhang <qiankun@hust.edu.cn>.

TL;DR: Correct ≠ Efficient: LLMs struggle with time-space efficient code.

Abstract

Large language models (LLMs) have revolutionized research in software engineering, and among various tasks, LLM-based code synthesis is promising. A recent line of benchmarks aims to evaluate LLM-generated codes in time efficiency, beyond their correctness. However, space, another vital aspect of code efficiency, is rarely evaluated in prior benchmarks. To fill in the gap, this paper introduces BEST, the first benchmark for evaluating the efficiency of LLM-generated codes in both time and space. It comprises 440 coding tasks that are rigorously constructed by experts. In addition, we propose a fine-grained subtask-based evaluation scheme by dividing each task into multiple subtasks, with different input scales and difficulties. Each subtask is then accompanied by an expert-crafted standard implementation as the efficiency baseline, which achieves the Pareto optimum. Building on BEST, we introduce a unified and novel dual-indicator (time and space) metric, named dual@k, generalizing the notion of the standard pass@k metric and building on a careful and novel construction of a weight matrix of subtasks. Through extensive experiments with dual@k across 50 LLMs on BEST, our evaluation demonstrates that while LLMs exhibit weak capabilities in generating time-efficient code, their capabilities in space-efficient code generation are even worse. The benchmark is provided at https://github.com/kmsgk0/BEST.

Motivation

Why BEST?

Existing code evaluation is incomplete

  • Most code generation benchmarks focus on functional correctness.
  • Recent efficiency benchmarks mainly evaluate running time.
  • However, real-world efficient code must optimize both: Time + Space.

Time-space trade-off matters

  • Different algorithms may solve the same task with different resource profiles, i.e., time and space efficiency.
Overview of the BEST benchmark and the subtask-based evaluation, illustrated with a range sum query task.
Figure 1. An overview of the BEST benchmark and the subtask-based evaluation. An example task (range sum query problem) in BEST contains 9 subtasks and 3 Pareto optimal baselines. Two LLMs, GPT and Claude, generate codes for each subtask of different ranks in input scales and difficulty and obtain a score matrix. The proposed dual@1 metric is computed based on the score matrix and a thoughtfully designed weight matrix of subtasks.

No single implementation is universally optimal.

Key challenges

Challenge 1

Not every coding task is suitable for efficiency evaluation. A useful task should admit multiple algorithms with different time and space complexities.

Challenge 2

A single baseline is insufficient. Due to time-space trade-offs, there may be multiple Pareto-optimal reference implementations for the same task.

Challenge 3

Directly comparing runtime and memory usage can be hardware-dependent and may fail to capture asymptotic efficiency.

Method

How BEST works?

01

BEST benchmark

A benchmark for time-space dual efficiency evaluation.

  • 440 expert-curated coding tasks collected from multiple sources and newly constructed by experts
  • Labeled by difficulty: easy, medium, and hard
  • Labeled by algorithm category: e.g., sorting, greedy, graph, data structure
  • Expert-crafted subtasks, test cases, and Pareto-optimal baselines
02

Subtask-based evaluation

Each task is divided into a matrix of subtasks.

  • Rows: increasingly strict time requirements (larger inputs + tighter time limits)
  • Columns: increasingly strict space requirements (smaller memory limits)
  • Each cell is a subtask with specific time and memory constraints.
  • Pass = correct + within time + within memory
03

Pareto-optimal baselines

A single baseline is insufficient under time-space trade-offs.

  • BEST uses expert-crafted Pareto-optimal implementations.
  • A solution is Pareto-optimal if: improving time efficiency worsens space efficiency, or improving space efficiency worsens time efficiency.
  • This enables fair evaluation under different resource constraints.
04

dual@k metric

dual@k generalizes pass@k to time-space dual efficiency.

  • Form a score matrix S by computing subtask-level pass@k
  • Apply a weight matrix W that weights harder subtasks more heavily, controlled by two parameters τ and σ
  • The final dual@k score is the Frobenius inner product of S and W, normalized by the score of Pareto-optimal baselines; it represents the weighted performance of LLM-generated code
  • By tuning time weight τ and space weight σ, dual@k supports: time-only evaluation, space-only evaluation, joint time-space evaluation
dual@k = weighted model score weighted baselines’ score
Results

What we find?

Correct code is often inefficient. Across evaluated LLMs: dual@10 < pass@10.

Space efficiency is harder. LLMs are generally weaker at generating space-efficient code than time-efficient code.

Reasoning helps efficiency. Large reasoning models achieve stronger time-space efficiency.

Algorithm category matters. Models perform better on more structured categories, while models struggle more on reasoning-heavy categories.

Code efficiency evaluation results of 50 LLMs under time, space, and time-space dual efficiency settings.
Table 2. Code efficiency evaluation on 50 widely-studied LLMs. Evaluations are conducted on three different choices of τ and σ for evaluation on either time or space, as well as their dual efficiency. The dual@k metric is compared to the eff@k (Qiu et al., 2024) and the pass@k (Chen et al., 2021). The most efficient result for each metric is bolded and underlined.
dual@10 across 10 algorithm categories for three representative LLMs.
Table 3. The dual@10 (σ = τ = 1.2) among 10 types of algorithms for three representative LLMs. The most efficient result under each difficulty label is bolded and underlined, while the least is highlighted in grey.

Qiu et al., 2024: Ruizhong Qiu, Weiliang Will Zeng, James Ezick, et al. How efficient is llm-generated code? a rigorous & high standard benchmark.

Chen et al., 2021: Mark Chen, Jerry Tworek, Heewoo Jun, et al. Evaluating large language models trained on code.

Conclusion

Takeaway

Efficient code generation requires evaluating correctness, time, and space together. BEST provides a fine-grained benchmark for diagnosing whether LLMs generate code that is not only correct, but also resource-efficient.

BibTeX

Cite this work

@inproceedings{shen2026best,
  title = {BEST: Benchmarking Efficiency in Space and Time for LLM-Generated Code},
  author = {Shen, Aocheng and Zhang, Boyu and Li, Jiaze and Ma, Ruixuan and Zhang, Qiankun and Wang, Jing and Yuan, Bin and Liu, Shenghao and Deng, Xianjun},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
  year = {2026}
}