DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models

Yiming Huang*1, Jianwen Luo*1,2, Yan Yu1, Yitong Zhang1, Fangyu Lei1,2,
Yifan Wei1,2, Shizhu He1,2, Lifu Huang3, Xiao Liu4, Jun Zhao1,2, Kang Liu1,2,
1Institute of Automation, Chinese Academy of Sciences, 2University of Chinese Academy of Sciences,
3University of California, Davis, 4Microsoft Research Asia
Paper Code Data

Abstract

We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This benchmark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex data science programming languages to perform intricate data processing and derive the answers. We set up the benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators meticulously design the evaluation suite to ensure the accuracy and robustness of the evaluation. We develop the DA-Agent baseline. Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only 30.5% accuracy, leaving ample room for improvement.

News

  • Oct. 3, 2024: We release DA-Code, an agent data science code generation benchmark for large language models.
  • Sep. 24, 2024: Our paper is accepted by EMNLP 2024.

Data Examples

test image

Have Questions?

Ask us questions at our Github issues page or contact Yiming Huang, Jianwen Luo, Fangyu Lei for more information.

Data Statistics

We present an overview of DA-Code’s data statistics, showcasing its structure and variety of tasks. DA-Code contains 500 tasks in total, categorized into Data Wrangling (DW), Machine Learning (ML), and Exploratory Data Analysis (EDA).

The ML tasks are comprised of sub-tasks such as Classification, Regression, and Clustering. EDA includes Visualization, Statistical Analysis, Data Insights, and Data Manipulation, while DW encompasses tasks such as Data Loading, Cleaning, and Transformation.

table image

Data Statistics of Examples in DA-Code

test image

DA-Code Task Types Proportion

test image

DA-Code File Types Proportion

Leaderboard of DA-Agent Baseline Experiments

Rank Model DW ML EDA Easy Medium Hard Total Completion Rate (%) Avg Steps Executable Code (%)
1 GPT-4 30.4 48.4 24.6 45.4 27.8 23.4 30.5 99.4 7.3 76.8
2 GPT-4o 33.3 48.0 21.3 46.2 25.6 21.7 29.1 97.4 6.8 77.7
3 Claude-3-Opus 29.3 46.8 20.7 44.7 23.8 19.0 27.6 97.7 8.9 75.7
4 Qwen2.5-72B 24.9 41.8 15.4 31.9 19.4 22.3 22.6 93.8 8.6 72.2
5 Deepseek-Coder-V2.5 25.1 34.1 14.7 32.8 18.7 14.1 20.7 89.8 7.1 59.0
6 Mixtral-8x22B 14.8 31.6 10.2 17.6 16.8 8.6 15.4 67.2 11.1 55.1
7 Deepseek-Coder-33B 9.1 22.1 7.6 12.4 11.3 7.9 10.8 31.9 11.6 49.7