AgentBoardAgentBoard: An Analytical Evaluation Board of
Multi-Turn LLM Agents
(2024)

PaperPaper CodeCode DataData

About AgentBoard

AgentBoard is a benchmark designed for multi-turn LLM agents, complemented by an analytical evaluation board for detailed model assessment beyond final success rates.
Main Performance of different LLMs across various environments are shown below, please check our Result for more details.

Illustrative Overview

AgentBoard consists of 9 diverse tasks and 1013 exemplary environments, covering a range from embodied AI and game agents to web and tool agents. Our environment provides well-annotated subgoals and fine-grained interactions. Furthermore, it provides detailed analyses for agent evaluation, as shown below. You may explore our dataset examples at Explore, or check our paper for more details.

overview.png

Data

Our data can be directly downloaded on Huggingface datasets. Please refer to our github instructions for how to read and use the data.

Citation

@misc{ma2024agentboard,
      title={AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents},
      author={Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He},
      year={2024},
      eprint={2401.13178},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
        

Contact Us

Have any questions about AgentBoard? Please contact us at llmagentboard@gmail.com or create an issue on Github. For potential collaboration, please contact junxianh2@gmail.com.