About AgentBoard
AgentBoard is a benchmark designed for multi-turn LLM agents, complemented by an analytical
evaluation board for detailed model assessment beyond final success rates.
Main Performance of different LLMs across various environments are shown below, please check our Result for more details.
Illustrative Overview
AgentBoard consists of 9 diverse tasks and 1013 exemplary environments, covering a range from
embodied AI and game agents to web and tool agents.
Our environment provides well-annotated subgoals and fine-grained interactions. Furthermore,
it provides detailed analyses for agent evaluation, as shown below.
You may explore our dataset examples at Explore, or check
our paper for more details.
Citation
@misc{ma2024agentboard,
title={AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents},
author={Chang Ma and Junlei Zhang and Zhihao Zhu and Cheng Yang and Yujiu Yang and Yaohui Jin and Zhenzhong Lan and Lingpeng Kong and Junxian He},
year={2024},
eprint={2401.13178},
archivePrefix={arXiv},
primaryClass={cs.CL}
}