Result

We conduct a comprehensive, multi-turn LLM agent evaluation of popular large language models, including both API-based proprietary models and open-weight models. This analytical evaluation is designed to deeply analyze the performance through main results, dimension score, long-range interaction, and performance at easy/hard levels.

Main Performance

Dataset Statistics: We categorize the nine tasks into four distinct categories: Embodied AI, Games, Web, and Tools. Under Embodied AI, we include Alfworld, ScienceWorld , and BabyAI; the Game category contains Jericho and PDDL; the Web category encompasses WebShop and WebArena; for Tools, we include Tool-Query and Tool-Operation.
Evaluation metrics: We introduce two primary metrics for evaluation: success rate and progress rate. The success rate measures the proportion of instances in which the goal of an environment is achieved. The progress rate reflects the proportion of completed sub-goals. In addition to these, we incorporate grounding accuracy as a fundamental metric in assessing agent performance, which quantifies the percentage of valid actions taken in each task.

Dimensional Analysis

Our dimension scoring framework evaluates a model's capability across six dimensions: memory, planning, world modeling, retrospection, grounding, and spatial navigation.
Memory measures incorporating long-range information in context, planning assesses decomposing complex goals into manageable sub-goals, world modeling tests knowledge necessary for task completion, retrospection captures the ability to use environmental feedback, grounding focuses on competency in generating valid actions, and spatial navigation represents efficiency in moving to a target location).

GPT-4 exhibits superior performance across all dimensions, significantly surpassing other LLMs.
Generally, proprietary LLMs outperform open-weight models comprehensively and exhibit balanced abilities.
Open-weight LLMs exhibit weaknesses in self-reflection, world modeling, and planning.

Long-range Interaction

One important characteristic of LLM agents is their ability to engage in multi-round interactions, allowing them to continuously gather information and make progress. Here shows how the models proceed across long-range interactions. Specifically, we calculate the progress rate relative to the number of interaction steps.
We observe that proprietary models(i.e. GPT-4, Claude2 and GPT-3.5-Turbo) continue to gain rewards across 30 steps in tasks of Alfworld and PDDL. In contrast, in WebArena and Tool, they rapidly reach a peak reward value and then cease to gain further rewards. This trend may be due to the fact that tasks in Embodied AI and Games generally require more steps to complete.
For open-weight models, they quickly reach their peak progress rate. As the number of steps increases, they fail to continue gaining rewards in all tasks and most of them stop making progresses after around 6 steps. This phenomenon suggests that these models may be limited in handling long-range interactions. Longer interaction steps increase reasoning complexity and require extended context length, which can pose challenges for these models. As a result, their ability to solve complex tasks that involve long-range interactions may be limited.

Difficulty Analysis

For each of the task, we divide the environments into "easy" and "hard" two categories. These environments are divided according to the task types or the number of sub-goals.
Notably, even with a minor difference in task complexity (for some tasks, easy samples consisting of fewer than 3 sub-goals, and hard samples predominantly comprising 4-6 sub-goals). all models suffer from significant average performance drop on hard examples, with GPT4 success rate experiencing an average drop of 31.2%, which indicated that even the most robust Language Learning Models (LLM) such as GPT-4 are limited in terms of task compositionality.