We conduct a comprehensive, multi-turn LLM agent evaluation of popular large language models, including both API-based proprietary models and open-weight models. This analytical evaluation is designed to deeply analyze the performance through main results, dimension score, long-range interaction, and performance at easy/hard levels.
Dataset Statistics: We categorize the nine tasks into four distinct categories: Embodied AI, Games, Web, and Tools.
Under Embodied AI, we include Alfworld, ScienceWorld
, and BabyAI; the Game category contains
Jericho and PDDL;
the Web category encompasses WebShop and WebArena;
for Tools, we include Tool-Query
and Tool-Operation.
Evaluation metrics: We introduce two primary metrics for evaluation: success rate and progress rate. The
success rate measures the proportion of instances in which the goal of an
environment is achieved. The progress rate reflects the proportion of
completed sub-goals. In addition to these, we incorporate grounding accuracy
as a fundamental metric in assessing agent performance, which quantifies the percentage of valid actions
taken in each task.
Our dimension scoring framework evaluates a model's capability
across six dimensions: memory, planning, world modeling,
retrospection,
grounding, and spatial navigation.
Memory measures incorporating long-range information in context, planning assesses decomposing complex goals into
manageable sub-goals, world modeling tests knowledge necessary for task
completion, retrospection captures the ability to use environmental feedback,
grounding focuses on competency in generating valid actions, and spatial navigation represents efficiency in moving to a
target location).
One important characteristic of LLM agents is their ability to engage
in multi-round interactions, allowing them to continuously gather information and make progress.
Here shows how the models proceed across long-range interactions.
Specifically, we calculate the progress rate relative to the number of interaction steps.
We observe that proprietary models(i.e. GPT-4, Claude2 and GPT-3.5-Turbo) continue
to gain
rewards across 30 steps in tasks of Alfworld and PDDL. In
contrast, in WebArena and Tool,
they rapidly reach a peak reward value and then cease to gain further rewards. This trend may be due to the
fact that tasks in Embodied AI and Games generally require more steps to complete.
For open-weight models, they quickly reach their peak progress rate. As the number of steps
increases, they
fail to continue gaining rewards in all tasks and most of them
stop making progresses after around 6 steps.
This phenomenon suggests that these models may be limited in handling long-range
interactions.
Longer interaction steps increase reasoning complexity and require extended context length, which can pose
challenges for these models.
As a result, their ability to solve complex tasks that involve long-range interactions may be limited.
For each of the task, we divide the environments into "easy" and "hard" two categories. These environments are divided according to the task
types or the number of sub-goals.
Notably, even with a minor difference in task complexity (for some tasks,
easy samples consisting of fewer than 3 sub-goals, and hard samples predominantly comprising 4-6 sub-goals).
all models suffer from significant average performance drop on hard examples, with GPT4 success rate
experiencing an
average drop of 31.2%, which indicated that even the most robust Language Learning Models (LLM)
such as GPT-4 are limited in terms of task compositionality.