Evaluation and Benchmarking of
LLM Agents: A Survey

Mahmoud Mohammadi · Yipeng Li · Jane Lo · Wendy Yip | SAP Labs

KDD '25 · Toronto, Canada · August 3–7, 2025

The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This survey provides an in-depth overview of the emerging field of LLM agent evaluation, introducing a two-dimensional taxonomy that organizes existing work along (1) evaluation objectives—what to evaluate, such as agent behavior, capabilities, reliability, and safety—and (2) evaluation process—how to evaluate, including interaction modes, datasets and benchmarks, metric computation methods, and tooling.

In addition to taxonomy, the paper highlights enterprise-specific challenges (role-based access, reliability guarantees, dynamic long-horizon interactions, compliance) and identifies future research directions including holistic, more realistic, and scalable evaluation.

1. Introduction

Why evaluating LLM agents is fundamentally harder than evaluating LLMs

Agents based on LLMs are autonomous or semi-autonomous systems that use LLMs to reason, plan, and act, and represent a rapidly growing frontier in artificial intelligence. From customer service bots and coding copilots to digital assistants, LLM agents are redefining how we build intelligent systems.

As these agents move from research prototypes to real-world applications, the question of how to rigorously evaluate them becomes both pressing and complex. Evaluating LLM agents is more complex than evaluating LLMs in isolation. Unlike LLMs—primarily assessed for text generation or question answering—LLM agents operate in dynamic, interactive environments. They reason and make plans, execute tools, leverage memory, and even collaborate with humans or other agents.

🔧

Key Analogy: LLM evaluation is like examining the performance of an engine. In contrast, agent evaluation assesses a car's performance comprehensively, as well as under various driving conditions.

LLM agent evaluation also differs from traditional software evaluation. While software testing focuses on deterministic and static behavior, LLM agents are inherently probabilistic and behave dynamically. The evaluation of LLM agents is at the intersection of NLP, HCI, and software engineering, demanding additional perspectives.

LLM Evaluation

Engine

Static text generation, QA, fixed benchmarks

Software Testing

Parts

Deterministic, static, predefined behavior

Agent Evaluation

Full Car

Dynamic, interactive, probabilistic, multi-modal

This survey's contributions are twofold:

A taxonomy of LLM agent evaluation organized by evaluation objectives (what to evaluate) and evaluation process (how to evaluate).
Highlighting enterprise-specific challenges including role-based access control, reliability guarantees, long-term interaction, and compliance requirements.

2. Taxonomy for LLM-based Agent Evaluation

A two-dimensional framework: what to evaluate × how to evaluate

Evaluation Objectives

"What to evaluate"

Agent Behavior

Outcome oriented. Did the agent produce the right result, efficiently and affordably?

Task CompletionOutput QualityLatency & Cost

Agent Capabilities

Process oriented. Does the agent produce results in the right way, as designed?

Planning & ReasoningMemory & ContextTool UseMulti-Agent

Reliability

Can the agent perform reliably across inputs and over time?

RobustnessHallucinationsError Handling

Safety and Alignment

Can the agent be trusted not to produce harmful or non-compliant results?

FairnessHarmCompliance & Privacy

Evaluation Process

"How to evaluate"

Interaction Mode

Methods of interacting with LLM agent systems.

Static & OfflineDynamic & Online

Evaluation Data

Datasets, benchmarks, and synthetic data generation for evaluation.

DatasetsBenchmarksDomain Specific

Metrics Computation Methods

Methods to compute performance metrics.

Code BasedHuman-as-a-JudgeLLM-as-a-Judge

Evaluation Tooling

Frameworks and platforms to evaluate with.

FrameworksPlatformsLeaderboards

Evaluation Contexts

What environments to evaluate in.

SandboxesSimulatorsLive

This taxonomy serves both as a conceptual framework and a practical guide, enabling systematic comparison and analysis of LLM agents across a wide range of goals, methodologies, and deployment conditions. As LLM agents are deployed in increasingly diverse settings, factors such as single-turn vs. multi-turn interactions, multilingualism, and multimodality all become more important.

3. Evaluation Objectives

The four pillars of what to evaluate in LLM agents

3.1 Agent Behavior

Agent behavior refers to the overall performance of the agent as perceived by a user, treating the agent as a black box. It represents the highest-level view in evaluation and offers the most direct insight into the user experience.

3.1.1 Task Completion

Task completion is a fundamental objective, assessing whether an agent successfully achieves the predefined goals of a given task. It involves determining whether a desired state is reached or if specific criteria defined for task success are met. Although sometimes noted for providing limited fine-grained insight into failures, task completion remains a predominant and essential measure of overall agent performance.

Key metrics:

Success Rate (SR) Task Success Rate Overall Success Rate Task Goal Completion (TGC) Pass Rate pass@k pass^k

Metrics such as \(\text{pass@}k\) and \(\text{pass}^k\) extend binary task success by considering success over multiple trials. A binary reward function that returns 0 or 1 for goal achievement is also common.

Coding & Software Engineering: SWE-bench (GitHub issues), ScienceAgentBench (scientific data analysis), CORE-Bench (research reproduction), PaperBench (research replication), AppWorld (interactive coding)

Web Environments: BrowserGym, WebArena, WebCanvas (general web nav), VisualWebArena, MMInA (multimodal web), ASSISTANTBENCH (realistic time-consuming web tasks)

3.1.2 Output Quality

Output quality refers to the characteristics of responses by an LLM agent—an umbrella term encompassing accuracy, relevance, clarity, coherence, and adherence to agent specifications or task requirements. An agent may complete a task yet still deliver a subpar user experience if the interaction lacks these qualities.

Many metrics overlap with LLM evaluation: fluency measures natural language conventions, logical coherence focuses on rigor in arguments. Standard RAG metrics also apply, including Response Relevance and Factual Correctness.

3.1.3 Latency & Cost

Latency is critical in synchronous agent interactions. Long wait times degrade user experience and erode trust.

Streaming

Time To First Token (TTFT)

Delay before user sees the first token of the LLM's response

Async

End-to-End Request Latency

Time to receive the complete response

Cost measures monetary efficiency, typically estimated based on the number of input and output tokens correlating with usage-based pricing in most LLM deployments.

3.2 Agent Capabilities

Beyond external behavior, evaluations often target specific capabilities that enable agent performance. Key aspects include tool use, planning and reasoning, memory and context retention, and multi-agent collaboration.

3.2.1 Tool Use

Tool use is a core capability, enabling agents to retrieve grounding information, perform actions, and interact with external environments. In this survey, tool use involves invocation of a single tool (interchangeable with function calling); more complex cases of determining tool sequences are covered under Planning & Reasoning.

The evaluation of tool use involves answering several key questions:

Is tool invocation necessary?

Can it select the right tool?

Identify correct parameters?

Generate proper values?

Retrieve from large toolset?

Metrics:

Metric	What it measures
Invocation Accuracy	Whether the agent correctly decides to call a tool at all
Tool Selection Accuracy	Whether the proper tool is chosen from options
Retrieval Accuracy	Correct tool retrieval from larger toolset (rank accuracy k)
Mean Reciprocal Rank (MRR)	Position of correct tool in ranked list
NDCG	How well the system ranks all relevant tools
Parameter Name F1	Ability to identify correct parameter names

While some evaluations rely on abstract syntax tree (AST) correctness to verify syntactic validity, this approach may miss semantic errors—such as incorrect or hallucinated parameter values, especially for enumerated types. The Gorilla paper proposed execution-based evaluation, which runs tool calls and assesses outcomes, offering more comprehensive assessment.

3.2.2 Planning and Reasoning

Planning involves selecting the correct set of tools in an appropriate order. Reasoning enables agents to make context-aware decisions, either ahead of time or dynamically during task execution.

T-eval formulated planning evaluation as comparing the set of predicted tools against a reference. Since tool order and dependency matter, some benchmarks adopt graph-based representations and introduce metrics such as:

Node F1 for tool selection Edge F1 for invocation sequences Normalized Edit Distance for structural accuracy

In dynamic environments, agents often need to interleave planning and execution—the ReAct paradigm where agents alternate between reasoning steps and tool usage. AgentBoard proposes Progress Rate, comparing the agent's actual trajectory against the expected one.

When agents generate complete multi-step programs, evaluation methods from code generation become relevant. ScienceAgentBench uses program similarity metrics. Step Success Rate measures the percentage of steps successfully executed.

3.2.3 Memory and Context Retention

A critical capability for long-running agents is retaining information throughout many interactions and applying previous context to current requests.

Guan et al. categorize memory evaluation by:

Temporal

Memory Span

How long information is stored

Representational

Memory Forms

How information is represented

Benchmarks: LongEval and SocialBench test context retention in long dialogues (40+ turns). Maharana et al. demonstrate evaluation with 600+ turn dialogues. Li et al. introduce memory-enhanced evaluation tracking consistency in long-horizon tasks.

Metrics: Factual Recall Accuracy Consistency Score (no contradictions between turns)

Memory evaluation may also consider working memory for tool-using agents and forgetting strategies (whether the agent appropriately forgets irrelevant details to avoid confusion).

3.2.4 Multi-Agent Collaboration

Evaluating multi-agent collaboration in LLM-based systems requires different methodologies compared to traditional reinforcement learning–driven coordination. Unlike conventional agents that rely on predefined reward structures, LLM agents coordinate through natural language, strategic reasoning, and decentralized problem-solving.

These capabilities are crucial in real-world applications such as financial decision-making and structured data analysis, where autonomous agents must exchange information, negotiate, and synchronize.

Key metric: Collaborative Efficiency — assessing how well multiple agents share responsibilities and distribute tasks dynamically.

3.3 Reliability

Reliability is crucial for enterprise and safety-critical applications. It encompasses consistency, robustness to variations, and trustworthiness of outputs. Unlike task performance (which might measure best-case capabilities), reliability evaluation probes worst-case and average-case scenarios.

Consistency

3.3.1 Consistency

Stability of output when the same task is repeated multiple times. Since LLMs are inherently non-deterministic, LLM-based agents exhibit variability.

Metrics:

\(\text{pass@}k\) — probability of success at least once over \(k\) attempts

\(\text{pass}^k\) — whether the agent succeeds in all \(k\) attempts (stricter, from \(\tau\)-benchmark)

The \(\text{pass}^k\) metric better captures the consistency requirements of mission-critical deployments.

Robustness

3.3.2 Robustness

Stability of output under input variations or environmental changes. Stress-testing with perturbed inputs: paraphrased instructions, irrelevant context, typos, dialects.

HELM benchmark tracks performance degradation under input variation.

Adaptive resilience: WebLinX examines behavior when web page structure changes during execution.

Error handling: ToolEmu evaluates whether agents respond to tool failures (API errors, null responses) gracefully—retry, switch tools, or explain.

3.4 Safety and Alignment

Safety covers adherence to ethical guidelines, avoidance of harmful behavior, and compliance with legal or policy constraints. As LLM agents become more powerful and autonomous, the risk of unintended adverse outcomes grows—disinformation, hate speech, unsafe instructions.

3.4.1 Fairness

The lack of fairness and transparency can result in biased outcomes, decreased trust, and unintended societal consequences. In financial applications, biased decision-making in loan approvals or investment strategies can reinforce systemic inequalities.

Explainability is crucial for enhancing user trust. Methods include guideline-driven decision-making (AutoGuide) and structured transparency mechanisms (MATSA, FinCon). R-Judge analyzes how agents perceive risk when making autonomous decisions.

3.4.2 Harm, Toxicity, and Bias

Evaluation uses specialized test sets: RealToxicityPrompts (prompts likely to elicit toxic content), checked with automated toxicity detectors. Metrics include percentage of responses containing toxic language and average toxicity score.

HELM includes toxicity and bias metrics as part of holistic evaluation. For interactive agents, red-teaming measures failure rate (how often the agent responds unsafely).

CoSafe evaluates conversational agents on adversarial prompts designed to trick them into breaking safety rules—even advanced agents had vulnerabilities, such as falling for coreference-based attacks (ambiguous references to bypass filters).

3.4.3 Compliance and Privacy

Many deployments require agents to comply with specific regulatory or policy constraints—a finance chatbot must not disclose confidential information; a medical assistant must not deviate from established guidelines.

The HELM benchmark for enterprises includes domain-specific prompts and metrics for fields like finance and law. TheAgentCompany evaluates enterprise AI agents under structured correctness constraints, requiring them to follow predefined organizational policies.

Table 1: Evaluation Objectives and Their Metrics

Comprehensive mapping of objectives → categories → metrics → relevant papers. Click column headers to sort.

Filter by objective:

Objective ⇅	Category ⇅	Metrics	Relevant Papers
Agent Behavior	Task Completion	Success Rate (SR), F1-score, Pass@k, Progress Rate, Execution Accuracy, Transfer Learning Success, Zero-Shot Generalization Accuracy	AgentBoard, WebShop, AgentBench, SWE-bench, AppWorld, TheAgentCompany, MAGIC, Mobile-Env, Re-ReST, XMC-AGENT, SQuAD, ResearchArena, InformativeBench
Agent Behavior	Output Quality	Coherence, User Satisfaction, Usability, Likability, Overall Quality	PredictingIQ, EnDex, PsychoGAT
Agent Behavior	Latency & Cost	Latency, Token Usage, Cost	Cluster diagnosis, MobileBench, MobileAgentBench, LangSuitE, WebArena, Mobile-env, GUI Agents, GPTDroid, Spa-bench
Agent Capability	Tool Use	Task Completion Rate, Tool Selection Accuracy	ToolEmu, MetaTool, AutoCodeRover
Agent Capability	Planning & Reasoning	Reasoning Quality, Accuracy, Fine-Grained Progress Rate, Self Consistency, Plan Quality	AgentBoard, MMLU, LLM-Aug. Agents, SimuCourt, Magis
Agent Capability	Memory & Context	Factual Accuracy Recall, Consistency Scores	LongEval, SocialBench, LoCoMo, Optimus-1
Agent Capability	Multi-Agent Collaboration	Info Sharing Effectiveness, Adaptive Role Switching, Reasoning Rating	AgentSims, WebArena, MATSA, GAMEBENCH, BALROG, TheAgentCompany
Reliability	Consistency	pass^k	τ-Bench
Reliability	Robustness	Accuracy, Task Success Rate Under Perturbation	HELM, WebLinX
Safety	Fairness	Awareness Coverage, Violation Rate, Transparency, Ethics, Morality	CASA, R-Judge, SimuCourt, MATSA, FinCon, AutoGuide
Safety	Harm	Adversarial Robustness, Prompt Injection Resistance, Harmfulness, Bias Detection	ASB, AgentPoison, AgentDojo, Backdoor Attacks, SafeAgentBench, Agent-Safety Bench, AgentHarm, Adaptive Attacks, RealToxicityPrompts
Safety	Compliance & Privacy	Risk Awareness, Task Completion Under Constraints	R-Judge, Cybench, TheAgentCompany

4. Evaluation Process

The methodological dimension: how agents are assessed

4.1 Interaction Mode

4.1.1 Static & Offline

Often performed as a baseline, offline evaluations rely on datasets and static test cases: collections of tasks, prompts, or conversations. Simulated conversations may help develop these data, but are ultimately inert between runs.

Pros: Cheaper, simpler to run and maintain.

Cons: Lack nuance for the wide range of responses; more prone to error propagation; less accurate representations of system performance.

4.1.2 Dynamic & Online

Online evaluation occurs after deployment. Instead of synthetic data, it leverages simulations or fundamental user interactions. This adaptive data is crucial for identifying pain points not discovered during static testing.

Examples: Web simulators (MiniWoB, WebShop, WebArena) where agent behavior (clicking links, filling forms) can be programmed to verify correct sequences.

EDD: Evaluation-driven Development makes evaluation integral to the agent development cycle—continuous offline and online evaluation to detect regressions.

4.2 Evaluation Data

The growing interest has led to diverse datasets, benchmarks, and leaderboards specifically targeting agent capabilities:

Structured Benchmarks

AAAR-1.0, ScienceAgentBench, TaskBench — expert-labeled benchmarks for research reasoning, scientific workflows, multi-tool planning

Tool Use & Function Calling

FlowBench, ToolBench, API-Bank — tool use and function-calling across large API repositories with gold tool sequences and parameter structures

Interactive & Open-ended

AssistantBench, AppWorld, WebArena — dynamic decision-making, long-horizon planning, user-agent interactions

Safety & Robustness

AgentHarm (harmful behaviors), AgentDojo (prompt injection resilience)

Leaderboards

Berkeley Function-Calling Leaderboard (BFCL), Holistic Agent Leaderboard (HAL) — standardized test cases, automated metrics, ranking

Data Types

Human-annotated, synthetic, and interaction-generated data used in combination

4.4 Evaluation Tooling

Software frameworks and platforms that support automated, scalable, and continuous agent evaluation workflows, reflecting a movement toward Evaluation-driven Development (EDD).

Open-Source Frameworks

OpenAI Evals — specify evaluation tasks and metrics
DeepEval — rich analytics and debugging
InspectAI — UK AI Safety Institute framework
Phoenix (Arize AI) — evaluation orchestration
GALILEO — agentic evaluations

Development Platforms

Azure AI Foundry — evaluation features built-in
Google Vertex AI — monitor and detect regressions
LangGraph — agent development with eval
Amazon Bedrock — adapt agents to evolving needs

Xia et al. propose an AgentOps architecture to continuously monitor deployed agents, closing the loop between development and deployment through real-time feedback and quality control.

5. Enterprise-Specific Challenges

Requirements often overlooked in current research

As LLM-based agents transition from research demos to enterprise deployment, new challenges emerge. Enterprises demand high performance in conjunction with predictable reliability, compliance with regulations, data security, and maintainability.

5.1 Complexity from Role-Based Access

A key challenge is accounting for Role-Based Access Control (RBAC), which governs users' permissions to access data and services. Users operate with varying levels of access depending on roles, and agents acting on their behalf must adhere to the same constraints. This means an agent's ability to retrieve or act on information is contextually bound to the user's permissions.

IntellAgent includes evaluation tasks requiring authentication of user identity and enforcing policies that deny access to other users' information. By embedding role-specific restrictions into task generation, these approaches more accurately model permission-sensitive enterprise contexts.

5.2 Reliability Guarantees

Especially important where agents must operate within compliance and auditing frameworks requiring deterministic or repeatable behavior that is explainable. Occasional success is insufficient—agents must perform reliably across time and usage scenarios.

Evaluating reliability is nontrivial: running multiple trials per input is computationally expensive, especially for complex tasks involving tools, memory, or multi-agent coordination. The \(\tau\)-benchmark explicitly incorporates \(\text{pass}^k\) to evaluate consistency in retail and airline booking domains—showing that current agents struggle with consistency.

5.3 Dynamic and Long-Horizon Interactions

Unlike most benchmarks focusing on short episodes, real-world enterprise agents operate continuously over extended periods while interacting with users, systems, and data. Standard short-term evaluations cannot capture phenomena such as performance drift, context retention, or cumulative effect of decisions on business outcomes.

Park et al. observed generative agents in a continuously running simulated town environment to study emergent behaviors across multi-day interactions. Maharana et al. evaluated long-term conversational memory through 600-turn dialogues.

5.4 Adherence to Domain-Specific Policies and Compliance Requirements

Enterprises enforce strict operational rules: approval workflows, data retention policies, usage quotas, and legal regulations like GDPR or HIPAA. Evaluating agents in such contexts requires more than measuring task success—it demands verification that behaviors align with formal policy constraints.

Without explicit modeling of these constraints during evaluation, agents deemed "correct" in traditional benchmarks may still fail in production due to policy violations or compliance risks.

6. Future Research Directions

Four key directions to advance the field

🎯 Holistic Evaluation Frameworks

Current efforts focus on isolated dimensions. Future work should develop frameworks that assess agent performance across multiple, interdependent dimensions simultaneously.

🌍 More Realistic Evaluation Settings

Bridge the gap between lab and production. Create environments incorporating dynamic multi-user interactions, role-based access controls, and domain-specific knowledge—via real-world deployment trials or simulated enterprise workflows.

⚡ Automated & Scalable Evaluation

Manual evaluation is costly and hard to scale. Explore synthetic data generation for controllable test cases, simulated environments, and advancing LLM-as-a-judge / Agent-as-a-judge techniques.

⏱️ Time- and Cost-Bounded Protocols

Evaluation must be efficient to support iterative development. Develop protocols that strike a balance between depth and efficiency, especially for repeated trials and human-in-the-loop assessments.

📌

Bottom Line: Future research should focus on developing evaluation methods that are holistic, realistic, scalable, and efficient. These directions are essential for building reliable and trustworthy LLM-based agents at scale.

Evaluation and Benchmarking ofLLM Agents: A Survey