Evaluation and Benchmarking of
LLM Agents: A Survey

Mahmoud Mohammadi · Yipeng Li · Jane Lo · Wendy Yip  |  SAP Labs

KDD '25 · Toronto, Canada · August 3–7, 2025

The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This survey provides an in-depth overview of the emerging field of LLM agent evaluation, introducing a two-dimensional taxonomy that organizes existing work along (1) evaluation objectives—what to evaluate, such as agent behavior, capabilities, reliability, and safety—and (2) evaluation process—how to evaluate, including interaction modes, datasets and benchmarks, metric computation methods, and tooling.

In addition to taxonomy, the paper highlights enterprise-specific challenges (role-based access, reliability guarantees, dynamic long-horizon interactions, compliance) and identifies future research directions including holistic, more realistic, and scalable evaluation.

1. Introduction

Why evaluating LLM agents is fundamentally harder than evaluating LLMs

Agents based on LLMs are autonomous or semi-autonomous systems that use LLMs to reason, plan, and act, and represent a rapidly growing frontier in artificial intelligence. From customer service bots and coding copilots to digital assistants, LLM agents are redefining how we build intelligent systems.

As these agents move from research prototypes to real-world applications, the question of how to rigorously evaluate them becomes both pressing and complex. Evaluating LLM agents is more complex than evaluating LLMs in isolation. Unlike LLMs—primarily assessed for text generation or question answering—LLM agents operate in dynamic, interactive environments. They reason and make plans, execute tools, leverage memory, and even collaborate with humans or other agents.

🔧

Key Analogy: LLM evaluation is like examining the performance of an engine. In contrast, agent evaluation assesses a car's performance comprehensively, as well as under various driving conditions.

LLM agent evaluation also differs from traditional software evaluation. While software testing focuses on deterministic and static behavior, LLM agents are inherently probabilistic and behave dynamically. The evaluation of LLM agents is at the intersection of NLP, HCI, and software engineering, demanding additional perspectives.

LLM Evaluation
Engine

Static text generation, QA, fixed benchmarks

Software Testing
Parts

Deterministic, static, predefined behavior

Agent Evaluation
Full Car

Dynamic, interactive, probabilistic, multi-modal

This survey's contributions are twofold:

2. Taxonomy for LLM-based Agent Evaluation

A two-dimensional framework: what to evaluate × how to evaluate

Evaluation Objectives

"What to evaluate"

Agent Behavior
Outcome oriented. Did the agent produce the right result, efficiently and affordably?
Task CompletionOutput QualityLatency & Cost
Agent Capabilities
Process oriented. Does the agent produce results in the right way, as designed?
Planning & ReasoningMemory & ContextTool UseMulti-Agent
Reliability
Can the agent perform reliably across inputs and over time?
RobustnessHallucinationsError Handling
Safety and Alignment
Can the agent be trusted not to produce harmful or non-compliant results?
FairnessHarmCompliance & Privacy

Evaluation Process

"How to evaluate"

Interaction Mode
Methods of interacting with LLM agent systems.
Static & OfflineDynamic & Online
Evaluation Data
Datasets, benchmarks, and synthetic data generation for evaluation.
DatasetsBenchmarksDomain Specific
Metrics Computation Methods
Methods to compute performance metrics.
Code BasedHuman-as-a-JudgeLLM-as-a-Judge
Evaluation Tooling
Frameworks and platforms to evaluate with.
FrameworksPlatformsLeaderboards
Evaluation Contexts
What environments to evaluate in.
SandboxesSimulatorsLive

This taxonomy serves both as a conceptual framework and a practical guide, enabling systematic comparison and analysis of LLM agents across a wide range of goals, methodologies, and deployment conditions. As LLM agents are deployed in increasingly diverse settings, factors such as single-turn vs. multi-turn interactions, multilingualism, and multimodality all become more important.

3. Evaluation Objectives

The four pillars of what to evaluate in LLM agents

3.1 Agent Behavior

Agent behavior refers to the overall performance of the agent as perceived by a user, treating the agent as a black box. It represents the highest-level view in evaluation and offers the most direct insight into the user experience.

3.1.1 Task Completion

Task completion is a fundamental objective, assessing whether an agent successfully achieves the predefined goals of a given task. It involves determining whether a desired state is reached or if specific criteria defined for task success are met. Although sometimes noted for providing limited fine-grained insight into failures, task completion remains a predominant and essential measure of overall agent performance.

Key metrics:

Success Rate (SR) Task Success Rate Overall Success Rate Task Goal Completion (TGC) Pass Rate pass@k pass^k

Metrics such as \(\text{pass@}k\) and \(\text{pass}^k\) extend binary task success by considering success over multiple trials. A binary reward function that returns 0 or 1 for goal achievement is also common.

Coding & Software Engineering: SWE-bench (GitHub issues), ScienceAgentBench (scientific data analysis), CORE-Bench (research reproduction), PaperBench (research replication), AppWorld (interactive coding)

Web Environments: BrowserGym, WebArena, WebCanvas (general web nav), VisualWebArena, MMInA (multimodal web), ASSISTANTBENCH (realistic time-consuming web tasks)

3.1.2 Output Quality

Output quality refers to the characteristics of responses by an LLM agent—an umbrella term encompassing accuracy, relevance, clarity, coherence, and adherence to agent specifications or task requirements. An agent may complete a task yet still deliver a subpar user experience if the interaction lacks these qualities.

Many metrics overlap with LLM evaluation: fluency measures natural language conventions, logical coherence focuses on rigor in arguments. Standard RAG metrics also apply, including Response Relevance and Factual Correctness.

3.1.3 Latency & Cost

Latency is critical in synchronous agent interactions. Long wait times degrade user experience and erode trust.

Streaming
Time To First Token (TTFT)

Delay before user sees the first token of the LLM's response

Async
End-to-End Request Latency

Time to receive the complete response

Cost measures monetary efficiency, typically estimated based on the number of input and output tokens correlating with usage-based pricing in most LLM deployments.

3.2 Agent Capabilities

Beyond external behavior, evaluations often target specific capabilities that enable agent performance. Key aspects include tool use, planning and reasoning, memory and context retention, and multi-agent collaboration.

3.2.1 Tool Use

Tool use is a core capability, enabling agents to retrieve grounding information, perform actions, and interact with external environments. In this survey, tool use involves invocation of a single tool (interchangeable with function calling); more complex cases of determining tool sequences are covered under Planning & Reasoning.

The evaluation of tool use involves answering several key questions:

1
Is tool invocation necessary?
2
Can it select the right tool?
3
Identify correct parameters?
4
Generate proper values?
5
Retrieve from large toolset?

Metrics:

MetricWhat it measures
Invocation AccuracyWhether the agent correctly decides to call a tool at all
Tool Selection AccuracyWhether the proper tool is chosen from options
Retrieval AccuracyCorrect tool retrieval from larger toolset (rank accuracy k)
Mean Reciprocal Rank (MRR)Position of correct tool in ranked list
NDCGHow well the system ranks all relevant tools
Parameter Name F1Ability to identify correct parameter names
While some evaluations rely on abstract syntax tree (AST) correctness to verify syntactic validity, this approach may miss semantic errors—such as incorrect or hallucinated parameter values, especially for enumerated types. The Gorilla paper proposed execution-based evaluation, which runs tool calls and assesses outcomes, offering more comprehensive assessment.

3.2.2 Planning and Reasoning

Planning involves selecting the correct set of tools in an appropriate order. Reasoning enables agents to make context-aware decisions, either ahead of time or dynamically during task execution.

T-eval formulated planning evaluation as comparing the set of predicted tools against a reference. Since tool order and dependency matter, some benchmarks adopt graph-based representations and introduce metrics such as:

Node F1 for tool selection   Edge F1 for invocation sequences   Normalized Edit Distance for structural accuracy

In dynamic environments, agents often need to interleave planning and execution—the ReAct paradigm where agents alternate between reasoning steps and tool usage. AgentBoard proposes Progress Rate, comparing the agent's actual trajectory against the expected one.

When agents generate complete multi-step programs, evaluation methods from code generation become relevant. ScienceAgentBench uses program similarity metrics. Step Success Rate measures the percentage of steps successfully executed.

3.2.3 Memory and Context Retention

A critical capability for long-running agents is retaining information throughout many interactions and applying previous context to current requests.

Guan et al. categorize memory evaluation by:

Temporal
Memory Span

How long information is stored

Representational
Memory Forms

How information is represented

Benchmarks: LongEval and SocialBench test context retention in long dialogues (40+ turns). Maharana et al. demonstrate evaluation with 600+ turn dialogues. Li et al. introduce memory-enhanced evaluation tracking consistency in long-horizon tasks.

Metrics: Factual Recall Accuracy Consistency Score (no contradictions between turns)

Memory evaluation may also consider working memory for tool-using agents and forgetting strategies (whether the agent appropriately forgets irrelevant details to avoid confusion).

3.2.4 Multi-Agent Collaboration

Evaluating multi-agent collaboration in LLM-based systems requires different methodologies compared to traditional reinforcement learning–driven coordination. Unlike conventional agents that rely on predefined reward structures, LLM agents coordinate through natural language, strategic reasoning, and decentralized problem-solving.

These capabilities are crucial in real-world applications such as financial decision-making and structured data analysis, where autonomous agents must exchange information, negotiate, and synchronize.

Key metric: Collaborative Efficiency — assessing how well multiple agents share responsibilities and distribute tasks dynamically.

3.3 Reliability

Reliability is crucial for enterprise and safety-critical applications. It encompasses consistency, robustness to variations, and trustworthiness of outputs. Unlike task performance (which might measure best-case capabilities), reliability evaluation probes worst-case and average-case scenarios.

Consistency

3.3.1 Consistency

Stability of output when the same task is repeated multiple times. Since LLMs are inherently non-deterministic, LLM-based agents exhibit variability.

Metrics:

\(\text{pass@}k\) — probability of success at least once over \(k\) attempts

\(\text{pass}^k\) — whether the agent succeeds in all \(k\) attempts (stricter, from \(\tau\)-benchmark)

The \(\text{pass}^k\) metric better captures the consistency requirements of mission-critical deployments.

Robustness

3.3.2 Robustness

Stability of output under input variations or environmental changes. Stress-testing with perturbed inputs: paraphrased instructions, irrelevant context, typos, dialects.

HELM benchmark tracks performance degradation under input variation.

Adaptive resilience: WebLinX examines behavior when web page structure changes during execution.

Error handling: ToolEmu evaluates whether agents respond to tool failures (API errors, null responses) gracefully—retry, switch tools, or explain.

3.4 Safety and Alignment

Safety covers adherence to ethical guidelines, avoidance of harmful behavior, and compliance with legal or policy constraints. As LLM agents become more powerful and autonomous, the risk of unintended adverse outcomes grows—disinformation, hate speech, unsafe instructions.

3.4.1 Fairness

The lack of fairness and transparency can result in biased outcomes, decreased trust, and unintended societal consequences. In financial applications, biased decision-making in loan approvals or investment strategies can reinforce systemic inequalities.

Explainability is crucial for enhancing user trust. Methods include guideline-driven decision-making (AutoGuide) and structured transparency mechanisms (MATSA, FinCon). R-Judge analyzes how agents perceive risk when making autonomous decisions.

3.4.2 Harm, Toxicity, and Bias

Evaluation uses specialized test sets: RealToxicityPrompts (prompts likely to elicit toxic content), checked with automated toxicity detectors. Metrics include percentage of responses containing toxic language and average toxicity score.

HELM includes toxicity and bias metrics as part of holistic evaluation. For interactive agents, red-teaming measures failure rate (how often the agent responds unsafely).

CoSafe evaluates conversational agents on adversarial prompts designed to trick them into breaking safety rules—even advanced agents had vulnerabilities, such as falling for coreference-based attacks (ambiguous references to bypass filters).

3.4.3 Compliance and Privacy

Many deployments require agents to comply with specific regulatory or policy constraints—a finance chatbot must not disclose confidential information; a medical assistant must not deviate from established guidelines.

The HELM benchmark for enterprises includes domain-specific prompts and metrics for fields like finance and law. TheAgentCompany evaluates enterprise AI agents under structured correctness constraints, requiring them to follow predefined organizational policies.

Table 1: Evaluation Objectives and Their Metrics

Comprehensive mapping of objectives → categories → metrics → relevant papers. Click column headers to sort.

Filter by objective:
Objective ⇅ Category ⇅ Metrics Relevant Papers
Agent BehaviorTask CompletionSuccess Rate (SR), F1-score, Pass@k, Progress Rate, Execution Accuracy, Transfer Learning Success, Zero-Shot Generalization AccuracyAgentBoard, WebShop, AgentBench, SWE-bench, AppWorld, TheAgentCompany, MAGIC, Mobile-Env, Re-ReST, XMC-AGENT, SQuAD, ResearchArena, InformativeBench
Agent BehaviorOutput QualityCoherence, User Satisfaction, Usability, Likability, Overall QualityPredictingIQ, EnDex, PsychoGAT
Agent BehaviorLatency & CostLatency, Token Usage, CostCluster diagnosis, MobileBench, MobileAgentBench, LangSuitE, WebArena, Mobile-env, GUI Agents, GPTDroid, Spa-bench
Agent CapabilityTool UseTask Completion Rate, Tool Selection AccuracyToolEmu, MetaTool, AutoCodeRover
Agent CapabilityPlanning & ReasoningReasoning Quality, Accuracy, Fine-Grained Progress Rate, Self Consistency, Plan QualityAgentBoard, MMLU, LLM-Aug. Agents, SimuCourt, Magis
Agent CapabilityMemory & ContextFactual Accuracy Recall, Consistency ScoresLongEval, SocialBench, LoCoMo, Optimus-1
Agent CapabilityMulti-Agent CollaborationInfo Sharing Effectiveness, Adaptive Role Switching, Reasoning RatingAgentSims, WebArena, MATSA, GAMEBENCH, BALROG, TheAgentCompany
ReliabilityConsistencypass^kτ-Bench
ReliabilityRobustnessAccuracy, Task Success Rate Under PerturbationHELM, WebLinX
SafetyFairnessAwareness Coverage, Violation Rate, Transparency, Ethics, MoralityCASA, R-Judge, SimuCourt, MATSA, FinCon, AutoGuide
SafetyHarmAdversarial Robustness, Prompt Injection Resistance, Harmfulness, Bias DetectionASB, AgentPoison, AgentDojo, Backdoor Attacks, SafeAgentBench, Agent-Safety Bench, AgentHarm, Adaptive Attacks, RealToxicityPrompts
SafetyCompliance & PrivacyRisk Awareness, Task Completion Under ConstraintsR-Judge, Cybench, TheAgentCompany

4. Evaluation Process

The methodological dimension: how agents are assessed

4.1 Interaction Mode

4.1.1 Static & Offline

Often performed as a baseline, offline evaluations rely on datasets and static test cases: collections of tasks, prompts, or conversations. Simulated conversations may help develop these data, but are ultimately inert between runs.

Pros: Cheaper, simpler to run and maintain.

Cons: Lack nuance for the wide range of responses; more prone to error propagation; less accurate representations of system performance.

4.1.2 Dynamic & Online

Online evaluation occurs after deployment. Instead of synthetic data, it leverages simulations or fundamental user interactions. This adaptive data is crucial for identifying pain points not discovered during static testing.

Examples: Web simulators (MiniWoB, WebShop, WebArena) where agent behavior (clicking links, filling forms) can be programmed to verify correct sequences.

EDD: Evaluation-driven Development makes evaluation integral to the agent development cycle—continuous offline and online evaluation to detect regressions.

4.2 Evaluation Data

The growing interest has led to diverse datasets, benchmarks, and leaderboards specifically targeting agent capabilities:

Structured Benchmarks

AAAR-1.0, ScienceAgentBench, TaskBench — expert-labeled benchmarks for research reasoning, scientific workflows, multi-tool planning

Tool Use & Function Calling

FlowBench, ToolBench, API-Bank — tool use and function-calling across large API repositories with gold tool sequences and parameter structures

Interactive & Open-ended

AssistantBench, AppWorld, WebArena — dynamic decision-making, long-horizon planning, user-agent interactions

Safety & Robustness

AgentHarm (harmful behaviors), AgentDojo (prompt injection resilience)

Leaderboards

Berkeley Function-Calling Leaderboard (BFCL), Holistic Agent Leaderboard (HAL) — standardized test cases, automated metrics, ranking

Data Types

Human-annotated, synthetic, and interaction-generated data used in combination

4.3 Metrics Computation Methods

Deterministic

Code-Based

Relies on explicit rules, test cases, or assertions. Effective for tasks with well-defined outputs (numerical calculations, structured queries, syntactic correctness).

✅ Consistent, reproducible
❌ Inflexible for open-ended responses

AI-Powered

LLM-as-a-Judge

Leverages LLM reasoning to evaluate responses on qualitative criteria. Extension: Agent-as-a-Judge where multiple AI agents interact to refine assessment.

✅ Scalable, handles nuance
❌ May inherit LLM biases

Gold Standard

Human-in-the-Loop

User studies, expert audits, crowdworker annotations. Rated along dimensions: relevance, correctness, tone.

✅ Highest reliability
❌ Expensive, slow, hard to scale

4.4 Evaluation Tooling

Software frameworks and platforms that support automated, scalable, and continuous agent evaluation workflows, reflecting a movement toward Evaluation-driven Development (EDD).

Open-Source Frameworks
  • OpenAI Evals — specify evaluation tasks and metrics
  • DeepEval — rich analytics and debugging
  • InspectAI — UK AI Safety Institute framework
  • Phoenix (Arize AI) — evaluation orchestration
  • GALILEO — agentic evaluations
Development Platforms
  • Azure AI Foundry — evaluation features built-in
  • Google Vertex AI — monitor and detect regressions
  • LangGraph — agent development with eval
  • Amazon Bedrock — adapt agents to evolving needs

Xia et al. propose an AgentOps architecture to continuously monitor deployed agents, closing the loop between development and deployment through real-time feedback and quality control.

4.5 Evaluation Contexts

The evaluation context pertains to the environment in which evaluation is performed. A tradeoff exists between more realistic (costly, potentially less secure) and controlled (less representative) environments.

1
Mocked APIs
Simple, controlled
2
Sandbox
Isolated environment
3
Web Simulators
MiniWoB, WebArena
4
Live Deployment
Production environment

As development continues, the evaluation context often evolves from smaller, mocked API environments to live deployment as agent performance and trustworthiness are established.

5. Enterprise-Specific Challenges

Requirements often overlooked in current research

As LLM-based agents transition from research demos to enterprise deployment, new challenges emerge. Enterprises demand high performance in conjunction with predictable reliability, compliance with regulations, data security, and maintainability.

5.1 Complexity from Role-Based Access

A key challenge is accounting for Role-Based Access Control (RBAC), which governs users' permissions to access data and services. Users operate with varying levels of access depending on roles, and agents acting on their behalf must adhere to the same constraints. This means an agent's ability to retrieve or act on information is contextually bound to the user's permissions.

IntellAgent includes evaluation tasks requiring authentication of user identity and enforcing policies that deny access to other users' information. By embedding role-specific restrictions into task generation, these approaches more accurately model permission-sensitive enterprise contexts.

5.2 Reliability Guarantees

Especially important where agents must operate within compliance and auditing frameworks requiring deterministic or repeatable behavior that is explainable. Occasional success is insufficient—agents must perform reliably across time and usage scenarios.

Evaluating reliability is nontrivial: running multiple trials per input is computationally expensive, especially for complex tasks involving tools, memory, or multi-agent coordination. The \(\tau\)-benchmark explicitly incorporates \(\text{pass}^k\) to evaluate consistency in retail and airline booking domains—showing that current agents struggle with consistency.

5.3 Dynamic and Long-Horizon Interactions

Unlike most benchmarks focusing on short episodes, real-world enterprise agents operate continuously over extended periods while interacting with users, systems, and data. Standard short-term evaluations cannot capture phenomena such as performance drift, context retention, or cumulative effect of decisions on business outcomes.

Park et al. observed generative agents in a continuously running simulated town environment to study emergent behaviors across multi-day interactions. Maharana et al. evaluated long-term conversational memory through 600-turn dialogues.

5.4 Adherence to Domain-Specific Policies and Compliance Requirements

Enterprises enforce strict operational rules: approval workflows, data retention policies, usage quotas, and legal regulations like GDPR or HIPAA. Evaluating agents in such contexts requires more than measuring task success—it demands verification that behaviors align with formal policy constraints.

Without explicit modeling of these constraints during evaluation, agents deemed "correct" in traditional benchmarks may still fail in production due to policy violations or compliance risks.

6. Future Research Directions

Four key directions to advance the field

🎯 Holistic Evaluation Frameworks

Current efforts focus on isolated dimensions. Future work should develop frameworks that assess agent performance across multiple, interdependent dimensions simultaneously.

🌍 More Realistic Evaluation Settings

Bridge the gap between lab and production. Create environments incorporating dynamic multi-user interactions, role-based access controls, and domain-specific knowledge—via real-world deployment trials or simulated enterprise workflows.

⚡ Automated & Scalable Evaluation

Manual evaluation is costly and hard to scale. Explore synthetic data generation for controllable test cases, simulated environments, and advancing LLM-as-a-judge / Agent-as-a-judge techniques.

⏱️ Time- and Cost-Bounded Protocols

Evaluation must be efficient to support iterative development. Develop protocols that strike a balance between depth and efficiency, especially for repeated trials and human-in-the-loop assessments.

📌

Bottom Line: Future research should focus on developing evaluation methods that are holistic, realistic, scalable, and efficient. These directions are essential for building reliable and trustworthy LLM-based agents at scale.

Interactive Benchmark Landscape

Hover to see details about key benchmarks organized by evaluation focus