The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This survey provides an in-depth overview of the emerging field of LLM agent evaluation, introducing a two-dimensional taxonomy that organizes existing work along (1) evaluation objectives—what to evaluate, such as agent behavior, capabilities, reliability, and safety—and (2) evaluation process—how to evaluate, including interaction modes, datasets and benchmarks, metric computation methods, and tooling.
In addition to taxonomy, the paper highlights enterprise-specific challenges (role-based access, reliability guarantees, dynamic long-horizon interactions, compliance) and identifies future research directions including holistic, more realistic, and scalable evaluation.
Why evaluating LLM agents is fundamentally harder than evaluating LLMs
Agents based on LLMs are autonomous or semi-autonomous systems that use LLMs to reason, plan, and act, and represent a rapidly growing frontier in artificial intelligence. From customer service bots and coding copilots to digital assistants, LLM agents are redefining how we build intelligent systems.
As these agents move from research prototypes to real-world applications, the question of how to rigorously evaluate them becomes both pressing and complex. Evaluating LLM agents is more complex than evaluating LLMs in isolation. Unlike LLMs—primarily assessed for text generation or question answering—LLM agents operate in dynamic, interactive environments. They reason and make plans, execute tools, leverage memory, and even collaborate with humans or other agents.
Key Analogy: LLM evaluation is like examining the performance of an engine. In contrast, agent evaluation assesses a car's performance comprehensively, as well as under various driving conditions.
LLM agent evaluation also differs from traditional software evaluation. While software testing focuses on deterministic and static behavior, LLM agents are inherently probabilistic and behave dynamically. The evaluation of LLM agents is at the intersection of NLP, HCI, and software engineering, demanding additional perspectives.
Static text generation, QA, fixed benchmarks
Deterministic, static, predefined behavior
Dynamic, interactive, probabilistic, multi-modal
This survey's contributions are twofold:
A two-dimensional framework: what to evaluate × how to evaluate
"What to evaluate"
"How to evaluate"
This taxonomy serves both as a conceptual framework and a practical guide, enabling systematic comparison and analysis of LLM agents across a wide range of goals, methodologies, and deployment conditions. As LLM agents are deployed in increasingly diverse settings, factors such as single-turn vs. multi-turn interactions, multilingualism, and multimodality all become more important.
The four pillars of what to evaluate in LLM agents
Agent behavior refers to the overall performance of the agent as perceived by a user, treating the agent as a black box. It represents the highest-level view in evaluation and offers the most direct insight into the user experience.
Task completion is a fundamental objective, assessing whether an agent successfully achieves the predefined goals of a given task. It involves determining whether a desired state is reached or if specific criteria defined for task success are met. Although sometimes noted for providing limited fine-grained insight into failures, task completion remains a predominant and essential measure of overall agent performance.
Key metrics:
Metrics such as \(\text{pass@}k\) and \(\text{pass}^k\) extend binary task success by considering success over multiple trials. A binary reward function that returns 0 or 1 for goal achievement is also common.
Coding & Software Engineering: SWE-bench (GitHub issues), ScienceAgentBench (scientific data analysis), CORE-Bench (research reproduction), PaperBench (research replication), AppWorld (interactive coding)
Web Environments: BrowserGym, WebArena, WebCanvas (general web nav), VisualWebArena, MMInA (multimodal web), ASSISTANTBENCH (realistic time-consuming web tasks)
Output quality refers to the characteristics of responses by an LLM agent—an umbrella term encompassing accuracy, relevance, clarity, coherence, and adherence to agent specifications or task requirements. An agent may complete a task yet still deliver a subpar user experience if the interaction lacks these qualities.
Many metrics overlap with LLM evaluation: fluency measures natural language conventions, logical coherence focuses on rigor in arguments. Standard RAG metrics also apply, including Response Relevance and Factual Correctness.
Latency is critical in synchronous agent interactions. Long wait times degrade user experience and erode trust.
Delay before user sees the first token of the LLM's response
Time to receive the complete response
Cost measures monetary efficiency, typically estimated based on the number of input and output tokens correlating with usage-based pricing in most LLM deployments.
Beyond external behavior, evaluations often target specific capabilities that enable agent performance. Key aspects include tool use, planning and reasoning, memory and context retention, and multi-agent collaboration.
Tool use is a core capability, enabling agents to retrieve grounding information, perform actions, and interact with external environments. In this survey, tool use involves invocation of a single tool (interchangeable with function calling); more complex cases of determining tool sequences are covered under Planning & Reasoning.
The evaluation of tool use involves answering several key questions:
Metrics:
| Metric | What it measures |
|---|---|
| Invocation Accuracy | Whether the agent correctly decides to call a tool at all |
| Tool Selection Accuracy | Whether the proper tool is chosen from options |
| Retrieval Accuracy | Correct tool retrieval from larger toolset (rank accuracy k) |
| Mean Reciprocal Rank (MRR) | Position of correct tool in ranked list |
| NDCG | How well the system ranks all relevant tools |
| Parameter Name F1 | Ability to identify correct parameter names |
Planning involves selecting the correct set of tools in an appropriate order. Reasoning enables agents to make context-aware decisions, either ahead of time or dynamically during task execution.
T-eval formulated planning evaluation as comparing the set of predicted tools against a reference. Since tool order and dependency matter, some benchmarks adopt graph-based representations and introduce metrics such as:
In dynamic environments, agents often need to interleave planning and execution—the ReAct paradigm where agents alternate between reasoning steps and tool usage. AgentBoard proposes Progress Rate, comparing the agent's actual trajectory against the expected one.
When agents generate complete multi-step programs, evaluation methods from code generation become relevant. ScienceAgentBench uses program similarity metrics. Step Success Rate measures the percentage of steps successfully executed.
A critical capability for long-running agents is retaining information throughout many interactions and applying previous context to current requests.
Guan et al. categorize memory evaluation by:
How long information is stored
How information is represented
Benchmarks: LongEval and SocialBench test context retention in long dialogues (40+ turns). Maharana et al. demonstrate evaluation with 600+ turn dialogues. Li et al. introduce memory-enhanced evaluation tracking consistency in long-horizon tasks.
Metrics: Factual Recall Accuracy Consistency Score (no contradictions between turns)
Memory evaluation may also consider working memory for tool-using agents and forgetting strategies (whether the agent appropriately forgets irrelevant details to avoid confusion).
Evaluating multi-agent collaboration in LLM-based systems requires different methodologies compared to traditional reinforcement learning–driven coordination. Unlike conventional agents that rely on predefined reward structures, LLM agents coordinate through natural language, strategic reasoning, and decentralized problem-solving.
These capabilities are crucial in real-world applications such as financial decision-making and structured data analysis, where autonomous agents must exchange information, negotiate, and synchronize.
Key metric: Collaborative Efficiency — assessing how well multiple agents share responsibilities and distribute tasks dynamically.
Reliability is crucial for enterprise and safety-critical applications. It encompasses consistency, robustness to variations, and trustworthiness of outputs. Unlike task performance (which might measure best-case capabilities), reliability evaluation probes worst-case and average-case scenarios.
Stability of output when the same task is repeated multiple times. Since LLMs are inherently non-deterministic, LLM-based agents exhibit variability.
Metrics:
\(\text{pass@}k\) — probability of success at least once over \(k\) attempts
\(\text{pass}^k\) — whether the agent succeeds in all \(k\) attempts (stricter, from \(\tau\)-benchmark)
The \(\text{pass}^k\) metric better captures the consistency requirements of mission-critical deployments.
Stability of output under input variations or environmental changes. Stress-testing with perturbed inputs: paraphrased instructions, irrelevant context, typos, dialects.
HELM benchmark tracks performance degradation under input variation.
Adaptive resilience: WebLinX examines behavior when web page structure changes during execution.
Error handling: ToolEmu evaluates whether agents respond to tool failures (API errors, null responses) gracefully—retry, switch tools, or explain.
Safety covers adherence to ethical guidelines, avoidance of harmful behavior, and compliance with legal or policy constraints. As LLM agents become more powerful and autonomous, the risk of unintended adverse outcomes grows—disinformation, hate speech, unsafe instructions.
The lack of fairness and transparency can result in biased outcomes, decreased trust, and unintended societal consequences. In financial applications, biased decision-making in loan approvals or investment strategies can reinforce systemic inequalities.
Explainability is crucial for enhancing user trust. Methods include guideline-driven decision-making (AutoGuide) and structured transparency mechanisms (MATSA, FinCon). R-Judge analyzes how agents perceive risk when making autonomous decisions.
Evaluation uses specialized test sets: RealToxicityPrompts (prompts likely to elicit toxic content), checked with automated toxicity detectors. Metrics include percentage of responses containing toxic language and average toxicity score.
HELM includes toxicity and bias metrics as part of holistic evaluation. For interactive agents, red-teaming measures failure rate (how often the agent responds unsafely).
CoSafe evaluates conversational agents on adversarial prompts designed to trick them into breaking safety rules—even advanced agents had vulnerabilities, such as falling for coreference-based attacks (ambiguous references to bypass filters).
Many deployments require agents to comply with specific regulatory or policy constraints—a finance chatbot must not disclose confidential information; a medical assistant must not deviate from established guidelines.
The HELM benchmark for enterprises includes domain-specific prompts and metrics for fields like finance and law. TheAgentCompany evaluates enterprise AI agents under structured correctness constraints, requiring them to follow predefined organizational policies.
Comprehensive mapping of objectives → categories → metrics → relevant papers. Click column headers to sort.
| Objective ⇅ | Category ⇅ | Metrics | Relevant Papers |
|---|---|---|---|
| Agent Behavior | Task Completion | Success Rate (SR), F1-score, Pass@k, Progress Rate, Execution Accuracy, Transfer Learning Success, Zero-Shot Generalization Accuracy | AgentBoard, WebShop, AgentBench, SWE-bench, AppWorld, TheAgentCompany, MAGIC, Mobile-Env, Re-ReST, XMC-AGENT, SQuAD, ResearchArena, InformativeBench |
| Agent Behavior | Output Quality | Coherence, User Satisfaction, Usability, Likability, Overall Quality | PredictingIQ, EnDex, PsychoGAT |
| Agent Behavior | Latency & Cost | Latency, Token Usage, Cost | Cluster diagnosis, MobileBench, MobileAgentBench, LangSuitE, WebArena, Mobile-env, GUI Agents, GPTDroid, Spa-bench |
| Agent Capability | Tool Use | Task Completion Rate, Tool Selection Accuracy | ToolEmu, MetaTool, AutoCodeRover |
| Agent Capability | Planning & Reasoning | Reasoning Quality, Accuracy, Fine-Grained Progress Rate, Self Consistency, Plan Quality | AgentBoard, MMLU, LLM-Aug. Agents, SimuCourt, Magis |
| Agent Capability | Memory & Context | Factual Accuracy Recall, Consistency Scores | LongEval, SocialBench, LoCoMo, Optimus-1 |
| Agent Capability | Multi-Agent Collaboration | Info Sharing Effectiveness, Adaptive Role Switching, Reasoning Rating | AgentSims, WebArena, MATSA, GAMEBENCH, BALROG, TheAgentCompany |
| Reliability | Consistency | pass^k | τ-Bench |
| Reliability | Robustness | Accuracy, Task Success Rate Under Perturbation | HELM, WebLinX |
| Safety | Fairness | Awareness Coverage, Violation Rate, Transparency, Ethics, Morality | CASA, R-Judge, SimuCourt, MATSA, FinCon, AutoGuide |
| Safety | Harm | Adversarial Robustness, Prompt Injection Resistance, Harmfulness, Bias Detection | ASB, AgentPoison, AgentDojo, Backdoor Attacks, SafeAgentBench, Agent-Safety Bench, AgentHarm, Adaptive Attacks, RealToxicityPrompts |
| Safety | Compliance & Privacy | Risk Awareness, Task Completion Under Constraints | R-Judge, Cybench, TheAgentCompany |
The methodological dimension: how agents are assessed
Often performed as a baseline, offline evaluations rely on datasets and static test cases: collections of tasks, prompts, or conversations. Simulated conversations may help develop these data, but are ultimately inert between runs.
Pros: Cheaper, simpler to run and maintain.
Cons: Lack nuance for the wide range of responses; more prone to error propagation; less accurate representations of system performance.
Online evaluation occurs after deployment. Instead of synthetic data, it leverages simulations or fundamental user interactions. This adaptive data is crucial for identifying pain points not discovered during static testing.
Examples: Web simulators (MiniWoB, WebShop, WebArena) where agent behavior (clicking links, filling forms) can be programmed to verify correct sequences.
EDD: Evaluation-driven Development makes evaluation integral to the agent development cycle—continuous offline and online evaluation to detect regressions.
The growing interest has led to diverse datasets, benchmarks, and leaderboards specifically targeting agent capabilities:
AAAR-1.0, ScienceAgentBench, TaskBench — expert-labeled benchmarks for research reasoning, scientific workflows, multi-tool planning
FlowBench, ToolBench, API-Bank — tool use and function-calling across large API repositories with gold tool sequences and parameter structures
AssistantBench, AppWorld, WebArena — dynamic decision-making, long-horizon planning, user-agent interactions
AgentHarm (harmful behaviors), AgentDojo (prompt injection resilience)
Berkeley Function-Calling Leaderboard (BFCL), Holistic Agent Leaderboard (HAL) — standardized test cases, automated metrics, ranking
Human-annotated, synthetic, and interaction-generated data used in combination
Relies on explicit rules, test cases, or assertions. Effective for tasks with well-defined outputs (numerical calculations, structured queries, syntactic correctness).
✅ Consistent, reproducible
❌ Inflexible for open-ended responses
Leverages LLM reasoning to evaluate responses on qualitative criteria. Extension: Agent-as-a-Judge where multiple AI agents interact to refine assessment.
✅ Scalable, handles nuance
❌ May inherit LLM biases
User studies, expert audits, crowdworker annotations. Rated along dimensions: relevance, correctness, tone.
✅ Highest reliability
❌ Expensive, slow, hard to scale
Software frameworks and platforms that support automated, scalable, and continuous agent evaluation workflows, reflecting a movement toward Evaluation-driven Development (EDD).
Xia et al. propose an AgentOps architecture to continuously monitor deployed agents, closing the loop between development and deployment through real-time feedback and quality control.
The evaluation context pertains to the environment in which evaluation is performed. A tradeoff exists between more realistic (costly, potentially less secure) and controlled (less representative) environments.
As development continues, the evaluation context often evolves from smaller, mocked API environments to live deployment as agent performance and trustworthiness are established.
Requirements often overlooked in current research
As LLM-based agents transition from research demos to enterprise deployment, new challenges emerge. Enterprises demand high performance in conjunction with predictable reliability, compliance with regulations, data security, and maintainability.
A key challenge is accounting for Role-Based Access Control (RBAC), which governs users' permissions to access data and services. Users operate with varying levels of access depending on roles, and agents acting on their behalf must adhere to the same constraints. This means an agent's ability to retrieve or act on information is contextually bound to the user's permissions.
IntellAgent includes evaluation tasks requiring authentication of user identity and enforcing policies that deny access to other users' information. By embedding role-specific restrictions into task generation, these approaches more accurately model permission-sensitive enterprise contexts.
Especially important where agents must operate within compliance and auditing frameworks requiring deterministic or repeatable behavior that is explainable. Occasional success is insufficient—agents must perform reliably across time and usage scenarios.
Evaluating reliability is nontrivial: running multiple trials per input is computationally expensive, especially for complex tasks involving tools, memory, or multi-agent coordination. The \(\tau\)-benchmark explicitly incorporates \(\text{pass}^k\) to evaluate consistency in retail and airline booking domains—showing that current agents struggle with consistency.
Unlike most benchmarks focusing on short episodes, real-world enterprise agents operate continuously over extended periods while interacting with users, systems, and data. Standard short-term evaluations cannot capture phenomena such as performance drift, context retention, or cumulative effect of decisions on business outcomes.
Park et al. observed generative agents in a continuously running simulated town environment to study emergent behaviors across multi-day interactions. Maharana et al. evaluated long-term conversational memory through 600-turn dialogues.
Enterprises enforce strict operational rules: approval workflows, data retention policies, usage quotas, and legal regulations like GDPR or HIPAA. Evaluating agents in such contexts requires more than measuring task success—it demands verification that behaviors align with formal policy constraints.
Without explicit modeling of these constraints during evaluation, agents deemed "correct" in traditional benchmarks may still fail in production due to policy violations or compliance risks.
Four key directions to advance the field
Current efforts focus on isolated dimensions. Future work should develop frameworks that assess agent performance across multiple, interdependent dimensions simultaneously.
Bridge the gap between lab and production. Create environments incorporating dynamic multi-user interactions, role-based access controls, and domain-specific knowledge—via real-world deployment trials or simulated enterprise workflows.
Manual evaluation is costly and hard to scale. Explore synthetic data generation for controllable test cases, simulated environments, and advancing LLM-as-a-judge / Agent-as-a-judge techniques.
Evaluation must be efficient to support iterative development. Develop protocols that strike a balance between depth and efficiency, especially for repeated trials and human-in-the-loop assessments.
Bottom Line: Future research should focus on developing evaluation methods that are holistic, realistic, scalable, and efficient. These directions are essential for building reliable and trustworthy LLM-based agents at scale.
Hover to see details about key benchmarks organized by evaluation focus