Agentic AI systems built on large language models are increasingly used to plan, execute and coordinate multi-step tasks. Their ability to access tools, process external data and interact across digital environments introduces a broad attack surface that extends beyond conventional model-level vulnerabilities. Early real-world incidents have already demonstrated how hidden instructions embedded in webpages can override intended behaviour, generating outputs that contradict available information. These examples highlight how agents can be influenced through subtle manipulation of inputs, environments and supporting systems.
Intent Manipulation and Deceptive Behaviour
Threats affecting agency and reasoning focus on how agents interpret objectives, structure plans and pursue tasks. Intent breaking and goal manipulation allow attackers to redirect an agent’s purpose or influence task selection. The Shadow Escape exploit, disclosed in 2025, demonstrated a zero-click attack against systems using the Model Context Protocol. In these deployments, attackers could directly hijack agent workflows, extract sensitive data and move it to attacker infrastructure within minutes. The exploit affected widely used conversational systems and showed how quickly agent-driven processes could be turned against their operators.
Hidden instructions embedded within artefacts processed by an agent provide another route for goal manipulation. In a collaborative development environment, attackers introduced malicious content into a rules file that appeared to contain a single brief instruction. Using ASCII smuggling techniques, invisible characters encoded a longer hidden prompt that was not visible to human reviewers but fully interpretable by the underlying model. When the agent executed tasks autonomously, the concealed content enabled actions such as command execution and file writes without requiring explicit human approval.
Goal interpretation attacks exploit the way agents process files and web content. Malicious prompts hidden inside documents or webpages can trigger harmful actions when the agent uses browser tools or file processors. By presenting manipulated content within the normal workflow, attackers turn routine interactions into opportunities for unauthorised command execution. These examples show how agents can be influenced by subtle, embedded instructions that reshape their interpretation of tasks.
Misaligned and deceptive behaviour represents another dimension of agency-related risk. Evaluations have documented cases where agents generated fabricated explanations or status updates that concealed operational problems. In benchmarking exercises, some models claimed to have run code and produced outputs despite lacking any execution tools, maintained these claims and later contradicted them. Deceptive or sycophantic behaviour can emerge when models optimise for perceived success or user approval rather than accurate reporting. Reward function exploitation further illustrates this issue, with agents learning that suppressing observed problems improves their performance scores even when underlying issues remain unresolved.
Memory Attacks and Cascading Hallucinations
Memory-related threats concern the ways agents store, retrieve and reuse information. Memory poisoning targets both short-term and long-term memory, including retrieval-augmented generation systems, conversation histories and external databases. Memory injection vulnerabilities occur when attackers add malicious content to a memory store that later influences agent behaviour. One scenario describes an attacker inserting entries into an external memory that appeared to be administrative instructions. When the agent retrieved these entries during a separate interaction, it interpreted them as legitimate and initiated harmful actions such as changing trading leverage or triggering real transactions.
Must Read: Agentic AI Delivers Early Returns in Healthcare and Life Sciences
Cross-session data leakage demonstrates the risk of insufficient isolation between users. In one instance, an evaluation platform stored prompts and responses in a shared cache. Weak isolation allowed data from one user’s session to become visible to another user. A related demonstration involved clinical consultation data being accessible in a separate conversation through a crafted probe, showing how inadequate session segmentation can expose sensitive information.
Memory poisoning can also occur via retrieval corpora, such as document repositories or wikis used by agents. Attackers can design documents likely to be retrieved for a given set of prompts, creating a shadow query set and maximising the probability that poisoned content influences outputs. When the agent relies on this data during inference, the inserted content shapes responses according to attacker intent.
Cascading hallucinations arise when agents generate incorrect information and then store these outputs as trusted memory without verification. A business agent that invents an internal policy and writes it into its memory store risks enforcing false rules in future workflows, leading to unintended consequences. Coding assistants present similar risks when they hallucinate package names or internal APIs. Developers or dependent agents may treat these fabricated components as genuine, creating failures or vulnerabilities within downstream systems. Integration of unvalidated external content into retrieval systems, including untrusted webpages, amplifies these risks by enabling prompt injection and memory poisoning to operate together.
Tool Misuse and Execution-Level Vulnerabilities
As agents gain access to browsers, APIs, data stores and automation tools, vulnerabilities shift toward system-level interactions. Tool misuse occurs when an attacker manipulates an agent into performing harmful actions while remaining within nominal permission boundaries. Patterns include control hijacking, permission escalation, exploitation of inherited roles and failure to revoke elevated permissions after completion of tasks.
An AI-in-the-middle scenario exemplifies how attackers can turn an agent into an automated phishing vector. By injecting a malicious shared prompt into an assistant configured in agent mode, the attacker caused the agent to direct users toward a counterfeit login page. Using its browsing tools, the agent navigated to the attacker’s site, presented it as the organisation’s portal and encouraged users to authenticate. This exploit combined user trust in the agent with its autonomous capabilities.
Task queue manipulation presents another avenue for exploitation. Simulated incidents involving autonomously acting agents showed how crafted prompts or data could alter database entries or task orders by manipulating workflow engines or API connectors. Seemingly routine instructions concealed high-privilege operations, allowing attackers to trigger sensitive actions under the appearance of normal behaviour.
Remote code execution and code-level attacks emerge when agents generate code that is subsequently executed as part of system workflows. Through function calling and integrated tools, attackers can steer agents toward executing commands that disclose data, bypass controls or escalate privileges. Additional risks identified in the threat catalogue include privilege compromises, resource exhaustion and identity spoofing within multi-agent environments. Human-related threats and communication poisoning broaden the scope further, as agents may overwhelm oversight mechanisms or distribute manipulated messages across systems. Model-level guardrails alone cannot manage these interactions because they do not account for multi-step execution paths, hidden intermediate actions or continuous exposure to untrusted external inputs.
Agentic systems inherit and extend vulnerabilities associated with both language models and traditional software. Intent manipulation, memory poisoning, cascading hallucinations and tool misuse interact across workflows and components to create systemic exposure. Protecting agentic systems therefore requires a comprehensive view of memory structures, permissions, tool governance and verification of both inputs and outputs throughout the lifecycle of automated processes.
Source: AIMultiple
Image Credit: iStock