Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents

For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities

For safety reasons, large language models (LLMs) are trained to refuse harmful user instructions, such as assisting dangerous activities. We study an open question in this work. Can the desired safety refusal, typically enforced in chat contexts, be generalized to non-chat and agentic use cases? Unlike chatbots, LLM agents equipped with general-purpose tools, such as web browsers and mobile devices, can directly influence the real world, making it even more crucial to refuse harmful instructions. In this work, we primarily focus on red-teaming browser agents – LLMs that manipulate information via web browsers. To this end, we introduce Browser Agent Red-teaming Toolkit (BrowserART), a comprehensive test suite designed specifically for red-teaming browser agents. BrowserART consists of 100 diverse browser-related harmful behaviors (including original behaviors and ones sourced from HarmBench [Mazeika et al., 2024] and AirBench 2024 [Zeng et al., 2024b]) across both synthetic and real websites. Our empirical study on state-of-the-art browser agents reveals that, while the backbone LLM refuses harmful instructions as a chatbot, the corresponding agent does not. Moreover, attack methods designed to jailbreak refusal-trained LLMs in the chat settings transfer effectively to browser agents. With human rewrites, GPT-4o and o1-preview-based browser agents attempted 98 and 63 harmful behaviors (out of 100), respectively. We publicly release BrowserART and call on LLM developers, policymakers, and agent developers to collaborate on improving agent safety.

BrowserART consists of 100 harmful browser-related behaviors (including original behaviors and ones sourced from HarmBench [Mazeika et al., 2024] and AirBench 2024 [Zeng et al., 2024b]) that an agent is not supposed to assist. We divided all behaviors into two main categories: harmful content and harmful interaction. Under each main category, we created sub-categories for the harm semantics. We created 40 synthetic websites under 19 domains for red teaming browser behaviors that target specific websites (e.g., Twitter/X). These synthetic pages are hosted locally for running red teaming experiments in a sandbox without polluting the real world, especially social media and government sites.

One of the major findings in our work is a clear gap between Attack Success Rates (ASRs) of the backbone LLM (blue) and its agent (lavender). That is, while the LLM refuses to follow the harmful behavior in the user instruction, the agent will. Gaps of GPT-4o and GPT-4-turbo models are most outstanding, compared to other backbone LLMs. In particular, we find Opus-3 and Llama-3.1-405B show the least drop in their refusal capabilities.

Ethics and Disclosure

Bibtex Citation