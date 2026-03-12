Uncomfortable Truth

We’ve spent the last few years asking a single, very reasonable question about AI safety and security: Can large language models be prevented from helping attackers? That question matters. A lot. Models that casually explain how to write ransomware, bypass endpoint protection, or exploit known vulnerabilities create obvious and serious risk. Alignment work has made real, meaningful progress here, and that work deserves credit.

But focusing on that question alone hides a deeper problem. While we’ve been measuring harmful compliance, we’ve mostly ignored the inverse failure mode: What happens when the same safety mechanisms deny help to legitimate defenders?

This post, based on a new paper from our security team, is about what happens when AI safety works exactly as designed, yet still fails the people trying to keep systems secure. The problem is that we’ve been measuring only half of the risk surface.

The Gap Nobody Was Measuring

Most AI safety benchmarks treat refusal as an unambiguous success. If a model refuses to explain an exploit, generate malware, or bypass a control, great. Alignment win. But real‑world defense work doesn’t look like a clean separation between “good” and “bad” tasks.

Defenders routinely:

analyze malware to understand behavior

study exploits to assess blast radius

disable persistence mechanisms after compromise

scan for vulnerabilities in their own environments

reverse attacker tooling to detect future abuse

In other words, defensive work often looks exactly like offense, at least at the surface level.

From the model’s perspective, the difference between an attacker and a defender is rarely obvious from a single prompt. Both use the same tools. The same terminology. The same technical language.

Until now, few benchmarks have systematically asked: How often do aligned models refuse legitimate, authorized defensive requests? So we measured it.

The Dataset: Real Defenders, Real Pressure

We analyzed 2,390 real interactions from the National Collegiate Cyber Defense Competition (NCCDC). This is not synthetic prompting. Not red‑team role‑play. Not hypothetical scenarios written after the fact. NCCDC is a sanctioned, educational environment where:

student blue teams defend live systems

professional red teams actively attack them

outages, compromises, and misconfigurations happen in real time

time pressure and operational realism are very real

Every prompt in the dataset comes from a legitimate defensive context. There are no attackers asking for help here. Only defenders trying to understand what’s happening to their systems and how to respond.

That realism matters, because alignment failures in controlled demos don’t always translate to failures under pressure. This dataset captures the opposite: what alignment looks like when defenders are stressed, rushed, and dealing with actual incidents.

What We Found

Across all conversations, 12.2% resulted in outright refusal. When degraded, non-actionable responses are included, the problematic response rate rises further. The highest refusal rates occurred in the most critical defensive tasks: system hardening (43.8%), malware analysis (34.3%), vulnerability assessment (22.7%), and incident response (18.9%).

By contrast, log analysis, a defensive task with little attack-shaped language, had zero refusals. That contrast turns out to be the key. The strongest signal in the data was vocabulary. Not intent. Not authorization. Not urgency. The model was 2.72× more likely to refuse when a defensive prompt contained attack-shaped terms like: exploit, payload, shell, bypass, C2, or evasion.

This held true even when:

the user explicitly declared blue‑team context

the environment was described as a sandbox or training exercise

the task was clearly defensive and reactive

We call this pattern Defensive Refusal Bias (DRB). DRB is about legitimate defenders colliding with alignment systems that rely heavily on surface‑level lexical cues. Once certain words appear, deeper reasoning about intent or context often never happens.

Authorization Makes Things Worse

One of our initial hypotheses was that explicit authorization would help. We thought phrases like “I’m on the blue team,” “This is for NCCDC,” or “This is a sanctioned training environment” would reduce refusals. We found empirically that it does the opposite. When we rewrote refused prompts to remove authorization language, refusal rates dropped from 21.8% to 13.7%.

When attack‑shaped vocabulary and authorization appeared together, refusal rates were the highest observed in the dataset. From the model’s perspective, authorization language seems to function less like legitimacy and more like confirmation that the task is genuinely dangerous.

In effect: “I’m authorized to do X” is interpreted as evidence that X really is high‑risk, not as a safety boundary. Counterintuitively, authorization amplifies lexical safety triggers.

Urgency Helps, But Only Briefly

There was one partial exception. Concrete descriptions of ongoing damage appear to act as a legitimacy signal, albeit a weak one. When defenders described an active incident—something already broken, compromised, or causing harm—models appeared somewhat more willing to respond.

Models seem more willing to assist when responding to harm that has already occurred, rather than abstract planning or preventative work. But this effect is fragile. As soon as an attack‑shaped vocabulary appears, or authorization language is added…the benefit vanishes. Lexical safety triggers still dominate.

Why This Matters

This is a serious security asymmetry. Attackers face no equivalent constraints. Defenders operating within aligned systems do.

That creates a safety‑induced denial‑of‑service–meaning:

slower incident response

degraded analysis under pressure

reduced effectiveness of human-AI teaming

Alignment appears to protect systems in theory while weakening them in practice.

Most AI safety benchmarks today measure one side of the tradeoff: Does the model help when it shouldn’t? This study measures the other: Does the model refuse when it shouldn’t? Together, these define the real safety surface. A model that frequently blocks defenders becomes brittle in practice.

What Needs to Change

Fixing Defensive Refusal Bias means building better guardrails. That means:

Moving beyond lexical heuristics toward intent‑aware reasoning

Treating authorization as a first‑class system signal, not just text in a prompt

Evaluating AI security at the system level, not just the output level

Measuring both harmful compliance and defensive refusal

Until we do that, we’ll keep optimizing for benchmarks that look good, while real defenders quietly work around the tools that were supposed to help them.