Fortress
Frontier Risk Evaluation for National Security and Public Safety
Introduction
The rapid advancement of large language models (LLMs) introduces powerful dual-use capabilities that could both threaten and bolster national security and public safety (NSPS). Developers often implement model safeguards to help protect against misuse of models that could lead to possible risks. However, these mitigation measures also sometimes inadvertently prevent models from providing useful information. We need to understand the extent to which model safeguards both prevent harmful responses and enable helpful ones to assess the relevant trade-offs for research, development, and policy. Existing benchmarks often do not adequately test model robustness to national security and public safety (NSPS) related risks in a scalable, objective manner that accounts for the dual-use nature of NSPS information.

