Visible to the public "Researchers Discover New Vulnerability in Large Language Models"Conflict Detection Enabled

Large Language Models (LLMs) apply deep learning techniques to process and generate text. This Artificial Intelligence (AI) technology has resulted in the development of open source and publicly accessible tools, such as ChatGPT, Claude, Google Bard, and more. Recent work has focused on aligning LLMs to prevent undesirable generation. For example, public chatbots will not generate inappropriate content if asked directly. Although attackers have been able to evade these measures, their strategy often requires significant human creativity, and the results have been found to be inconsistent. Researchers from the School of Computer Science (SCS) at Carnegie Mellon University (CMU), the CyLab Security and Privacy Institute, and the Center for AI Safety in San Francisco have discovered a new vulnerability, proposing a simple and effective attack method that can cause aligned LLMs to generate objectionable behaviors with a high success rate. In their study titled "Universal and Transferable Attacks on Aligned Language Models," CMU Associate Professors Matt Fredrikson and Zico Kolter, Ph.D. student Andy Zou, and CMU alum Zifan Wang discovered a suffix that, when attached to a wide range of queries, significantly increases the chance that both open and closed source LLMs will deliver affirmative responses to queries they would otherwise reject. This article continues to discuss the vulnerability found in LLMs.

CyLab reports "Researchers Discover New Vulnerability in Large Language Models"