Advances in Large Language Models and Adversarial Attacks

The field of large language models (LLMs) is rapidly evolving, with a focus on improving safety and security. Recent studies have highlighted the vulnerability of LLMs to adversarial attacks, including jailbreak attacks and backdoor attacks. Researchers are exploring new methods to detect and mitigate these attacks, such as learning natural language constraints and using visual retrieval-augmented generation. Additionally, there is a growing interest in developing more robust and generalizable LLMs that can adapt to novel safety requirements and domain shifts. Noteworthy papers in this area include the introduction of a domain-based taxonomy of jailbreak vulnerabilities and the proposal of a novel backdoor attack method called Parasite, which leverages steganography for trigger hiding. Another notable work is the development of a training-free adversarial detection method using visual retrieval-augmented generation, which achieves state-of-the-art performance in detecting adversarial patches.

Sources

Automated Survey Collection with LLM-based Conversational Agents

Learning Natural Language Constraints for Safe Reinforcement Learning of Language Agents

Multi-lingual Multi-turn Automated Red Teaming for LLMs

Malware Detection in Docker Containers: An Image is Worth a Thousand Logs

JailDAM: Jailbreak Detection with Adaptive Memory for Vision-Language Model

Don't Lag, RAG: Training-Free Adversarial Detection Using RAG

A Domain-Based Taxonomy of Jailbreak Vulnerabilities in Large Language Models

Sugar-Coated Poison: Benign Generation Unlocks LLM Jailbreaking

Parasite: A Steganography-based Backdoor Attack Framework for Diffusion Models

Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking

Defending Deep Neural Networks against Backdoor Attacks via Module Switching

Bypassing Safety Guardrails in LLMs Using Humor

Built with on top of