Enhancing LLM Security and Robustness Against Adversarial Attacks

The recent advancements in the field of large language models (LLMs) have primarily focused on enhancing their robustness against adversarial attacks, particularly in the context of jailbreaking and privacy leakage. Researchers are exploring innovative methods to adapt LLMs for specific tasks, such as time series forecasting, while ensuring they remain secure against malicious prompts and model manipulations. Key developments include the use of nearest neighbor contrastive learning for time series forecasting, novel jailbreak techniques leveraging alignment and bit-flip manipulations, and advanced red-teaming frameworks for privacy leakage. These approaches aim to balance the performance and security of LLMs, addressing critical vulnerabilities while maintaining or improving their functionality. Notably, some papers have introduced groundbreaking methods that significantly reduce computational requirements and increase attack success rates, underscoring the evolving landscape of LLM security. As the field progresses, the emphasis on developing robust defense mechanisms and moving target defense strategies is becoming increasingly important to safeguard LLMs against a wide array of threats.

Sources

Rethinking Time Series Forecasting with LLMs via Nearest Neighbor Contrastive Learning

LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds

BadGPT-4o: stripping safety finetuning from GPT models

PrivAgent: Agentic-based Red-teaming for LLM Privacy Leakage

Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models

Trust No AI: Prompt Injection Along The CIA Security Triad

Poison Attacks and Adversarial Prompts Against an Informed University Virtual Assistant

PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips

FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks

Adversarial Vulnerabilities in Large Language Models for Time Series Forecasting

Model-Editing-Based Jailbreak against Safety-aligned Large Language Models

AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models

Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models

Built with on top of