Efficient and Multi-Dimensional Alignment of Large Language Models

The recent developments in the field of aligning Large Language Models (LLMs) with human preferences have seen a shift towards more efficient and nuanced methods. Researchers are increasingly focusing on inference-time alignment techniques that allow for dynamic adjustment of model behavior without the need for full retraining, thereby reducing computational costs. These methods, such as the introduction of Alignment Vectors (AVs), enable users to tailor LLM outputs to specific domains and preference levels, offering a more flexible and cost-effective solution compared to traditional training-time alignment. Additionally, there is a growing emphasis on multi-dimensional preference optimization, addressing the complex and varied nature of human preferences by extending optimization to multiple aspects and segments of model responses. This approach, exemplified by 2D-DPO, demonstrates superior performance in aligning models with human preferences across various benchmarks. Furthermore, the field is witnessing advancements in safety-focused alignment, with methods like Rectified Policy Optimization (RePO) enhancing safety without compromising performance, particularly in scenarios where safety constraints are stringent. The integration of uncertainty-aware optimization and the use of ensemble models to mitigate the risks associated with reward model variability are also notable trends, ensuring more reliable and robust alignment outcomes. Overall, the current direction in this research area is towards developing more adaptable, efficient, and safe alignment techniques that can better capture and respond to the nuanced and diverse preferences of human users.

Sources

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback

Inference time LLM alignment in single and multidomain preference spectrum

2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Annotation Efficiency: Identifying Hard Samples via Blocked Sparse Linear Bandits

Uncertainty-Penalized Direct Preference Optimization

Fast Best-of-N Decoding via Speculative Rejection

Learning from Response not Preference: A Stackelberg Approach for LLM Detoxification using Non-parallel Data

Accelerating Direct Preference Optimization with Prefix Sharing

Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment

L3Ms -- Lagrange Large Language Models

$f$-PO: Generalizing Preference Optimization with $f$-divergence Minimization

Choice between Partial Trajectories

VPO: Leveraging the Number of Votes in Preference Optimization

Carrot and Stick: Eliciting Comparison Data and Beyond

COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences

Dynamic Information Sub-Selection for Decision Support

Towards Reliable Alignment: Uncertainty-aware RLHF

Joint Training for Selective Prediction

Progressive Safeguards for Safe and Model-Agnostic Reinforcement Learning

Built with on top of