Enhancing Language Model Alignment: Multi-Objective Fine-Tuning and Robust Preference Modeling

The recent advancements in the field of language model alignment have seen a shift towards more sophisticated and robust optimization techniques. Researchers are increasingly focusing on addressing the limitations of traditional methods like Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF). One of the key areas of innovation is the development of frameworks that can handle multi-objective fine-tuning, such as the HyperDPO framework, which leverages hypernetworks to manage listwise ranking datasets effectively. This approach not only enhances efficiency but also provides greater flexibility in post-training control over trade-offs between objectives.

Another significant trend is the simultaneous modeling of rewards and preferences to mitigate issues like model drift and reward overfitting. Methods like DRDO introduce a supervised knowledge distillation approach that directly mimics rewards while learning human preferences, demonstrating improved robustness to noisy preference signals and out-of-distribution settings.

The field is also grappling with the phenomenon of likelihood displacement, where the likelihood of preferred responses decreases during training. Studies have shown that this can lead to unintentional unalignment, particularly in safety-critical contexts. Researchers are now developing metrics like the Centered Hidden Embedding Similarity (CHES) score to identify and mitigate these issues.

Moreover, there is a growing recognition of the importance of output diversity in alignment algorithms. Findings indicate that while higher likelihood of preferred completions can improve memorisation, it may reduce diversity and harm generalization. Indicators such as decreasing entropy over top-k tokens and diminishing top-k probability mass are being used to prevent over-optimization and maintain alignment with human preferences.

Lastly, the introduction of multi-sample comparison methods like mDPO and mIPO highlights the need for more comprehensive evaluation of generative models. These methods aim to capture critical characteristics such as diversity and bias more accurately, offering a more robust optimization framework, especially in the presence of label noise.

Noteworthy papers include the HyperDPO framework, which significantly advances multi-objective fine-tuning, and DRDO, which introduces a novel approach to simultaneously model rewards and preferences, demonstrating superior robustness.

Sources

HyperDPO: Hypernetwork-based Multi-Objective Fine-Tuning Framework

Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

Understanding Likelihood Over-optimisation in Direct Alignment Algorithms

Preference Optimization with Multi-Sample Comparisons

A Common Pitfall of Margin-based Language Model Alignment: Gradient Entanglement

Built with on top of