Large Language Model Alignment with Human Preferences

Report on Recent Developments in Large Language Model Alignment with Human Preferences

Overview

The field of aligning Large Language Models (LLMs) with human preferences is witnessing significant advancements, driven by the need for more efficient and effective methods. Recent developments focus on refining offline Reinforcement Learning with Human Feedback (RLHF) techniques, enhancing the robustness and accuracy of preference optimization, and leveraging innovative reward models to better capture human intent.

General Direction

The current research trend is towards simplifying and enhancing the alignment process without compromising the quality of LLM outputs. Researchers are exploring methods that reduce the resource intensity of RLHF by introducing offline alternatives that optimize LLMs using ranking losses on fixed datasets. A notable shift is the introduction of techniques that not only capture the ordinal relationship between responses but also quantify the degree of preference, thereby providing more nuanced supervision signals.

Additionally, there is a growing emphasis on improving the stability and performance of LLM fine-tuning through novel loss functions and training metrics. These innovations aim to minimize the deviation between the optimized and original models, ensuring that the fine-tuned LLMs remain aligned with human preferences while maintaining their original capabilities.

Noteworthy Developments

  • Reward Difference Optimization (RDO): This method introduces reward difference coefficients to reweigh sample pairs, enhancing the accuracy of offline RLHF by quantifying preference differences.
  • Critique-out-Loud (CLoud) Reward Models: These models generate natural language critiques to predict scalar rewards, improving preference classification accuracy and leading to Pareto improvements in win rates.

These developments highlight the potential for more sophisticated and effective alignment techniques, paving the way for LLMs that more accurately reflect human values and intents.

Sources

Offline RLHF Methods Need More Accurate Supervision Signals

Minor DPO reject penalty to increase training robustness

Minor SFT loss for LLM fine-tune to increase performance and reduce model deviation

Critique-out-Loud Reward Models