Preference Modeling for Aligning Language Models

Report on Current Developments in Preference Modeling for Aligning Language Models

General Direction of the Field

The field of preference modeling for aligning language models with human values is rapidly evolving, with a strong focus on enhancing the expressiveness, efficiency, and robustness of preference representations. Recent advancements are moving beyond traditional reward modeling methods, which often struggle with intransitive preferences and high computational costs, towards more sophisticated and scalable approaches. The general direction of the field can be summarized in three key areas:

  1. Preference Representation Learning: There is a growing emphasis on embedding responses into latent spaces to capture intricate preference structures more efficiently. This approach aims to achieve linear query complexity, making it feasible to model complex preferences without the quadratic computational overhead of traditional methods.

  2. General Preference Optimization: Innovations in preference optimization are shifting from reward-based reinforcement learning to more generalized frameworks that can handle a wider range of preference data. These methods are designed to improve the alignment of language models with nuanced human values, particularly in scenarios where traditional reward models fall short.

  3. Enhanced Model Alignment with Granular Feedback: The incorporation of relative quality margins into preference optimization is gaining traction. This approach allows for a more granular understanding of human preferences, leading to improved model policies and reward models that are better calibrated and less prone to overfitting.

Noteworthy Innovations

  • Preference Representation Learning: This approach introduces a novel way to embed responses into a latent space, achieving linear query complexity and outperforming traditional reward models on benchmarks like RewardBench.

  • Margin Matching Preference Optimization (MMPO): MMPO incorporates relative quality margins into optimization, leading to state-of-the-art performance on benchmarks and improved robustness against overfitting.

  • Self-Rationalization for LLM-as-a-Judge: This method iteratively improves the rationales of judge models, leading to better alignment and evaluation accuracy, outperforming larger models on scoring benchmarks.

These innovations are pushing the boundaries of preference modeling, making it possible to align language models more effectively with human values in increasingly complex and nuanced ways.

Sources

General Preference Modeling with Preference Representations for Aligning Language Models

An Exploration of Self-Supervised Mutual Information Alignment for Multi-Task Settings

Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback

Beyond Scalar Reward Model: Learning Generative Judge from Preference Data

Preference Optimization as Probabilistic Inference

LRHP: Learning Representations for Human Preferences via Preference Pairs

Reward Learning From Preference With Ties

Self-rationalization improves LLM as a fine-grained judge

Built with on top of