The field of reinforcement learning from human feedback (RLHF) is moving towards addressing key challenges such as reward hacking, sparse rewards, and inefficient exploration. Researchers are proposing innovative methods to mitigate these issues, including the use of uncertain reward models, behavior-supported regularization, and entropy-guided sequence weighting. These approaches aim to improve the efficiency and robustness of RLHF algorithms, enabling more effective training of large language models. Noteworthy papers include:
- Likelihood Reward Redistribution, which proposes a framework for modeling per-step rewards with parametric probability distributions to address sparse rewards.
- Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization, which introduces a method to regularize the value function and prevent overestimation caused by extrapolation errors.
- Trajectory Balance with Asynchrony, which presents a scalable LLM RL system that decouples training and search to speed up training wall-clock time.
- One Framework to Rule Them All, which unifies RL-based and RL-free methods in RLHF through a neural structured bandit prediction perspective.
- Optimizing Language Models for Inference Time Objectives using Reinforcement Learning, which investigates the merits of explicitly optimizing for inference time algorithmic performance during model training.
- RL-finetuning LLMs from on- and off-policy data with a single algorithm, which introduces a novel reinforcement learning algorithm for fine-tuning large-language models.
- Is Best-of-N the Best of Them, which analyzes the performance of inference-time alignment algorithms and introduces a new algorithm that mitigates reward hacking.
- Sharpe Ratio-Guided Active Learning for Preference Optimization in RLHF, which proposes an active learning approach to efficiently select prompt and preference pairs using a risk assessment strategy based on the Sharpe Ratio.
- Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback, which explores data-driven bottlenecks in RLHF performance scaling and introduces a hybrid reward system and a novel prompt-selection method.
- Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning, which enhances the exploration-exploitation tradeoff by dynamically assigning weights to generated outputs based on their advantage and entropy.
- Probabilistic Uncertain Reward Model, which proposes a natural generalization of the classical Bradley-Terry reward model to address reward hacking. These papers demonstrate significant advancements in the field of RLHF, offering new perspectives and solutions to long-standing challenges.