Advances in Zero Reinforcement Learning for Large Language Models

The field of large language models is moving towards more advanced reasoning capabilities, with a focus on integrating external search processes and improving chain-of-thought reasoning. Recent developments have shown that reinforcement learning can be used to enhance the reasoning capabilities of large language models without supervised fine-tuning. Researchers are exploring new frameworks and algorithms, such as ReSearch and Jensen's evidence lower bound, to optimize chain-of-thought reasoning and improve the performance of large language models. Notably, some studies have found that small models can exhibit advanced cognitive behaviors, such as verification, without requiring extensive training data. Noteworthy papers include SimpleRL-Zoo, which investigates zero reinforcement learning across diverse base models and achieves substantial improvements in reasoning accuracy and response length. ReSearch is another notable work, which proposes a novel framework that trains large language models to reason with search via reinforcement learning, demonstrating strong generalizability across various benchmarks.

Sources

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning

Learning to chain-of-thought with Jensen's evidence lower bound

Understanding R1-Zero-Like Training: A Critical Perspective

Built with on top of