Evaluation of Large Language Models

Report on Current Developments in the Evaluation of Large Language Models

General Direction of the Field

The recent advancements in the evaluation of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have significantly shifted the focus towards more rigorous and realistic assessments of model capabilities. The field is moving towards creating benchmarks that not only test the models' performance on traditional tasks but also evaluate their reasoning, planning, and lateral thinking abilities in more complex and dynamic scenarios. This shift is driven by the need to ensure that models are not only proficient in known tasks but can also generalize well to out-of-distribution and novel situations.

One of the key trends is the development of benchmarks that minimize the influence of domain-specific knowledge, thereby focusing on pure reasoning and problem-solving skills. This approach aims to provide a more accurate evaluation of models' capabilities in scenarios where prior knowledge is not a significant factor. Additionally, there is a growing emphasis on interactive and game-based benchmarks that simulate real-world interactions, allowing for a more comprehensive assessment of models' abilities in dynamic environments.

Another notable trend is the use of real-world data and user interactions to create evaluation datasets. This approach not only mitigates the risk of model cheating but also aligns the assessments more closely with genuine user needs and expectations. The incorporation of multi-turn interactions and the simulation of interactive games are becoming increasingly common, reflecting the need for models to handle complex, multi-step reasoning tasks.

Overall, the field is progressing towards more robust, dynamic, and realistic benchmarks that challenge models in ways that are more representative of real-world applications. This shift is expected to drive further advancements in the development of LLMs and MLLMs, pushing the boundaries of what these models can achieve in terms of reasoning, planning, and creative thinking.

Noteworthy Papers

  • KOR-Bench: Introduces Knowledge-Orthogonal Reasoning to minimize domain-specific knowledge impact, focusing on pure reasoning abilities in out-of-distribution scenarios.
  • TurtleBench: Utilizes real-world user interactions to create a dynamic evaluation dataset, enhancing the reliability of LLM assessments.
  • GameTraversalBenchmark (GTB): Proposes a benchmark for evaluating LLMs' planning abilities through traversing 2D game maps, highlighting the models' performance in complex spatial reasoning tasks.
  • SPLAT: Leverages Situation Puzzles to evaluate and elicit lateral thinking in LLMs, using a multi-turn player-judge framework to simulate interactive games.

Sources

How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

ING-VP: MLLMs cannot Play Easy Vision-based Games Yet

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

GameTraversalBenchmark: Evaluating Planning Abilities Of Large Language Models Through Traversing 2D Game Maps

Built with on top of