Semantic-Driven Evaluation and Comprehensive Benchmarking in Software Engineering

The recent advancements in software engineering research have significantly focused on leveraging large language models (LLMs) to enhance various aspects of code quality, testing, and evaluation. A notable trend is the development of frameworks and tools that aim to improve the accuracy and efficiency of code quality assessment by integrating semantic understanding and human-like evaluation criteria. This shift is evident in the proposal of novel evaluation methods that utilize LLMs for recursive semantic comprehension and role-playing to better align with human judgment in code summarization and quality assessment tasks. Additionally, there is a growing emphasis on creating comprehensive benchmarks that evaluate LLMs' capabilities in full-stack programming and complex unit test generation, reflecting real-world usage scenarios across diverse programming languages. These benchmarks not only test the models' ability to generate code but also their adaptability to multi-stage feedback and complex dependencies. Furthermore, the field is witnessing innovative approaches to automated unit test generation that incorporate attention mechanisms to guide the process more effectively, thereby improving defect detection and error triggering. The integration of knowledge units (KUs) from programming languages with traditional code metrics is also emerging as a promising direction for enhancing defect prediction models, offering a more nuanced understanding of software systems. Overall, the current research landscape is characterized by a blend of semantic-driven evaluation, comprehensive benchmarking, and innovative defect detection techniques, all aimed at advancing the state-of-the-art in software engineering.

Sources

SoK: Detection and Repair of Accessibility Issues

WDD: Weighted Delta Debugging

Human-Like Code Quality Evaluation through LLM-based Recursive Semantic Comprehension

A Feedback Toolkit and Procedural Guidance for Teaching Thorough Testing

FullStack Bench: Evaluating LLMs as Full Stack Coder

What You See Is What You Get: Attention-based Self-guided Automatic Unit Test Generation

Can Large Language Models Serve as Evaluators for Code Summarization?

Commit0: Library Generation from Scratch

CPP-UT-Bench: Can LLMs Write Complex Unit Tests in C++?

TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved?

Predicting post-release defects with knowledge units (KUs) of programming languages: an empirical study

System Test Case Design from Requirements Specifications: Insights and Challenges of Using ChatGPT

Built with on top of