The field of large language models (LLMs) is rapidly evolving, with significant developments in code generation, evaluation, and debugging. Recent research has focused on improving the capabilities of LLMs in generating high-quality code, detecting errors, and providing feedback to developers. Notably, the introduction of benchmarks such as DSDBench and CodeARC has enabled the systematic evaluation of LLMs in code debugging and inductive program synthesis. Furthermore, the use of LLMs in automated testing, such as Copilot for Testing, has shown promise in enhancing the efficiency and accuracy of software testing. The development of frameworks like MaintainCoder and SRLCG has also improved the maintainability and scalability of generated code. Overall, these advancements have the potential to transform the field of software development, enabling more efficient, reliable, and maintainable code generation and evaluation. Noteworthy papers include DSDBench, which introduces a benchmark for evaluating LLMs in data science code debugging, and CodeARC, which proposes a framework for evaluating LLMs in inductive program synthesis. Additionally, the paper on Copilot for Testing demonstrates the effectiveness of LLMs in automated testing, while MaintainCoder and SRLCG showcase the potential of LLMs in improving code maintainability and scalability.
Advancements in Large Language Models for Code Generation and Evaluation
Sources
Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors
Generative Reliability-Based Design Optimization Using In-Context Learning Capabilities of Large Language Models
CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation
Towards an intelligent assessment system for evaluating the development of algorithmic thinking skills: An exploratory study in Swiss compulsory schools
RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation
A Multi-agent Onboarding Assistant based on Large Language Models, Retrieval Augmented Generation, and Chain-of-Thought
Beyond Detection: Designing AI-Resilient Assessments with Automated Feedback Tool to Foster Critical Thinking
SRLCG: Self-Rectified Large-Scale Code Generation with Multidimensional Chain-of-Thought and Dynamic Backtracking