Advancements in Large Language Models for Code Generation and Evaluation

The field of large language models (LLMs) is rapidly evolving, with significant developments in code generation, evaluation, and debugging. Recent research has focused on improving the capabilities of LLMs in generating high-quality code, detecting errors, and providing feedback to developers. Notably, the introduction of benchmarks such as DSDBench and CodeARC has enabled the systematic evaluation of LLMs in code debugging and inductive program synthesis. Furthermore, the use of LLMs in automated testing, such as Copilot for Testing, has shown promise in enhancing the efficiency and accuracy of software testing. The development of frameworks like MaintainCoder and SRLCG has also improved the maintainability and scalability of generated code. Overall, these advancements have the potential to transform the field of software development, enabling more efficient, reliable, and maintainable code generation and evaluation. Noteworthy papers include DSDBench, which introduces a benchmark for evaluating LLMs in data science code debugging, and CodeARC, which proposes a framework for evaluating LLMs in inductive program synthesis. Additionally, the paper on Copilot for Testing demonstrates the effectiveness of LLMs in automated testing, while MaintainCoder and SRLCG showcase the potential of LLMs in improving code maintainability and scalability.

Sources

Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors

Generative Reliability-Based Design Optimization Using In-Context Learning Capabilities of Large Language Models

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

Towards an intelligent assessment system for evaluating the development of algorithmic thinking skills: An exploratory study in Swiss compulsory schools

RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation

AI Delivers Creative Output but Struggles with Thinking Processes

CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

A Multi-agent Onboarding Assistant based on Large Language Models, Retrieval Augmented Generation, and Chain-of-Thought

Beyond Detection: Designing AI-Resilient Assessments with Automated Feedback Tool to Foster Critical Thinking

Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics

MaintainCoder: Maintainable Code Generation Under Dynamic Requirements

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

Form-Substance Discrimination: Concept, Cognition, and Pedagogy

Curriculum Design of Competitive Programming: a Contest-based Approach

SRLCG: Self-Rectified Large-Scale Code Generation with Multidimensional Chain-of-Thought and Dynamic Backtracking

Open, Small, Rigmarole -- Evaluating Llama 3.2 3B's Feedback for Programming Exercises

Automated Factual Benchmarking for In-Car Conversational Systems using Large Language Models

Facilitating Instructors-LLM Collaboration for Problem Design in Introductory Programming Classrooms

Grade Guard: A Smart System for Short Answer Automated Grading

From Code Generation to Software Testing: AI Copilot with Context-Based RAG

On Simulation-Guided LLM-based Code Generation for Safe Autonomous Driving Software

BOOST: Bootstrapping Strategy-Driven Reasoning Programs for Program-Guided Fact-Checking

"I Feel Like I'm Teaching in a Gladiator Ring": Barriers and Benefits of Live Coding in Classroom Settings