The field of audio-driven human motion generation is rapidly advancing, with a focus on creating more realistic and expressive movements. Researchers are exploring new architectures and techniques to improve the quality and diversity of generated motions, including the use of diffusion models, transformers, and recurrent embedded transformers. These innovations have led to significant improvements in generating coherent and natural-looking human movements, such as gestures and body language, that are synchronized with speech. Noteworthy papers in this area include DIDiffGes, which achieves real-time gesture generation from speech with high quality and expressiveness, and ReCoM, which presents a framework for generating high-fidelity and generalizable human body motions synchronized with speech, achieving state-of-the-art performance across metrics.