The recent advancements in music generation and synthesis have seen a significant shift towards leveraging deep learning and diffusion models to enhance the quality and diversity of generated content. Researchers are increasingly focusing on integrating multi-modal inputs, such as text and images, to guide the generation process, thereby improving the coherence and emotional alignment of the output. Notably, the use of reference-based diffusion networks and cascaded flow matching techniques is proving to be effective in generating high-fidelity music, particularly in singing voice synthesis. Additionally, there is a growing emphasis on proactive protection technologies to mitigate the risks associated with unauthorized speech synthesis, ensuring privacy and security in voice data. These developments collectively push the boundaries of what is possible in music generation, offering new tools for creative applications and addressing critical issues in voice protection.