The field of video generation is rapidly advancing, with a focus on improving control and quality. Recent developments have enabled more precise control over video attributes, such as motion and appearance, and have led to the creation of more realistic and engaging videos. Notably, innovations in diffusion models and transformers have played a key role in these advancements. One of the significant trends in this area is the use of novel frameworks and architectures that integrate multiple conditions and modalities, allowing for more flexible and controllable video generation. Additionally, there is a growing interest in self-supervised and unsupervised learning methods that can learn motion concepts and abstract object movements from videos without requiring extensive labeled datasets. Some noteworthy papers in this regard include Enabling Versatile Controls for Video Diffusion Models, which introduces a novel framework for fine-grained control over pre-trained video diffusion models, and Mask$^2$DiT, which proposes a dual mask-based diffusion transformer for multi-scene long video generation. VideoMage is also a notable work, as it presents a unified framework for video customization over both multiple subjects and their interactive motions.