The recent advancements in 3D scene generation and perception have been significantly influenced by the integration of diffusion models, which are now being utilized to bridge the gap between generation and perception tasks. A notable trend is the development of unified frameworks that not only generate high-quality 3D scenes but also enhance perception models through mutual learning paradigms. These frameworks leverage semantic occupancy and joint-training diffusion models to create realistic scenes from text prompts, while also improving perception tasks like semantic occupancy prediction. Additionally, there is a growing focus on multi-object novel view synthesis, where models are being enhanced to handle complex scenarios with multiple objects, ensuring consistent and accurate placement and appearance across different views. Controllability and efficiency in driving simulations are also being addressed, with models designed to initialize and rollout scenes realistically while maintaining inference efficiency and closed-loop realism. Furthermore, the synthesis of photorealistic street views from vehicle sensor data is being advanced through controllable video diffusion models, which offer precise camera control and real-time rendering capabilities. Object insertion tasks are evolving with the introduction of affordance-aware models that seamlessly integrate objects into scenes, addressing the interplay between foreground and background. Lastly, view synthesis from 3D lifting is being refined through progressive techniques that enhance the quality of 3D representations and their rendering. Overall, these developments indicate a shift towards more integrated, controllable, and efficient solutions in 3D scene generation and perception.