The research area of video anomaly understanding and long-term video comprehension is experiencing significant advancements, driven by the need for more sophisticated models capable of handling complex, real-world scenarios. A notable trend is the development of benchmarks that move beyond simple anomaly detection to encompass a deeper understanding of causation, temporal relationships, and multimodal reasoning. These benchmarks, designed to evaluate models on tasks such as abductive reasoning, hierarchical anomaly understanding, and long-term video comprehension, highlight the limitations of current vision-language models and emphasize the need for enhanced architectures and training strategies. Innovative approaches, such as the use of semi-automated annotation engines and anomaly-focused temporal samplers, are being introduced to improve the efficiency and accuracy of anomaly detection in long videos. Additionally, new methodologies and evaluation metrics are being proposed to better align with human judgment criteria, ensuring a more comprehensive assessment of model performance. These developments collectively push the boundaries of what is possible in video understanding, paving the way for more robust and versatile models in the future.