The recent advancements in video understanding and multi-object tracking have shown significant progress, particularly in leveraging temporal information and novel methodologies to enhance performance. Key innovations include the integration of instance masks for feature aggregation in video object detection, which has set new benchmarks for speed-accuracy trade-offs. Additionally, the development of pose-based virtual markers for multi-object tracking in dynamic scenes, such as team sports, has addressed challenges related to occlusion and ID switches, demonstrating high accuracy and efficiency. Furthermore, the introduction of class-level perceptual consistency frameworks in video semantic segmentation has advanced the field by providing personalized inter-class features and diversified intra-class features, leading to superior segmentation results. These developments collectively indicate a shift towards more robust, efficient, and context-aware video analysis techniques, with potential applications ranging from surveillance to autonomous driving. Notably, the use of time-symmetric tracking methodologies and the exploration of TGOSPA metric parameters for tailored performance evaluations highlight the ongoing efforts to refine tracking algorithms for diverse applications.