The recent developments in the research area indicate a strong focus on integrating multimodal data and leveraging advanced machine learning techniques to enhance human-centered applications. A notable trend is the adoption of contrastive learning and transformer-based models for tasks such as rehabilitation exercise interpretation and procedural mistake detection. These approaches aim to provide more accurate and interpretable feedback, which is crucial for fields like healthcare and task automation. Additionally, there is a growing interest in unifying segmentation tasks across image and video domains using multi-modal large language models, which promises to simplify and improve the performance of visual segmentation models. The integration of text, video, and vision data is also being explored to create more cohesive and informative procedural plans, addressing the limitations of unimodal approaches. Notably, the use of novel fusion methods and bridge techniques is enhancing the interaction between different data modalities, leading to more effective and coherent outputs. Overall, the field is moving towards more integrated, interpretable, and versatile solutions that leverage the strengths of various data types and advanced machine learning models.