Video-based Human Activity Recognition

Report on Current Developments in Video-based Human Activity Recognition

General Direction of the Field

The field of video-based Human Activity Recognition (HAR) is witnessing a significant shift towards more sophisticated and context-aware models, driven by the need for enhanced surveillance systems, educational tools, and real-time anomaly detection. Recent advancements are characterized by a focus on fine-grained action recognition, multimodal data integration, and semi-supervised learning approaches. These developments aim to address the complexities of human behavior in diverse and dynamic environments, such as classrooms, sports arenas, and public spaces.

One of the key trends is the integration of deep learning techniques with traditional machine learning methods to improve the accuracy and robustness of HAR systems. Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and transformers are being combined with compositional query machines and consistency loss functions to better model the spatio-temporal interactions of human activities. This multimodal approach leverages the complementary strengths of different data modalities, such as RGB and skeletal data, to enhance the overall performance of HAR models.

Another notable trend is the increasing emphasis on semi-supervised and few-shot learning methods. These approaches are particularly valuable in scenarios where labeled data is scarce or expensive to obtain. By leveraging unlabeled data and novel alignment techniques, researchers are developing models that can generalize better to new and unseen classes, thereby reducing the dependency on extensive manual annotations.

The field is also seeing a rise in the development of specialized datasets and benchmarks tailored to specific applications, such as classroom surveillance and repetitive action counting in sports. These datasets not only provide a rich source of data for training and evaluation but also highlight the unique challenges associated with different real-world scenarios, such as occlusion, varied shooting angles, and dense object engagement.

Noteworthy Innovations

  1. FinePseudo: Introduces a novel alignment-based metric learning technique for semi-supervised fine-grained action recognition, significantly outperforming prior methods on multiple datasets.

  2. COMPUTER: Proposes a compositional query machine that effectively integrates multimodal data for robust human activity recognition, demonstrating superior performance in action localization and group activity recognition tasks.

  3. MultiCounter: Develops an end-to-end framework for simultaneous detection, tracking, and counting of repetitive actions in untrimmed videos, setting a new benchmark in multi-instance repetitive action counting.

These innovations represent significant strides in advancing the field of video-based human activity recognition, offering new methodologies and frameworks that address the complexities and challenges inherent in real-world applications.

Sources

A Critical Analysis on Machine Learning Techniques for Video-based Human Activity Recognition of Surveillance Systems: A Review

FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition

Towards Student Actions in Classroom Scenes: New Dataset and Baseline

Unified Framework with Consistency across Modalities for Human Activity Recognition

SITAR: Semi-supervised Image Transformer for Action Recognition

Unveiling Context-Related Anomalies: Knowledge Graph Empowered Decoupling of Scene and Action for Human-Related Video Anomaly Detection

Few-Shot Continual Learning for Activity Recognition in Classroom Surveillance Images

MultiCounter: Multiple Action Agnostic Repetition Counting in Untrimmed Videos

Introducing Gating and Context into Temporal Action Detection