Efficiency and Scalability in Computer Vision Models

The field of computer vision is witnessing a significant shift towards more efficient and scalable models, particularly in handling high-resolution images and complex visual tasks. State-Space Models (SSMs) and Recurrent Neural Networks (RNNs) are emerging as powerful alternatives to traditional Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), offering linear complexity and reduced computational costs. Innovations in model architecture, such as the introduction of natively multidimensional SSMs and simplified RNN units, are enabling more effective modeling of spatial dependencies and long-range interactions in visual data. Additionally, there is a growing emphasis on developing models that can efficiently process gigapixel images, such as whole slide images in medical diagnostics, by combining local inductive biases with global information. The field is also seeing advancements in specific applications, including 3D lane detection, road network extraction, and intelligent road inspection, where novel architectures and training-free compression methods are improving accuracy and computational efficiency.

Noteworthy Papers

  • Mamba2D: Introduces a natively 2D state-space model for vision tasks, effectively modeling spatial dependencies with a single 2D scan direction.
  • VMeanba: Proposes a training-free compression method for SSMs, optimizing computation by averaging activation maps across channels.
  • Pixel-Mamba: A novel architecture for gigapixel whole slide image analysis, leveraging SSMs for efficient end-to-end processing.
  • CCFormer: A hierarchical transformer model for analyzing cell spatial distributions in histopathology images, achieving state-of-the-art performance in survival prediction and cancer staging.
  • Anchor3DLane++: A BEV-free method for 3D lane detection, introducing sample-adaptive sparse 3D anchors and achieving superior performance on benchmarks.
  • VisionGRU: An RNN-based architecture for efficient image classification, demonstrating significant reductions in memory usage and computational costs.

Sources

Mamba2D: A Natively Multi-Dimensional State-Space Model for Vision Tasks

V"Mean"ba: Visual State Space Models only need 1 hidden dimension

From Pixels to Gigapixels: Bridging Local Inductive Bias and Long-Range Dependencies with Pixel-Mamba

From Histopathology Images to Cell Clouds: Learning Slide Representations with Hierarchical Cell Transformer

ViM-Disparity: Bridging the Gap of Speed, Accuracy and Memory for Disparity Map Generation

Anchor3DLane++: 3D Lane Detection via Sample-Adaptive Sparse 3D Anchor Regression

ImagineMap: Enhanced HD Map Construction with SD Maps

URoadNet: Dual Sparse Attentive U-Net for Multiscale Road Network Extraction

Establishing Reality-Virtuality Interconnections in Urban Digital Twins for Superior Intelligent Road Inspection

VisionGRU: A Linear-Complexity RNN Model for Efficient Image Analysis

UNet--: Memory-Efficient and Feature-Enhanced Network Architecture based on U-Net with Reduced Skip-Connections

Built with on top of