The field of multimodal sarcasm detection and comprehension is witnessing significant advancements, driven by innovative approaches that enhance the integration of various data modalities and improve the understanding of nuanced contexts. Recent developments emphasize the importance of multimodal data augmentation strategies, which have shown to significantly boost performance by generating diverse and contextually rich samples. Attention mechanisms and graph-based models are also being refined to better capture relational contexts and dynamic interactions between text and images, leading to more accurate sarcasm detection. Additionally, the incorporation of commonsense reasoning and adversarial learning techniques is enhancing the robustness and generalization capabilities of models in complex scenarios. Benchmarking efforts, such as the introduction of new datasets and evaluation frameworks, are providing more comprehensive assessments of model performance, highlighting areas for further improvement. Notably, the integration of synthetic native samples and multi-task learning strategies is proving effective in code-mixed scenarios, while novel fusion networks are advancing the state-of-the-art in multimodal sarcasm detection. Overall, the field is progressing towards more sophisticated and context-aware models that can effectively interpret and respond to the intricacies of human communication.