The field of emotion recognition is rapidly advancing, with a growing focus on multimodal approaches that integrate multiple sources of information, such as text, speech, and facial expressions. This trend is driven by the need to better understand human emotions and develop more effective affective computing systems. Recent research has explored the use of deep learning techniques, such as convolutional neural networks and recurrent neural networks, to improve emotion recognition accuracy. Additionally, there is a growing interest in using large language models and contrastive learning to refine speech emotion recognition and enable zero-shot emotion recognition across languages. Noteworthy papers in this area include the proposal of GatedxLSTM, a novel speech-text multimodal emotion recognition model that achieves state-of-the-art performance on the IEMOCAP dataset. Another notable paper presents OmniVox, a systematic evaluation of omni-LLMs for zero-shot emotion recognition, demonstrating their competitive performance with fine-tuned audio models.