The field of artificial intelligence is moving towards developing more advanced Theory of Mind (ToM) capabilities, enabling machines to better understand human intentions, beliefs, and mental states. Recent research has focused on evaluating and improving the performance of vision-language models in ToM tasks, with a particular emphasis on egocentric domains and human-image aesthetic assessment. Notably, reinforcement learning has been shown to be effective in unlocking ToM capabilities in small language models, and multimodal models are being developed to provide more explainable and visually grounded social intelligence. Noteworthy papers include: EgoToM, which introduces a new video question-answering benchmark for evaluating ToM in egocentric domains. ToM-RL, which demonstrates the effectiveness of reinforcement learning in unlocking ToM capabilities in small language models.