The field of human-robot interaction (HRI) is rapidly advancing, with a focus on developing more natural and intuitive interfaces between humans and robots. Recent research has emphasized the importance of multimodal interaction, which combines multiple forms of input and output, such as speech, vision, and gesture, to enable more effective and efficient communication. This has led to the development of new architectures and frameworks that can integrate and process multiple modalities in real-time. One of the key challenges in HRI is enabling robots to effectively perceive and reason over multimodal inputs, and to use this information to guide their behavior and decision-making. To address this challenge, researchers have proposed a range of new techniques, including the use of large language models (LLMs) to integrate and process multimodal data, and the development of new frameworks for adaptive scaffolding and explanation generation. Notable papers in this area include Multimodal Transformer Models for Turn-taking Prediction, which introduces a novel transformer-based model for predicting turn-taking events in human-robot conversations, and FAM-HRI, which presents an efficient multi-modal framework for human-robot interaction that integrates language and gaze inputs via foundation models. Additionally, papers such as ACE and SemanticScanpath have proposed new approaches to explaining and interpreting multimodal data, and have demonstrated the effectiveness of these approaches in improving human-robot interaction.
Advances in Multimodal Human-Robot Interaction
Sources
Multimodal Transformer Models for Turn-taking Prediction: Effects on Conversational Dynamics of Human-Agent Interaction during Cooperative Gameplay
SHIFT: An Interdisciplinary Framework for Scaffolding Human Attention and Understanding in Explanatory Tasks
ACE, Action and Control via Explanations: A Proposal for LLMs to Provide Human-Centered Explainability for Multimodal AI Assistants
Inclusive STEAM Education: A Framework for Teaching Cod-2 ing and Robotics to Students with Visually Impairment Using 3 Advanced Computer Vision
Design of Seamless Multi-modal Interaction Framework for Intelligent Virtual Agents in Wearable Mixed Reality Environment
FedMM-X: A Trustworthy and Interpretable Framework for Federated Multi-Modal Learning in Dynamic Environments