Advances in Multimodal Human-Robot Interaction

The field of human-robot interaction (HRI) is rapidly advancing, with a focus on developing more natural and intuitive interfaces between humans and robots. Recent research has emphasized the importance of multimodal interaction, which combines multiple forms of input and output, such as speech, vision, and gesture, to enable more effective and efficient communication. This has led to the development of new architectures and frameworks that can integrate and process multiple modalities in real-time. One of the key challenges in HRI is enabling robots to effectively perceive and reason over multimodal inputs, and to use this information to guide their behavior and decision-making. To address this challenge, researchers have proposed a range of new techniques, including the use of large language models (LLMs) to integrate and process multimodal data, and the development of new frameworks for adaptive scaffolding and explanation generation. Notable papers in this area include Multimodal Transformer Models for Turn-taking Prediction, which introduces a novel transformer-based model for predicting turn-taking events in human-robot conversations, and FAM-HRI, which presents an efficient multi-modal framework for human-robot interaction that integrates language and gaze inputs via foundation models. Additionally, papers such as ACE and SemanticScanpath have proposed new approaches to explaining and interpreting multimodal data, and have demonstrated the effectiveness of these approaches in improving human-robot interaction.

Sources

Multimodal Transformer Models for Turn-taking Prediction: Effects on Conversational Dynamics of Human-Agent Interaction during Cooperative Gameplay

SHIFT: An Interdisciplinary Framework for Scaffolding Human Attention and Understanding in Explanatory Tasks

ACE, Action and Control via Explanations: A Proposal for LLMs to Provide Human-Centered Explainability for Multimodal AI Assistants

Enhancing Explainability with Multimodal Context Representations for Smarter Robots

Inclusive STEAM Education: A Framework for Teaching Cod-2 ing and Robotics to Students with Visually Impairment Using 3 Advanced Computer Vision

FAM-HRI: Foundation-Model Assisted Multi-Modal Human-Robot Interaction Combining Gaze and Speech

SemanticScanpath: Combining Gaze and Speech for Situated Human-Robot Interaction Using LLMs

Design of Seamless Multi-modal Interaction Framework for Intelligent Virtual Agents in Wearable Mixed Reality Environment

FedMM-X: A Trustworthy and Interpretable Framework for Federated Multi-Modal Learning in Dynamic Environments

Leveraging Cognitive States for Adaptive Scaffolding of Understanding in Explanatory Tasks in HRI

Towards Online Multi-Modal Social Interaction Understanding

Leveraging LLMs with Iterative Loop Structure for Enhanced Social Intelligence in Video Question Answering

StreetScape: Gamified Tactile Interactions for Collaborative Learning and Play

Built with on top of