Advancements in Multimodal and Multisensory Human-Computer Interaction

The recent developments in the research area highlight a significant shift towards enhancing human-computer interaction through innovative multimodal and multisensory approaches. A notable trend is the integration of advanced machine learning techniques with traditional signal processing to create more intuitive and accessible interfaces. This includes the development of touchscreens that adapt to vehicular movements, thereby reducing the risk of accidents, and the creation of sound synthesis models that offer fine-grained control over audio timbre through text-based interfaces. Additionally, there is a growing emphasis on improving conversational speech synthesis by modeling the complex interactions between different modalities in dialogue history. Another key area of advancement is in the generation of synthetic data to train models for tasks such as spoken named entity recognition and video dubbing, which traditionally require extensive manual annotation. Furthermore, the field is witnessing the emergence of novel datasets and benchmarks that facilitate the development of more robust and versatile models for audio and video processing. The use of haptic feedback to assist individuals with visual impairments and the exploration of immersive virtual reality systems for robotic teleoperation are also gaining traction, indicating a broader move towards creating more inclusive and efficient technological solutions.

Sources

FITS: Ensuring Safe and Effective Touchscreen Use in Moving Vehicles

Simi-SFX: A similarity-based conditioning method for controllable sound effect synthesis

Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis

Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction

"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

How Can Haptic Feedback Assist People with Blind and Low Vision (BLV): A Systematic Literature Review

ETTA: Elucidating the Design Space of Text-to-Audio Models

Tri-Ergon: Fine-grained Video-to-Audio Generation with Multi-modal Conditions and LUFS Control

TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

TSPE: Task-Specific Prompt Ensemble for Improved Zero-Shot Audio Classification

VinT-6D: A Large-Scale Object-in-hand Dataset from Vision, Touch and Proprioception

SoundBrush: Sound as a Brush for Visual Scene Editing

An Immersive Virtual Reality Bimanual Telerobotic System With Haptic Feedback

OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios

StereoMath: An Accessible and Musical Equation Editor

Built with on top of