Multimodal Generation and Understanding

The field of multimodal generation and understanding is moving towards the development of unified models that can seamlessly integrate visual understanding and image generation tasks. Recent work has focused on improving the performance of these models through the use of novel training strategies, expanded training corpora, and enhanced image generation capabilities. Notably, the integration of multimodal features and stance guidance has been shown to improve semantic consistency and stance control in generated content. Furthermore, the use of collaborative multi-agent frameworks and unified agentic evaluation frameworks has demonstrated promising results in creative content generation and evaluation. Some noteworthy papers include: VARGPT-v1.1, which achieves state-of-the-art performance in multimodal understanding and text-to-image instruction-following tasks through the use of iterative instruction tuning and reinforcement learning. UniToken, which introduces a unified visual encoding framework that captures both high-level semantics and low-level details, enabling seamless integration of visual understanding and image generation tasks. CREA, which proposes a novel multi-agent collaborative framework for creative content generation with diffusion models, demonstrating significant improvements in diversity, semantic alignment, and creative transformation. CIGEval, which introduces a unified agentic framework for comprehensive evaluation of conditional image generation tasks, achieving a high correlation with human assessments and surpassing previous state-of-the-art methods.

Sources

VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning

Stance-Driven Multimodal Controlled Statement Generation: New Dataset and Task

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

CREA: A Collaborative Multi-Agent Framework for Creative Content Generation with Diffusion Models

An Empirical Study of GPT-4o Image Generation Capabilities

Towards Holistic Prompt Craft

A Unified Agentic Framework for Evaluating Conditional Image Generation

Built with on top of