Advances in Text-Guided Image Editing

The field of text-guided image editing is rapidly evolving, with a focus on improving the accuracy and control of editing operations. Recent developments have centered around enhancing the cross-attention mechanisms between textual instructions and visual features, enabling more precise and fine-grained edits. This has led to significant advancements in preserving background integrity and maintaining semantic consistency between the edited result and the source image. Noteworthy papers include DCEdit, which introduces a Dual-Level Control mechanism for incorporating regional cues at both feature and latent levels, and FireEdit, which proposes a Time-Aware Target Injection module and a Hybrid Visual Cross Attention module to enhance fine-grained visual perception capabilities. EditCLIP is also notable for its novel representation-learning approach for image editing, which learns a unified representation of edits by jointly encoding an input image and its edited counterpart. LOCATEdit and FDS are also remarkable for their innovative approaches to optimizing cross-attention maps and selective optimization of specific frequency bands, respectively.

Sources

DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics

FDS: Frequency-Aware Denoising Score for Text-Guided Latent Diffusion Image Editing

FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model

EditCLIP: Representation Learning for Image Editing

LOCATEdit: Graph Laplacian Optimized Cross Attention for Localized Text-Guided Image Editing

Built with on top of