The field of multimodal document understanding is moving towards more efficient and effective methods for training and evaluating large language models. Researchers are exploring new techniques to Improve the comprehension of interleaved image-text in documents and to accelerate the training of multimodal large language models. A key challenge in this area is addressing the issue of modality composition incoherence, which can lead to uneven GPU utilization and degrade training efficiency. Another important aspect is the development of robust inference mechanisms that can handle noise-like information such as watermarks in documents. Noteworthy papers in this area include M-DocSum, which introduces a novel benchmark for multimodal document summarization and achieves state-of-the-art performance, and OrchMLLM, which proposes a framework to mitigate the inefficiencies in multimodal large language model training. Open-Qwen2VL is also notable for its efficient pre-training of fully-open multimodal LLMs on academic resources.