The recent advancements in document processing and scene text recognition have seen a shift towards more integrated and efficient models. Researchers are increasingly focusing on developing models that can handle arbitrary reading orders and shapes in scene text, as well as improving document classification through innovative multi-modal approaches. Notably, there is a trend towards leveraging local semantic information and visual features to enhance recognition and classification tasks, often without the need for extensive computational resources or large datasets. Additionally, the exploration of disentanglement and compositionality in neural models for text processing has highlighted the need for benchmarks that evaluate the generalization ability of these models. The field is also witnessing a move towards more explicit relational reasoning and alignment techniques in document layout analysis, aiming to improve performance on visual tasks without reliance on OCR during inference. These developments collectively push the boundaries of what is possible in automated document processing and scene text recognition, offering more efficient and accurate solutions for real-world applications.