The research area is witnessing a significant shift towards more efficient and versatile tokenization techniques for generative models, particularly in the context of video and image generation. Innovations are focusing on improving compression ratios, enhancing reconstruction fidelity, and accelerating inference speeds. Key advancements include the development of continuous and discrete tokenizers that leverage semantic information and adaptive strategies to handle spatial-temporal dimensions more effectively. These methods are not only improving the quality of generated content but also enabling faster and more efficient training processes. Notably, there is a strong emphasis on multilingual applications, particularly in the generation of talking avatars, where cross-lingual capabilities are being enhanced through novel quantization frameworks. Additionally, the integration of probabilistic frameworks and curriculum learning strategies is proving critical for stable and effective training of discrete visual representation models. Overall, the field is progressing towards more efficient, high-fidelity, and semantically rich generative models, with a particular focus on scalability and cross-lingual applications.