Refining Tokenization and Semantics in Large Language Models

The recent research in the field of large language models (LLMs) has been particularly focused on refining tokenization techniques and exploring alternative foundational semantics. A significant portion of the work has centered around the implications of tokenization on model cognition and semantic understanding. Researchers are advocating for linguistically-informed interventions in tokenization to improve semantic primitives and distributional patterns, addressing issues such as bias and unwanted content that can be introduced through current tokenization practices. Additionally, novel techniques like Token Prepending are being developed to enhance sentence embeddings by allowing earlier tokens to attend to complete sentence information, demonstrating improvements in semantic textual similarity tasks. The philosophical underpinnings of LLMs are also under scrutiny, with proposals for inferentialist semantics to better align with the post-anthropocentric capabilities of these models. Furthermore, the importance of distinguishing between reasoning and boilerplate tokens in fine-tuning processes has been highlighted, with new methods aiming to optimize learning by focusing on reasoning tokens. These developments collectively indicate a shift towards more nuanced and effective approaches to enhancing the capabilities and ethical considerations of LLMs.

Noteworthy papers include one that explores the sufficiency of the Distributional Hypothesis for human-like language performance and another that proposes inferentialist semantics as a foundational approach for LLMs, challenging traditional views in the philosophy of language.

Sources

Still "Talking About Large Language Models": Some Clarifications

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning

Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs

Is it the end of (generative) linguistics as we know it?

Do Large Language Models Defend Inferentialist Semantics?: On the Logical Expressivism and Anti-Representationalism of LLMs

Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning

Tokenisation is NP-Complete

Built with on top of