Tokenization and Language Model Developments in Linguistics and Psycholinguistics

Report on Current Developments in the Research Area

General Direction of the Field

The recent advancements in the research area are significantly shaping the landscape of language modeling, psycholinguistics, and computational studies of language acquisition and processing. A notable trend is the growing emphasis on the role of tokenization and its implications for language models. Researchers are increasingly questioning the assumptions underlying current tokenization practices and exploring alternative approaches that better align with linguistic principles. This shift is driven by the need for more linguistically plausible models that can accurately capture and represent the complexities of human language.

One of the key directions is the exploration of tokenization-free models, particularly those based on grapheme and phoneme levels. These models aim to bypass the limitations imposed by subword-based tokenization algorithms like Byte Pair Encoding, which have been criticized for not fully capturing linguistic representations. The development of small, character-level language models, such as those based on the Llama architecture, is demonstrating promising results in both syntactic and lexical benchmarks. These models are seen as a step towards creating more linguistically grounded systems that are better suited for computational studies of language acquisition and processing.

Another significant development is the integration of large language models (LLMs) into various applications, such as spoken grammar assessment and Chinese spelling correction. These applications highlight the versatility and potential of LLMs to address specific language-related tasks, often outperforming traditional methods. The use of LLMs in these contexts is not just about leveraging their computational power but also about innovating new approaches that are more effective and efficient.

In psycholinguistics, there is a renewed focus on the proper treatment of tokenization when applying language models to cognitive studies. Researchers are advocating for the marginalization of token-level models into character-level models to better align with the cognitive processes involved in reading and comprehension. This approach aims to address the misalignment between regions of interest in psycholinguistic studies and the token strings produced by current language models.

Eye-tracking techniques are also gaining traction as a means to assess reading comprehension and attention in various contexts. The use of machine-learning based eye-tracking to study the impact of background noise on attention and performance in timed stress tasks is an emerging area of interest. These studies are providing new insights into how environmental factors can influence cognitive processes and academic performance.

Overall, the field is moving towards a more nuanced understanding of language modeling and its applications, with a strong emphasis on linguistic validity, cognitive alignment, and practical utility. The integration of diverse methodologies, from computational modeling to empirical studies, is paving the way for more robust and effective language-related technologies and theories.

Noteworthy Papers

  • Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas: Demonstrates that small, tokenization-free models can achieve strong linguistic performance, suggesting a promising direction for more linguistically plausible language models.

  • A Simple yet Effective Training-free Prompt-free Approach to Chinese Spelling Correction Based on Large Language Models: Proposes a novel, training-free approach to Chinese spelling correction using LLMs, significantly improving performance and competing with state-of-the-art models.

  • Fine-Grained Prediction of Reading Comprehension from Eye Movements: Addresses the challenging task of predicting reading comprehension from eye movements, suggesting that eye movements contain useful signals for fine-grained prediction.

Sources

Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus

On the Proper Treatment of Tokenization in Psycholinguistics

Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas

Spoken Grammar Assessment Using LLM

A Simple yet Effective Training-free Prompt-free Approach to Chinese Spelling Correction Based on Large Language Models

Assessing the Impact of Disorganized Background Noise on Timed Stress Task Performance Through Attention Using Machine-Learning Based Eye-Tracking Techniques

Fine-Grained Prediction of Reading Comprehension from Eye Movements

Investigating large language models for their competence in extracting grammatically sound sentences from transcribed noisy utterances

A Two-Step Approach for Data-Efficient French Pronunciation Learning

The Effect of Surprisal on Reading Times in Information Seeking and Repeated Reading

Built with on top of