Large Language Models (LLMs) with Biomedical Data

Report on Current Developments in the Integration of Large Language Models (LLMs) with Biomedical Data

General Direction of the Field

The integration of Large Language Models (LLMs) with biomedical data is rapidly evolving, with a strong focus on enhancing predictive modeling, feature selection, and knowledge retrieval in complex and high-dimensional datasets. Recent advancements leverage the extensive knowledge encoded in LLMs to address challenges in genotype-phenotype prediction, latent feature mining, clinical decision support, and natural language generation for medical explanations. The field is moving towards more sophisticated frameworks that not only improve model performance but also enhance interpretability and trustworthiness in high-stakes healthcare applications.

One of the key trends is the development of knowledge-driven frameworks that utilize LLMs to select and engineer features in genotype data, overcoming the limitations of traditional data-driven approaches. These frameworks are designed to operate effectively in low-shot regimes, where data availability is limited, and are showing promising results in predicting complex phenotypes.

Another significant direction is the exploration of latent feature mining, where LLMs are employed to infer unobserved yet critical factors that traditional machine learning models struggle to incorporate. This approach is particularly valuable in domains with limited and ethically challenging data collection, such as criminal justice and healthcare, where it enhances the predictive power of models by augmenting observed features with latent ones.

In the realm of clinical decision support, there is a growing emphasis on integrating knowledge graph (KG) retrieval with LLM reasoning to improve the accuracy and interpretability of healthcare predictions. These frameworks are designed to address the limitations of LLMs in high-stakes healthcare applications, such as clinical diagnosis, by providing fine-grained, contextually relevant information.

Additionally, the generation of synthetic data using LLMs is emerging as a powerful technique to improve disease entity recognition and normalization, particularly in scenarios where training data is sparse. This approach not only enhances overall performance but also shows significant improvements in out-of-distribution data, making it a valuable tool for improving the robustness of biomedical information extraction systems.

Noteworthy Papers

Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models: Introduces FREEFORM, a novel framework that leverages LLMs for feature selection and engineering in genotype data, outperforming traditional methods in low-shot regimes.
Latent Feature Mining for Predictive Model Enhancement with Large Language Models: Proposes FLAME, a framework that uses LLMs to infer latent features, significantly enhancing predictive models in domains with limited data availability.
Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval: Introduces KARE, a framework that integrates KG community-level retrieval with LLM reasoning, achieving state-of-the-art results in clinical prediction tasks.
LLaVA Needs More Knowledge: Retrieval Augmented Natural Language Generation with Knowledge Graph for Explaining Thoracic Pathologies: Proposes a KG-augmented Vision-Language framework that improves the quality of natural language explanations for medical images, achieving state-of-the-art results on the MIMIC-NLE dataset.
Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions: Demonstrates that LLM-generated synthetic data significantly improves disease entity normalization, particularly in out-of-distribution scenarios.

Large Language Models (LLMs) with Biomedical Data

Report on Current Developments in the Integration of Large Language Models (LLMs) with Biomedical Data

General Direction of the Field

Noteworthy Papers

Sources