Synthetic Data and Privacy in EHR Research

The recent developments in the field of electronic health records (EHRs) and synthetic data generation are significantly advancing the potential for data harmonization, privacy preservation, and model evaluation. There is a notable shift towards the use of generative AI models for creating synthetic medical data, including text, time series, and longitudinal records, which addresses issues of data scarcity and class imbalance while enhancing privacy. These models, particularly those leveraging adversarial networks and large language models, are showing promise in generating high-fidelity synthetic data. Additionally, there is a growing focus on the empirical evaluation of privacy risks associated with synthetic data, highlighting the need for realistic threat models and computational feasibility in privacy assessments. Innovations in auto-evaluation techniques, such as those using post-hoc regression with few labels, are also emerging to address the resource-intensive nature of model evaluation. These advancements collectively contribute to more robust and privacy-conscious data management strategies in healthcare research.

Sources

EHRs Data Harmonization Platform, an easy-to-use shiny app based on recodeflow for harmonizing and deriving clinical features

Empirical Privacy Evaluations of Generative and Predictive Machine Learning Models -- A review and challenges for practice

A Review on Generative AI Models for Synthetic Medical Text, Time Series, and Longitudinal Data

Auto-Evaluation with Few Labels through Post-hoc Regression

SynEHRgy: Synthesizing Mixed-Type Structured Electronic Health Records using Decoder-Only Transformers

Built with on top of