Synthetic Data and Privacy in EHR Research

The recent developments in the field of electronic health records (EHRs) and synthetic data generation are significantly advancing the potential for data harmonization, privacy preservation, and model evaluation. There is a notable shift towards the use of generative AI models for creating synthetic medical data, including text, time series, and longitudinal records, which addresses issues of data scarcity and class imbalance while enhancing privacy. These models, particularly those leveraging adversarial networks and large language models, are showing promise in generating high-fidelity synthetic data. Additionally, there is a growing focus on the empirical evaluation of privacy risks associated with synthetic data, highlighting the need for realistic threat models and computational feasibility in privacy assessments. Innovations in auto-evaluation techniques, such as those using post-hoc regression with few labels, are also emerging to address the resource-intensive nature of model evaluation. These advancements collectively contribute to more robust and privacy-conscious data management strategies in healthcare research.

Synthetic Data and Privacy in EHR Research

Sources