QLS Seminar Series - Lamin Juwara
Mitigating the impact of data bias through synthetic data generators
Lamin Juwara, CHEO & University of Ottawa.
Tuesday March 28, 12-1pm
Zoom Link: https://mcgill.zoom.us/j/86855481591
Abstract: Data bias is a pervasive problem in biomedical research, especially in large-scale observational studies. During statistical modeling, underrepresentation of specific covariate categories (e.g., gender, ethnicity, etc.) in the training cohort typically results in inconsistent estimations and imprecise predictions. While various bias-mitigating approaches have been proposed in recent years, these methods are not always effective especially when the source of bias is unclear or the severity is extreme (e.g., more than 50% missing covariate category). We propose a novel bias-mitigating approach that combines the simplicity of random oversampling and the utility of synthetic data generation. The approach involves augmenting randomly selected synthetic samples of the minor covariate category with the bias training cohort in order to rebalance the covariate distributions. The approach is termed Synthetic Minor Augmentation (SMA) and is demonstrated through extensive simulations and applications on several real data examples.
In this talk, I will review the current standards for mitigating data bias at the data analysis stage. I will then demonstrate how synthetic data generation could be simultaneously utilized to preserve data privacy and mitigate the impact of data bias. In particular, I will show how the use of light gradient boosting machines for data synthesis could generate suitable supplementary samples of the underrepresented covariate categories. The resulting data model is compared to some current standards including subsampling and matching.