Variable selection methods in high-dimensional genetic data
Sahir Bhatnagar, McGill University
Tuesday February 23, 12-1pm
Zoom Link: https:/mcgill.zoom.us/j/91589192037
Abstract: In high-dimensional (HD) data, where the number of covariates (p) greatly exceeds the number of observations (n), estimation can benefit from the bet-on-sparsity principle, i.e., only a small number of predictors are relevant in the response. This assumption can lead to more interpretable models, improved predictive accuracy, and algorithms that are computationally efficient. In genetic studies, where the sample sizes are small relative to the number of measured features, we must often assume a sparse model because there isn’t enough information to estimate p parameters. Even when the sample size is large enough to estimate many parameters (e.g. 500k individuals in the UK Biobank), new challenges arise such as compute time, memory management and file format. In this talk, I introduce some popular variable selection techniques with a focus on the optimization algorithms and software implementations. I then share some recent applications of these methods for variant discovery and genetic risk prediction. If time permits, I will end with an opinionated view of why the statistical methods have failed to keep up with the complexity of the data being generated.