Ameer Dharamshi (University of Washington)
TITLE: “Expanding the scope of post-selection inference”
ABSTRACT: Contemporary data analysis pipelines often use the same data both to generate and subsequently test a null hypothesis. This procedure is problematic as classical testing procedures that fail to account for the fact that the hypothesis is data-dependent do not control Type I error rates. This problem, commonly referred to as post-selection inference, is pervasive in modern science. One way to perform valid post-selection inference is to test the hypothesis conditional on the fact that the data were used to select the hypothesis. However, for the resulting conditional distribution to be tractable, the selection event must be amenable to mathematical characterization and multivariate Gaussianity of the data is typically required. In practice, such assumptions are rigid, and limit applicability.
In this talk, I will discuss a sequence of projects that expand the scope of post-selection inference through the careful use of external randomness. I first present “data thinning”, a strategy for partitioning each entry of a data matrix into two independent pieces, one for exploration and one for testing; because the folds are independent, any selection algorithm can be used for exploration and classical testing procedures can be applied for inference. Data thinning enables valid post-selection inference with data generated from a broad class of distributions, both within and beyond the exponential family, and is particularly useful in instances where the sample size is small, the data are non-identically distributed, or selection involves unsupervised learning algorithms. For settings in which data thinning is not available, I present a second strategy in which each entry of a data matrix is partitioned into two dependent pieces. As before, I will explore the first to generate a hypothesis. Inference is conducted by orthogonalizing the second with respect to the first under the selected null, then testing if orthogonalization is successful. Together, these frameworks provide analysts with a suite of tools for conducting valid post-selection inference in diverse settings.