ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping
Jingyi Jessica Li, PhD
Professor of Statistics and Data Science
University of California, Los Angeles (UCLA)
WHEN: Wednesday, January 17, 2024, from 3:30 to 4:30 p.m.
WHERE: hybrid | 2001 McGill College Avenue, room 1140; Zoom
NOTE: Dr. Li will be presenting from UCLA
Abstract
In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find discrete cell clusters as putative cell types, and then a statistical test is employed to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure suffers the "double dipping" issue: the same data are used twice to find discrete cell clusters as putative cell types and DE genes as potential cell-type marker genes, leading to false-positive cell-type marker genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE method for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality, which can work as an add-on to popular pipelines such as Seurat. The core idea of ClusterDE is to generate real-data-based synthetic null data containing only one cell type, in contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has solid FDR control and the ability to identify canonical cell-type marker genes as top DE genes, distinguishing them from common housekeeping genes. Notably, the DE genes identified by ClusterDE are informative markers for discrete cell types and can guide the merging of spurious clusters. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests.
Speaker bio
Jingyi Jessica Li, Professor of Statistics and Data Science (also affiliated with Biostatistics, Computational Medicine, and Human Genetics), leads a research group titled the Junction of Statistics and Biology at UCLA. With Ph.D. from UC Berkeley and B.S. from Tsinghua University, Dr. Li focuses on developing interpretable statistical methods for biomedical data. Her research delves into quantifying the central dogma, extracting hidden information from transcriptomics data, and ensuring statistical rigor in data analysis by employing synthetic negative controls. Recipient of multiple awards including the NSF CAREER Ward, Sloan Research Fellowship, ISCB Overton Prize, and COPSS Emerging Leaders Award, her contributions have gained recognition in the fields of computational biology and statistics. Website: http://jsb.ucla.edu/