Pre-processing and quality control of whole genome sequencing data: a case study using 9000 whole genomes from the GENESIS-HD study

Published: 15 December 2022

Andreas Ziegler, Dr. rer. nat

Scientific Director and CEO | Research Group Cardio-CARE

Where: Hybrid Event | 2001 McGill College, Room 1140; Zoom


Rapid advances in high-throughput DNA sequencing technologies have enabled the conduct of large-scale whole genome sequencing (WGS) studies. In this presentation, we describe the per-processing pipeline and quality control framework we have selected for the GENEtic SequencIng Study Hamburg-Davos (GENESIS-HD), a study involving more than 9000 human whole genomes. All samples were sequenced on a single Illumina NovaSeq 6000 with an average coverage of 35x using a PCR-free protocol and unique dual indices (UDI). For quality control, one genome-in-a-bottle (GIAB) trio was sequenced in triplicate, and one GIAB sample was sequenced 70 times in different runs. First, we explain the sequencing approach using illustrations. We describe important quality control metrics on the raw data (fastq file), after mapping and alignment (bam file), after variant calling (gvcf file) and multi-sample calling (msvcf file). We provide empirical data for efficient sample storage using original read archive (ORA) compression of fastq files. Finally, we sketch methods tailored for downstream association analysis and their incorporation in our analysis pipeline. The most important quality metrics for sample filtering were ancestry, sample cross-contamination, deviation from the expected Het/Hom ratio, relatedness, and too low coverage. We detected some patterns of sample cross-contamination which indicate cross-contamination through a multichannel pipette. When fastq files were compressed using ORA compression, the resulting file size was approximately 1/5 of the original file size, and compression time was linear to mismatch bases. In summary, the pre-processing, joint calling and QC of large WGS studies is nowadays feasible in reasonable time and efficient quality control procedures are readily available.

Speaker Bio

Andreas is Scientific Director and CEO of the non-profit research group Cardio-CARE, Davos, Switzerland, a 100% daughter of the Kühne Foundation since 2020. Previously, he was director of the Institute of Medical Biometry and Statistics at the University of Lübeck, Germany. He was president of the German Region of the International Biometric Society and the International Genetic Epidemiology Society. His research covers different areas, including machine learning, clinical trials for medical devices, and genetic epidemiology. He has authored and co-authored more than 500 research articles and 8 books, including a textbook on "A Statistical Approach to Genetic Epidemiology". In the past three years, Andreas’ main focus was on the whole genome sequencing experiment described in the presentation.



Back to top