• Medientyp: E-Artikel
  • Titel: CALQ: compression of quality values of aligned sequencing data
  • Beteiligte: Voges, Jan; Ostermann, Jörn; Hernaez, Mikel
  • Erschienen: Oxford University Press (OUP), 2018
  • Erschienen in: Bioinformatics, 34 (2018) 10, Seite 1650-1658
  • Sprache: Englisch
  • DOI: 10.1093/bioinformatics/btx737
  • ISSN: 1367-4803; 1367-4811
  • Schlagwörter: Computational Mathematics ; Computational Theory and Mathematics ; Computer Science Applications ; Molecular Biology ; Biochemistry ; Statistics and Probability
  • Entstehung:
  • Anmerkungen:
  • Beschreibung: <jats:title>Abstract</jats:title> <jats:sec> <jats:title>Motivation</jats:title> <jats:p>Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses.</jats:p> </jats:sec> <jats:sec> <jats:title>Results</jats:title> <jats:p>We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets.</jats:p> </jats:sec> <jats:sec> <jats:title>Availability and implementation</jats:title> <jats:p>CALQ is written in C ++ and can be downloaded from https://github.com/voges/calq.</jats:p> </jats:sec> <jats:sec> <jats:title>Supplementary information</jats:title> <jats:p>Supplementary data are available at Bioinformatics online.</jats:p> </jats:sec>
  • Zugangsstatus: Freier Zugang