• Media type: Electronic Conference Proceeding; E-Article; Text
  • Title: Efficient Reconciliation of Genomic Datasets of High Similarity
  • Contributor: Shibuya, Yoshihiro [Author]; Belazzougui, Djamal [Author]; Kucherov, Gregory [Author]
  • imprint: Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022
  • Language: English
  • DOI: https://doi.org/10.4230/LIPIcs.WABI.2022.14
  • Keywords: MinHash ; Invertible Bloom Lookup Tables ; k-mers ; IBLT ; minimizers ; sketching ; syncmers
  • Origination:
  • Footnote: Diese Datenquelle enthält auch Bestandsnachweise, die nicht zu einem Volltext führen.
  • Description: We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes).
  • Access State: Open Access