• Media type: E-Article
  • Title: False gene and chromosome losses in genome assemblies caused by GC content variation and repeats
  • Contributor: Kim, Juwan; Lee, Chul; Ko, Byung June; Yoo, Dong Ahn; Won, Sohyoung; Phillippy, Adam M.; Fedrigo, Olivier; Zhang, Guojie; Howe, Kerstin; Wood, Jonathan; Durbin, Richard; Formenti, Giulio; Brown, Samara; Cantin, Lindsey; Mello, Claudio V.; Cho, Seoae; Rhie, Arang; Kim, Heebal; Jarvis, Erich D.
  • Published: Springer Science and Business Media LLC, 2022
  • Published in: Genome Biology, 23 (2022) 1
  • Language: English
  • DOI: 10.1186/s13059-022-02765-0
  • ISSN: 1474-760X
  • Origination:
  • Footnote:
  • Description: <jats:title>Abstract</jats:title><jats:sec> <jats:title>Background</jats:title> <jats:p>Many short-read genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project has been producing new reference genome assemblies with an emphasis on being as complete and error-free as possible, which requires utilizing long reads, long-range scaffolding data, new assembly algorithms, and manual curation. A more thorough evaluation of the recent references relative to prior assemblies can provide a detailed overview of the types and magnitude of improvements.</jats:p> </jats:sec><jats:sec> <jats:title>Results</jats:title> <jats:p>Here we evaluate new vertebrate genome references relative to the previous assemblies for the same species and, in two cases, the same individuals, including a mammal (platypus), two birds (zebra finch, Anna’s hummingbird), and a fish (climbing perch). We find that up to 11% of genomic sequence is entirely missing in the previous assemblies. In the Vertebrate Genomes Project zebra finch assembly, we identify eight new GC- and repeat-rich micro-chromosomes with high gene density. The impact of missing sequences is biased towards GC-rich 5′-proximal promoters and 5′ exon regions of protein-coding genes and long non-coding RNAs. Between 26 and 60% of genes include structural or sequence errors that could lead to misunderstanding of their function when using the previous genome assemblies.</jats:p> </jats:sec><jats:sec> <jats:title>Conclusions</jats:title> <jats:p>Our findings reveal novel regulatory landscapes and protein coding sequences that have been greatly underestimated in previous assemblies and are now present in the Vertebrate Genomes Project reference genomes.</jats:p> </jats:sec>
  • Access State: Open Access