1
|
Calamari ZT, Song A, Cohen E, Akter M, Das Roy R, Hallikas O, Christensen MM, Li P, Marangoni P, Jernvall J, Klein OD. Bank vole genomics links determinate and indeterminate growth of teeth. BMC Genomics 2024; 25:1000. [PMID: 39472825 PMCID: PMC11523675 DOI: 10.1186/s12864-024-10901-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Accepted: 10/14/2024] [Indexed: 11/02/2024] Open
Abstract
BACKGROUND Continuously growing teeth are an important innovation in mammalian evolution, yet genetic regulation of continuous growth by stem cells remains incompletely understood. Dental stem cells responsible for tooth crown growth are lost at the onset of tooth root formation. Genetic signaling that initiates this loss is difficult to study with the ever-growing incisor and rooted molars of mice, the most common mammalian dental model species, because signals for root formation overlap with signals that pattern tooth size and shape (i.e., cusp patterns). Bank and prairie voles (Cricetidae, Rodentia, Glires) have evolved rooted and unrooted molars while retaining similar size and shape, providing alternative models for studying roots. RESULTS We assembled a de novo genome of Myodes glareolus, a vole with high-crowned, rooted molars, and performed genomic and transcriptomic analyses in a broad phylogenetic context of Glires (rodents and lagomorphs) to assess differential selection and evolution in tooth forming genes. Bulk transcriptomics comparisons of embryonic molar development between bank voles and mice demonstrated overall conservation of gene expression levels, with species-specific differences corresponding to the accelerated and more extensive patterning of the vole molar. We leverage convergent evolution of unrooted molars across the clade to examine changes that may underlie the evolution of unrooted molars. We identified 15 dental genes with changing synteny relationships and six dental genes undergoing positive selection across Glires, two of which were undergoing positive selection in species with unrooted molars, Dspp and Aqp1. Decreased expression of both genes in prairie voles with unrooted molars compared to bank voles supports the presence of positive selection and may underlie differences in root formation. CONCLUSIONS Our results support ongoing evolution of dental genes across Glires and identify candidate genes for mechanistic studies of root formation. Comparative research using the bank vole as a model species can reveal the complex evolutionary background of convergent evolution for ever-growing molars.
Collapse
Affiliation(s)
- Zachary T Calamari
- Baruch College, City University of New York, One Bernard Baruch Way, New York, NY, 10010, USA.
- The Graduate Center, City University of New York, 365 Fifth Ave, New York, NY, 10016, USA.
- Program in Craniofacial Biology, Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA, 94158, USA.
- Division of Paleontology, American Museum of Natural History, Central Park West at 79th Street, New York, NY, 10024, USA.
| | - Andrew Song
- Baruch College, City University of New York, One Bernard Baruch Way, New York, NY, 10010, USA
- Cornell University, 616 Thurston Ave, Ithaca, NY, 14853, USA
| | - Emily Cohen
- Baruch College, City University of New York, One Bernard Baruch Way, New York, NY, 10010, USA
- New York University College of Dentistry, 345 E 34th St, New York, NY, 10010, USA
| | - Muspika Akter
- Baruch College, City University of New York, One Bernard Baruch Way, New York, NY, 10010, USA
| | - Rishi Das Roy
- Institute of Biotechnology, University of Helsinki, Helsinki, FI-00014, Finland
| | - Outi Hallikas
- Institute of Biotechnology, University of Helsinki, Helsinki, FI-00014, Finland
| | - Mona M Christensen
- Institute of Biotechnology, University of Helsinki, Helsinki, FI-00014, Finland
| | - Pengyang Li
- Program in Craniofacial Biology, Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA, 94158, USA
- Department of Pediatrics, Cedars-Sinai Guerin Children's, 8700 Beverly Blvd., Suite 2416, Los Angeles, CA, 90048, USA
- Department of Bioengineering, Stanford University, 443 Via Ortega, Rm 119, Stanford, CA, 94305, USA
| | - Pauline Marangoni
- Program in Craniofacial Biology, Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA, 94158, USA
- Department of Pediatrics, Cedars-Sinai Guerin Children's, 8700 Beverly Blvd., Suite 2416, Los Angeles, CA, 90048, USA
| | - Jukka Jernvall
- Institute of Biotechnology, University of Helsinki, Helsinki, FI-00014, Finland
- Department of Geosciences and Geography, University of Helsinki, Helsinki, FI-00014, Finland
| | - Ophir D Klein
- Program in Craniofacial Biology, Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA, 94158, USA.
- Department of Pediatrics, Cedars-Sinai Guerin Children's, 8700 Beverly Blvd., Suite 2416, Los Angeles, CA, 90048, USA.
| |
Collapse
|
2
|
Kumar M, Tibocha-Bonilla JD, Füssy Z, Lieng C, Schwenck SM, Levesque AV, Al-Bassam MM, Passi A, Neal M, Zuniga C, Kaiyom F, Espinoza JL, Lim H, Polson SW, Allen LZ, Zengler K. Mixotrophic growth of a ubiquitous marine diatom. SCIENCE ADVANCES 2024; 10:eado2623. [PMID: 39018398 PMCID: PMC466952 DOI: 10.1126/sciadv.ado2623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 06/12/2024] [Indexed: 07/19/2024]
Abstract
Diatoms are major players in the global carbon cycle, and their metabolism is affected by ocean conditions. Understanding the impact of changing inorganic nutrients in the oceans on diatoms is crucial, given the changes in global carbon dioxide levels. Here, we present a genome-scale metabolic model (iMK1961) for Cylindrotheca closterium, an in silico resource to understand uncharacterized metabolic functions in this ubiquitous diatom. iMK1961 represents the largest diatom metabolic model to date, comprising 1961 open reading frames and 6718 reactions. With iMK1961, we identified the metabolic response signature to cope with drastic changes in growth conditions. Comparing model predictions with Tara Oceans transcriptomics data unraveled C. closterium's metabolism in situ. Unexpectedly, the diatom only grows photoautotrophically in 21% of the sunlit ocean samples, while the majority of the samples indicate a mixotrophic (71%) or, in some cases, even a heterotrophic (8%) lifestyle in the light. Our findings highlight C. closterium's metabolic flexibility and its potential role in global carbon cycling.
Collapse
Affiliation(s)
- Manish Kumar
- Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - Juan D. Tibocha-Bonilla
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - Zoltán Füssy
- Department of Parasitology, Faculty of Science, Charles University, BIOCEV, Vestec, Czech Republic
| | - Chloe Lieng
- Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - Sarah M. Schwenck
- Scripps Institution of Oceanography, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - Alice V. Levesque
- Scripps Institution of Oceanography, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - Mahmoud M. Al-Bassam
- Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - Anurag Passi
- Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - Maxwell Neal
- Department of Bioengineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - Cristal Zuniga
- Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - Farrah Kaiyom
- Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - Josh L. Espinoza
- Department of Microbial and Environmental Genomics, J. Craig Venter Institute, 4120 Capricorn Way, La Jolla, CA 92037, USA
| | - Hyungyu Lim
- Department of Bioengineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| | - Shawn W. Polson
- Department of Computer and Information Sciences, University of Delaware, 18 Amstel Ave., Newark, DE 19716, USA
- Center for Bioinformatics and Computational Biology, University of Delaware, 590 Avenue 1743, Newark, DE 19713, USA
| | - Lisa Zeigler Allen
- Scripps Institution of Oceanography, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
- Department of Microbial and Environmental Genomics, J. Craig Venter Institute, 4120 Capricorn Way, La Jolla, CA 92037, USA
| | - Karsten Zengler
- Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
- Department of Bioengineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
- Center for Microbiome Innovation, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
- Program in Materials Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
| |
Collapse
|
3
|
Calamari ZT, Song A, Cohen E, Akter M, Roy RD, Hallikas O, Christensen MM, Li P, Marangoni P, Jernvall J, Klein OD. Vole genomics links determinate and indeterminate growth of teeth. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.18.572015. [PMID: 38187646 PMCID: PMC10769287 DOI: 10.1101/2023.12.18.572015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Continuously growing teeth are an important innovation in mammalian evolution, yet genetic regulation of continuous growth by stem cells remains incompletely understood. Dental stem cells responsible for tooth crown growth are lost at the onset of tooth root formation. Genetic signaling that initiates this loss is difficult to study with the ever-growing incisor and rooted molars of mice, the most common mammalian dental model species, because signals for root formation overlap with signals that pattern tooth size and shape (i.e., cusp patterns). Different species of voles (Cricetidae, Rodentia, Glires) have evolved rooted and unrooted molars that have similar size and shape, providing alternative models for studying roots. We assembled a de novo genome of Myodes glareolus, a vole with high-crowned, rooted molars, and performed genomic and transcriptomic analyses in a broad phylogenetic context of Glires (rodents and lagomorphs) to assess differential selection and evolution in tooth forming genes. We identified 15 dental genes with changing synteny relationships and six dental genes undergoing positive selection across Glires, two of which were undergoing positive selection in species with unrooted molars, Dspp and Aqp1. Decreased expression of both genes in prairie voles with unrooted molars compared to bank voles supports the presence of positive selection and may underlie differences in root formation. Bulk transcriptomics analyses of embryonic molar development in bank voles also demonstrated conserved patterns of dental gene expression compared to mice, with species-specific variation likely related to developmental timing and morphological differences between mouse and vole molars. Our results support ongoing evolution of dental genes across Glires, revealing the complex evolutionary background of convergent evolution for ever-growing molars.
Collapse
Affiliation(s)
- Zachary T. Calamari
- Baruch College, City University of New York, One Bernard Baruch Way, New York, NY 10010, USA
- The Graduate Center, City University of New York, 365 Fifth Ave, New York, NY 10016, USA
- Program in Craniofacial Biology and Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
- Division of Paleontology, American Museum of Natural History, Central Park West at 79th Street, New York, NY, 10024, USA
| | - Andrew Song
- Baruch College, City University of New York, One Bernard Baruch Way, New York, NY 10010, USA
- Cornell University, 616 Thurston Ave, Ithaca, NY 14853, USA
| | - Emily Cohen
- Baruch College, City University of New York, One Bernard Baruch Way, New York, NY 10010, USA
- New York University College of Dentistry, 345 E 34th St, New York, NY 10010
| | - Muspika Akter
- Baruch College, City University of New York, One Bernard Baruch Way, New York, NY 10010, USA
| | - Rishi Das Roy
- Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| | - Outi Hallikas
- Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| | - Mona M. Christensen
- Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| | - Pengyang Li
- Program in Craniofacial Biology and Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
- Department of Pediatrics, Cedars-Sinai Guerin Children’s, 8700 Beverly Blvd., Suite 2416, Los Angeles, CA 90048, USA
| | - Pauline Marangoni
- Program in Craniofacial Biology and Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
- Department of Pediatrics, Cedars-Sinai Guerin Children’s, 8700 Beverly Blvd., Suite 2416, Los Angeles, CA 90048, USA
| | - Jukka Jernvall
- Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
- Department of Geosciences and Geography, University of Helsinki, FI-00014 Helsinki, Finland
| | - Ophir D. Klein
- Program in Craniofacial Biology and Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
- Department of Pediatrics, Cedars-Sinai Guerin Children’s, 8700 Beverly Blvd., Suite 2416, Los Angeles, CA 90048, USA
| |
Collapse
|
4
|
Glidden-Handgis G, Wheeler TJ. WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences. BIOINFORMATICS ADVANCES 2024; 4:vbae052. [PMID: 38764475 PMCID: PMC11099658 DOI: 10.1093/bioadv/vbae052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 03/31/2024] [Accepted: 04/04/2024] [Indexed: 05/21/2024]
Abstract
Background Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis. Results We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences. Impact Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.
Collapse
Affiliation(s)
| | - Travis J Wheeler
- R. Ken Coit College of Pharmacy, University of Arizona, Tucson, AZ 85721, United States
| |
Collapse
|
5
|
Matsushima W, Planet E, Trono D. Ancestral genome reconstruction enhances transposable element annotation by identifying degenerate integrants. CELL GENOMICS 2024; 4:100497. [PMID: 38295789 PMCID: PMC10879028 DOI: 10.1016/j.xgen.2024.100497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 08/09/2023] [Accepted: 01/06/2024] [Indexed: 02/17/2024]
Abstract
Growing evidence indicates that transposable elements (TEs) play important roles in evolution by providing genomes with coding and non-coding sequences. Identification of TE-derived functional elements, however, has relied on TE annotations in individual species, which limits its scope to relatively intact TE sequences. Here, we report a novel approach to uncover previously unannotated degenerate TEs (degTEs) by probing multiple ancestral genomes reconstructed from hundreds of species. We applied this method to the human genome and achieved a 10.8% increase in coverage over the most recent annotation. Further, we discovered that degTEs contribute to various cis-regulatory elements and transcription factor binding sites, including those of a known TE-controlling family, the KRAB zinc-finger proteins. We also report unannotated chimeric transcripts between degTEs and human genes expressed in embryos. This study provides a novel methodology and a freely available resource that will facilitate the investigation of TE co-option events on a full scale.
Collapse
Affiliation(s)
- Wayo Matsushima
- School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland.
| | - Evarist Planet
- School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
| | - Didier Trono
- School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland.
| |
Collapse
|
6
|
Nguyen-Vo TH, Trinh QH, Nguyen L, Nguyen-Hoang PU, Rahardja S, Nguyen BP. i4mC-GRU: Identifying DNA N 4-Methylcytosine sites in mouse genomes using bidirectional gated recurrent unit and sequence-embedded features. Comput Struct Biotechnol J 2023; 21:3045-3053. [PMID: 37273848 PMCID: PMC10238585 DOI: 10.1016/j.csbj.2023.05.014] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 05/12/2023] [Accepted: 05/12/2023] [Indexed: 06/06/2023] Open
Abstract
N4-methylcytosine (4mC) is one of the most common DNA methylation modifications found in both prokaryotic and eukaryotic genomes. Since the 4mC has various essential biological roles, determining its location helps reveal unexplored physiological and pathological pathways. In this study, we propose an effective computational method called i4mC-GRU using a gated recurrent unit and duplet sequence-embedded features to predict potential 4mC sites in mouse (Mus musculus) genomes. To fairly assess the performance of the model, we compared our method with several state-of-the-art methods using two different benchmark datasets. Our results showed that i4mC-GRU achieved area under the receiver operating characteristic curve values of 0.97 and 0.89 and area under the precision-recall curve values of 0.98 and 0.90 on the first and second benchmark datasets, respectively. Briefly, our method outperformed existing methods in predicting 4mC sites in mouse genomes. Also, we deployed i4mC-GRU as an online web server, supporting users in genomics studies.
Collapse
Affiliation(s)
- Thanh-Hoang Nguyen-Vo
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
- School of Innovation, Design and Technology, Wellington Institute of Technology, Wellington 5012, New Zealand
| | - Quang H. Trinh
- School of Information and Communication Technology, Hanoi University of Science and Technology, Hanoi 100000, Vietnam
| | - Loc Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| | - Phuong-Uyen Nguyen-Hoang
- Computational Biology Center, International University - VNU HCMC, Ho Chi Minh City 700000, Vietnam
| | - Susanto Rahardja
- School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
- Infocomm Technology Cluster, Singapore Institute of Technology, Singapore 138683, Singapore
| | - Binh P. Nguyen
- School of Mathematics and Statistics, Victoria University of Wellington, Wellington 6140, New Zealand
| |
Collapse
|
7
|
Sun YH, Cui H, Song C, Shen JT, Zhuo X, Wang RH, Yu X, Ndamba R, Mu Q, Gu H, Wang D, Murthy GG, Li P, Liang F, Liu L, Tao Q, Wang Y, Orlowski S, Xu Q, Zhou H, Jagne J, Gokcumen O, Anthony N, Zhao X, Li XZ. Amniotes co-opt intrinsic genetic instability to protect germ-line genome integrity. Nat Commun 2023; 14:812. [PMID: 36781861 PMCID: PMC9925758 DOI: 10.1038/s41467-023-36354-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 01/27/2023] [Indexed: 02/15/2023] Open
Abstract
Unlike PIWI-interacting RNA (piRNA) in other species that mostly target transposable elements (TEs), >80% of piRNAs in adult mammalian testes lack obvious targets. However, mammalian piRNA sequences and piRNA-producing loci evolve more rapidly than the rest of the genome for unknown reasons. Here, through comparative studies of chickens, ducks, mice, and humans, as well as long-read nanopore sequencing on diverse chicken breeds, we find that piRNA loci across amniotes experience: (1) a high local mutation rate of structural variations (SVs, mutations ≥ 50 bp in size); (2) positive selection to suppress young and actively mobilizing TEs commencing at the pachytene stage of meiosis during germ cell development; and (3) negative selection to purge deleterious SV hotspots. Our results indicate that genetic instability at pachytene piRNA loci, while producing certain pathogenic SVs, also protects genome integrity against TE mobilization by driving the formation of rapid-evolving piRNA sequences.
Collapse
Affiliation(s)
- Yu H Sun
- Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
| | - Hongxiao Cui
- College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Chi Song
- College of Public Health, Division of Biostatistics, The Ohio State University, Columbus, OH, 43210, USA
| | - Jiafei Teng Shen
- International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, 322000, China
| | - Xiaoyu Zhuo
- Department of Genetics, The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA
| | - Ruoqiao Huiyi Wang
- Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
- College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Xiaohui Yu
- College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China
| | - Rudo Ndamba
- Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
| | - Qian Mu
- Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
| | - Hanwen Gu
- Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
| | - Duolin Wang
- Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
| | - Gayathri Guru Murthy
- Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
| | - Pidong Li
- Grandomics Biosciences Co., Ltd, Beijing, 102206, China
| | - Fan Liang
- Grandomics Biosciences Co., Ltd, Beijing, 102206, China
| | - Lei Liu
- Grandomics Biosciences Co., Ltd, Beijing, 102206, China
| | - Qing Tao
- Grandomics Biosciences Co., Ltd, Beijing, 102206, China
| | - Ying Wang
- Department of Animal Science, University of California, Davis, CA, 95616, USA
| | - Sara Orlowski
- Department of Poultry Science, University of Arkansas, Fayetteville, AR, 72701, USA
| | - Qi Xu
- Department of Animal Science, McGill University, Quebec, H9X 3V9, Canada
| | - Huaijun Zhou
- Department of Animal Science, University of California, Davis, CA, 95616, USA
| | - Jarra Jagne
- Animal Health Diagnostic Center, Cornell University College of Veterinary Medicine, Ithaca, NY, 14850, USA
| | - Omer Gokcumen
- Department of Biological Sciences, University at Buffalo, State University of New York, Buffalo, NY, 14260, USA
| | - Nick Anthony
- Department of Poultry Science, University of Arkansas, Fayetteville, AR, 72701, USA
| | - Xin Zhao
- Department of Animal Science, McGill University, Quebec, H9X 3V9, Canada.
| | - Xin Zhiguo Li
- Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA.
| |
Collapse
|
8
|
Löytynoja A. Thousands of human mutation clusters are explained by short-range template switching. Genome Res 2022; 32:1437-1447. [PMID: 35760560 PMCID: PMC9435742 DOI: 10.1101/gr.276478.121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Accepted: 06/21/2022] [Indexed: 02/03/2023]
Abstract
Variation within human genomes is unevenly distributed, and variants show spatial clustering. DNA replication-related template switching is a poorly known mutational mechanism capable of causing major chromosomal rearrangements as well as creating short inverted sequence copies that appear as local mutation clusters in sequence comparisons. In this study, haplotype-resolved genome assemblies representing 25 human populations and multinucleotide variants aggregated from 140,000 human sequencing experiments were reanalyzed. Local template switching could explain thousands of complex mutation clusters across the human genome, the loci segregating within and between populations. During the study, computational tools were developed for identification of template switch events using both short-read sequencing data and genotype data, and for genotyping candidate loci using short-read data. The characteristics of template-switch mutations complicate their detection, and widely used analysis pipelines for short-read sequencing data, normally capable of identifying single nucleotide changes, were found to miss template-switch mutations of tens of base pairs, potentially invalidating medical genetic studies searching for a causative allele behind genetic diseases. Combined with the massive sequencing data now available for humans, the novel tools described here enable building catalogs of affected loci and studying the cellular mechanisms behind template switching in both healthy organisms and disease.
Collapse
Affiliation(s)
- Ari Löytynoja
- Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
| |
Collapse
|
9
|
Madison RW, Hu X, Ramanan V, Xu Z, Huang RSP, Sokol ES, Frampton GM, Schrock AB, Ali SM, Ganesan S, De S. Clustered 8-Oxo-Guanine Mutations and Oncogenic Gene Fusions in Microsatellite-Unstable Colorectal Cancer. JCO Precis Oncol 2022; 6:e2100477. [PMID: 35584350 PMCID: PMC9200390 DOI: 10.1200/po.21.00477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Colorectal carcinomas (CRCs) with microsatellite-instability (MSI) are enriched for oncogenic kinase fusions (KFs), including NTRK1, RET, and BRAF, but the mechanism underlying this finding is unclear. Clustered 8-oxo-guanine mutations promote oncogenic fusions in MSI colorectal tumor![]()
Collapse
Affiliation(s)
| | - Xiaoju Hu
- Rutgers Cancer Institute of New Jersey and Rutgers Robert Wood Johnson Medical School, Rutgers University, New Brunswick, NJ
| | | | - Zhuxuan Xu
- Rutgers Cancer Institute of New Jersey and Rutgers Robert Wood Johnson Medical School, Rutgers University, New Brunswick, NJ
| | | | | | | | | | | | - Shridar Ganesan
- Rutgers Cancer Institute of New Jersey and Rutgers Robert Wood Johnson Medical School, Rutgers University, New Brunswick, NJ
| | - Subhajyoti De
- Rutgers Cancer Institute of New Jersey and Rutgers Robert Wood Johnson Medical School, Rutgers University, New Brunswick, NJ
| |
Collapse
|
10
|
Storer JM, Hubley R, Rosen J, Smit AFA. Methodologies for the De novo Discovery of Transposable Element Families. Genes (Basel) 2022; 13:709. [PMID: 35456515 PMCID: PMC9025800 DOI: 10.3390/genes13040709] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 04/14/2022] [Accepted: 04/15/2022] [Indexed: 02/07/2023] Open
Abstract
The discovery and characterization of transposable element (TE) families are crucial tasks in the process of genome annotation. Careful curation of TE libraries for each organism is necessary as each has been exposed to a unique and often complex set of TE families. De novo methods have been developed; however, a fully automated and accurate approach to the development of complete libraries remains elusive. In this review, we cover established methods and recent developments in de novo TE analysis. We also present various methodologies used to assess these tools and discuss opportunities for further advancement of the field.
Collapse
Affiliation(s)
| | | | | | - Arian F. A. Smit
- Institute for Systems Biology, Seattle, WA 98109, USA; (J.M.S.); (R.H.); (J.R.)
| |
Collapse
|
11
|
Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A 2020; 117:9451-9457. [PMID: 32300014 PMCID: PMC7196820 DOI: 10.1073/pnas.1921046117] [Citation(s) in RCA: 1342] [Impact Index Per Article: 335.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all of the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete long terminal repeat (LTR) retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately 3 times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, http://www.repeatmasker.org/RepeatModeler/).
Collapse
Affiliation(s)
- Jullien M Flynn
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853
| | | | - Clément Goubert
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853
| | - Jeb Rosen
- Institute for Systems Biology, Seattle, WA 98109
| | - Andrew G Clark
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853;
| | - Cédric Feschotte
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853;
| | - Arian F Smit
- Institute for Systems Biology, Seattle, WA 98109
| |
Collapse
|
12
|
Hoeppner MP, Denisenko E, Gardner PP, Schmeier S, Poole AM. An Evaluation of Function of Multicopy Noncoding RNAs in Mammals Using ENCODE/FANTOM Data and Comparative Genomics. Mol Biol Evol 2019; 35:1451-1462. [PMID: 29617896 PMCID: PMC5967550 DOI: 10.1093/molbev/msy046] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
Mammalian diversification has coincided with a rapid proliferation of various types of noncoding RNAs, including members of both snRNAs and snoRNAs. The significance of this expansion however remains obscure. While some ncRNA copy-number expansions have been linked to functionally tractable effects, such events may equally likely be neutral, perhaps as a result of random retrotransposition. Hindering progress in our understanding of such observations is the difficulty in establishing function for the diverse features that have been identified in our own genome. Projects such as ENCODE and FANTOM have revealed a hidden world of genomic expression patterns, as well as a host of other potential indicators of biological function. However, such projects have been criticized, particularly from practitioners in the field of molecular evolution, where many suspect these data provide limited insight into biological function. The molecular evolution community has largely taken a skeptical view, thus it is important to establish tests of function. We use a range of data, including data drawn from ENCODE and FANTOM, to examine the case for function for the recent copy number expansion in mammals of six evolutionarily ancient RNA families involved in splicing and rRNA maturation. We use several criteria to assess evidence for function: conservation of sequence and structure, genomic synteny, evidence for transposition, and evidence for species-specific expression. Applying these criteria, we find that only a minority of loci show strong evidence for function and that, for the majority, we cannot reject the null hypothesis of no function.
Collapse
Affiliation(s)
- Marc P Hoeppner
- Institute of Clinical Molecular Biology, Christian-Albrechts-University of Kiel, Kiel, Germany
| | - Elena Denisenko
- Institute of Natural and Mathematical Sciences, Massey University, Auckland, New Zealand
| | - Paul P Gardner
- Biomolecular Interaction Centre, School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
| | - Sebastian Schmeier
- Institute of Natural and Mathematical Sciences, Massey University, Auckland, New Zealand
| | - Anthony M Poole
- Bioinformatics Institute, School of Biological Sciences, University of Auckland, Auckland, New Zealand
| |
Collapse
|
13
|
Yao Y, Liu Z, Wei Q, Ramsey SA. CERENKOV2: improved detection of functional noncoding SNPs using data-space geometric features. BMC Bioinformatics 2019; 20:63. [PMID: 30727967 PMCID: PMC6364436 DOI: 10.1186/s12859-019-2637-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2018] [Accepted: 01/18/2019] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND We previously reported on CERENKOV, an approach for identifying regulatory single nucleotide polymorphisms (rSNPs) that is based on 246 annotation features. CERENKOV uses the xgboost classifier and is designed to be used to find causal noncoding SNPs in loci identified by genome-wide association studies (GWAS). We reported that CERENKOV has state-of-the-art performance (by two traditional measures and a novel GWAS-oriented measure, AVGRANK) in a comparison to nine other tools for identifying functional noncoding SNPs, using a comprehensive reference SNP set (OSU17, 15,331 SNPs). Given that SNPs are grouped within loci in the reference SNP set and given the importance of the data-space manifold geometry for machine-learning model selection, we hypothesized that within-locus inter-SNP distances would have class-based distributional biases that could be exploited to improve rSNP recognition accuracy. We thus defined an intralocus SNP "radius" as the average data-space distance from a SNP to the other intralocus neighbors, and explored radius likelihoods for five distance measures. RESULTS We expanded the set of reference SNPs to 39,083 (the OSU18 set) and extracted CERENKOV SNP feature data. We computed radius empirical likelihoods and likelihood densities for rSNPs and control SNPs, and found significant likelihood differences between rSNPs and control SNPs. We fit parametric models of likelihood distributions for five different distance measures to obtain ten log-likelihood features that we combined with the 248-dimensional CERENKOV feature matrix. On the OSU18 SNP set, we measured the classification accuracy of CERENKOV with and without the new distance-based features, and found that the addition of distance-based features significantly improves rSNP recognition performance as measured by AUPVR, AUROC, and AVGRANK. Along with feature data for the OSU18 set, the software code for extracting the base feature matrix, estimating ten distance-based likelihood ratio features, and scoring candidate causal SNPs, are released as open-source software CERENKOV2. CONCLUSIONS Accounting for the locus-specific geometry of SNPs in data-space significantly improved the accuracy with which noncoding rSNPs can be computationally identified.
Collapse
Affiliation(s)
- Yao Yao
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, 97330 OR USA
- Department of Biomedical Sciences, Oregon State University, 106 Dryden Hall, Corvallis, 97330 OR USA
| | - Zheng Liu
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, 97330 OR USA
- Department of Biomedical Sciences, Oregon State University, 106 Dryden Hall, Corvallis, 97330 OR USA
| | - Qi Wei
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, 97330 OR USA
- Department of Biomedical Sciences, Oregon State University, 106 Dryden Hall, Corvallis, 97330 OR USA
| | - Stephen A. Ramsey
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, 97330 OR USA
- Department of Biomedical Sciences, Oregon State University, 106 Dryden Hall, Corvallis, 97330 OR USA
| |
Collapse
|
14
|
|
15
|
Olson D, Wheeler T. ULTRA: A Model Based Tool to Detect Tandem Repeats. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2018; 2018:37-46. [PMID: 31080962 DOI: 10.1145/3233547.3233604] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
In biological sequences, tandem repeats consist of tens to hundreds of residues of a repeated pattern, such as atgatgatgatgatg ('atg' repeated), often the result of replication slippage. Over time, these repeats decay so that the original sharp pattern of repetition is somewhat obscured, but even degenerate repeats pose a problem for sequence annotation: when two sequences both contain shared patterns of similar repetition, the result can be a false signal of sequence homology. We describe an implementation of a new hidden Markov model for detecting tandem repeats that shows substantially improved sensitivity to labeling decayed repetitive regions, presents low and reliable false annotation rates across a wide range of sequence composition, and produces scores that follow a stable distribution. On typical genomic sequence, the time and memory requirements of the resulting tool (ULTRA) are competitive with the most heavily used tool for repeat masking (TRF). ULTRA is released under an open source license and lays the groundwork for inclusion of the model in sequence alignment tools and annotation pipelines.
Collapse
|
16
|
Lee HO, Choi JW, Baek JH, Oh JH, Lee SC, Kim CK. Assembly of the Mitochondrial Genome in the Campanulaceae Family Using Illumina Low-Coverage Sequencing. Genes (Basel) 2018; 9:E383. [PMID: 30061537 PMCID: PMC6116063 DOI: 10.3390/genes9080383] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 07/24/2018] [Accepted: 07/25/2018] [Indexed: 11/16/2022] Open
Abstract
Platycodongrandiflorus (balloon flower) and Codonopsislanceolata (bonnet bellflower) are important herbs used in Asian traditional medicine, and both belong to the botanical family Campanulaceae. In this study, we designed and implemented a de novo DNA sequencing and assembly strategy to map the complete mitochondrial genomes of the first two members of the Campanulaceae using low-coverage Illumina DNA sequencing data. We produced a total of 28.9 Gb of paired-end sequencing data from the genomic DNA of P.grandiflorus (20.9 Gb) and C.lanceolata (8.0 Gb). The assembled mitochondrial genome of P.grandiflorus was found to consist of two circular chromosomes; the master circle contains 56 genes, and the minor circle contains 42 genes. The C.lanceolata mitochondrial genome consists of a single circle harboring 54 genes. Using a comparative genome structure and a pattern of repeated sequences, we show that the P.grandiflorus minor circle resulted from a recombination event involving the direct repeats of the master circle. Our dataset will be useful for comparative genomics and for evolutionary studies, and will facilitate further biological and phylogenetic characterization of species in the Campanulaceae.
Collapse
Affiliation(s)
- Hyun-Oh Lee
- Phyzen Genomics Institute, Seongnam 13558, Korea.
- Department of Plant Science, Seoul National University, Seoul 08826, Korea.
| | - Ji-Weon Choi
- Postharvest Technology Division, National Institute of Horticultural and Herbal Science, Wanju 55365, Korea.
| | - Jeong-Ho Baek
- Gene Engineering Division, National Institute of Agricultural Sciences, RDA, Jeonju 54874, Korea.
| | - Jae-Hyeon Oh
- Genomics Division, National Institute of Agricultural Sciences, RDA, Jeonju 54874, Korea.
| | | | - Chang-Kug Kim
- Genomics Division, National Institute of Agricultural Sciences, RDA, Jeonju 54874, Korea.
| |
Collapse
|
17
|
Hombach D, Schwarz JM, Robinson PN, Schuelke M, Seelow D. A systematic, large-scale comparison of transcription factor binding site models. BMC Genomics 2016; 17:388. [PMID: 27209209 PMCID: PMC4875604 DOI: 10.1186/s12864-016-2729-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2016] [Accepted: 05/06/2016] [Indexed: 11/10/2022] Open
Abstract
Background The modelling of gene regulation is a major challenge in biomedical research. This process is dominated by transcription factors (TFs) and mutations in their binding sites (TFBSs) may cause the misregulation of genes, eventually leading to disease. The consequences of DNA variants on TF binding are modelled in silico using binding matrices, but it remains unclear whether these are capable of accurately representing in vivo binding. In this study, we present a systematic comparison of binding models for 82 human TFs from three freely available sources: JASPAR matrices, HT-SELEX-generated models and matrices derived from protein binding microarrays (PBMs). We determined their ability to detect experimentally verified “real” in vivo TFBSs derived from ENCODE ChIP-seq data. As negative controls we chose random downstream exonic sequences, which are unlikely to harbour TFBS. All models were assessed by receiver operating characteristics (ROC) analysis. Results While the area-under-curve was low for most of the tested models with only 47 % reaching a score of 0.7 or higher, we noticed strong differences between the various position-specific scoring matrices with JASPAR and HT-SELEX models showing higher success rates than PBM-derived models. In addition, we found that while TFBS sequences showed a higher degree of conservation than randomly chosen sequences, there was a high variability between individual TFBSs. Conclusions Our results show that only few of the matrix-based models used to predict potential TFBS are able to reliably detect experimentally confirmed TFBS. We compiled our findings in a freely accessible web application called ePOSSUM (http:/mutationtaster.charite.de/ePOSSUM/) which uses a Bayes classifier to assess the impact of genetic alterations on TF binding in user-defined sequences. Additionally, ePOSSUM provides information on the reliability of the prediction using our test set of experimentally confirmed binding sites. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2729-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Daniela Hombach
- Department of Neuropaediatrics, Charité-Universitätsmedizin Berlin, Berlin, Germany.,NeuroCure Clinical Research Center, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Jana Marie Schwarz
- Department of Neuropaediatrics, Charité-Universitätsmedizin Berlin, Berlin, Germany.,NeuroCure Clinical Research Center, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Peter N Robinson
- Institute for Medical Genetics and Human Genetics, Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Markus Schuelke
- Department of Neuropaediatrics, Charité-Universitätsmedizin Berlin, Berlin, Germany.,NeuroCure Clinical Research Center, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Dominik Seelow
- Department of Neuropaediatrics, Charité-Universitätsmedizin Berlin, Berlin, Germany. .,NeuroCure Clinical Research Center, Charité - Universitätsmedizin Berlin, Berlin, Germany. .,Berliner Institut für Gesundheitsforschung / Berlin Institute of Health, Berlin, Germany.
| |
Collapse
|
18
|
Trimming of sequence reads alters RNA-Seq gene expression estimates. BMC Bioinformatics 2016; 17:103. [PMID: 26911985 PMCID: PMC4766705 DOI: 10.1186/s12859-016-0956-2] [Citation(s) in RCA: 101] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2015] [Accepted: 02/19/2016] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND High-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature. One pre-processing step that is widely but heterogeneously applied is trimming, in which low quality bases, identified by the probability that they are called incorrectly, are removed. However, the impact of trimming on subsequent alignment to a genome could influence downstream analyses including gene expression estimation; we hypothesized that this might occur in an inconsistent manner across different genes, resulting in differential bias. RESULTS To assess the effects of trimming on gene expression, we generated RNA-Seq data sets from four samples of larval Drosophila melanogaster sensory neurons, and used three trimming algorithms--SolexaQA, Trimmomatic, and ConDeTri-to perform quality-based trimming across a wide range of stringencies. After aligning the reads to the D. melanogaster genome with TopHat2, we used Cuffdiff2 to compare the original, untrimmed gene expression estimates to those following trimming. With the most aggressive trimming parameters, over ten percent of genes had significant changes in their estimated expression levels. This trend was seen with two additional RNA-Seq data sets and with alternative differential expression analysis pipelines. We found that the majority of the expression changes could be mitigated by imposing a minimum length filter following trimming, suggesting that the differential gene expression was primarily being driven by spurious mapping of short reads. Slight differences with the untrimmed data set remained after length filtering, which were associated with genes with low exon numbers and high GC content. Finally, an analysis of paired RNA-seq/microarray data sets suggests that no or modest trimming results in the most biologically accurate gene expression estimates. CONCLUSIONS We find that aggressive quality-based trimming has a large impact on the apparent makeup of RNA-Seq-based gene expression estimates, and that short reads can have a particularly strong impact. We conclude that implementation of trimming in RNA-Seq analysis workflows warrants caution, and if used, should be used in conjunction with a minimum read length filter to minimize the introduction of unpredictable changes in expression estimates.
Collapse
|
19
|
Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, Smit AFA, Wheeler TJ. The Dfam database of repetitive DNA families. Nucleic Acids Res 2015; 44:D81-9. [PMID: 26612867 PMCID: PMC4702899 DOI: 10.1093/nar/gkv1272] [Citation(s) in RCA: 421] [Impact Index Per Article: 46.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2015] [Accepted: 11/03/2015] [Indexed: 11/20/2022] Open
Abstract
Repetitive DNA, especially that due to transposable elements (TEs), makes up a large fraction of many genomes. Dfam is an open access database of families of repetitive DNA elements, in which each family is represented by a multiple sequence alignment and a profile hidden Markov model (HMM). The initial release of Dfam, featured in the 2013 NAR Database Issue, contained 1143 families of repetitive elements found in humans, and was used to produce more than 100 Mb of additional annotation of TE-derived regions in the human genome, with improved speed. Here, we describe recent advances, most notably expansion to 4150 total families including a comprehensive set of known repeat families from four new organisms (mouse, zebrafish, fly and nematode). We describe improvements to coverage, and to our methods for identifying and reducing false annotation. We also describe updates to the website interface. The Dfam website has moved to http://dfam.org. Seed alignments, profile HMMs, hit lists and other underlying data are available for download.
Collapse
Affiliation(s)
- Robert Hubley
- Institute for Systems Biology, Seattle, WA 98109, USA
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1RQ, UK
| | - Jody Clements
- HHMI Janelia Research Campus, Ashburn, VA 20147, USA
| | - Sean R Eddy
- Howard Hughes Medical Institute, Harvard University, Cambridge, MA 02138, USA
| | - Thomas A Jones
- Howard Hughes Medical Institute, Harvard University, Cambridge, MA 02138, USA
| | - Weidong Bao
- Genetic Information Research Institute, Los Altos, CA 94022, USA
| | | | | |
Collapse
|
20
|
Hoen DR, Hickey G, Bourque G, Casacuberta J, Cordaux R, Feschotte C, Fiston-Lavier AS, Hua-Van A, Hubley R, Kapusta A, Lerat E, Maumus F, Pollock DD, Quesneville H, Smit A, Wheeler TJ, Bureau TE, Blanchette M. A call for benchmarking transposable element annotation methods. Mob DNA 2015; 6:13. [PMID: 26244060 PMCID: PMC4524446 DOI: 10.1186/s13100-015-0044-6] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Accepted: 07/22/2015] [Indexed: 12/31/2022] Open
Abstract
DNA derived from transposable elements (TEs) constitutes large parts of the genomes of complex eukaryotes, with major impacts not only on genomic research but also on how organisms evolve and function. Although a variety of methods and tools have been developed to detect and annotate TEs, there are as yet no standard benchmarks-that is, no standard way to measure or compare their accuracy. This lack of accuracy assessment calls into question conclusions from a wide range of research that depends explicitly or implicitly on TE annotation. In the absence of standard benchmarks, toolmakers are impeded in improving their tools, annotators cannot properly assess which tools might best suit their needs, and downstream researchers cannot judge how accuracy limitations might impact their studies. We therefore propose that the TE research community create and adopt standard TE annotation benchmarks, and we call for other researchers to join the authors in making this long-overdue effort a success.
Collapse
Affiliation(s)
- Douglas R Hoen
- School of Computer Science, McGill University, McConnell Engineering Bldg., Rm. 318, 3480 Rue University, Montréal, Québec H3A 0E9 Canada ; Department of Biology, McGill University, Stewart Biology Bldg., 1205 Ave. du Docteur-Penfield, Montréal, Québec H3A 1B1 Canada
| | - Glenn Hickey
- School of Computer Science, McGill University, McConnell Engineering Bldg., Rm. 318, 3480 Rue University, Montréal, Québec H3A 0E9 Canada ; McGill Centre for Bioinformatics, McGill University, Montréal, Québec Canada
| | - Guillaume Bourque
- Department of Human Genetics, McGill University, Montréal, Québec Canada ; McGill University and Génome Québec Innovation Center, Montréal, Québec Canada
| | - Josep Casacuberta
- Centre for Research in Agricultural Genomics CSIC-IRTA-UAB-UB, 08193 Barcelona, Spain
| | - Richard Cordaux
- Université de Poitiers, UMR CNRS 7267 Ecologie et Biologie des Interactions, Equipe Ecologie Evolution Symbiose, 5 Rue Albert Turpin, 86073 Poitiers Cedex 9, France
| | - Cédric Feschotte
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, UT 84112 USA
| | - Anna-Sophie Fiston-Lavier
- Institut des Sciences de l'Evolution de Montpellier (ISE-M), Equipe Evolution, Vecteurs, Adaptation et Symbiose, UMR5554 CNRS-Université Montpellier, Montpellier, 34090 cedex 05 France
| | - Aurélie Hua-Van
- Laboratoire Evolution, Génomes, Comportement Ecologie, CNRS-Université Paris-Sud (UMR 9191)-IRD (UMR 247)-Université Paris-Saclay, F-91198 Gif-sur-Yvette, France
| | - Robert Hubley
- Institute for Systems Biology, 401 Terry Ave. N, Seattle, WA 98109 USA
| | - Aurélie Kapusta
- Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, UT 84112 USA
| | - Emmanuelle Lerat
- Laboratoire Biometrie et Biologie Evolutive, Universite Claude Bernard-Lyon 1, UMR-CNRS 5558-Bat. Mendel, 43 bd du 11 novembre 1918, 69622 Villeurbanne cedex, France
| | - Florian Maumus
- INRA, UR1164 URGI-Research Unit in Genomics-Info, INRA de Versailles-Grignon, Route de Saint-Cyr, Versailles, 78026 France
| | - David D Pollock
- University of Colorado School of Medicine, Aurora, CO 80045 USA
| | - Hadi Quesneville
- INRA, UR1164 URGI-Research Unit in Genomics-Info, INRA de Versailles-Grignon, Route de Saint-Cyr, Versailles, 78026 France
| | - Arian Smit
- Institute for Systems Biology, 401 Terry Ave. N, Seattle, WA 98109 USA
| | - Travis J Wheeler
- Department of Computer Science, University of Montana, Missoula, MT 59812 USA
| | - Thomas E Bureau
- Department of Biology, McGill University, Stewart Biology Bldg., 1205 Ave. du Docteur-Penfield, Montréal, Québec H3A 1B1 Canada
| | - Mathieu Blanchette
- School of Computer Science, McGill University, McConnell Engineering Bldg., Rm. 318, 3480 Rue University, Montréal, Québec H3A 0E9 Canada ; McGill Centre for Bioinformatics, McGill University, Montréal, Québec Canada
| |
Collapse
|