Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Caballero J, Smit AFA, Hood L, Glusman G. Realistic artificial DNA sequences as negative controls for computational genomics. Nucleic Acids Res 2014;42:e99. [PMID: 24803667 PMCID: PMC4081056 DOI: 10.1093/nar/gku356] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

For:	Caballero J, Smit AFA, Hood L, Glusman G. Realistic artificial DNA sequences as negative controls for computational genomics. Nucleic Acids Res 2014;42:e99. [PMID: 24803667 PMCID: PMC4081056 DOI: 10.1093/nar/gku356] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Number

Cited by Other Article(s)

Calamari ZT, Song A, Cohen E, Akter M, Das Roy R, Hallikas O, Christensen MM, Li P, Marangoni P, Jernvall J, Klein OD. Bank vole genomics links determinate and indeterminate growth of teeth. BMC Genomics 2024;25:1000. [PMID: 39472825 PMCID: PMC11523675 DOI: 10.1186/s12864-024-10901-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Accepted: 10/14/2024] [Indexed: 11/02/2024] Open

Abstract

BACKGROUND

RESULTS

We assembled a de novo genome of Myodes glareolus, a vole with high-crowned, rooted molars, and performed genomic and transcriptomic analyses in a broad phylogenetic context of Glires (rodents and lagomorphs) to assess differential selection and evolution in tooth forming genes. Bulk transcriptomics comparisons of embryonic molar development between bank voles and mice demonstrated overall conservation of gene expression levels, with species-specific differences corresponding to the accelerated and more extensive patterning of the vole molar. We leverage convergent evolution of unrooted molars across the clade to examine changes that may underlie the evolution of unrooted molars. We identified 15 dental genes with changing synteny relationships and six dental genes undergoing positive selection across Glires, two of which were undergoing positive selection in species with unrooted molars, Dspp and Aqp1. Decreased expression of both genes in prairie voles with unrooted molars compared to bank voles supports the presence of positive selection and may underlie differences in root formation.

CONCLUSIONS

Our results support ongoing evolution of dental genes across Glires and identify candidate genes for mechanistic studies of root formation. Comparative research using the bank vole as a model species can reveal the complex evolutionary background of convergent evolution for ever-growing molars.

Collapse

Affiliation(s)

Zachary T Calamari Baruch College, City University of New York, One Bernard Baruch Way, New York, NY, 10010, USA. The Graduate Center, City University of New York, 365 Fifth Ave, New York, NY, 10016, USA. Program in Craniofacial Biology, Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA, 94158, USA. Division of Paleontology, American Museum of Natural History, Central Park West at 79th Street, New York, NY, 10024, USA.
Andrew Song Baruch College, City University of New York, One Bernard Baruch Way, New York, NY, 10010, USA Cornell University, 616 Thurston Ave, Ithaca, NY, 14853, USA
Emily Cohen Baruch College, City University of New York, One Bernard Baruch Way, New York, NY, 10010, USA New York University College of Dentistry, 345 E 34th St, New York, NY, 10010, USA
Muspika Akter Baruch College, City University of New York, One Bernard Baruch Way, New York, NY, 10010, USA
Rishi Das Roy Institute of Biotechnology, University of Helsinki, Helsinki, FI-00014, Finland
Outi Hallikas Institute of Biotechnology, University of Helsinki, Helsinki, FI-00014, Finland
Mona M Christensen Institute of Biotechnology, University of Helsinki, Helsinki, FI-00014, Finland
Pengyang Li Program in Craniofacial Biology, Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA, 94158, USA Department of Pediatrics, Cedars-Sinai Guerin Children's, 8700 Beverly Blvd., Suite 2416, Los Angeles, CA, 90048, USA Department of Bioengineering, Stanford University, 443 Via Ortega, Rm 119, Stanford, CA, 94305, USA
Pauline Marangoni Program in Craniofacial Biology, Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA, 94158, USA Department of Pediatrics, Cedars-Sinai Guerin Children's, 8700 Beverly Blvd., Suite 2416, Los Angeles, CA, 90048, USA
Jukka Jernvall Institute of Biotechnology, University of Helsinki, Helsinki, FI-00014, Finland Department of Geosciences and Geography, University of Helsinki, Helsinki, FI-00014, Finland
Ophir D Klein Program in Craniofacial Biology, Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA, 94158, USA. Department of Pediatrics, Cedars-Sinai Guerin Children's, 8700 Beverly Blvd., Suite 2416, Los Angeles, CA, 90048, USA.

Collapse

Kumar M, Tibocha-Bonilla JD, Füssy Z, Lieng C, Schwenck SM, Levesque AV, Al-Bassam MM, Passi A, Neal M, Zuniga C, Kaiyom F, Espinoza JL, Lim H, Polson SW, Allen LZ, Zengler K. Mixotrophic growth of a ubiquitous marine diatom. SCIENCE ADVANCES 2024;10:eado2623. [PMID: 39018398 PMCID: PMC466952 DOI: 10.1126/sciadv.ado2623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 06/12/2024] [Indexed: 07/19/2024]

Affiliation(s)

Manish Kumar Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Juan D. Tibocha-Bonilla Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Zoltán Füssy Department of Parasitology, Faculty of Science, Charles University, BIOCEV, Vestec, Czech Republic
Chloe Lieng Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Sarah M. Schwenck Scripps Institution of Oceanography, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Alice V. Levesque Scripps Institution of Oceanography, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Mahmoud M. Al-Bassam Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Anurag Passi Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Maxwell Neal Department of Bioengineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Cristal Zuniga Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Farrah Kaiyom Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Josh L. Espinoza Department of Microbial and Environmental Genomics, J. Craig Venter Institute, 4120 Capricorn Way, La Jolla, CA 92037, USA
Hyungyu Lim Department of Bioengineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA
Shawn W. Polson Department of Computer and Information Sciences, University of Delaware, 18 Amstel Ave., Newark, DE 19716, USA Center for Bioinformatics and Computational Biology, University of Delaware, 590 Avenue 1743, Newark, DE 19713, USA
Lisa Zeigler Allen Scripps Institution of Oceanography, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA Department of Microbial and Environmental Genomics, J. Craig Venter Institute, 4120 Capricorn Way, La Jolla, CA 92037, USA
Karsten Zengler Department of Pediatrics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA Department of Bioengineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA Center for Microbiome Innovation, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA Program in Materials Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA

Collapse

Calamari ZT, Song A, Cohen E, Akter M, Roy RD, Hallikas O, Christensen MM, Li P, Marangoni P, Jernvall J, Klein OD. Vole genomics links determinate and indeterminate growth of teeth. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.18.572015. [PMID: 38187646 PMCID: PMC10769287 DOI: 10.1101/2023.12.18.572015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]

Abstract

Continuously growing teeth are an important innovation in mammalian evolution, yet genetic regulation of continuous growth by stem cells remains incompletely understood. Dental stem cells responsible for tooth crown growth are lost at the onset of tooth root formation. Genetic signaling that initiates this loss is difficult to study with the ever-growing incisor and rooted molars of mice, the most common mammalian dental model species, because signals for root formation overlap with signals that pattern tooth size and shape (i.e., cusp patterns). Different species of voles (Cricetidae, Rodentia, Glires) have evolved rooted and unrooted molars that have similar size and shape, providing alternative models for studying roots. We assembled a de novo genome of Myodes glareolus, a vole with high-crowned, rooted molars, and performed genomic and transcriptomic analyses in a broad phylogenetic context of Glires (rodents and lagomorphs) to assess differential selection and evolution in tooth forming genes. We identified 15 dental genes with changing synteny relationships and six dental genes undergoing positive selection across Glires, two of which were undergoing positive selection in species with unrooted molars, Dspp and Aqp1. Decreased expression of both genes in prairie voles with unrooted molars compared to bank voles supports the presence of positive selection and may underlie differences in root formation. Bulk transcriptomics analyses of embryonic molar development in bank voles also demonstrated conserved patterns of dental gene expression compared to mice, with species-specific variation likely related to developmental timing and morphological differences between mouse and vole molars. Our results support ongoing evolution of dental genes across Glires, revealing the complex evolutionary background of convergent evolution for ever-growing molars.

Collapse

Affiliation(s)

Zachary T. Calamari Baruch College, City University of New York, One Bernard Baruch Way, New York, NY 10010, USA The Graduate Center, City University of New York, 365 Fifth Ave, New York, NY 10016, USA Program in Craniofacial Biology and Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA 94158, USA Division of Paleontology, American Museum of Natural History, Central Park West at 79th Street, New York, NY, 10024, USA
Andrew Song Baruch College, City University of New York, One Bernard Baruch Way, New York, NY 10010, USA Cornell University, 616 Thurston Ave, Ithaca, NY 14853, USA
Emily Cohen Baruch College, City University of New York, One Bernard Baruch Way, New York, NY 10010, USA New York University College of Dentistry, 345 E 34th St, New York, NY 10010
Muspika Akter Baruch College, City University of New York, One Bernard Baruch Way, New York, NY 10010, USA
Rishi Das Roy Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
Outi Hallikas Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
Mona M. Christensen Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland
Pengyang Li Program in Craniofacial Biology and Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA 94158, USA Department of Pediatrics, Cedars-Sinai Guerin Children’s, 8700 Beverly Blvd., Suite 2416, Los Angeles, CA 90048, USA
Pauline Marangoni Program in Craniofacial Biology and Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA 94158, USA Department of Pediatrics, Cedars-Sinai Guerin Children’s, 8700 Beverly Blvd., Suite 2416, Los Angeles, CA 90048, USA
Jukka Jernvall Institute of Biotechnology, University of Helsinki, FI-00014 Helsinki, Finland Department of Geosciences and Geography, University of Helsinki, FI-00014 Helsinki, Finland
Ophir D. Klein Program in Craniofacial Biology and Department of Orofacial Sciences, University of California, San Francisco, San Francisco, CA 94158, USA Department of Pediatrics, Cedars-Sinai Guerin Children’s, 8700 Beverly Blvd., Suite 2416, Los Angeles, CA 90048, USA

Collapse

Glidden-Handgis G, Wheeler TJ. WAS IT A MATch I SAW? Approximate palindromes lead to overstated false match rates in benchmarks using reversed sequences. BIOINFORMATICS ADVANCES 2024;4:vbae052. [PMID: 38764475 PMCID: PMC11099658 DOI: 10.1093/bioadv/vbae052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 03/31/2024] [Accepted: 04/04/2024] [Indexed: 05/21/2024]

Abstract

Background

Software for labeling biological sequences typically produces a theory-based statistic for each match (the E-value) that indicates the likelihood of seeing that match's score by chance. E-values accurately predict false match rate for comparisons of random (shuffled) sequences, and thus provide a reasoned mechanism for setting score thresholds that enable high sensitivity with low expected false match rate. This threshold-setting strategy is challenged by real biological sequences, which contain regions of local repetition and low sequence complexity that cause excess matches between non-homologous sequences. Knowing this, tool developers often develop benchmarks that use realistic-seeming decoy sequences to explore empirical tradeoffs between sensitivity and false match rate. A recent trend has been to employ reversed biological sequences as realistic decoys, because these preserve the distribution of letters and the existence of local repeats, while disrupting the original sequence's functional properties. However, we and others have observed that sequences appear to produce high scoring alignments to their reversals with surprising frequency, leading to overstatement of false match risk that may negatively affect downstream analysis.

Results

We demonstrate that an alignment between a sequence S and its (possibly mutated) reversal tends to produce higher scores than alignment between truly unrelated sequences, even when S is a shuffled string with no notable repetitive or low-complexity regions. This phenomenon is due to the unintuitive fact that (even randomly shuffled) sequences contain palindromes that are on average longer than the longest common substrings (LCS) shared between permuted variants of the same sequence. Though the expected palindrome length is only slightly larger than the expected LCS, the distribution of alignment scores involving reversed sequences is strongly right-shifted, leading to greatly increased frequency of high-scoring alignments to reversed sequences.

Impact

Overestimates of false match risk can motivate unnecessarily high score thresholds, leading to potentially reduced true match sensitivity. Also, when tool sensitivity is only reported up to the score of the first matched decoy sequence, a large decoy set consisting of reversed sequences can obscure sensitivity differences between tools. As a result of these observations, we advise that reversed biological sequences be used as decoys only when care is taken to remove positive matches in the original (un-reversed) sequences, or when overstatement of false labeling is not a concern. Though the primary focus of the analysis is on sequence annotation, we also demonstrate that the prevalence of internal palindromes may lead to an overstatement of the rate of false labels in protein identification with mass spectrometry.

Collapse

Matsushima W, Planet E, Trono D. Ancestral genome reconstruction enhances transposable element annotation by identifying degenerate integrants. CELL GENOMICS 2024;4:100497. [PMID: 38295789 PMCID: PMC10879028 DOI: 10.1016/j.xgen.2024.100497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 08/09/2023] [Accepted: 01/06/2024] [Indexed: 02/17/2024]

Nguyen-Vo TH, Trinh QH, Nguyen L, Nguyen-Hoang PU, Rahardja S, Nguyen BP. i4mC-GRU: Identifying DNA N⁴-Methylcytosine sites in mouse genomes using bidirectional gated recurrent unit and sequence-embedded features. Comput Struct Biotechnol J 2023;21:3045-3053. [PMID: 37273848 PMCID: PMC10238585 DOI: 10.1016/j.csbj.2023.05.014] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Revised: 05/12/2023] [Accepted: 05/12/2023] [Indexed: 06/06/2023] Open

Sun YH, Cui H, Song C, Shen JT, Zhuo X, Wang RH, Yu X, Ndamba R, Mu Q, Gu H, Wang D, Murthy GG, Li P, Liang F, Liu L, Tao Q, Wang Y, Orlowski S, Xu Q, Zhou H, Jagne J, Gokcumen O, Anthony N, Zhao X, Li XZ. Amniotes co-opt intrinsic genetic instability to protect germ-line genome integrity. Nat Commun 2023;14:812. [PMID: 36781861 PMCID: PMC9925758 DOI: 10.1038/s41467-023-36354-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 01/27/2023] [Indexed: 02/15/2023] Open

Affiliation(s)

Yu H Sun Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
Hongxiao Cui College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China
Chi Song College of Public Health, Division of Biostatistics, The Ohio State University, Columbus, OH, 43210, USA
Jiafei Teng Shen International Institutes of Medicine, The Fourth Affiliated Hospital, Zhejiang University School of Medicine, Yiwu, Zhejiang, 322000, China
Xiaoyu Zhuo Department of Genetics, The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO, 63110, USA
Ruoqiao Huiyi Wang Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China
Xiaohui Yu College of Animal Science and Technology, Northwest A&F University, Yangling, Shaanxi, 712100, China
Rudo Ndamba Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
Qian Mu Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
Hanwen Gu Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
Duolin Wang Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
Gayathri Guru Murthy Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA
Pidong Li Grandomics Biosciences Co., Ltd, Beijing, 102206, China
Fan Liang Grandomics Biosciences Co., Ltd, Beijing, 102206, China
Lei Liu Grandomics Biosciences Co., Ltd, Beijing, 102206, China
Qing Tao Grandomics Biosciences Co., Ltd, Beijing, 102206, China
Ying Wang Department of Animal Science, University of California, Davis, CA, 95616, USA
Sara Orlowski Department of Poultry Science, University of Arkansas, Fayetteville, AR, 72701, USA
Qi Xu Department of Animal Science, McGill University, Quebec, H9X 3V9, Canada
Huaijun Zhou Department of Animal Science, University of California, Davis, CA, 95616, USA
Jarra Jagne Animal Health Diagnostic Center, Cornell University College of Veterinary Medicine, Ithaca, NY, 14850, USA
Omer Gokcumen Department of Biological Sciences, University at Buffalo, State University of New York, Buffalo, NY, 14260, USA
Nick Anthony Department of Poultry Science, University of Arkansas, Fayetteville, AR, 72701, USA
Xin Zhao Department of Animal Science, McGill University, Quebec, H9X 3V9, Canada.
Xin Zhiguo Li Center for RNA Biology: From Genome to Therapeutics, Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, 14642, USA.

Collapse

Löytynoja A. Thousands of human mutation clusters are explained by short-range template switching. Genome Res 2022;32:1437-1447. [PMID: 35760560 PMCID: PMC9435742 DOI: 10.1101/gr.276478.121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Accepted: 06/21/2022] [Indexed: 02/03/2023]

Madison RW, Hu X, Ramanan V, Xu Z, Huang RSP, Sokol ES, Frampton GM, Schrock AB, Ali SM, Ganesan S, De S. Clustered 8-Oxo-Guanine Mutations and Oncogenic Gene Fusions in Microsatellite-Unstable Colorectal Cancer. JCO Precis Oncol 2022;6:e2100477. [PMID: 35584350 PMCID: PMC9200390 DOI: 10.1200/po.21.00477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open

Storer JM, Hubley R, Rosen J, Smit AFA. Methodologies for the De novo Discovery of Transposable Element Families. Genes (Basel) 2022;13:709. [PMID: 35456515 PMCID: PMC9025800 DOI: 10.3390/genes13040709] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 04/14/2022] [Accepted: 04/15/2022] [Indexed: 02/07/2023] Open

Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. RepeatModeler2 for automated genomic discovery of transposable element families. Proc Natl Acad Sci U S A 2020;117:9451-9457. [PMID: 32300014 PMCID: PMC7196820 DOI: 10.1073/pnas.1921046117] [Citation(s) in RCA: 1342] [Impact Index Per Article: 335.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open

Abstract

The accelerating pace of genome sequencing throughout the tree of life is driving the need for improved unsupervised annotation of genome components such as transposable elements (TEs). Because the types and sequences of TEs are highly variable across species, automated TE discovery and annotation are challenging and time-consuming tasks. A critical first step is the de novo identification and accurate compilation of sequence models representing all of the unique TE families dispersed in the genome. Here we introduce RepeatModeler2, a pipeline that greatly facilitates this process. This program brings substantial improvements over the original version of RepeatModeler, one of the most widely used tools for TE discovery. In particular, this version incorporates a module for structural discovery of complete long terminal repeat (LTR) retroelements, which are widespread in eukaryotic genomes but recalcitrant to automated identification because of their size and sequence complexity. We benchmarked RepeatModeler2 on three model species with diverse TE landscapes and high-quality, manually curated TE libraries: Drosophila melanogaster (fruit fly), Danio rerio (zebrafish), and Oryza sativa (rice). In these three species, RepeatModeler2 identified approximately 3 times more consensus sequences matching with >95% sequence identity and sequence coverage to the manually curated sequences than the original RepeatModeler. As expected, the greatest improvement is for LTR retroelements. Thus, RepeatModeler2 represents a valuable addition to the genome annotation toolkit that will enhance the identification and study of TEs in eukaryotic genome sequences. RepeatModeler2 is available as source code or a containerized package under an open license (https://github.com/Dfam-consortium/RepeatModeler, http://www.repeatmasker.org/RepeatModeler/).

Collapse

Hoeppner MP, Denisenko E, Gardner PP, Schmeier S, Poole AM. An Evaluation of Function of Multicopy Noncoding RNAs in Mammals Using ENCODE/FANTOM Data and Comparative Genomics. Mol Biol Evol 2019;35:1451-1462. [PMID: 29617896 PMCID: PMC5967550 DOI: 10.1093/molbev/msy046] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open

Yao Y, Liu Z, Wei Q, Ramsey SA. CERENKOV2: improved detection of functional noncoding SNPs using data-space geometric features. BMC Bioinformatics 2019;20:63. [PMID: 30727967 PMCID: PMC6364436 DOI: 10.1186/s12859-019-2637-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2018] [Accepted: 01/18/2019] [Indexed: 02/07/2023] Open

Abstract

BACKGROUND

We previously reported on CERENKOV, an approach for identifying regulatory single nucleotide polymorphisms (rSNPs) that is based on 246 annotation features. CERENKOV uses the xgboost classifier and is designed to be used to find causal noncoding SNPs in loci identified by genome-wide association studies (GWAS). We reported that CERENKOV has state-of-the-art performance (by two traditional measures and a novel GWAS-oriented measure, AVGRANK) in a comparison to nine other tools for identifying functional noncoding SNPs, using a comprehensive reference SNP set (OSU17, 15,331 SNPs). Given that SNPs are grouped within loci in the reference SNP set and given the importance of the data-space manifold geometry for machine-learning model selection, we hypothesized that within-locus inter-SNP distances would have class-based distributional biases that could be exploited to improve rSNP recognition accuracy. We thus defined an intralocus SNP "radius" as the average data-space distance from a SNP to the other intralocus neighbors, and explored radius likelihoods for five distance measures.

RESULTS

We expanded the set of reference SNPs to 39,083 (the OSU18 set) and extracted CERENKOV SNP feature data. We computed radius empirical likelihoods and likelihood densities for rSNPs and control SNPs, and found significant likelihood differences between rSNPs and control SNPs. We fit parametric models of likelihood distributions for five different distance measures to obtain ten log-likelihood features that we combined with the 248-dimensional CERENKOV feature matrix. On the OSU18 SNP set, we measured the classification accuracy of CERENKOV with and without the new distance-based features, and found that the addition of distance-based features significantly improves rSNP recognition performance as measured by AUPVR, AUROC, and AVGRANK. Along with feature data for the OSU18 set, the software code for extracting the base feature matrix, estimating ten distance-based likelihood ratio features, and scoring candidate causal SNPs, are released as open-source software CERENKOV2.

CONCLUSIONS

Accounting for the locus-specific geometry of SNPs in data-space significantly improved the accuracy with which noncoding rSNPs can be computationally identified.

Collapse

Goerner-Potvin P, Bourque G. Computational tools to unmask transposable elements. Nat Rev Genet 2018;19:688-704. [DOI: 10.1038/s41576-018-0050-x] [Citation(s) in RCA: 126] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]

Olson D, Wheeler T. ULTRA: A Model Based Tool to Detect Tandem Repeats. ACM-BCB ... ... : THE ... ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE. ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY AND BIOMEDICINE 2018;2018:37-46. [PMID: 31080962 DOI: 10.1145/3233547.3233604] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]

Lee HO, Choi JW, Baek JH, Oh JH, Lee SC, Kim CK. Assembly of the Mitochondrial Genome in the Campanulaceae Family Using Illumina Low-Coverage Sequencing. Genes (Basel) 2018;9:E383. [PMID: 30061537 PMCID: PMC6116063 DOI: 10.3390/genes9080383] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 07/24/2018] [Accepted: 07/25/2018] [Indexed: 11/16/2022] Open

Hombach D, Schwarz JM, Robinson PN, Schuelke M, Seelow D. A systematic, large-scale comparison of transcription factor binding site models. BMC Genomics 2016;17:388. [PMID: 27209209 PMCID: PMC4875604 DOI: 10.1186/s12864-016-2729-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2016] [Accepted: 05/06/2016] [Indexed: 11/10/2022] Open

Abstract

Background

The modelling of gene regulation is a major challenge in biomedical research. This process is dominated by transcription factors (TFs) and mutations in their binding sites (TFBSs) may cause the misregulation of genes, eventually leading to disease. The consequences of DNA variants on TF binding are modelled in silico using binding matrices, but it remains unclear whether these are capable of accurately representing in vivo binding.

In this study, we present a systematic comparison of binding models for 82 human TFs from three freely available sources: JASPAR matrices, HT-SELEX-generated models and matrices derived from protein binding microarrays (PBMs). We determined their ability to detect experimentally verified “real” in vivo TFBSs derived from ENCODE ChIP-seq data. As negative controls we chose random downstream exonic sequences, which are unlikely to harbour TFBS. All models were assessed by receiver operating characteristics (ROC) analysis.

Results

While the area-under-curve was low for most of the tested models with only 47 % reaching a score of 0.7 or higher, we noticed strong differences between the various position-specific scoring matrices with JASPAR and HT-SELEX models showing higher success rates than PBM-derived models. In addition, we found that while TFBS sequences showed a higher degree of conservation than randomly chosen sequences, there was a high variability between individual TFBSs.

Conclusions

Our results show that only few of the matrix-based models used to predict potential TFBS are able to reliably detect experimentally confirmed TFBS.

We compiled our findings in a freely accessible web application called ePOSSUM (http:/mutationtaster.charite.de/ePOSSUM/) which uses a Bayes classifier to assess the impact of genetic alterations on TF binding in user-defined sequences. Additionally, ePOSSUM provides information on the reliability of the prediction using our test set of experimentally confirmed binding sites.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-016-2729-8) contains supplementary material, which is available to authorized users.

Collapse

Trimming of sequence reads alters RNA-Seq gene expression estimates. BMC Bioinformatics 2016;17:103. [PMID: 26911985 PMCID: PMC4766705 DOI: 10.1186/s12859-016-0956-2] [Citation(s) in RCA: 101] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2015] [Accepted: 02/19/2016] [Indexed: 01/08/2023] Open

Abstract

BACKGROUND

High-throughput RNA-Sequencing (RNA-Seq) has become the preferred technique for studying gene expression differences between biological samples and for discovering novel isoforms, though the techniques to analyze the resulting data are still immature. One pre-processing step that is widely but heterogeneously applied is trimming, in which low quality bases, identified by the probability that they are called incorrectly, are removed. However, the impact of trimming on subsequent alignment to a genome could influence downstream analyses including gene expression estimation; we hypothesized that this might occur in an inconsistent manner across different genes, resulting in differential bias.

RESULTS

To assess the effects of trimming on gene expression, we generated RNA-Seq data sets from four samples of larval Drosophila melanogaster sensory neurons, and used three trimming algorithms--SolexaQA, Trimmomatic, and ConDeTri-to perform quality-based trimming across a wide range of stringencies. After aligning the reads to the D. melanogaster genome with TopHat2, we used Cuffdiff2 to compare the original, untrimmed gene expression estimates to those following trimming. With the most aggressive trimming parameters, over ten percent of genes had significant changes in their estimated expression levels. This trend was seen with two additional RNA-Seq data sets and with alternative differential expression analysis pipelines. We found that the majority of the expression changes could be mitigated by imposing a minimum length filter following trimming, suggesting that the differential gene expression was primarily being driven by spurious mapping of short reads. Slight differences with the untrimmed data set remained after length filtering, which were associated with genes with low exon numbers and high GC content. Finally, an analysis of paired RNA-seq/microarray data sets suggests that no or modest trimming results in the most biologically accurate gene expression estimates.

CONCLUSIONS

We find that aggressive quality-based trimming has a large impact on the apparent makeup of RNA-Seq-based gene expression estimates, and that short reads can have a particularly strong impact. We conclude that implementation of trimming in RNA-Seq analysis workflows warrants caution, and if used, should be used in conjunction with a minimum read length filter to minimize the introduction of unpredictable changes in expression estimates.

Collapse

Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, Smit AFA, Wheeler TJ. The Dfam database of repetitive DNA families. Nucleic Acids Res 2015;44:D81-9. [PMID: 26612867 PMCID: PMC4702899 DOI: 10.1093/nar/gkv1272] [Citation(s) in RCA: 421] [Impact Index Per Article: 46.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2015] [Accepted: 11/03/2015] [Indexed: 11/20/2022] Open

Hoen DR, Hickey G, Bourque G, Casacuberta J, Cordaux R, Feschotte C, Fiston-Lavier AS, Hua-Van A, Hubley R, Kapusta A, Lerat E, Maumus F, Pollock DD, Quesneville H, Smit A, Wheeler TJ, Bureau TE, Blanchette M. A call for benchmarking transposable element annotation methods. Mob DNA 2015;6:13. [PMID: 26244060 PMCID: PMC4524446 DOI: 10.1186/s13100-015-0044-6] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Accepted: 07/22/2015] [Indexed: 12/31/2022] Open

Affiliation(s)

Douglas R Hoen School of Computer Science, McGill University, McConnell Engineering Bldg., Rm. 318, 3480 Rue University, Montréal, Québec H3A 0E9 Canada ; Department of Biology, McGill University, Stewart Biology Bldg., 1205 Ave. du Docteur-Penfield, Montréal, Québec H3A 1B1 Canada
Glenn Hickey School of Computer Science, McGill University, McConnell Engineering Bldg., Rm. 318, 3480 Rue University, Montréal, Québec H3A 0E9 Canada ; McGill Centre for Bioinformatics, McGill University, Montréal, Québec Canada
Guillaume Bourque Department of Human Genetics, McGill University, Montréal, Québec Canada ; McGill University and Génome Québec Innovation Center, Montréal, Québec Canada
Josep Casacuberta Centre for Research in Agricultural Genomics CSIC-IRTA-UAB-UB, 08193 Barcelona, Spain
Richard Cordaux Université de Poitiers, UMR CNRS 7267 Ecologie et Biologie des Interactions, Equipe Ecologie Evolution Symbiose, 5 Rue Albert Turpin, 86073 Poitiers Cedex 9, France
Cédric Feschotte Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, UT 84112 USA
Anna-Sophie Fiston-Lavier Institut des Sciences de l'Evolution de Montpellier (ISE-M), Equipe Evolution, Vecteurs, Adaptation et Symbiose, UMR5554 CNRS-Université Montpellier, Montpellier, 34090 cedex 05 France
Aurélie Hua-Van Laboratoire Evolution, Génomes, Comportement Ecologie, CNRS-Université Paris-Sud (UMR 9191)-IRD (UMR 247)-Université Paris-Saclay, F-91198 Gif-sur-Yvette, France
Robert Hubley Institute for Systems Biology, 401 Terry Ave. N, Seattle, WA 98109 USA
Aurélie Kapusta Department of Human Genetics, University of Utah School of Medicine, Salt Lake City, UT 84112 USA
Emmanuelle Lerat Laboratoire Biometrie et Biologie Evolutive, Universite Claude Bernard-Lyon 1, UMR-CNRS 5558-Bat. Mendel, 43 bd du 11 novembre 1918, 69622 Villeurbanne cedex, France
Florian Maumus INRA, UR1164 URGI-Research Unit in Genomics-Info, INRA de Versailles-Grignon, Route de Saint-Cyr, Versailles, 78026 France
David D Pollock University of Colorado School of Medicine, Aurora, CO 80045 USA
Hadi Quesneville INRA, UR1164 URGI-Research Unit in Genomics-Info, INRA de Versailles-Grignon, Route de Saint-Cyr, Versailles, 78026 France
Arian Smit Institute for Systems Biology, 401 Terry Ave. N, Seattle, WA 98109 USA
Travis J Wheeler Department of Computer Science, University of Montana, Missoula, MT 59812 USA
Thomas E Bureau Department of Biology, McGill University, Stewart Biology Bldg., 1205 Ave. du Docteur-Penfield, Montréal, Québec H3A 1B1 Canada
Mathieu Blanchette School of Computer Science, McGill University, McConnell Engineering Bldg., Rm. 318, 3480 Rue University, Montréal, Québec H3A 0E9 Canada ; McGill Centre for Bioinformatics, McGill University, Montréal, Québec Canada

Collapse