1
|
Wesp V, Theißen G, Schuster S. Statistical analysis of synonymous and stop codons in pseudo-random and real sequences as a function of GC content. Sci Rep 2023; 13:22996. [PMID: 38151539 PMCID: PMC10752896 DOI: 10.1038/s41598-023-49626-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 12/10/2023] [Indexed: 12/29/2023] Open
Abstract
Knowledge of the frequencies of synonymous triplets in protein-coding and non-coding DNA stretches can be used in gene finding. These frequencies depend on the GC content of the genome or parts of it. An example of interest is provided by stop codons. This is relevant for the definition of Open Reading Frames. A generic case is provided by pseudo-random sequences, especially when they code for complex proteins or when they are non-coding and not subject to selection pressure. Here, we calculate, for such sequences and for all 25 known genetic codes, the frequency of each amino acid and stop codon based on their set of codons and as a function of GC content. The amino acids can be classified into five groups according to the GC content where their expected frequency reaches its maximum. We determine the overall Shannon information based on groups of synonymous codons and show that it becomes maximum at a percent GC of 43.3% (for the standard code). This is in line with the observation that in most fungi, plants, and animals, this genomic parameter is in the range from 35 to 50%. By analysing natural sequences, we show that there is a clear bias for triplets corresponding to stop codons near the 5'- and 3'-splice sites in the introns of various clades.
Collapse
Affiliation(s)
- Valentin Wesp
- Department of Bioinformatics, Matthias Schleiden Institute, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743, Jena, Germany
| | - Günter Theißen
- Department of Genetics, Matthias Schleiden Institute, Friedrich Schiller University Jena, Philosophenweg 12, 07743, Jena, Germany
| | - Stefan Schuster
- Department of Bioinformatics, Matthias Schleiden Institute, Friedrich Schiller University Jena, Ernst-Abbe-Platz 2, 07743, Jena, Germany.
| |
Collapse
|
2
|
Tay Fernandez CG, Bayer PE, Petereit J, Varshney R, Batley J, Edwards D. The conservation of gene models can support genome annotation. THE PLANT GENOME 2023; 16:e20377. [PMID: 37602500 DOI: 10.1002/tpg2.20377] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 07/19/2023] [Accepted: 07/24/2023] [Indexed: 08/22/2023]
Abstract
Many genome annotations include false-positive gene models, leading to errors in phylogenetic and comparative studies. Here, we propose a method to support gene model prediction based on evolutionary conservation and use it to identify potentially erroneous annotations. Using this method, we developed a set of 15,345 representative gene models from 12 legume assemblies that can be used to support genome annotations for other legumes.
Collapse
Affiliation(s)
- Cassandria G Tay Fernandez
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, Western Australia, Australia
| | - Philipp E Bayer
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, Western Australia, Australia
| | - Jakob Petereit
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, Western Australia, Australia
| | - Rajeev Varshney
- State Agricultural Biotechnology Centre, Centre for Crop and Food Innovation, Food Futures Institute, Murdoch University, Murdoch, Western Australia, Australia
- Centre of Excellence in Genomics & Systems Biology, International Crops Research Institute for the Semi-Arid Tropics (ICRISAT), Hyderabad, Telangana, India
| | - Jacqueline Batley
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, Western Australia, Australia
| | - David Edwards
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, Western Australia, Australia
| |
Collapse
|
3
|
Singh N, Nath R, Singh DB. Splice-site identification for exon prediction using bidirectional LSTM-RNN approach. Biochem Biophys Rep 2022; 30:101285. [PMID: 35663929 PMCID: PMC9157471 DOI: 10.1016/j.bbrep.2022.101285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2022] [Revised: 05/18/2022] [Accepted: 05/19/2022] [Indexed: 11/24/2022] Open
Abstract
Machine learning methods played a major role in improving the accuracy of predictions and classification of DNA (Deoxyribonucleic Acid) and protein sequences. In eukaryotes, Splice-site identification and prediction is though not a straightforward job because of numerous false positives. To solve this problem, here, in this paper, we represent a bidirectional Long Short Term Memory (LSTM) Recurrent Neural Network (RNN) based deep learning model that has been developed to identify and predict the splice-sites for the prediction of exons from eukaryotic DNA sequences. During the splicing mechanism of the primary mRNA transcript, the introns, the non-coding region of the gene are spliced out and the exons, the coding region of the gene are joined. This bidirectional LSTM-RNN model uses the intron features that start with splice site donor (GT) and end with splice site acceptor (AG) in order of its length constraints. The model has been improved by increasing the number of epochs while training. This designed model achieved a maximum accuracy of 95.5%. This model is compatible with huge sequential data such as the complete genome. A deep learning-based method, bidirectional LSTM-RNN has been used for splice site identification for the prediction of exons. The built model has been trained and tested on genome data Cryptosporidium parvum. Model shows better accuracy over some other deep learning methods and is compatible with huge sequential data such as complete genome. The exons predicted by this method aid in protein modeling for specific drug targets.
Collapse
|
4
|
Ray A. Machine learning in postgenomic biology and personalized medicine. WILEY INTERDISCIPLINARY REVIEWS. DATA MINING AND KNOWLEDGE DISCOVERY 2022; 12:e1451. [PMID: 35966173 PMCID: PMC9371441 DOI: 10.1002/widm.1451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Accepted: 12/22/2021] [Indexed: 06/15/2023]
Abstract
In recent years Artificial Intelligence in the form of machine learning has been revolutionizing biology, biomedical sciences, and gene-based agricultural technology capabilities. Massive data generated in biological sciences by rapid and deep gene sequencing and protein or other molecular structure determination, on the one hand, requires data analysis capabilities using machine learning that are distinctly different from classical statistical methods; on the other, these large datasets are enabling the adoption of novel data-intensive machine learning algorithms for the solution of biological problems that until recently had relied on mechanistic model-based approaches that are computationally expensive. This review provides a bird's eye view of the applications of machine learning in post-genomic biology. Attempt is also made to indicate as far as possible the areas of research that are poised to make further impacts in these areas, including the importance of explainable artificial intelligence (XAI) in human health. Further contributions of machine learning are expected to transform medicine, public health, agricultural technology, as well as to provide invaluable gene-based guidance for the management of complex environments in this age of global warming.
Collapse
Affiliation(s)
- Animesh Ray
- Riggs School of Applied Life Sciences, Keck Graduate Institute, 535 Watson Drive, Claremont, CA91711, USA
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
| |
Collapse
|
5
|
Dimonaco NJ, Aubrey W, Kenobi K, Clare A, Creevey CJ. No one tool to rule them all: prokaryotic gene prediction tool annotations are highly dependent on the organism of study. Bioinformatics 2021; 38:1198-1207. [PMID: 34875010 PMCID: PMC8825762 DOI: 10.1093/bioinformatics/btab827] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Revised: 11/13/2021] [Accepted: 12/02/2021] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION The biases in CoDing Sequence (CDS) prediction tools, which have been based on historic genomic annotations from model organisms, impact our understanding of novel genomes and metagenomes. This hinders the discovery of new genomic information as it results in predictions being biased towards existing knowledge. To date, users have lacked a systematic and replicable approach to identify the strengths and weaknesses of any CDS prediction tool and allow them to choose the right tool for their analysis. RESULTS We present an evaluation framework (ORForise) based on a comprehensive set of 12 primary and 60 secondary metrics that facilitate the assessment of the performance of CDS prediction tools. This makes it possible to identify which performs better for specific use-cases. We use this to assess 15 ab initio- and model-based tools representing those most widely used (historically and currently) to generate the knowledge in genomic databases. We find that the performance of any tool is dependent on the genome being analysed, and no individual tool ranked as the most accurate across all genomes or metrics analysed. Even the top-ranked tools produced conflicting gene collections, which could not be resolved by aggregation. The ORForise evaluation framework provides users with a replicable, data-led approach to make informed tool choices for novel genome annotations and for refining historical annotations. AVAILABILITY AND IMPLEMENTATION Code and datasets for reproduction and customisation are available at https://github.com/NickJD/ORForise. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nicholas J Dimonaco
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth SY23 3PD, UK,To whom correspondence should be addressed.
| | - Wayne Aubrey
- Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, UK
| | - Kim Kenobi
- Department of Mathematics, Aberystwyth University, Aberystwyth SY23 3BZ, UK
| | - Amanda Clare
- Department of Computer Science, Aberystwyth University, Aberystwyth SY23 3DB, UK
| | | |
Collapse
|
6
|
Li R, Li L, Xu Y, Yang J. Machine learning meets omics: applications and perspectives. Brief Bioinform 2021; 23:6425809. [PMID: 34791021 DOI: 10.1093/bib/bbab460] [Citation(s) in RCA: 40] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 09/29/2021] [Accepted: 10/07/2021] [Indexed: 02/07/2023] Open
Abstract
The innovation of biotechnologies has allowed the accumulation of omics data at an alarming rate, thus introducing the era of 'big data'. Extracting inherent valuable knowledge from various omics data remains a daunting problem in bioinformatics. Better solutions often need some kind of more innovative methods for efficient handlings and effective results. Recent advancements in integrated analysis and computational modeling of multi-omics data helped address such needs in an increasingly harmonious manner. The development and application of machine learning have largely advanced our insights into biology and biomedicine and greatly promoted the development of therapeutic strategies, especially for precision medicine. Here, we propose a comprehensive survey and discussion on what happened, is happening and will happen when machine learning meets omics. Specifically, we describe how artificial intelligence can be applied to omics studies and review recent advancements at the interface between machine learning and the ever-widest range of omics including genomics, transcriptomics, proteomics, metabolomics, radiomics, as well as those at the single-cell resolution. We also discuss and provide a synthesis of ideas, new insights, current challenges and perspectives of machine learning in omics.
Collapse
Affiliation(s)
- Rufeng Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Lixin Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Yungang Xu
- School of Electronics and Information, Northwestern Polytechnical University, Xi'an, 710129, China
| | - Juan Yang
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China.,Key Laboratory of Environment and Genes Related to Diseases (Xi'an Jiaotong University), Ministry of Education of China, Xi'an 710061, P. R. China
| |
Collapse
|
7
|
In-Depth Annotation of the Drosophila Bithorax-Complex Reveals the Presence of Several Alternative ORFs That Could Encode for Motif-Rich Peptides. Cells 2021; 10:cells10112983. [PMID: 34831206 PMCID: PMC8616405 DOI: 10.3390/cells10112983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Revised: 10/17/2021] [Accepted: 10/26/2021] [Indexed: 11/19/2022] Open
Abstract
It is recognized that a large proportion of eukaryotic RNAs and proteins is not produced from conventional genes but from short and alternative (alt) open reading frames (ORFs) that are not captured by gene prediction programs. Here we present an in silico prediction of altORFs by applying several selecting filters based on evolutionary conservation and annotations of previously characterized altORF peptides. Our work was performed in the Bithorax-complex (BX-C), which was one of the first genomic regions described to contain long non-coding RNAs in Drosophila. We showed that several altORFs could be predicted from coding and non-coding sequences of BX-C. In addition, the selected altORFs encode for proteins that contain several interesting molecular features, such as the presence of transmembrane helices or a general propensity to be rich in short interaction motifs. Of particular interest, one altORF encodes for a protein that contains a peptide sequence found in specific isoforms of two Drosophila Hox proteins. Our work thus suggests that several altORF proteins could be produced from a particular genomic region known for its critical role during Drosophila embryonic development. The molecular signatures of these altORF proteins further suggests that several of them could make numerous protein–protein interactions and be of functional importance in vivo.
Collapse
|
8
|
Pseudogene ACTBP2 increases blood-brain barrier permeability by promoting KHDRBS2 transcription through recruitment of KMT2D/WDR5 in Aβ 1-42 microenvironment. Cell Death Discov 2021; 7:142. [PMID: 34127651 PMCID: PMC8203645 DOI: 10.1038/s41420-021-00531-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 04/26/2021] [Accepted: 05/23/2021] [Indexed: 11/29/2022] Open
Abstract
The blood–brain barrier (BBB) has a vital role in maintaining the homeostasis of the central nervous system (CNS). Changes in the structure and function of BBB can accelerate Alzheimer’s disease (AD) development. β-Amyloid (Aβ) deposition is the major pathological event of AD. We elucidated the function and possible molecular mechanisms of the effect of pseudogene ACTBP2 on the permeability of BBB in Aβ1–42 microenvironment. BBB model treated with Aβ1–42 for 48 h were used to simulate Aβ-mediated BBB dysfunction in AD. We proved that pseudogene ACTBP2, RNA-binding protein KHDRBS2, and transcription factor HEY2 are highly expressed in ECs that were obtained in a BBB model in vitro in Aβ1–42 microenvironment. In Aβ1–42-incubated ECs, ACTBP2 recruits methyltransferases KMT2D and WDR5, binds to KHDRBS2 promoter, and promotes KHDRBS2 transcription. The interaction of KHDRBS2 with the 3′UTR of HEY2 mRNA increases the stability of HEY2 and promotes its expression. HEY2 increases BBB permeability in Aβ1–42 microenvironment by transcriptionally inhibiting the expression of ZO-1, occludin, and claudin-5. We confirmed that knocking down of Khdrbs2 or Hey2 increased the expression levels of ZO-1, occludin, and claudin-5 in APP/PS1 mice brain microvessels. ACTBP2/KHDRBS2/HEY2 axis has a crucial role in the regulation of BBB permeability in Aβ1–42 microenvironment, which may provide a novel target for the therapy of AD.
Collapse
|
9
|
Ejigu GF, Jung J. Review on the Computational Genome Annotation of Sequences Obtained by Next-Generation Sequencing. BIOLOGY 2020; 9:E295. [PMID: 32962098 PMCID: PMC7565776 DOI: 10.3390/biology9090295] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 09/13/2020] [Accepted: 09/16/2020] [Indexed: 12/16/2022]
Abstract
Next-Generation Sequencing (NGS) has made it easier to obtain genome-wide sequence data and it has shifted the research focus into genome annotation. The challenging tasks involved in annotation rely on the currently available tools and techniques to decode the information contained in nucleotide sequences. This information will improve our understanding of general aspects of life and evolution and improve our ability to diagnose genetic disorders. Here, we present a summary of both structural and functional annotations, as well as the associated comparative annotation tools and pipelines. We highlight visualization tools that immensely aid the annotation process and the contributions of the scientific community to the annotation. Further, we discuss quality-control practices and the need for re-annotation, and highlight the future of annotation.
Collapse
Affiliation(s)
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin-si 17058, Gyeonggi-do, Korea;
| |
Collapse
|
10
|
Eisenberg AR, Higdon AL, Hollerer I, Fields AP, Jungreis I, Diamond PD, Kellis M, Jovanovic M, Brar GA. Translation Initiation Site Profiling Reveals Widespread Synthesis of Non-AUG-Initiated Protein Isoforms in Yeast. Cell Syst 2020; 11:145-160.e5. [PMID: 32710835 PMCID: PMC7508262 DOI: 10.1016/j.cels.2020.06.011] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 05/18/2020] [Accepted: 06/24/2020] [Indexed: 12/27/2022]
Abstract
Genomic analyses in budding yeast have helped define the foundational principles of eukaryotic gene expression. However, in the absence of empirical methods for defining coding regions, these analyses have historically excluded specific classes of possible coding regions, such as those initiating at non-AUG start codons. Here, we applied an experimental approach to globally annotate translation initiation sites in yeast and identified 149 genes with alternative N-terminally extended protein isoforms initiating from near-cognate codons upstream of annotated AUG start codons. These isoforms are produced in concert with canonical isoforms and translated with high specificity, resulting from initiation at only a small subset of possible start codons. The non-AUG initiation driving their production is enriched during meiosis and induced by low eIF5A, which is seen in this context. These findings reveal widespread production of non-canonical protein isoforms and unexpected complexity to the rules by which even a simple eukaryotic genome is decoded.
Collapse
Affiliation(s)
- Amy R Eisenberg
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Andrea L Higdon
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA 94720, USA; Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Ina Hollerer
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Alexander P Fields
- Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Irwin Jungreis
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Paige D Diamond
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Manolis Kellis
- MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Marko Jovanovic
- Department of Biological Sciences, Columbia University, New York, NY 10027, USA
| | - Gloria A Brar
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA 94720, USA; Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA.
| |
Collapse
|
11
|
Hypoxia-induced lncRNA PDIA3P1 promotes mesenchymal transition via sponging of miR-124-3p in glioma. Cell Death Dis 2020; 11:168. [PMID: 32127518 PMCID: PMC7054337 DOI: 10.1038/s41419-020-2345-z] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2020] [Revised: 02/09/2020] [Accepted: 02/10/2020] [Indexed: 12/12/2022]
Abstract
Hypoxia is a critical factor in the malignant progression of glioma, especially for the highly-invasive mesenchymal (MES) subtype. But the detailed mechanisms in hypoxia-induced glioma MES transition remain elusive. Pseudogenes, once considered to be non-functional relics of evolution, are emerging as a critical factor in human tumorigenesis and progression. Here, we investigated the clinical significance, biological function, and mechanisms of protein disulfide isomerase family A member 3 pseudogene 1 (PDIA3P1) in hypoxia-induced glioma MES transition. In this study, we found that PDIA3P1 expression was closely related to tumor degree, transcriptome subtype, and prognosis in glioma patients. Enrichment analysis found that high PDIA3P1 expression was associated with epithelial-mesenchymal transition, extracellular matrix (ECM) disassembly, and angiogenesis. In vitro study revealed that overexpression of PDIA3P1 enhanced the migration and invasion capacity of glioma cells, while knockdown of PDIA3P1 induced the opposite effect. Further studies revealed that PDIA3P1 functions as a ceRNA, sponging miR-124-3p to modulate RELA expression and activate the downstream NF-κB pathway, thus promoting the MES transition of glioma cells. In addition, Hypoxia Inducible Factor 1 was confirmed to directly bind to the PDIA3P1 promotor region and activate its transcription. In conclusion, PDIA3P1 is a crucial link between hypoxia and glioma MES transition through the PDIA3P1-miR-124-3p-RELA axis, which may serve as a prognostic indicator and potential therapeutic target for glioma treatment.
Collapse
|
12
|
Wilbrandt J, Misof B, Panfilio KA, Niehuis O. Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models. BMC Genomics 2019; 20:753. [PMID: 31623555 PMCID: PMC6798390 DOI: 10.1186/s12864-019-6064-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 08/27/2019] [Indexed: 02/06/2023] Open
Abstract
Background The location and modular structure of eukaryotic protein-coding genes in genomic sequences can be automatically predicted by gene annotation algorithms. These predictions are often used for comparative studies on gene structure, gene repertoires, and genome evolution. However, automatic annotation algorithms do not yet correctly identify all genes within a genome, and manual annotation is often necessary to obtain accurate gene models and gene sets. As manual annotation is time-consuming, only a fraction of the gene models in a genome is typically manually annotated, and this fraction often differs between species. To assess the impact of manual annotation efforts on genome-wide analyses of gene structural properties, we compared the structural properties of protein-coding genes in seven diverse insect species sequenced by the i5k initiative. Results Our results show that the subset of genes chosen for manual annotation by a research community (3.5–7% of gene models) may have structural properties (e.g., lengths and exon counts) that are not necessarily representative for a species’ gene set as a whole. Nonetheless, the structural properties of automatically generated gene models are only altered marginally (if at all) through manual annotation. Major correlative trends, for example a negative correlation between genome size and exonic proportion, can be inferred from either the automatically predicted or manually annotated gene models alike. Vice versa, some previously reported trends did not appear in either the automatic or manually annotated gene sets, pointing towards insect-specific gene structural peculiarities. Conclusions In our analysis of gene structural properties, automatically predicted gene models proved to be sufficiently reliable to recover the same gene-repertoire-wide correlative trends that we found when focusing on manually annotated gene models only. We acknowledge that analyses on the individual gene level clearly benefit from manual curation. However, as genome sequencing and annotation projects often differ in the extent of their manual annotation and curation efforts, our results indicate that comparative studies analyzing gene structural properties in these genomes can nonetheless be justifiable and informative. Electronic supplementary material The online version of this article (10.1186/s12864-019-6064-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jeanne Wilbrandt
- Center for molecular Biodiversity Research, Zoological Research Museum Alexander Koenig (ZFMK), Adenauerallee 160, 53113, Bonn, Germany. .,Present address: Hoffmann Research Group, Leibniz Institute on Aging - Fritz Lipmann Institute, Beutenbergstraße 11, 07745, Jena, Germany.
| | - Bernhard Misof
- Center for molecular Biodiversity Research, Zoological Research Museum Alexander Koenig (ZFMK), Adenauerallee 160, 53113, Bonn, Germany
| | - Kristen A Panfilio
- School of Life Sciences, University of Warwick, Gibbet Hill Campus, Coventry, CV4 7AL, UK
| | - Oliver Niehuis
- Evolutionary Biology and Ecology, Institute of Biology I (Zoology), Albert Ludwig University, Hauptstr. 1, 79104, Freiburg, Germany
| |
Collapse
|
13
|
Hatje K, Mühlhausen S, Simm D, Kollmar M. The Protein-Coding Human Genome: Annotating High-Hanging Fruits. Bioessays 2019; 41:e1900066. [PMID: 31544971 DOI: 10.1002/bies.201900066] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Revised: 08/07/2019] [Indexed: 12/19/2022]
Abstract
The major transcript variants of human protein-coding genes are annotated to a certain degree of accuracy combining manual curation, transcript data, and proteomics evidence. However, there is considerable disagreement on the annotation of about 2000 genes-they can be protein-coding, noncoding, or pseudogenes-and on the annotation of most of the predicted alternative transcripts. Pure transcriptome mapping approaches seem to be limited in discriminating functional expression from noise. These limitations have partially been overcome by dedicated algorithms to detect alternative spliced micro-exons and wobble splice variants. Recently, knowledge about splice mechanism and protein structure are incorporated into an algorithm to predict neighboring homologous exons, often spliced in a mutually exclusive manner. Predicted exons are evaluated by transcript data, structural compatibility, and evolutionary conservation, revealing hundreds of novel coding exons and splice mechanism re-assignments. The emerging human pan-genome is necessitating distinctive annotations incorporating differences between individuals and between populations.
Collapse
Affiliation(s)
- Klas Hatje
- Roche Pharmaceutical Research and Early Development, Pharmaceutical Sciences, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Grenzacherstr. 124, 4070, Basel, Switzerland
| | - Stefanie Mühlhausen
- Group Systems Biology of Motor Proteins, Department of NMR-based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Am Fassberg 11, 37077, Göttingen, Germany
| | - Dominic Simm
- Group Systems Biology of Motor Proteins, Department of NMR-based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Am Fassberg 11, 37077, Göttingen, Germany.,Theoretical Computer Science and Algorithmic Methods, Institute of Computer Science, Georg-August-University Göttingen, Goldschmidtstr. 7, 37077, Göttingen, Germany
| | - Martin Kollmar
- Group Systems Biology of Motor Proteins, Department of NMR-based Structural Biology, Max-Planck-Institute for Biophysical Chemistry, Am Fassberg 11, 37077, Göttingen, Germany
| |
Collapse
|
14
|
Abstract
Every microarray experiment is based on a common format. First, a large number of nucleotide "spots" are arrayed onto a substrate, typically a glass slide, a silicon chip, or microbeads. Second, a complex population of nucleic acids (isolated from cells, selected from in vitro-synthesized libraries, or obtained from another source) is labeled, typically with fluorescent dyes. Third, the labeled nucleic acids are allowed to hybridize to their complementary spot(s) on the microarray. Fourth, the hybridized microarray is washed, allowing the amount of hybridized label to then be quantified. Analysis of the raw data generates a readout of the levels of each species of RNA in the original complex population. This introduction includes several examples of microarray applications and provides a discussion of the basic steps of most microarray experiments.
Collapse
|
15
|
Sieber P, Voigt K, Kämmer P, Brunke S, Schuster S, Linde J. Comparative Study on Alternative Splicing in Human Fungal Pathogens Suggests Its Involvement During Host Invasion. Front Microbiol 2018; 9:2313. [PMID: 30333805 PMCID: PMC6176087 DOI: 10.3389/fmicb.2018.02313] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2018] [Accepted: 09/11/2018] [Indexed: 11/13/2022] Open
Abstract
Alternative splicing (AS) is an important regulatory mechanism in eukaryotes but only little is known about its impact in fungi. Human fungal pathogens are of high clinical interest causing recurrent or life-threatening infections. AS can be well-investigated genome-wide and quantitatively with the powerful technology of RNA-Seq. Here, we systematically studied AS in human fungal pathogens based on RNA-Seq data. To do so, we investigated its effect in seven fungi during conditions simulating ex vivo infection processes and during in vitro stress. Genes undergoing AS are species-specific and act independently from differentially expressed genes pointing to an independent mechanism to change abundance and functionality. Candida species stand out with a low number of introns with higher and more varying lengths and more alternative splice sites. Moreover, we identified a functional difference between response to host and other stress conditions: During stress, AS affects more genes and is involved in diverse regulatory functions. In contrast, during response-to-host conditions, genes undergoing AS have membrane functionalities and might be involved in the interaction with the host. We assume that AS plays a crucial regulatory role in pathogenic fungi and is important in both response to host and stress conditions.
Collapse
Affiliation(s)
- Patricia Sieber
- Department of Bioinformatics, Faculty of Biological Sciences, Friedrich Schiller University, Jena, Germany.,Research Group Systems Biology, Bioinformatics, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Kerstin Voigt
- Jena Microbial Resource Collection, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany.,Institute of Microbiology, Faculty of Biological Sciences, Friedrich Schiller University, Jena, Germany
| | - Philipp Kämmer
- Microbial Pathogenicity Mechanisms, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Sascha Brunke
- Microbial Pathogenicity Mechanisms, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany
| | - Stefan Schuster
- Department of Bioinformatics, Faculty of Biological Sciences, Friedrich Schiller University, Jena, Germany
| | - Jörg Linde
- Research Group PiDOMICS, Leibniz Institute for Natural Product Research and Infection Biology - Hans Knöll Institute, Jena, Germany.,Institute for Bacterial Infections and Zoonoses, Federal Research Institute for Animal Health-Friedrich-Loeffler-Institute, Jena, Germany
| |
Collapse
|
16
|
Budamgunta H, Olexiouk V, Luyten W, Schildermans K, Maes E, Boonen K, Menschaert G, Baggerman G. Comprehensive Peptide Analysis of Mouse Brain Striatum Identifies Novel sORF-Encoded Polypeptides. Proteomics 2018; 18:e1700218. [DOI: 10.1002/pmic.201700218] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2017] [Revised: 03/30/2018] [Indexed: 11/10/2022]
Affiliation(s)
| | - Volodimir Olexiouk
- BioBix; Lab for Bioinformatics and Computational Genomics; Department of Mathematical Modelling; Statistics and Bio-informatics; Ghent University; Ghent Belgium
| | - Walter Luyten
- Animal Physiology and Neurobiology; KULeuven; Leuven Belgium
| | | | - Evelyne Maes
- Centre for Proteomics; UAntwerp; Antwerp Belgium
- Proteins and Biomaterials; AgResearch; Christchurch New Zealand
| | - Kurt Boonen
- Centre for Proteomics; UAntwerp; Antwerp Belgium
- Unit Environmental Risk and Health; VITO; Mol Belgium
| | - Gerben Menschaert
- BioBix; Lab for Bioinformatics and Computational Genomics; Department of Mathematical Modelling; Statistics and Bio-informatics; Ghent University; Ghent Belgium
| | - Geert Baggerman
- Centre for Proteomics; UAntwerp; Antwerp Belgium
- Unit Environmental Risk and Health; VITO; Mol Belgium
| |
Collapse
|
17
|
The Definition of Open Reading Frame Revisited. Trends Genet 2018; 34:167-170. [DOI: 10.1016/j.tig.2017.12.009] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2017] [Revised: 12/07/2017] [Accepted: 12/13/2017] [Indexed: 11/22/2022]
|
18
|
Li LJ, Leng RX, Fan YG, Pan HF, Ye DQ. Translation of noncoding RNAs: Focus on lncRNAs, pri-miRNAs, and circRNAs. Exp Cell Res 2017; 361:1-8. [PMID: 29031633 DOI: 10.1016/j.yexcr.2017.10.010] [Citation(s) in RCA: 87] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Revised: 09/17/2017] [Accepted: 10/11/2017] [Indexed: 02/06/2023]
Abstract
Mammalian genome is pervasively transcribed, producing large number of noncoding RNAs (ncRNAs), including long noncoding RNAs (lncRNAs), primary miRNAs (pri-miRNA), and circular RNAs (circRNAs). The translation of these ncRNAs has long been overlooked. Increasing studies, however, based on ribosome profiling in various organisms provide important clues to unanticipated translation potential of lncRNAs. Moreover, a few functional peptides encoded by lncRNAs and pri-miRNAs underline the significance of their translation. Recently, several novel researches also evidence the translation of endogenous circRNAs. Given the functional significance exemplified by peptides translated by some ncRNAs and their pervasive translation, it is not too far-fetched to image that abnormal translation of ncRNAs may contribute to human diseases. Through challenging, deciphering ncRNA translation is required for comprehensive understanding of biology and medicine. In this review, we firstly present evidence concerning translation potential of lncRNAs and go on to introduce a few functional short peptides encoded by lncRNAs. Then, salient observations showing translation of pri-miRNAs and circRNAs are described in detail. We end by discussing the impact of ncRNA translation beyond producing peptides and referring briefly to the potential role of abnormal ncRNA translation in human diseases.
Collapse
Affiliation(s)
- Lian-Ju Li
- Department of Epidemiology and Biostatistics, School of Public Health, Anhui Medical University, 81 Meishan Road, Hefei 230032, Anhui, China; Anhui Province Key Laboratory of Major Autoimmune Diseases, Hefei 230032, Anhui, China
| | - Rui-Xue Leng
- Department of Epidemiology and Biostatistics, School of Public Health, Anhui Medical University, 81 Meishan Road, Hefei 230032, Anhui, China; Anhui Province Key Laboratory of Major Autoimmune Diseases, Hefei 230032, Anhui, China
| | - Yin-Guang Fan
- Department of Epidemiology and Biostatistics, School of Public Health, Anhui Medical University, 81 Meishan Road, Hefei 230032, Anhui, China; Anhui Province Key Laboratory of Major Autoimmune Diseases, Hefei 230032, Anhui, China
| | - Hai-Feng Pan
- Department of Epidemiology and Biostatistics, School of Public Health, Anhui Medical University, 81 Meishan Road, Hefei 230032, Anhui, China; Anhui Province Key Laboratory of Major Autoimmune Diseases, Hefei 230032, Anhui, China
| | - Dong-Qing Ye
- Department of Epidemiology and Biostatistics, School of Public Health, Anhui Medical University, 81 Meishan Road, Hefei 230032, Anhui, China; Anhui Province Key Laboratory of Major Autoimmune Diseases, Hefei 230032, Anhui, China.
| |
Collapse
|
19
|
Chan KL, Tatarinova TV, Rosli R, Amiruddin N, Azizi N, Halim MAA, Sanusi NSNM, Jayanthi N, Ponomarenko P, Triska M, Solovyev V, Firdaus-Raih M, Sambanthamurthi R, Murphy D, Low ETL. Evidence-based gene models for structural and functional annotations of the oil palm genome. Biol Direct 2017; 12:21. [PMID: 28886750 PMCID: PMC5591544 DOI: 10.1186/s13062-017-0191-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2017] [Accepted: 08/07/2017] [Indexed: 11/13/2022] Open
Abstract
Background Oil palm is an important source of edible oil. The importance of the crop, as well as its long breeding cycle (10-12 years) has led to the sequencing of its genome in 2013 to pave the way for genomics-guided breeding. Nevertheless, the first set of gene predictions, although useful, had many fragmented genes. Classification and characterization of genes associated with traits of interest, such as those for fatty acid biosynthesis and disease resistance, were also limited. Lipid-, especially fatty acid (FA)-related genes are of particular interest for the oil palm as they specify oil yields and quality. This paper presents the characterization of the oil palm genome using different gene prediction methods and comparative genomics analysis, identification of FA biosynthesis and disease resistance genes, and the development of an annotation database and bioinformatics tools. Results Using two independent gene-prediction pipelines, Fgenesh++ and Seqping, 26,059 oil palm genes with transcriptome and RefSeq support were identified from the oil palm genome. These coding regions of the genome have a characteristic broad distribution of GC3 (fraction of cytosine and guanine in the third position of a codon) with over half the GC3-rich genes (GC3 ≥ 0.75286) being intronless. In comparison, only one-seventh of the oil palm genes identified are intronless. Using comparative genomics analysis, characterization of conserved domains and active sites, and expression analysis, 42 key genes involved in FA biosynthesis in oil palm were identified. For three of them, namely EgFABF, EgFABH and EgFAD3, segmental duplication events were detected. Our analysis also identified 210 candidate resistance genes in six classes, grouped by their protein domain structures. Conclusions We present an accurate and comprehensive annotation of the oil palm genome, focusing on analysis of important categories of genes (GC3-rich and intronless), as well as those associated with important functions, such as FA biosynthesis and disease resistance. The study demonstrated the advantages of having an integrated approach to gene prediction and developed a computational framework for combining multiple genome annotations. These results, available in the oil palm annotation database (http://palmxplore.mpob.gov.my), will provide important resources for studies on the genomes of oil palm and related crops. Reviewers This article was reviewed by Alexander Kel, Igor Rogozin, and Vladimir A. Kuznetsov. Electronic supplementary material The online version of this article (doi:10.1186/s13062-017-0191-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kuang-Lim Chan
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia.,Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
| | - Tatiana V Tatarinova
- Department of Biology, University of La Verne, La Verne, California, 91750, USA.,Spatial Sciences Institute, University of Southern California, Los Angeles, CA, 90089, USA
| | - Rozana Rosli
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia.,Genomics and Computational Biology Research Group, University of South Wales, Pontypridd, CF371DL, UK
| | - Nadzirah Amiruddin
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia
| | - Norazah Azizi
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia
| | - Mohd Amin Ab Halim
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia
| | - Nik Shazana Nik Mohd Sanusi
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia
| | - Nagappan Jayanthi
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia
| | - Petr Ponomarenko
- Spatial Sciences Institute, University of Southern California, Los Angeles, CA, 90089, USA
| | - Martin Triska
- Children's Hospital Los Angeles, University of Southern California, Los Angeles, CA, 90089, USA
| | - Victor Solovyev
- Softberry Inc., 116 Radio Circle, Suite 400, Mount Kisco, NY, 10549, USA
| | - Mohd Firdaus-Raih
- Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
| | - Ravigadevi Sambanthamurthi
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia
| | - Denis Murphy
- Genomics and Computational Biology Research Group, University of South Wales, Pontypridd, CF371DL, UK
| | - Eng-Ti Leslie Low
- Advanced Biotechnology and Breeding Centre, Malaysian Palm Oil Board, No. 6, Persiaran Institusi, Bandar Baru Bangi, 43000 Kajang, Selangor, Malaysia.
| |
Collapse
|
20
|
Vitale L, Caracausi M, Casadei R, Pelleri MC, Piovesan A. Difficulty in obtaining the complete mRNA coding sequence at 5' region (5' end mRNA artifact): Causes, consequences in biology and medicine and possible solutions for obtaining the actual amino acid sequence of proteins (Review). Int J Mol Med 2017; 39:1063-1071. [PMID: 28393177 DOI: 10.3892/ijmm.2017.2942] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2016] [Accepted: 03/16/2017] [Indexed: 11/06/2022] Open
Abstract
The known difficulty in obtaining the actual full length, complete sequence of a messenger RNA (mRNA) may lead to the erroneous determination of its coding sequence at the 5' region (5' end mRNA artifact), and consequently to the wrong assignment of the translation start codon, leading to the inaccurate prediction of the encoded polypeptide at its amino terminus. Among the known human genes whose study was affected by this artifact, we can include disco interacting protein 2 homolog A (DIP2A; KIAA0184), Down syndrome critical region 1 (DSCR1), SON DNA binding protein (SON), trefoil factor 3 (TFF3) and URB1 ribosome biogenesis 1 homolog (URB1; KIAA0539) on chromosome 21, as well as receptor for activated C kinase 1 (RACK1, also known as GNB2L1), glutaminyl‑tRNA synthetase (QARS) and tyrosyl-DNA phosphodiesterase 2 (TDP2) along with another 474 loci, including interleukin 16 (IL16). In this review, we discuss the causes of this issue, its quantitative incidence in biomedical research, the consequences in biology and medicine, and the possible solutions for obtaining the actual amino acid sequence of proteins in the post-genomics era.
Collapse
Affiliation(s)
- Lorenza Vitale
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, I‑40126 Bologna, Italy
| | - Maria Caracausi
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, I‑40126 Bologna, Italy
| | - Raffaella Casadei
- Department for Life Quality Studies, University of Bologna, I‑47921 Rimini, Italy
| | - Maria Chiara Pelleri
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, I‑40126 Bologna, Italy
| | - Allison Piovesan
- Department of Experimental, Diagnostic and Specialty Medicine (DIMES), Unit of Histology, Embryology and Applied Biology, University of Bologna, I‑40126 Bologna, Italy
| |
Collapse
|
21
|
Zhang J, Yang MK, Zeng H, Ge F. GAPP: A Proteogenomic Software for Genome Annotation and Global Profiling of Post-translational Modifications in Prokaryotes. Mol Cell Proteomics 2016; 15:3529-3539. [PMID: 27630248 DOI: 10.1074/mcp.m116.060046] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2016] [Indexed: 11/06/2022] Open
Abstract
Although the number of sequenced prokaryotic genomes is growing rapidly, experimentally verified annotation of prokaryotic genome remains patchy and challenging. To facilitate genome annotation efforts for prokaryotes, we developed an open source software called GAPP for genome annotation and global profiling of post-translational modifications (PTMs) in prokaryotes. With a single command, it provides a standard workflow to validate and refine predicted genetic models and discover diverse PTM events. We demonstrated the utility of GAPP using proteomic data from Helicobacter pylori, one of the major human pathogens that is responsible for many gastric diseases. Our results confirmed 84.9% of the existing predicted H. pylori proteins, identified 20 novel protein coding genes, and corrected four existing gene models with regard to translation initiation sites. In particular, GAPP revealed a large repertoire of PTMs using the same proteomic data and provided a rich resource that can be used to examine the functions of reversible modifications in this human pathogen. This software is a powerful tool for genome annotation and global discovery of PTMs and is applicable to any sequenced prokaryotic organism; we expect that it will become an integral part of ongoing genome annotation efforts for prokaryotes. GAPP is freely available at https://sourceforge.net/projects/gappproteogenomic/.
Collapse
Affiliation(s)
- Jia Zhang
- From the ‡Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
| | - Ming-Kun Yang
- From the ‡Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China
| | - Honghui Zeng
- §Wuhan Branch, Supercomputing Center, Chinese Academy of Sciences, China
| | - Feng Ge
- From the ‡Key Laboratory of Algal Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China; .,§Wuhan Branch, Supercomputing Center, Chinese Academy of Sciences, China
| |
Collapse
|
22
|
Making sense of genomes of parasitic worms: Tackling bioinformatic challenges. Biotechnol Adv 2016; 34:663-686. [DOI: 10.1016/j.biotechadv.2016.03.001] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2015] [Revised: 02/25/2016] [Accepted: 03/01/2016] [Indexed: 01/25/2023]
|
23
|
Leelananda SP, Kloczkowski A, Jernigan RL. Fold-specific sequence scoring improves protein sequence matching. BMC Bioinformatics 2016; 17:328. [PMID: 27578239 PMCID: PMC5006591 DOI: 10.1186/s12859-016-1198-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2016] [Accepted: 08/24/2016] [Indexed: 11/10/2022] Open
Abstract
Background Sequence matching is extremely important for applications throughout biology, particularly for discovering information such as functional and evolutionary relationships, and also for discriminating between unimportant and disease mutants. At present the functions of a large fraction of genes are unknown; improvements in sequence matching will improve gene annotations. Universal amino acid substitution matrices such as Blosum62 are used to measure sequence similarities and to identify distant homologues, regardless of the structure class. However, such single matrices do not take into account important structural information evident within the different topologies of proteins and treats substitutions within all protein folds identically. Others have suggested that the use of structural information can lead to significant improvements in sequence matching but this has not yet been very effective. Here we develop novel substitution matrices that include not only general sequence information but also have a topology specific component that is unique for each CATH topology. This novel feature of using a combination of sequence and structure information for each protein topology significantly improves the sequence matching scores for the sequence pairs tested. We have used a novel multi-structure alignment method for each homology level of CATH in order to extract topological information. Results We obtain statistically significant improved sequence matching scores for 73 % of the alpha helical test cases. On average, 61 % of the test cases showed improvements in homology detection when structure information was incorporated into the substitution matrices. On average z-scores for homology detection are improved by more than 54 % for all cases, and some individual cases have z-scores more than twice those obtained using generic matrices. Our topology specific similarity matrices also outperform other traditional similarity matrices and single matrix based structure methods. When default amino acid substitution matrix in the Psi-blast algorithm is replaced by our structure-based matrices, the structure matching is significantly improved over conventional Psi-blast. It also outperforms results obtained for the corresponding HMM profiles generated for each topology. Conclusions We show that by incorporating topology-specific structure information in addition to sequence information into specific amino acid substitution matrices, the sequence matching scores and homology detection are significantly improved. Our topology specific similarity matrices outperform other traditional similarity matrices, single matrix based structure methods, also show improvement over conventional Psi-blast and HMM profile based methods in sequence matching. The results support the discriminatory ability of the new amino acid similarity matrices to distinguish between distant homologs and structurally dissimilar pairs. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1198-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sumudu P Leelananda
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA.,Laurence H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA.,Present Address: 2120 Newman and Wolfrom Laboratory, The Ohio State University, 100 W 18th Ave, Columbus, OH, 43210, USA.,Present Address: Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, OH, 43205, USA
| | - Andrzej Kloczkowski
- Present Address: Battelle Center for Mathematical Medicine, The Research Institute at Nationwide Children's Hospital, Columbus, OH, 43205, USA.,Present Address: Department of Pediatrics, The Ohio State University College of Medicine, Columbus, OH, 43205, USA
| | - Robert L Jernigan
- Department of Biochemistry, Biophysics and Molecular Biology, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA. .,Laurence H. Baker Center for Bioinformatics and Biological Statistics, Iowa State University, 112 Office and Lab Building, Ames, IA, 50011-3020, USA.
| |
Collapse
|
24
|
Aken BL, Ayling S, Barrell D, Clarke L, Curwen V, Fairley S, Fernandez Banet J, Billis K, García Girón C, Hourlier T, Howe K, Kähäri A, Kokocinski F, Martin FJ, Murphy DN, Nag R, Ruffier M, Schuster M, Tang YA, Vogel JH, White S, Zadissa A, Flicek P, Searle SMJ. The Ensembl gene annotation system. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw093. [PMID: 27337980 PMCID: PMC4919035 DOI: 10.1093/database/baw093] [Citation(s) in RCA: 662] [Impact Index Per Article: 82.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/11/2016] [Accepted: 05/09/2016] [Indexed: 12/12/2022]
Abstract
The Ensembl gene annotation system has been used to annotate over 70 different vertebrate species across a wide range of genome projects. Furthermore, it generates the automatic alignment-based annotation for the human and mouse GENCODE gene sets. The system is based on the alignment of biological sequences, including cDNAs, proteins and RNA-seq reads, to the target genome in order to construct candidate transcript models. Careful assessment and filtering of these candidate transcripts ultimately leads to the final gene set, which is made available on the Ensembl website. Here, we describe the annotation process in detail.Database URL: http://www.ensembl.org/index.html.
Collapse
Affiliation(s)
- Bronwen L Aken
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Sarah Ayling
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Present addresses: The Genome Analysis Centre, Norwich Research Park, Norwich NR4 7UH, UK
| | - Daniel Barrell
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Eagle Genomics Ltd, Babraham Research Campus, Cambridge CB22 3AT, UK
| | - Laura Clarke
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Valery Curwen
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Susan Fairley
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Julio Fernandez Banet
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Pfizer Inc, 10646 Science Center Dr, San Diego, CA 92121, USA
| | - Konstantinos Billis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Carlos García Girón
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Thibaut Hourlier
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Kevin Howe
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Andreas Kähäri
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Institutionen för cell-och molekylärbiologi, Uppsala University, Husargatan 3, Uppsala 752 37, Sweden
| | - Felix Kokocinski
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Fergal J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Daniel N Murphy
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Rishi Nag
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Magali Ruffier
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Michael Schuster
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna a-1090, Austria
| | - Y Amy Tang
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Jan-Hinnerk Vogel
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK Genentech Inc, 1 DNA Way, South San Francisco, CA 94080, USA
| | - Simon White
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK The Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Amonida Zadissa
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| | - Stephen M J Searle
- Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SA, UK
| |
Collapse
|
25
|
Katoh K, Standley DM. A simple method to control over-alignment in the MAFFT multiple sequence alignment program. Bioinformatics 2016; 32:1933-42. [PMID: 27153688 PMCID: PMC4920119 DOI: 10.1093/bioinformatics/btw108] [Citation(s) in RCA: 318] [Impact Index Per Article: 39.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2015] [Accepted: 02/19/2016] [Indexed: 12/17/2022] Open
Abstract
Motivation: We present a new feature of the MAFFT multiple alignment program for suppressing over-alignment (aligning unrelated segments). Conventional MAFFT is highly sensitive in aligning conserved regions in remote homologs, but the risk of over-alignment is recently becoming greater, as low-quality or noisy sequences are increasing in protein sequence databases, due, for example, to sequencing errors and difficulty in gene prediction. Results: The proposed method utilizes a variable scoring matrix for different pairs of sequences (or groups) in a single multiple sequence alignment, based on the global similarity of each pair. This method significantly increases the correctly gapped sites in real examples and in simulations under various conditions. Regarding sensitivity, the effect of the proposed method is slightly negative in real protein-based benchmarks, and mostly neutral in simulation-based benchmarks. This approach is based on natural biological reasoning and should be compatible with many methods based on dynamic programming for multiple sequence alignment. Availability and implementation: The new feature is available in MAFFT versions 7.263 and higher. http://mafft.cbrc.jp/alignment/software/ Contact:katoh@ifrec.osaka-u.ac.jp Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Kazutaka Katoh
- Immunology Frontier Research Center, Osaka University, Suita 565-0871, Japan
| | - Daron M Standley
- Immunology Frontier Research Center, Osaka University, Suita 565-0871, Japan Institute for Virus Research, Kyoto University, Kyoto 606-8507, Japan
| |
Collapse
|
26
|
Mouilleron H, Delcourt V, Roucou X. Death of a dogma: eukaryotic mRNAs can code for more than one protein. Nucleic Acids Res 2016; 44:14-23. [PMID: 26578573 PMCID: PMC4705651 DOI: 10.1093/nar/gkv1218] [Citation(s) in RCA: 63] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2015] [Revised: 10/26/2015] [Accepted: 10/28/2015] [Indexed: 12/13/2022] Open
Abstract
mRNAs carry the genetic information that is translated by ribosomes. The traditional view of a mature eukaryotic mRNA is a molecule with three main regions, the 5' UTR, the protein coding open reading frame (ORF) or coding sequence (CDS), and the 3' UTR. This concept assumes that ribosomes translate one ORF only, generally the longest one, and produce one protein. As a result, in the early days of genomics and bioinformatics, one CDS was associated with each protein-coding gene. This fundamental concept of a single CDS is being challenged by increasing experimental evidence indicating that annotated proteins are not the only proteins translated from mRNAs. In particular, mass spectrometry (MS)-based proteomics and ribosome profiling have detected productive translation of alternative open reading frames. In several cases, the alternative and annotated proteins interact. Thus, the expression of two or more proteins translated from the same mRNA may offer a mechanism to ensure the co-expression of proteins which have functional interactions. Translational mechanisms already described in eukaryotic cells indicate that the cellular machinery is able to translate different CDSs from a single viral or cellular mRNA. In addition to summarizing data showing that the protein coding potential of eukaryotic mRNAs has been underestimated, this review aims to challenge the single translated CDS dogma.
Collapse
Affiliation(s)
- Hélène Mouilleron
- Department of biochemistry, Université de Sherbrooke, Sherbrooke, Quebec J1E 4K8, Canada PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Quebec, Canada
| | - Vivian Delcourt
- Department of biochemistry, Université de Sherbrooke, Sherbrooke, Quebec J1E 4K8, Canada PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Quebec, Canada Inserm U-1192, Laboratoire de Protéomique, Réponse Inflammatoire, Spectrométrie de Masse (PRISM), Université de Lille 1, Cité Scientifique, 59655 Villeneuve D'Ascq, France
| | - Xavier Roucou
- Department of biochemistry, Université de Sherbrooke, Sherbrooke, Quebec J1E 4K8, Canada PROTEO, Quebec Network for Research on Protein Function, Structure, and Engineering, Quebec, Canada
| |
Collapse
|
27
|
Höglund JK, Buitenhuis B, Guldbrandtsen B, Lund MS, Sahana G. Genome-wide association study for female fertility in Nordic Red cattle. BMC Genet 2015; 16:110. [PMID: 26369327 PMCID: PMC4570259 DOI: 10.1186/s12863-015-0269-x] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2015] [Accepted: 09/04/2015] [Indexed: 12/28/2022] Open
Abstract
Background The Nordic Red Cattle (NRC) consists of animls belonging to the Danish Red, Finnish Ayrshire, and Swedish Red breeds. Compared to the Holstein breed, NRC animals are smaller, have a shorter calving interval, lower mastitis incidence and lower rates of stillborn calves, however they produce less milk, fat and protein. Female fertility is an important trait for the dairy cattle farmer. Selection decisions in female fertilty in NRC are based on the female fertility index (FTI). FTI is a composite index including a number of sub-indices describing aspects of female fertility in dairy cattle. The sub-traits of FTI are: number of inseminations per conception (AIS) in cows (C) and heifers (H), the length in days of the interval from calving to first insemination (ICF) in cows, days from first to last insemination (IFL) in cows and heifers, and 56-day non-return rate (NRR) in cows and heifers. The aim of this study was first to identify QTL for FTI by conducting a genome scan for variants associated with fertility index using imputed whole genome sequence data based on 4207 Nordic Red sires, and subsequently analyzing which of the sub-traits were affected by each FTI QTL by associating them with the sub-traits. Results A total 17,388 significant SNP markers (−log10(P) > 8.25) were detected for FTI distributed over 25 chromosomes. The chromosomes with the most significant markers were tested for associations with the underlying sub-traits: BTA1 (822 SNP), BTA2 (220 SNP), BTA3 (83 SNP), BTA5 (195 SNP), two regions on BTA6 (503 SNP), BTA13 (980 SNP), BTA15 (23 SNP), BTA20 (345 SNP), and BTA24 (104 SNP). The fertility traits underlying the FTI peak area were: BTA1 (IFLC, IFLH), BTA2 (AISH, IFLH, NRRH), BTA3 (AISH, NRRH), BTA5 (AISC, AISH, IFLH), BTA6 (region 1: AISH, NRRH; region 2: AISH, IFLH), BTA13 (IFLH, IFLC), BTA15 (IFLC, NRRH), and BTA24 (AISH, IFLH). For BTA20 all sub-traits had SNP markers with a –log10(P) > 10. Furthermore the genes assigned to the most significant SNP for FTI were located on BTA6 (GPR125), BTA13 (ANKRD60), BTA15 (GRAMD1B), and BTA24 (ZNF521). Conclusion This study 1) shows that many markers within FTI QTL regions were significantly associated with both AISH and IFLH, and 2) identified candidate genes for FTI located on BTA6 (GPR125), BTA13 (ANKRD60), BTA15 (GRAMD1B), and BTA24 (ZNF521). It is not known how the genes/variants identified in this study regulate female fertility, however the majority of these genes were involved in protein binding, 3) a SNP in a QTL region for FTI on BTA20 was previously validated in three cattle breeds. Electronic supplementary material The online version of this article (doi:10.1186/s12863-015-0269-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Johanna K Höglund
- Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics Aarhus University, P.O. Box 50, DK 8830, Tjele, Denmark. .,Present address: Department of Animal Science, Aarhus University, P.O. Box 50, DK-8830, Tjele, Denmark.
| | - Bart Buitenhuis
- Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics Aarhus University, P.O. Box 50, DK 8830, Tjele, Denmark.
| | - Bernt Guldbrandtsen
- Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics Aarhus University, P.O. Box 50, DK 8830, Tjele, Denmark.
| | - Mogens S Lund
- Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics Aarhus University, P.O. Box 50, DK 8830, Tjele, Denmark.
| | - Goutam Sahana
- Center for Quantitative Genetics and Genomics, Department of Molecular Biology and Genetics Aarhus University, P.O. Box 50, DK 8830, Tjele, Denmark.
| |
Collapse
|
28
|
Trends in genome dynamics among major orders of insects revealed through variations in protein families. BMC Genomics 2015; 16:583. [PMID: 26251035 PMCID: PMC4528696 DOI: 10.1186/s12864-015-1771-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2014] [Accepted: 07/13/2015] [Indexed: 01/22/2023] Open
Abstract
Background Insects belong to a class that accounts for the majority of animals on earth. With over one million identified species, insects display a huge diversity and occupy extreme environments. At present, there are dozens of fully sequenced insect genomes that cover a range of habitats, social behavior and morphologies. In view of such diverse collection of genomes, revealing evolutionary trends and charting functional relationships of proteins remain challenging. Results We analyzed the relatedness of 17 complete proteomes representative of proteomes from insects including louse, bee, beetle, ants, flies and mosquitoes, as well as an out-group from the crustaceans. The analyzed proteomes mostly represented the orders of Hymenoptera and Diptera. The 287,405 protein sequences from the 18 proteomes were automatically clustered into 20,933 families, including 799 singletons. A comprehensive analysis based on statistical considerations identified the families that were significantly expanded or reduced in any of the studied organisms. Among all the tested species, ants are characterized by an exceptionally high rate of family gain and loss. By assigning annotations to hundreds of species-specific families, the functional diversity among species and between the major clades (Diptera and Hymenoptera) is revealed. We found that many species-specific families are associated with receptor signaling, stress-related functions and proteases. The highest variability among insects associates with the function of transposition and nucleic acids processes (collectively coined TNAP). Specifically, the wasp and ants have an order of magnitude more TNAP families and proteins relative to species that belong to Diptera (mosquitoes and flies). Conclusions An unsupervised clustering methodology combined with a comparative functional analysis unveiled proteomic signatures in the major clades of winged insects. We propose that the expansion of TNAP families in Hymenoptera potentially contributes to the accelerated genome dynamics that characterize the wasp and ants. Electronic supplementary material The online version of this article (doi:10.1186/s12864-015-1771-2) contains supplementary material, which is available to authorized users.
Collapse
|
29
|
Carnielli CM, Winck FV, Paes Leme AF. Functional annotation and biological interpretation of proteomics data. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2015; 1854:46-54. [DOI: 10.1016/j.bbapap.2014.10.019] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2014] [Revised: 10/07/2014] [Accepted: 10/21/2014] [Indexed: 12/22/2022]
|
30
|
Prabakaran S, Hemberg M, Chauhan R, Winter D, Tweedie-Cullen RY, Dittrich C, Hong E, Gunawardena J, Steen H, Kreiman G, Steen JA. Quantitative profiling of peptides from RNAs classified as noncoding. Nat Commun 2014; 5:5429. [PMID: 25403355 PMCID: PMC4416701 DOI: 10.1038/ncomms6429] [Citation(s) in RCA: 47] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2014] [Accepted: 09/30/2014] [Indexed: 01/28/2023] Open
Abstract
Only a small fraction of the mammalian genome codes for messenger RNAs destined to be translated into proteins, and it is generally assumed that a large portion of transcribed sequences--including introns and several classes of noncoding RNAs (ncRNAs)--do not give rise to peptide products. A systematic examination of translation and physiological regulation of ncRNAs has not been conducted. Here we use computational methods to identify the products of non-canonical translation in mouse neurons by analysing unannotated transcripts in combination with proteomic data. This study supports the existence of non-canonical translation products from both intragenic and extragenic genomic regions, including peptides derived from antisense transcripts and introns. Moreover, the studied novel translation products exhibit temporal regulation similar to that of proteins known to be involved in neuronal activity processes. These observations highlight a potentially large and complex set of biologically regulated translational events from transcripts formerly thought to lack coding potential.
Collapse
Affiliation(s)
- Sudhakaran Prabakaran
- Proteomics Center, Boston Children’s Hospital, Boston, MA 02115, USA
- Department of Systems Biology, Harvard Medical School, Boston MA 02115, USA
| | - Martin Hemberg
- Department of Ophthalmology, Boston Children’s Hospital, Boston, MA 02115, USA
| | - Ruchi Chauhan
- Proteomics Center, Boston Children’s Hospital, Boston, MA 02115, USA
- F.M. Kirby Neurobiology Center, Boston Children’s Hospital, Boston, MA 02115, USA
| | - Dominic Winter
- Proteomics Center, Boston Children’s Hospital, Boston, MA 02115, USA
| | - Ry Y. Tweedie-Cullen
- F.M. Kirby Neurobiology Center, Boston Children’s Hospital, Boston, MA 02115, USA
| | - Christian Dittrich
- F.M. Kirby Neurobiology Center, Boston Children’s Hospital, Boston, MA 02115, USA
| | - Elizabeth Hong
- F.M. Kirby Neurobiology Center, Boston Children’s Hospital, Boston, MA 02115, USA
| | - Jeremy Gunawardena
- Department of Systems Biology, Harvard Medical School, Boston MA 02115, USA
| | - Hanno Steen
- Proteomics Center, Boston Children’s Hospital, Boston, MA 02115, USA
- Department of Pathology, Boston Children’s Hospital and Harvard Medical School, Boston, MA 02115, USA
| | - Gabriel Kreiman
- Department of Ophthalmology, Boston Children’s Hospital, Boston, MA 02115, USA
- F.M. Kirby Neurobiology Center, Boston Children’s Hospital, Boston, MA 02115, USA
| | - Judith A. Steen
- Proteomics Center, Boston Children’s Hospital, Boston, MA 02115, USA
- F.M. Kirby Neurobiology Center, Boston Children’s Hospital, Boston, MA 02115, USA
| |
Collapse
|
31
|
DFA7, a new method to distinguish between intron-containing and intronless genes. PLoS One 2014; 9:e101363. [PMID: 25036549 PMCID: PMC4103774 DOI: 10.1371/journal.pone.0101363] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2014] [Accepted: 06/05/2014] [Indexed: 11/23/2022] Open
Abstract
Intron-containing and intronless genes have different biological properties and statistical characteristics. Here we propose a new computational method to distinguish between intron-containing and intronless gene sequences. Seven feature parameters , , , , , , and based on detrended fluctuation analysis (DFA) are fully used, and thus we can compute a 7-dimensional feature vector for any given gene sequence to be discriminated. Furthermore, support vector machine (SVM) classifier with Gaussian radial basis kernel function is performed on this feature space to classify the genes into intron-containing and intronless. We investigate the performance of the proposed method in comparison with other state-of-the-art algorithms on biological datasets. The experimental results show that our new method significantly improves the accuracy over those existing techniques.
Collapse
|
32
|
Improving mRNA 5' coding sequence determination in the mouse genome. Mamm Genome 2014; 25:149-59. [PMID: 24504701 DOI: 10.1007/s00335-013-9498-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2013] [Accepted: 12/09/2013] [Indexed: 10/25/2022]
Abstract
The incomplete determination of the mRNA 5' end sequence may lead to the incorrect assignment of the first AUG codon and to errors in the prediction of the encoded protein product. Due to the significance of the mouse as a model organism in biomedical research, we performed a systematic identification of coding regions at the 5' end of all known mouse mRNAs, using an automated expressed sequence tag (EST)-based approach which we have previously described. By parsing almost 4 million BLAT alignments we found 351 mouse loci, out of 20,221 analyzed, in which an extension of the mRNA 5' coding region was identified. Proof-of-concept confirmation was obtained by in vitro cloning and sequencing for Apc2 and Mknk2 cDNAs. We also generated a list of 16,330 mouse mRNAs where the presence of an in-frame stop codon upstream of the known start codon indicates completeness of the coding sequence at 5' end in the current form. Systematic searches in the main mouse genome databases and genome browsers showed that 82% of our results are original and have not been identified by their annotation pipelines. Moreover, the same information is not easily derivable from RNA-Seq data, due to short sequence length and laboriousness in building full-length transcript structures. In conclusion, our results improve the determination of full-length 5' coding sequences and might be useful in order to reduce errors when studying mouse gene structure and function in biomedical research.
Collapse
|
33
|
Combining in silico prediction and ribosome profiling in a genome-wide search for novel putatively coding sORFs. BMC Genomics 2013; 14:648. [PMID: 24059539 PMCID: PMC3852105 DOI: 10.1186/1471-2164-14-648] [Citation(s) in RCA: 72] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2013] [Accepted: 09/13/2013] [Indexed: 11/23/2022] Open
Abstract
Background It was long assumed that proteins are at least 100 amino acids (AAs) long. Moreover, the detection of short translation products (e.g. coded from small Open Reading Frames, sORFs) is very difficult as the short length makes it hard to distinguish true coding ORFs from ORFs occurring by chance. Nevertheless, over the past few years many such non-canonical genes (with ORFs < 100 AAs) have been discovered in different organisms like Arabidopsis thaliana, Saccharomyces cerevisiae, and Drosophila melanogaster. Thanks to advances in sequencing, bioinformatics and computing power, it is now possible to scan the genome in unprecedented scrutiny, for example in a search of this type of small ORFs. Results Using bioinformatics methods, we performed a systematic search for putatively functional sORFs in the Mus musculus genome. A genome-wide scan detected all sORFs which were subsequently analyzed for their coding potential, based on evolutionary conservation at the AA level, and ranked using a Support Vector Machine (SVM) learning model. The ranked sORFs are finally overlapped with ribosome profiling data, hinting to sORF translation. All candidates are visually inspected using an in-house developed genome browser. In this way dozens of highly conserved sORFs, targeted by ribosomes were identified in the mouse genome, putatively encoding micropeptides. Conclusion Our combined genome-wide approach leads to the prediction of a comprehensive but manageable set of putatively coding sORFs, a very important first step towards the identification of a new class of bioactive peptides, called micropeptides.
Collapse
|
34
|
Krug K, Carpy A, Behrends G, Matic K, Soares NC, Macek B. Deep coverage of the Escherichia coli proteome enables the assessment of false discovery rates in simple proteogenomic experiments. Mol Cell Proteomics 2013; 12:3420-30. [PMID: 23908556 DOI: 10.1074/mcp.m113.029165] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Recent advances in mass spectrometry (MS) have led to increased applications of shotgun proteomics to the refinement of genome annotation. The typical "proteo-genomic" workflows rely on the mapping of peptide MS/MS spectra onto databases derived via six-frame translation of the genome sequence. These databases contain a large proportion of spurious protein sequences which make the statistical confidence of the resulting peptide spectrum matches difficult to assess. Here we performed a comprehensive analysis of the Escherichia coli proteome using LTQ-Orbitrap MS and mapped the corresponding MS/MS spectra onto a six-frame translation of the E. coli genome. We hypothesized that the protein-coding part of the E. coli genome approaches complete annotation and that the majority of six frame-specific (novel) peptide spectrum matches can be considered as false positive identifications. We confirm our hypothesis by showing that the posterior error probability distribution of novel hits is almost identical to that of reversed (decoy) hits; this enables us to estimate the sensitivity, specificity, accuracy, and false discovery rate in a typical bacterial proteo-genomic dataset. We use two complementary computational frameworks for processing and statistical assessment of MS/MS data: MaxQuant and Trans-Proteomic Pipeline. We show that MaxQuant achieves a more sensitive six-frame database search with an acceptable false discovery rate and is therefore well suited for global genome reannotation applications, whereas the Trans-Proteomic Pipeline achieves higher specificity and is well suited for high-confidence validation. The use of a small and well-annotated bacterial genome enables us to address genome coverage achieved in state-of-the-art bacterial proteomics: identified peptide sequences mapped to all expressed E. coli proteins but covered 31.7% of the protein-coding genome sequence. Our results show that false discovery rates can be substantially underestimated even in "simple" proteo-genomic experiments obtained by means of high-accuracy MS and point to the necessity of further improvements concerning the coverage of peptide sequences by MS-based methods.
Collapse
Affiliation(s)
- Karsten Krug
- Proteome Center Tuebingen, University of Tuebingen, 72076 Tuebingen, Germany
| | | | | | | | | | | |
Collapse
|
35
|
Abstract
By its very nature, genomics produces large, high-dimensional datasets that are well suited to analysis by machine learning approaches. Here, we explain some key aspects of machine learning that make it useful for genome annotation, with illustrative examples from ENCODE.
Collapse
Affiliation(s)
- Kevin Y Yip
- Program in Computational Biology and Bioinformatics, Yale University, 260/266 Whitney Avenue, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, 260/266 Whitney Avenue, New Haven, CT 06520, USA
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
- Hong Kong Bioinformatics Centre, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
- CUHK-BGI Innovation Institute of Trans-omics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong
| | - Chao Cheng
- Department of Genetics, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
- Institute for Quantitative Biomedical Sciences, Norris Cotton Cancer Center, Geisel School of Medicine at Dartmouth, Lebanon, NH 03766, USA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, 260/266 Whitney Avenue, New Haven, CT 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, 260/266 Whitney Avenue, New Haven, CT 06520, USA
- Department of Computer Science, Yale University, 51 Prospect Street, New Haven, CT 06511, USA
| |
Collapse
|
36
|
Wijaya E, Frith MC, Horton P, Asai K. Finding protein-coding genes through human polymorphisms. PLoS One 2013; 8:e54210. [PMID: 23349826 PMCID: PMC3551959 DOI: 10.1371/journal.pone.0054210] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2012] [Accepted: 12/10/2012] [Indexed: 11/29/2022] Open
Abstract
Human gene catalogs are fundamental to the study of human biology and medicine. But they are all based on open reading frames (ORFs) in a reference genome sequence (with allowance for introns). Individual genomes, however, are polymorphic: their sequences are not identical. There has been much research on how polymorphism affects previously-identified genes, but no research has been done on how it affects gene identification itself. We computationally predict protein-coding genes in a straightforward manner, by finding long ORFs in mRNA sequences aligned to the reference genome. We systematically test the effect of known polymorphisms with this procedure. Polymorphisms can not only disrupt ORFs, they can also create long ORFs that do not exist in the reference sequence. We found 5,737 putative protein-coding genes that do not exist in the reference, whose protein-coding status is supported by homology to known proteins. On average 10% of these genes are located in the genomic regions devoid of annotated genes in 12 other catalogs. Our statistical analysis showed that these ORFs are unlikely to occur by chance.
Collapse
Affiliation(s)
- Edward Wijaya
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan.
| | | | | | | |
Collapse
|
37
|
Maiolica A, Jünger MA, Ezkurdia I, Aebersold R. Targeted proteome investigation via selected reaction monitoring mass spectrometry. J Proteomics 2012; 75:3495-513. [PMID: 22579752 DOI: 10.1016/j.jprot.2012.04.048] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2011] [Revised: 04/27/2012] [Accepted: 04/29/2012] [Indexed: 12/20/2022]
Abstract
Due to the enormous complexity of proteomes which constitute the entirety of protein species expressed by a certain cell or tissue, proteome-wide studies performed in discovery mode are still limited in their ability to reproducibly identify and quantify all proteins present in complex biological samples. Therefore, the targeted analysis of informative subsets of the proteome has been beneficial to generate reproducible data sets across multiple samples. Here we review the repertoire of antibody- and mass spectrometry (MS) -based analytical tools which is currently available for the directed analysis of predefined sets of proteins. The topics of emphasis for this review are Selected Reaction Monitoring (SRM) mass spectrometry, emerging tools to control error rates in targeted proteomic experiments, and some representative examples of applications. The ability to cost- and time-efficiently generate specific and quantitative assays for large numbers of proteins and posttranslational modifications has the potential to greatly expand the range of targeted proteomic coverage in biological studies. This article is part of a Special Section entitled: Understanding genome regulation and genetic diversity by mass spectrometry.
Collapse
Affiliation(s)
- Alessio Maiolica
- Department of Biology, Institute of Molecular Systems Biology, Zurich, Switzerland
| | | | | | | |
Collapse
|
38
|
|
39
|
Ladoukakis E, Pereira V, Magny EG, Eyre-Walker A, Couso JP. Hundreds of putatively functional small open reading frames in Drosophila. Genome Biol 2011; 12:R118. [PMID: 22118156 PMCID: PMC3334604 DOI: 10.1186/gb-2011-12-11-r118] [Citation(s) in RCA: 120] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2011] [Revised: 11/04/2011] [Accepted: 11/25/2011] [Indexed: 12/22/2022] Open
Abstract
Background The relationship between DNA sequence and encoded information is still an unsolved puzzle. The number of protein-coding genes in higher eukaryotes identified by genome projects is lower than was expected, while a considerable amount of putatively non-coding transcription has been detected. Functional small open reading frames (smORFs) are known to exist in several organisms. However, coding sequence detection methods are biased against detecting such very short open reading frames. Thus, a substantial number of non-canonical coding regions encoding short peptides might await characterization. Results Using bio-informatics methods, we have searched for smORFs of less than 100 amino acids in the putatively non-coding euchromatic DNA of Drosophila melanogaster, and initially identified nearly 600,000 of them. We have studied the pattern of conservation of these smORFs as coding entities between D. melanogaster and Drosophila pseudoobscura, their presence in syntenic and in transcribed regions of the genome, and their ratio of conservative versus non-conservative nucleotide changes. For negative controls, we compared the results with those obtained using random short sequences, while a positive control was provided by smORFs validated by proteomics data. Conclusions The combination of these analyses led us to postulate the existence of at least 401 functional smORFs in Drosophila, with the possibility that as many as 4,561 such functional smORFs may exist.
Collapse
|
40
|
Hawkins T, Kihara D. FUNCTION PREDICTION OF UNCHARACTERIZED PROTEINS. J Bioinform Comput Biol 2011; 5:1-30. [PMID: 17477489 DOI: 10.1142/s0219720007002503] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2006] [Revised: 09/23/2006] [Accepted: 10/10/2006] [Indexed: 11/18/2022]
Abstract
Function prediction of uncharacterized protein sequences generated by genome projects has emerged as an important focus for computational biology. We have categorized several approaches beyond traditional sequence similarity that utilize the overwhelmingly large amounts of available data for computational function prediction, including structure-, association (genomic context)-, interaction (cellular context)-, process (metabolic context)-, and proteomics-experiment-based methods. Because they incorporate structural and experimental data that is not used in sequence-based methods, they can provide additional accuracy and reliability to protein function prediction. Here, first we review the definition of protein function. Then the recent developments of these methods are introduced with special focus on the type of predictions that can be made. The need for further development of comprehensive systems biology techniques that can utilize the ever-increasing data presented by the genomics and proteomics communities is emphasized. For the readers' convenience, tables of useful online resources in each category are included. The role of computational scientists in the near future of biological research and the interplay between computational and experimental biology are also addressed.
Collapse
Affiliation(s)
- Troy Hawkins
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
| | | |
Collapse
|
41
|
Ingolia NT, Lareau LF, Weissman JS. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell 2011; 147:789-802. [PMID: 22056041 DOI: 10.1016/j.cell.2011.10.002] [Citation(s) in RCA: 1556] [Impact Index Per Article: 119.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2011] [Revised: 06/19/2011] [Accepted: 09/13/2011] [Indexed: 12/14/2022]
Abstract
The ability to sequence genomes has far outstripped approaches for deciphering the information they encode. Here we present a suite of techniques, based on ribosome profiling (the deep sequencing of ribosome-protected mRNA fragments), to provide genome-wide maps of protein synthesis as well as a pulse-chase strategy for determining rates of translation elongation. We exploit the propensity of harringtonine to cause ribosomes to accumulate at sites of translation initiation together with a machine learning algorithm to define protein products systematically. Analysis of translation in mouse embryonic stem cells reveals thousands of strong pause sites and unannotated translation products. These include amino-terminal extensions and truncations and upstream open reading frames with regulatory potential, initiated at both AUG and non-AUG codons, whose translation changes after differentiation. We also define a class of short, polycistronic ribosome-associated coding RNAs (sprcRNAs) that encode small proteins. Our studies reveal an unanticipated complexity to mammalian proteomes.
Collapse
Affiliation(s)
- Nicholas T Ingolia
- Howard Hughes Medical Institute, Department of Cellular and Molecular Pharmacology, University of California, San Francisco, San Francisco, CA 94158, USA.
| | | | | |
Collapse
|
42
|
Integrative analysis of transcriptome and genome indicates two potential genomic islands are associated with pathogenesis of Mycobacterium tuberculosis. Gene 2011; 489:21-9. [PMID: 21924330 DOI: 10.1016/j.gene.2011.08.019] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2011] [Revised: 07/20/2011] [Accepted: 08/26/2011] [Indexed: 11/20/2022]
Abstract
Mycobacterium tuberculosis (M.tb) is a successful human pathogen and widely prevalent throughout the world. Genomic islands (GIs) are thought to be related to pathogenicity. In this study, we predicted two potential genomic islands in M.tb genome, respectively named as GI-1 and GI-2. It is indicated that the genes belong to PE_PGRS family in GI-1 and genes involved in sulfolipid-1 (SL-1) synthesis in GI-2 are strongly associated with M.tb pathogenesis. Sequence analysis revealed that the five PGRS genes are more polymorphic than other PGRS members in full virulence M.tb complex strains at significance level 0.01 but not in attenuated strains. Expression analysis of microarrays collected from literatures displayed that GI-1 genes, especially Rv3508 might be correlated with the response to the inhibition of aerobic respiration. Microarray analysis also showed that SL-1 cluster genes are drastically down-expressed in attenuated strains relative to full virulence strains. We speculated that the effect of SL-1 on M.tb pathogenicity could be associated with long-term survival and persistence establishment during infection. Additionally, the gene Rv3508 in GI-1 was under positive selection. Rv3508 may involve the response of M.tb to the inhibition of aerobic respiration by low oxygen or drug PA-824, and it may be a common feature of genes in GI-1. These findings may provide some novel insights into M.tb physiology and pathogenesis.
Collapse
|
43
|
Nesbitt MJ, Moerman DG, Chen N. Identifying novel genes in C. elegans using SAGE tags. BMC Mol Biol 2010; 11:96. [PMID: 21143975 PMCID: PMC3017025 DOI: 10.1186/1471-2199-11-96] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2010] [Accepted: 12/10/2010] [Indexed: 11/10/2022] Open
Abstract
Background Despite extensive efforts devoted to predicting protein-coding genes in genome sequences, many bona fide genes have not been found and many existing gene models are not accurate in all sequenced eukaryote genomes. This situation is partly explained by the fact that gene prediction programs have been developed based on our incomplete understanding of gene feature information such as splicing and promoter characteristics. Additionally, full-length cDNAs of many genes and their isoforms are hard to obtain due to their low level or rare expression. In order to obtain full-length sequences of all protein-coding genes, alternative approaches are required. Results In this project, we have developed a method of reconstructing full-length cDNA sequences based on short expressed sequence tags which is called sequence tag-based amplification of cDNA ends (STACE). Expressed tags are used as anchors for retrieving full-length transcripts in two rounds of PCR amplification. We have demonstrated the application of STACE in reconstructing full-length cDNA sequences using expressed tags mined in an array of serial analysis of gene expression (SAGE) of C. elegans cDNA libraries. We have successfully applied STACE to recover sequence information for 12 genes, for two of which we found isoforms. STACE was used to successfully recover full-length cDNA sequences for seven of these genes. Conclusions The STACE method can be used to effectively reconstruct full-length cDNA sequences of genes that are under-represented in cDNA sequencing projects and have been missed by existing gene prediction methods, but their existence has been suggested by short sequence tags such as SAGE tags.
Collapse
Affiliation(s)
- Matthew J Nesbitt
- Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, Canada
| | | | | |
Collapse
|
44
|
de Souza GA, Arntzen MØ, Fortuin S, Schürch AC, Målen H, McEvoy CRE, van Soolingen D, Thiede B, Warren RM, Wiker HG. Proteogenomic analysis of polymorphisms and gene annotation divergences in prokaryotes using a clustered mass spectrometry-friendly database. Mol Cell Proteomics 2010; 10:M110.002527. [PMID: 21030493 PMCID: PMC3013451 DOI: 10.1074/mcp.m110.002527] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Precise annotation of genes or open reading frames is still a difficult task that results in divergence even for data generated from the same genomic sequence. This has an impact in further proteomic studies, and also compromises the characterization of clinical isolates with many specific genetic variations that may not be represented in the selected database. We recently developed software called multistrain mass spectrometry prokaryotic database builder (MSMSpdbb) that can merge protein databases from several sources and be applied on any prokaryotic organism, in a proteomic-friendly approach. We generated a database for the Mycobacterium tuberculosis complex (using three strains of Mycobacterium bovis and five of M. tuberculosis), and analyzed data collected from two laboratory strains and two clinical isolates of M. tuberculosis. We identified 2561 proteins, of which 24 were present in M. tuberculosis H37Rv samples, but not annotated in the M. tuberculosis H37Rv genome. We were also able to identify 280 nonsynonymous single amino acid polymorphisms and confirm 367 translational start sites. As a proof of concept we applied the database to whole-genome DNA sequencing data of one of the clinical isolates, which allowed the validation of 116 predicted single amino acid polymorphisms and the annotation of 131 N-terminal start sites. Moreover we identified regions not present in the original M. tuberculosis H37Rv sequence, indicating strain divergence or errors in the reference sequence. In conclusion, we demonstrated the potential of using a merged database to better characterize laboratory or clinical bacterial strains.
Collapse
Affiliation(s)
- Gustavo A de Souza
- The Gade Institute, Section for Microbiology and Immunology, University of Bergen, N-5021 Bergen, Norway
| | | | | | | | | | | | | | | | | | | |
Collapse
|
45
|
Risueño A, Fontanillo C, Dinger ME, De Las Rivas J. GATExplorer: genomic and transcriptomic explorer; mapping expression probes to gene loci, transcripts, exons and ncRNAs. BMC Bioinformatics 2010; 11:221. [PMID: 20429936 PMCID: PMC2875241 DOI: 10.1186/1471-2105-11-221] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2009] [Accepted: 04/29/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome-wide expression studies have developed exponentially in recent years as a result of extensive use of microarray technology. However, expression signals are typically calculated using the assignment of "probesets" to genes, without addressing the problem of "gene" definition or proper consideration of the location of the measuring probes in the context of the currently known genomes and transcriptomes. Moreover, as our knowledge of metazoan genomes improves, the number of both protein-coding and noncoding genes, as well as their associated isoforms, continues to increase. Consequently, there is a need for new databases that combine genomic and transcriptomic information and provide updated mapping of expression probes to current genomic annotations. RESULTS GATExplorer (Genomic and Transcriptomic Explorer) is a database and web platform that integrates a gene loci browser with nucleotide level mappings of oligo probes from expression microarrays. It allows interactive exploration of gene loci, transcripts and exons of human, mouse and rat genomes, and shows the specific location of all mappable Affymetrix microarray probes and their respective expression levels in a broad set of biological samples. The web site allows visualization of probes in their genomic context together with any associated protein-coding or noncoding transcripts. In the case of all-exon arrays, this provides a means by which the expression of the individual exons within a gene can be compared, thereby facilitating the identification and analysis of alternatively spliced exons. The application integrates data from four major source databases: Ensembl, RNAdb, Affymetrix and GeneAtlas; and it provides the users with a series of files and packages (R CDFs) to analyze particular query expression datasets. The maps cover both the widely used Affymetrix GeneChip microarrays based on 3' expression (e.g. human HG U133 series) and the all-exon expression microarrays (Gene 1.0 and Exon 1.0). CONCLUSIONS GATExplorer is an integrated database that combines genomic/transcriptomic visualization with nucleotide-level probe mapping. By considering expression at the nucleotide level rather than the gene level, it shows that the arrays detect expression signals from entities that most researchers do not contemplate or discriminate. This approach provides the means to undertake a higher resolution analysis of microarray data and potentially extract considerably more detailed and biologically accurate information from existing and future microarray experiments.
Collapse
Affiliation(s)
- Alberto Risueño
- Bioinformatics and Functional Genomics Research Group, Cancer Research Center (CiC-IBMCC, CSIC/USAL), Salamanca, Spain
| | | | | | | |
Collapse
|
46
|
de Souza GA, Søfteland T, Koehler CJ, Thiede B, Wiker HG. Validating divergent ORF annotation of the Mycobacterium leprae genome through a full translation data set and peptide identification by tandem mass spectrometry. Proteomics 2009; 9:3233-43. [PMID: 19562797 DOI: 10.1002/pmic.200800955] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023]
Abstract
Mycobacterium leprae has undergone extensive degenerative evolution, with a large number of pseudogenes. It is also the organism with the greatest divergence between gene annotations from independent institutes. Therefore, M. leprae is a good model to verify the currently predicted coding sequence regions between different annotations, to identify new ones and to investigate the expression of pseudogenes. We submitted a total extract of the bacteria isolated from Armadillo to Gel-LC-MS/MS using a linear quadrupole ion trap-Orbitrap mass spectrometer. Spectra were analyzed using the Leproma (1614 genes and 1133 pseudogenes) and TIGR (5446 genes) databases and a database containing the full genome translation. We identified a total of 1046 proteins, including five proteins encoded by previously predicted pseudogenes, which upon closer inspection appeared to be proper genes. Only 11 of the additional annotations by TIGR were verified. We also identified six tryptic peptides from five proteins from regions not considered to be coding sequences, in addition to peptides from two unannotated gene candidates that overlap with other genes. Our data show that the Leproma annotation of M. leprae is quite accurate, and there were no peptide observations corresponding to true pseudogenes, except for a new gene candidate, overlapping with an essential enolase on the complementary strand.
Collapse
Affiliation(s)
- Gustavo A de Souza
- The Gade Institute, Section for Microbiology and Immunology, University of Bergen, Norway
| | | | | | | | | |
Collapse
|
47
|
Abstract
Proteolytic enzymes play an essential role in many biological and pathological processes. Taking advantage of the recent availability of several mammalian genome sequences and by using a set of computational approaches, we have annotated and compared the degradome or complete repertoire of proteases of different mammalian species including human, mouse, rat, and chimpanzee. These studies have allowed us to expand our knowledge about the complexity, evolution, and diversity of proteolytic systems, which represent about 2% of the studied genomes. In this chapter, we review the genomic and computational methodologies used in this degradomic analysis and summarize the main findings derived from comparison of mammalian degradomes.
Collapse
Affiliation(s)
- Gonzalo R Ordóñez
- Departamento de Bioquímica y Biología Molecular, Facultad de Medicina, Instituto Universitario de Oncología, Universidad de Oviedo, Oviedo, Spain
| | | | | | | |
Collapse
|
48
|
Abstract
In mammalian cells, apoptotic and anti-apoptotic pathways may be investigated using a variety of biochemical, molecular, and genetic approaches. Retrovirus mediated genetic screens have proven a powerful tool in mapping out the network of players in a number of signaling pathways. We have developed the ERM (for enhanced retroviral mutagen) mutagenesis approach to identify novel players in the growth factor dependent survival pathways. ERM has been shown to be efficient and amenable to genome wide genetic screens in mammalian cells without the need of cDNA library construction. The advantages of the ERM method include regulatable expression, flexible design, and efficiency.
Collapse
Affiliation(s)
- Dan Liu
- Verna and Marrs McLean Department of Biochemistry and Molecular Biology, Baylor College of Medicine, Houston, TX, USA
| | | |
Collapse
|
49
|
Abstract
Genetic screens have been proven powerful for the identification of components of various signaling pathways. For mammalian cells, methods for genetic screens are limited. We have developed the ERM (enhanced retroviral mutagen) mutagenesis approach that has been shown to be efficient and amenable to genomewide genetic screens in mammalian cells without the need of cDNA library construction. The ERM method offers several advantages, including conditional gene expression and the flexibility to tag endogenous genes with different epitope-tag and marker sequences. This chapter will discuss general design, procedures, and applications of the ERM strategy.
Collapse
|
50
|
de Souza GA, Målen H, Søfteland T, Saelensminde G, Prasad S, Jonassen I, Wiker HG. High accuracy mass spectrometry analysis as a tool to verify and improve gene annotation using Mycobacterium tuberculosis as an example. BMC Genomics 2008; 9:316. [PMID: 18597682 PMCID: PMC2483986 DOI: 10.1186/1471-2164-9-316] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2008] [Accepted: 07/02/2008] [Indexed: 01/23/2023] Open
Abstract
Background While the genomic annotations of diverse lineages of the Mycobacterium tuberculosis complex are available, divergences between gene prediction methods are still a challenge for unbiased protein dataset generation. M. tuberculosis gene annotation is an example, where the most used datasets from two independent institutions (Sanger Institute and Institute of Genomic Research-TIGR) differ up to 12% in the number of annotated open reading frames, and 46% of the genes contained in both annotations have different start codons. Such differences emphasize the importance of the identification of the sequence of protein products to validate each gene annotation including its sequence coding area. Results With this objective, we submitted a culture filtrate sample from M. tuberculosis to a high-accuracy LTQ-Orbitrap mass spectrometer analysis and applied refined N-terminal prediction to perform comparison of two gene annotations. From a total of 449 proteins identified from the MS data, we validated 35 tryptic peptides that were specific to one of the two datasets, representing 24 different proteins. From those, 5 proteins were only annotated in the Sanger database. In the remaining proteins, the observed differences were due to differences in annotation of transcriptional start sites. Conclusion Our results indicate that, even in a less complex sample likely to represent only 10% of the bacterial proteome, we were still able to detect major differences between different gene annotation approaches. This gives hope that high-throughput proteomics techniques can be used to improve and validate gene annotations, and in particular for verification of high-throughput, automatic gene annotations.
Collapse
Affiliation(s)
- Gustavo A de Souza
- Section for Microbiology and Immunology, The Gade Institute, University of Bergen, Bergen, Norway.
| | | | | | | | | | | | | |
Collapse
|