1
|
Yadalam PK, Anegundi RV, Saravanan M, Meles HN, Heboyan A. Deep Learning Prediction of Inflammatory Inducing Protein Coding mRNA in P. gingivalis Released Outer Membrane Vesicles. Biomed Eng Comput Biol 2024; 15:11795972241277081. [PMID: 39221175 PMCID: PMC11365027 DOI: 10.1177/11795972241277081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 08/06/2024] [Indexed: 09/04/2024] Open
Abstract
Aim The Insilco study uses deep learning algorithms to predict the protein-coding pg m RNA sequences. Material and methods The NCBI GEO DATA SET GSE218606's GEO R tool discovered P.G's outer membrane vesicles' most differentially expressed mRNA. Genemania analyzed differentially expressed gene networks. Transcriptomics data were collected and labeled on P. gingivalis protein-coding mRNA sequence and pseudogene, lincRNA, and bidirectional promoter lincRNA. Orange, a machine learning tool, analyzed and predicted data after preprocessing. Naïve Bayes, neural networks, and gradient descent partition data into training and testing sets, yielding accurate results. Cross-validation, model accuracy, and ROC curve were evaluated after model validation. Results Three models, Neural Networks, Naive Bayes, and Gradient Boosting, were evaluated using metrics like Area Under the Curve (AUC), Classification Accuracy (CA), F1 Score, Precision, Recall, and Specificity. Gradient Boosting achieved a balanced performance (AUC: 0.72, CA: 0.41, F1: 0.32) compared to Neural Networks (AUC: 0.721, CA: 0.391, F1: 0.314) and Naive Bayes (AUC: 0.701, CA: 0.172, F1: 0.114). While statistical tests revealed no significant differences between the models, Gradient Boosting exhibited a more balanced precision-recall relationship. Conclusion In silico analysis using machine learning techniques successfully predicted protein-coding mRNA sequences within Porphyromonas gingivalis OMVs. Gradient Boosting outperformed other models (Neural Networks, Naive Bayes) by achieving a balanced performance across metrics like AUC, classification accuracy, and precision-recall, suggests its potential as a reliable tool for protein-coding mRNA prediction in P. gingivalis OMVs.
Collapse
Affiliation(s)
- Pradeep Kumar Yadalam
- Department of Periodontics, Saveetha Dental College, Saveetha Institute of Medical and Technical Sciences (SIMATS), Saveetha University, Chennai, Tamil Nadu, India
| | - Raghavendra Vamsi Anegundi
- Department of Periodontics, Saveetha Dental College, Saveetha Institute of Medical and Technical Sciences (SIMATS), Saveetha University, Chennai, Tamil Nadu, India
| | - Muthupandian Saravanan
- AMR and Nanotherapeutics Lab, Department of Pharmacology, Saveetha Dental college and Hospitals, Saveetha Institute of Medical and Technical Sciences, Chennai, Tamilnadu, India
| | - Hadush Negash Meles
- Unit of Medical Microbiology, Department of Medical Laboratory Sciences, College of Medicine and Health Science, Adigrat University, Adigrat, Ethiopia
| | - Artak Heboyan
- Department of Prosthodontics, Faculty of Stomatology, Yerevan State Medical University after Mkhitar Heratsi, Yerevan, Armenia
| |
Collapse
|
2
|
Chen Z, Ain NU, Zhao Q, Zhang X. From tradition to innovation: conventional and deep learning frameworks in genome annotation. Brief Bioinform 2024; 25:bbae138. [PMID: 38581418 PMCID: PMC10998533 DOI: 10.1093/bib/bbae138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 03/08/2024] [Accepted: 03/10/2024] [Indexed: 04/08/2024] Open
Abstract
Following the milestone success of the Human Genome Project, the 'Encyclopedia of DNA Elements (ENCODE)' initiative was launched in 2003 to unearth information about the numerous functional elements within the genome. This endeavor coincided with the emergence of numerous novel technologies, accompanied by the provision of vast amounts of whole-genome sequences, high-throughput data such as ChIP-Seq and RNA-Seq. Extracting biologically meaningful information from this massive dataset has become a critical aspect of many recent studies, particularly in annotating and predicting the functions of unknown genes. The core idea behind genome annotation is to identify genes and various functional elements within the genome sequence and infer their biological functions. Traditional wet-lab experimental methods still rely on extensive efforts for functional verification. However, early bioinformatics algorithms and software primarily employed shallow learning techniques; thus, the ability to characterize data and features learning was limited. With the widespread adoption of RNA-Seq technology, scientists from the biological community began to harness the potential of machine learning and deep learning approaches for gene structure prediction and functional annotation. In this context, we reviewed both conventional methods and contemporary deep learning frameworks, and highlighted novel perspectives on the challenges arising during annotation underscoring the dynamic nature of this evolving scientific landscape.
Collapse
Affiliation(s)
- Zhaojia Chen
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangzhou 518120, China
- College of Biomedical Engineering, Taiyuan University of Technology, Jinzhong 030600, China
| | - Noor ul Ain
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangzhou 518120, China
| | - Qian Zhao
- State Key Laboratory for Ecological Pest Control of Fujian/Taiwan Crops and College of Life Science, Fujian Agriculture and Forestry University, Fuzhou, 350002, China
| | - Xingtan Zhang
- National Key Laboratory for Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangzhou 518120, China
| |
Collapse
|
3
|
Benfica LF, Brito LF, do Bem RD, Mulim HA, Glessner J, Braga LG, Gloria LS, Cyrillo JNSG, Bonilha SFM, Mercadante MEZ. Genome-wide association study between copy number variation and feeding behavior, feed efficiency, and growth traits in Nellore cattle. BMC Genomics 2024; 25:54. [PMID: 38212678 PMCID: PMC10785391 DOI: 10.1186/s12864-024-09976-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2023] [Accepted: 01/04/2024] [Indexed: 01/13/2024] Open
Abstract
BACKGROUND Feeding costs represent the largest expenditures in beef production. Therefore, the animal efficiency in converting feed in high-quality protein for human consumption plays a major role in the environmental impact of the beef industry and in the beef producers' profitability. In this context, breeding animals for improved feed efficiency through genomic selection has been considered as a strategic practice in modern breeding programs around the world. Copy number variation (CNV) is a less-studied source of genetic variation that can contribute to phenotypic variability in complex traits. In this context, this study aimed to: (1) identify CNV and CNV regions (CNVRs) in the genome of Nellore cattle (Bos taurus indicus); (2) assess potential associations between the identified CNVR and weaning weight (W210), body weight measured at the time of selection (WSel), average daily gain (ADG), dry matter intake (DMI), residual feed intake (RFI), time spent at the feed bunk (TF), and frequency of visits to the feed bunk (FF); and, (3) perform functional enrichment analyses of the significant CNVR identified for each of the traits evaluated. RESULTS A total of 3,161 CNVs and 561 CNVRs ranging from 4,973 bp to 3,215,394 bp were identified. The CNVRs covered up to 99,221,894 bp (3.99%) of the Nellore autosomal genome. Seventeen CNVR were significantly associated with dry matter intake and feeding frequency (number of daily visits to the feed bunk). The functional annotation of the associated CNVRs revealed important candidate genes related to metabolism that may be associated with the phenotypic expression of the evaluated traits. Furthermore, Gene Ontology (GO) analyses revealed 19 enrichment processes associated with FF. CONCLUSIONS A total of 3,161 CNVs and 561 CNVRs were identified and characterized in a Nellore cattle population. Various CNVRs were significantly associated with DMI and FF, indicating that CNVs play an important role in key biological pathways and in the phenotypic expression of feeding behavior and growth traits in Nellore cattle.
Collapse
Affiliation(s)
- Lorena F Benfica
- Department of Animal Sciences, Purdue University, 270 S. Russell Street, West Lafayette, IN, 47907, USA.
- Department of Animal Science, Faculty of Agricultural and Veterinary Sciences, Sao Paulo State University, Jaboticabal, SP, Brazil.
| | - Luiz F Brito
- Department of Animal Sciences, Purdue University, 270 S. Russell Street, West Lafayette, IN, 47907, USA
| | - Ricardo D do Bem
- Department of Animal Science, Faculty of Agricultural and Veterinary Sciences, Sao Paulo State University, Jaboticabal, SP, Brazil
| | - Henrique A Mulim
- Department of Animal Sciences, Purdue University, 270 S. Russell Street, West Lafayette, IN, 47907, USA
| | - Joseph Glessner
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Larissa G Braga
- Department of Animal Science, Faculty of Agricultural and Veterinary Sciences, Sao Paulo State University, Jaboticabal, SP, Brazil
| | - Leonardo S Gloria
- Department of Animal Sciences, Purdue University, 270 S. Russell Street, West Lafayette, IN, 47907, USA
| | | | | | | |
Collapse
|
4
|
Choi Y, Nam MW, Lee HK, Choi KC. Use of cutting-edge RNA-sequencing technology to identify biomarkers and potential therapeutic targets in canine and feline cancers and other diseases. J Vet Sci 2023; 24:e71. [PMID: 38031650 PMCID: PMC10556291 DOI: 10.4142/jvs.23036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Revised: 06/16/2023] [Accepted: 06/18/2023] [Indexed: 12/01/2023] Open
Abstract
With the growing interest in companion animals and the rapidly expanding animal healthcare and pharmaceuticals market worldwide. With the advancements in RNA-sequencing (RNA-seq) technology, it has become a valuable tool for understanding biological processes in companion animals and has multiple applications in animal healthcare. Historically, veterinary diagnoses and treatments relied solely on clinical symptoms and drugs used in human diseases. However, RNA-seq has emerged as an effective technology for studying companion animals, providing insights into their genetic information. The sequencing technology has revealed that not only messenger RNAs (mRNAs) but also non-coding RNAs (ncRNAs) such as long ncRNAs and microRNAs can serve as biomarkers. Based on the examination of RNA-seq applications in veterinary medicine, particularly in dogs and cats, this review concludes that RNA-seq has significant potential as a diagnostic and research tool. It has enabled the identification of potential biomarkers for cancer and other diseases in companion animals. Further research and development are required to maximize the utilization of RNA-seq for improved disease diagnosis and therapeutic targeting in companion animals.
Collapse
Affiliation(s)
- Youngdong Choi
- Laboratory of Biochemistry and Immunology, College of Veterinary Medicine, Chungbuk National University, Cheongju 28644, Korea
| | - Min-Woo Nam
- Laboratory of Biochemistry and Immunology, College of Veterinary Medicine, Chungbuk National University, Cheongju 28644, Korea
| | - Hong Kyu Lee
- Laboratory of Biochemistry and Immunology, College of Veterinary Medicine, Chungbuk National University, Cheongju 28644, Korea
| | - Kyung-Chul Choi
- Laboratory of Biochemistry and Immunology, College of Veterinary Medicine, Chungbuk National University, Cheongju 28644, Korea.
| |
Collapse
|
5
|
Lovero D, Porcelli D, Giordano L, Lo Giudice C, Picardi E, Pesole G, Pignataro E, Palazzo A, Marsano RM. Structural and Comparative Analyses of Insects Suggest the Presence of an Ultra-Conserved Regulatory Element of the Genes Encoding Vacuolar-Type ATPase Subunits and Assembly Factors. BIOLOGY 2023; 12:1127. [PMID: 37627011 PMCID: PMC10452791 DOI: 10.3390/biology12081127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 07/28/2023] [Accepted: 08/09/2023] [Indexed: 08/27/2023]
Abstract
Gene and genome comparison represent an invaluable tool to identify evolutionarily conserved sequences with possible functional significance. In this work, we have analyzed orthologous genes encoding subunits and assembly factors of the V-ATPase complex, an important enzymatic complex of the vacuolar and lysosomal compartments of the eukaryotic cell with storage and recycling functions, respectively, as well as the main pump in the plasma membrane that energizes the epithelial transport in insects. This study involves 70 insect species belonging to eight insect orders. We highlighted the conservation of a short sequence in the genes encoding subunits of the V-ATPase complex and their assembly factors analyzed with respect to their exon-intron organization of those genes. This study offers the possibility to study ultra-conserved regulatory elements under an evolutionary perspective, with the aim of expanding our knowledge on the regulation of complex gene networks at the basis of organellar biogenesis and cellular organization.
Collapse
Affiliation(s)
- Domenica Lovero
- Dipartimento di Bioscienze Biotecnologie e Ambiente, Università Degli Studi di Bari Aldo Moro, Via Orabona 4, 70125 Bari, Italy; (D.L.); (D.P.); (E.P.); (G.P.); (E.P.); (A.P.)
- MASMEC Biomed S.p.A., Via Delle Violette 14, 70026 Modugno, Italy
| | - Damiano Porcelli
- Dipartimento di Bioscienze Biotecnologie e Ambiente, Università Degli Studi di Bari Aldo Moro, Via Orabona 4, 70125 Bari, Italy; (D.L.); (D.P.); (E.P.); (G.P.); (E.P.); (A.P.)
- METALABS S.R.L., Corso A. De Gasperi 381/1, 70125 Bari, Italy
| | - Luca Giordano
- Cardio-Pulmonary Institute (CPI), Universities of Giessen and Marburg Lung Center (UGMLC), Member of the German Center for Lung Research (DZL), Justus-Liebig-University, Aulweg 130, 35392 Giessen, Germany;
| | - Claudio Lo Giudice
- Istituto di Tecnologie Biomediche (ITB), Consiglio Nazionale Delle Ricerche, Via Giovanni Amendola, 122, 70126 Bari, Italy;
| | - Ernesto Picardi
- Dipartimento di Bioscienze Biotecnologie e Ambiente, Università Degli Studi di Bari Aldo Moro, Via Orabona 4, 70125 Bari, Italy; (D.L.); (D.P.); (E.P.); (G.P.); (E.P.); (A.P.)
| | - Graziano Pesole
- Dipartimento di Bioscienze Biotecnologie e Ambiente, Università Degli Studi di Bari Aldo Moro, Via Orabona 4, 70125 Bari, Italy; (D.L.); (D.P.); (E.P.); (G.P.); (E.P.); (A.P.)
| | - Eugenia Pignataro
- Dipartimento di Bioscienze Biotecnologie e Ambiente, Università Degli Studi di Bari Aldo Moro, Via Orabona 4, 70125 Bari, Italy; (D.L.); (D.P.); (E.P.); (G.P.); (E.P.); (A.P.)
| | - Antonio Palazzo
- Dipartimento di Bioscienze Biotecnologie e Ambiente, Università Degli Studi di Bari Aldo Moro, Via Orabona 4, 70125 Bari, Italy; (D.L.); (D.P.); (E.P.); (G.P.); (E.P.); (A.P.)
| | - René Massimiliano Marsano
- Dipartimento di Bioscienze Biotecnologie e Ambiente, Università Degli Studi di Bari Aldo Moro, Via Orabona 4, 70125 Bari, Italy; (D.L.); (D.P.); (E.P.); (G.P.); (E.P.); (A.P.)
| |
Collapse
|
6
|
Bopape FL, Chiulele RM, Shonhai A, Gwata ET. The Genome of a Pigeonpea Compatible Rhizobial Strain '10ap3' Appears to Lack Common Nodulation Genes. Genes (Basel) 2023; 14:1084. [PMID: 37239443 PMCID: PMC10217799 DOI: 10.3390/genes14051084] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 05/10/2023] [Accepted: 05/13/2023] [Indexed: 05/28/2023] Open
Abstract
The symbiotic fixation of atmospheric nitrogen (N) in root nodules of tropical legumes such as pigeonpea (Cajanus cajan) is a complex process, which is regulated by multiple genetic factors at the host plant genotype microsymbiont interface. The process involves multiple genes with various modes of action and is accomplished only when both organisms are compatible. Therefore, it is necessary to develop tools for the genetic manipulation of the host or bacterium towards improving N fixation. In this study, we sequenced the genome of a robust rhizobial strain, Rhizobium tropici '10ap3' that was compatible with pigeonpea, and we determined its genome size. The genome consisted of a large circular chromosome (6,297,373 bp) and contained 6013 genes of which 99.13% were coding sequences. However only 5833 of the genes were associated with proteins that could be assigned to specific functions. The genes for nitrogen, phosphorus and iron metabolism, stress response and the adenosine monophosphate nucleoside for purine conversion were present in the genome. However, the genome contained no common nod genes, suggesting that an alternative pathway involving a purine derivative was involved in the symbiotic association with pigeonpea.
Collapse
Affiliation(s)
- Francina L. Bopape
- Agricultural Research Council, Plant Health and Protection (ARC-PHP), Private Bag X134, Pretoria 0121, South Africa
- Department of Plant and Soil Sciences, Faculty of Science, Engineering and Agriculture, University of Venda, Private Bag X5050, Thohoyandou 0950, South Africa
| | - Rogerio M. Chiulele
- Centre of Excellence in Agri-Food Systems and Nutrition, Eduardo Mondlane University, 5th Floor, Rectory Building, 25th June Square, Maputo 1100, Mozambique;
- Faculty of Agronomy and Forestry Engineering, Eduardo Mondlane University, Julius Nyerere Avenue, Maputo 1100, Mozambique
| | - Addmore Shonhai
- Department of Biochemistry and Microbiology, Faculty of Science, Engineering and Agriculture, University of Venda, Private Bag X5050, Thohoyandou 0950, South Africa
| | - Eastonce T. Gwata
- Department of Plant and Soil Sciences, Faculty of Science, Engineering and Agriculture, University of Venda, Private Bag X5050, Thohoyandou 0950, South Africa
| |
Collapse
|
7
|
Cho HN, Chaves de Souza L, Johnson C, Klein JR, Kirkpatrick TC, Silva R, Letra A. Differentially expressed genes in dental pulp tissues of individuals with symptomatic irreversible pulpitis with and without history of COVID-19. J Endod 2023:S0099-2399(23)00244-3. [PMID: 37178757 PMCID: PMC10174733 DOI: 10.1016/j.joen.2023.05.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2023] [Revised: 05/01/2023] [Accepted: 05/02/2023] [Indexed: 05/15/2023]
Abstract
INTRODUCTION Increased levels of pro-inflammatory markers have been reported in tissues of individuals with Coronavirus Disease 2019 (COVID-19). We hypothesize that inflamed dental pulp tissues of individuals with previous history of COVID-19 may present a differential inflammatory gene expression profile in comparison to individuals who never had COVID-19. MATERIALS AND METHODS Dental pulp tissues were collected from 27 individuals referred for endodontic treatment due to symptomatic irreversible pulpitis. Of these, 16 individuals had a history of COVID-19 (6 months to 1 year post infection) and 11 individuals had no previous history of COVID-19 (controls). Total RNA from pulp tissue samples was extracted and subjected to RNA sequencing for comparison of differentially expressed genes (DEGs) among groups. DEGs showing log2(fold-change) > 1 or < -1, and P < .05 were considered significantly dysregulated. RESULTS RNA sequencing identified 1461 genes as differentially expressed among the groups. Of these, 311 were protein coding genes, 252 (81%) which were upregulated and 59 (19%) which were downregulated in the COVID group compared to controls. The top upregulated genes in COVID group were HSFX1 (4.12 fold-change) and LINGO3 (2.06 fold-change); significantly downregulated genes were LYZ (-1.52 fold-change), CCL15 and IL8 (-1.45 fold-change). CONCLUSIONS Differential gene expression in dental pulp tissues of COVID and non-COVID groups suggests potential contribution of COVID-19 on dysregulating inflammatory gene expression in the inflamed dental pulp.
Collapse
Affiliation(s)
- Han Na Cho
- Department of Endodontics, The University of Texas Health Science Center at Houston School of Dentistry, Houston, TX
| | - Leticia Chaves de Souza
- Department of Endodontics, The University of Texas Health Science Center at Houston School of Dentistry, Houston, TX
| | - Cleverick Johnson
- Department of General Practice and Dental Public Health, The University of Texas Health Science Center at Houston School of Dentistry, Houston, TX
| | - John R Klein
- Department of Diagnostic and Biomedical Sciences, The University of Texas Health Science Center at Houston School of Dentistry, Houston, TX
| | - Timothy C Kirkpatrick
- Department of Endodontics, The University of Texas Health Science Center at Houston School of Dentistry, Houston, TX
| | - Renato Silva
- Department of Endodontics, The University of Pittsburgh School of Dental Medicine, Pittsburgh, PA
| | - Ariadne Letra
- Department of Endodontics, The University of Pittsburgh School of Dental Medicine, Pittsburgh, PA; Department of Oral and Craniofacial Sciences, The University of Pittsburgh School of Dental Medicine, Pittsburgh, PA; Center for Craniofacial and Dental Genetics, The University of Pittsburgh School of Dental Medicine, Pittsburgh, PA.
| |
Collapse
|
8
|
Sattar S, Kabat J, Jerome K, Feldmann F, Bailey K, Mehedi M. Nuclear translocation of spike mRNA and protein is a novel feature of SARS-CoV-2. Front Microbiol 2023; 14:1073789. [PMID: 36778849 PMCID: PMC9909199 DOI: 10.3389/fmicb.2023.1073789] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Accepted: 01/12/2023] [Indexed: 01/27/2023] Open
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) causes severe pathophysiology in vulnerable older populations and appears to be highly pathogenic and more transmissible than other coronaviruses. The spike (S) protein appears to be a major pathogenic factor that contributes to the unique pathogenesis of SARS-CoV-2. Although the S protein is a surface transmembrane type 1 glycoprotein, it has been predicted to be translocated into the nucleus due to the novel nuclear localization signal (NLS) "PRRARSV," which is absent from the S protein of other coronaviruses. Indeed, S proteins translocate into the nucleus in SARS-CoV-2-infected cells. S mRNAs also translocate into the nucleus. S mRNA colocalizes with S protein, aiding the nuclear translocation of S mRNA. While nuclear translocation of nucleoprotein (N) has been shown in many coronaviruses, the nuclear translocation of both S mRNA and S protein reveals a novel feature of SARS-CoV-2.
Collapse
Affiliation(s)
- Sarah Sattar
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND, United States
| | - Juraj Kabat
- Biological Imaging Section, Research Technology Branch, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD, United States
| | - Kailey Jerome
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND, United States
| | - Friederike Feldmann
- Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Hamilton, MT, United States
| | - Kristina Bailey
- Department of Internal Medicine, Pulmonary, Critical Care, and Sleep and Allergy, University of Nebraska Medical Center, Omaha, NE, United States
| | - Masfique Mehedi
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND, United States
| |
Collapse
|
9
|
Sattar S, Kabat J, Jerome K, Feldmann F, Bailey K, Mehedi M. Nuclear translocation of spike mRNA and protein is a novel pathogenic feature of SARS-CoV-2. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2022:2022.09.27.509633. [PMID: 36203551 PMCID: PMC9536038 DOI: 10.1101/2022.09.27.509633] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) causes severe pathophysiology in vulnerable older populations and appears to be highly pathogenic and more transmissible than SARS-CoV or MERS-CoV [1, 2]. The spike (S) protein appears to be a major pathogenic factor that contributes to the unique pathogenesis of SARS-CoV-2. Although the S protein is a surface transmembrane type 1 glycoprotein, it has been predicted to be translocated into the nucleus due to the novel nuclear localization signal (NLS) "PRRARSV", which is absent from the S protein of other coronaviruses. Indeed, S proteins translocate into the nucleus in SARS-CoV-2-infected cells. To our surprise, S mRNAs also translocate into the nucleus. S mRNA colocalizes with S protein, aiding the nuclear translocation of S mRNA. While nuclear translocation of nucleoprotein (N) has been shown in many coronaviruses, the nuclear translocation of both S mRNA and S protein reveals a novel pathogenic feature of SARS-CoV-2. Author summary One of the novel sequence insertions resides at the S1/S2 boundary of Spike (S) protein and constitutes a functional nuclear localization signal (NLS) motif "PRRARSV", which may supersede the importance of previously proposed polybasic furin cleavage site "RRAR". Indeed, S protein's NLS-driven nuclear translocation and its possible role in S mRNA's nuclear translocation reveal a novel pathogenic feature of SARS-CoV-2.
Collapse
Affiliation(s)
- Sarah Sattar
- Department of Biomedical Sciences, University of North Dakota School of Medicine & Health Sciences, Grand Forks, ND, USA
| | - Juraj Kabat
- Biological Imaging Section, Research Technology Branch, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD, USA
| | - Kailey Jerome
- Department of Biomedical Sciences, University of North Dakota School of Medicine & Health Sciences, Grand Forks, ND, USA
| | - Friederike Feldmann
- Division of Intramural Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Hamilton, MT, USA
| | - Kristina Bailey
- Department of Internal Medicine, Pulmonary, Critical Care, and Sleep and Allergy, University of Nebraska Medical Center, Omaha, NE, USA
| | - Masfique Mehedi
- Department of Biomedical Sciences, University of North Dakota School of Medicine & Health Sciences, Grand Forks, ND, USA
| |
Collapse
|
10
|
Chae YK, Shin HB, Woo TR. Identification of interaction partners using protein aggregation and NMR spectroscopy. PLoS One 2022; 17:e0270058. [PMID: 36084098 PMCID: PMC9462707 DOI: 10.1371/journal.pone.0270058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Accepted: 06/02/2022] [Indexed: 11/19/2022] Open
Abstract
The interaction among proteins is one of the most fundamental methods of information transfer in the living system. Many methods have been developed in order to identify the interaction pairs or groups either in vivo or in vitro. The in vitro pulldown/coprecipitation assay directly observes the protein that binds to the target. This method involves electrophoresis, which is a technique of a low resolution as well as a low throughput. As a better alternative, we wish to propose a new method that is based on the NMR spectroscopy. This method utilizes the aggregation of the target protein and the concomitant signal disappearance of the interacting partner. The aggregation is accomplished by the elastin-like polypeptide, which is fused to the target. If a protein binds to this supramolecular complex, its NMR signal then becomes too broadened in order to be observed, which is the basic phenomenon of the NMR spectroscopy. Thus, the protein that loses its signal is the one that binds to the target. A compound that interferes with these types of bindings among the proteins can be identified by observing the reappearance of the protein signals with the simultaneous disappearance of the signals of the compound. This technique will be applied in order to find an interaction pair in the information transfer pathway as well as a compound that disrupts it. This proposed method should be able to work with a mixture of proteins and provide a higher resolution in order to find the binding partner in a higher throughput fashion.
Collapse
Affiliation(s)
- Young Kee Chae
- Department of Chemistry, Sejong University, Seoul, Korea
- * E-mail:
| | - Han Bin Shin
- Department of Chemistry, Sejong University, Seoul, Korea
| | - Tae Rin Woo
- Department of Chemistry, Sejong University, Seoul, Korea
| |
Collapse
|
11
|
A Comprehensive Analysis of Human Endogenous Retroviruses HERV-K (HML.2) from Teratocarcinoma Cell Lines and Detection of Viral Cargo in Microvesicles. Int J Mol Sci 2021; 22:ijms222212398. [PMID: 34830279 PMCID: PMC8619701 DOI: 10.3390/ijms222212398] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Revised: 11/14/2021] [Accepted: 11/15/2021] [Indexed: 12/12/2022] Open
Abstract
About 8% of our genome is composed of sequences from Human Endogenous Retroviruses (HERVs). The HERV-K (HML.2) family, here abbreviated HML.2, is able to produce virus particles that were detected in cell lines, malignant tumors and in autoimmune diseases. Parameters and properties of HML.2 released from teratocarcinoma cell lines GH and Tera-1 were investigated in detail. In most experiments, analyzed viruses were purified by density gradient centrifugation. HML.2 structural proteins, reverse transcriptase (RT) activity, viral RNA (vRNA) and particle morphology were analyzed. The HML.2 markers were predominantly detected in fractions with a buoyant density of 1.16 g/cm3. Deglycosylation of TM revealed truncated forms of transmembrane (TM) protein. Free virions and extracellular vesicles (presumably microvesicles—MVs) with HML.2 elements, including budding intermediates, were detected by electron microscopy. Viral elements and assembled virions captured and exported by MVs can boost specific immune responses and trigger immunomodulation in recipient cells. Sequencing of cDNA clones demonstrated exclusive presence of HERV-K108 env in HML.2 from Tera-1 cells. Not counting two recombinant variants, four known env sequences were found in HML.2 from GH cells. Obtained results shed light on parameters and morphology of HML.2. A possible mechanism of HML.2-induced diseases is discussed.
Collapse
|
12
|
Wang Z, Cheng H. Single-Trait and Multiple-Trait Genomic Prediction From Multi-Class Bayesian Alphabet Models Using Biological Information. Front Genet 2021; 12:717457. [PMID: 34707638 PMCID: PMC8542848 DOI: 10.3389/fgene.2021.717457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Accepted: 08/23/2021] [Indexed: 11/13/2022] Open
Abstract
Genomic prediction has been widely used in multiple areas and various genomic prediction methods have been developed. The majority of these methods, however, focus on statistical properties and ignore the abundant useful biological information like genome annotation or previously discovered causal variants. Therefore, to improve prediction performance, several methods have been developed to incorporate biological information into genomic prediction, mostly in single-trait analysis. A commonly used method to incorporate biological information is allocating molecular markers into different classes based on the biological information and assigning separate priors to molecular markers in different classes. It has been shown that such methods can achieve higher prediction accuracy than conventional methods in some circumstances. However, these methods mainly focus on single-trait analysis, and available priors of these methods are limited. Thus, in both single-trait and multiple-trait analysis, we propose the multi-class Bayesian Alphabet methods, in which multiple Bayesian Alphabet priors, including RR-BLUP, BayesA, BayesB, BayesCΠ, and Bayesian LASSO, can be used for markers allocated to different classes. The superior performance of the multi-class Bayesian Alphabet in genomic prediction is demonstrated using both real and simulated data. The software tool JWAS offers open-source routines to perform these analyses.
Collapse
Affiliation(s)
- Zigui Wang
- Department of Animal Science, University of California, Davis, Davis, CA, United States
| | - Hao Cheng
- Department of Animal Science, University of California, Davis, Davis, CA, United States
| |
Collapse
|
13
|
Yu T, Han R, Fang Z, Mu Z, Zheng H, Liu J. TransRef enables accurate transcriptome assembly by redefining accurate neo-splicing graphs. Brief Bioinform 2021; 22:6319943. [PMID: 34254977 DOI: 10.1093/bib/bbab261] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2021] [Revised: 06/09/2021] [Accepted: 01/22/2020] [Indexed: 11/14/2022] Open
Abstract
RNA-seq technology is widely employed in various research areas related to transcriptome analyses, and the identification of all the expressed transcripts from short sequencing reads presents a considerable computational challenge. In this study, we introduce TransRef, a new computational algorithm for accurate transcriptome assembly by redefining a novel graph model, the neo-splicing graph, and then iteratively applying a constrained dynamic programming to reconstruct all the expressed transcripts for each graph. When TransRef is utilized to analyze both real and simulated datasets, its performance is notably better than those of several state-of-the-art assemblers, including StringTie2, Cufflinks and Scallop. In particular, the performance of TransRef is notably strong in identifying novel transcripts and transcripts with low-expression levels, while the other assemblers are less effective.
Collapse
Affiliation(s)
- Ting Yu
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
| | - Renmin Han
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
| | - Zhaoyuan Fang
- Key Laboratory of Systems Biology, CAS Center for Excellence in Molecular Cell Science, Institute of Biochemistry and Cell Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai, China
| | - Zengchao Mu
- School of Mathematics from Shandong University, China
| | - Hongyu Zheng
- Department of Radiation Oncology, Qilu Hospital, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Juntao Liu
- School of Mathematics and Statistics at Shandong University, Weihai, China
| |
Collapse
|
14
|
Transcriptome Analyses Implicate Endogenous Retroviruses Involved in the Host Antiviral Immune System through the Interferon Pathway. Virol Sin 2021; 36:1315-1326. [PMID: 34009516 PMCID: PMC8131884 DOI: 10.1007/s12250-021-00370-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2020] [Accepted: 02/08/2021] [Indexed: 12/19/2022] Open
Abstract
Human endogenous retroviruses (HERVs) are the remains of ancient retroviruses that invaded our ancestors' germline cell and were integrated into the genome. The expression of HERVs has always been a cause for concern because of its association with various cancers and diseases. However, few previous studies have focused on specific activation of HERVs by viral infections. Our previous study has shown that dengue virus type 2 (DENV-2) infection induces the transcription of a large number of abnormal HERVs loci; therefore, the purpose of this study was to explore the relationship between exogenous viral infection and HERV activation further. In this study, we retrieved and reanalyzed published data on 21 transcriptomes of human cells infected with various viruses. We found that infection with different viruses could induce transcriptional activation of HERV loci. Through the comparative analysis of all viral datasets, we identified 43 key HERV loci that were up-regulated by DENV-2, influenza A virus, influenza B virus, Zika virus, measles virus, and West Nile virus infections. Furthermore, the neighboring genes of these HERVs were simultaneously up-regulated, and almost all such neighboring genes were interferon-stimulated genes (ISGs), which are enriched in the host's antiviral immune response pathways. Our data supported the hypothesis that activation of HERVs, probably via an interferon-mediated mechanism, plays an important role in innate immunity against viral infections.
Collapse
|
15
|
Zhang Z, Zhang S, Li X, Zhao Z, Chen C, Zhang J, Li M, Wei Z, Jiang W, Pan B, Li Y, Liu Y, Cao Y, Zhao W, Gu Y, Yu Y, Meng Q, Qi L. Reference genome and annotation updates lead to contradictory prognostic predictions in gene expression signatures: a case study of resected stage I lung adenocarcinoma. Brief Bioinform 2020; 22:5834482. [PMID: 32383445 DOI: 10.1093/bib/bbaa081] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Revised: 04/02/2020] [Accepted: 04/18/2020] [Indexed: 12/28/2022] Open
Abstract
RNA-sequencing enables accurate and low-cost transcriptome-wide detection. However, expression estimates vary as reference genomes and gene annotations are updated, confounding existing expression-based prognostic signatures. Herein, prognostic 9-gene pair signature (GPS) was applied to 197 patients with stage I lung adenocarcinoma derived from previous and latest data from The Cancer Genome Atlas (TCGA) processed with different reference genomes and annotations. For 9-GPS, 6.6% of patients exhibited discordant risk classifications between the two TCGA versions. Similar results were observed for other prognostic signatures, including IRGPI, 15-gene and ORACLE. We found that conflicting annotations for gene length and overlap were the major cause of their discordant risk classification. Therefore, we constructed a prognostic 40-GPS based on stable genes across GENCODE v20-v30 and validated it using public data of 471 stage I samples (log-rank P < 0.0010). Risk classification was still stable in RNA-sequencing data processed with the newest GENCODE v32 versus GENCODE v20-v30. Specifically, 40-GPS could predict survival for 30 stage I samples with formalin-fixed paraffin-embedded tissues (log-rank P = 0.0177). In conclusion, this method overcomes the vulnerability of existing prognostic signatures due to reference genome and annotation updates. 40-GPS may offer individualized clinical applications due to its prognostic accuracy and classification stability.
Collapse
|
16
|
Lu S, Zhang J, Lian X, Sun L, Meng K, Chen Y, Sun Z, Yin X, Li Y, Zhao J, Wang T, Zhang G, He QY. A hidden human proteome encoded by 'non-coding' genes. Nucleic Acids Res 2019; 47:8111-8125. [PMID: 31340039 PMCID: PMC6735797 DOI: 10.1093/nar/gkz646] [Citation(s) in RCA: 108] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2019] [Revised: 07/07/2019] [Accepted: 07/15/2019] [Indexed: 01/27/2023] Open
Abstract
It has been a long debate whether the 98% ‘non-coding’ fraction of human genome can encode functional proteins besides short peptides. With full-length translating mRNA sequencing and ribosome profiling, we found that up to 3330 long non-coding RNAs (lncRNAs) were bound to ribosomes with active translation elongation. With shotgun proteomics, 308 lncRNA-encoded new proteins were detected. A total of 207 unique peptides of these new proteins were verified by multiple reaction monitoring (MRM) and/or parallel reaction monitoring (PRM); and 10 new proteins were verified by immunoblotting. We found that these new proteins deviated from the canonical proteins with various physical and chemical properties, and emerged mostly in primates during evolution. We further deduced the protein functions by the assays of translation efficiency, RNA folding and intracellular localizations. As the new protein UBAP1-AST6 is localized in the nucleoli and is preferentially expressed by lung cancer cell lines, we biologically verified that it has a function associated with cell proliferation. In sum, we experimentally evidenced a hidden human functional proteome encoded by purported lncRNAs, suggesting a resource for annotating new human proteins.
Collapse
Affiliation(s)
- Shaohua Lu
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Jing Zhang
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Xinlei Lian
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China.,Laboratory of Veterinary Pharmacology, College of Veterinary Medicine, South China Agricultural University, Guangzhou 510642, China
| | - Li Sun
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Kun Meng
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Yang Chen
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Zhenghua Sun
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Xingfeng Yin
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Yaxing Li
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Jing Zhao
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Tong Wang
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Gong Zhang
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| | - Qing-Yu He
- Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Institute of Life and Health Engineering, College of Life Science and Technology, Jinan University, Guangzhou 510632, China
| |
Collapse
|
17
|
Wilbrandt J, Misof B, Panfilio KA, Niehuis O. Repertoire-wide gene structure analyses: a case study comparing automatically predicted and manually annotated gene models. BMC Genomics 2019; 20:753. [PMID: 31623555 PMCID: PMC6798390 DOI: 10.1186/s12864-019-6064-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2018] [Accepted: 08/27/2019] [Indexed: 02/06/2023] Open
Abstract
Background The location and modular structure of eukaryotic protein-coding genes in genomic sequences can be automatically predicted by gene annotation algorithms. These predictions are often used for comparative studies on gene structure, gene repertoires, and genome evolution. However, automatic annotation algorithms do not yet correctly identify all genes within a genome, and manual annotation is often necessary to obtain accurate gene models and gene sets. As manual annotation is time-consuming, only a fraction of the gene models in a genome is typically manually annotated, and this fraction often differs between species. To assess the impact of manual annotation efforts on genome-wide analyses of gene structural properties, we compared the structural properties of protein-coding genes in seven diverse insect species sequenced by the i5k initiative. Results Our results show that the subset of genes chosen for manual annotation by a research community (3.5–7% of gene models) may have structural properties (e.g., lengths and exon counts) that are not necessarily representative for a species’ gene set as a whole. Nonetheless, the structural properties of automatically generated gene models are only altered marginally (if at all) through manual annotation. Major correlative trends, for example a negative correlation between genome size and exonic proportion, can be inferred from either the automatically predicted or manually annotated gene models alike. Vice versa, some previously reported trends did not appear in either the automatic or manually annotated gene sets, pointing towards insect-specific gene structural peculiarities. Conclusions In our analysis of gene structural properties, automatically predicted gene models proved to be sufficiently reliable to recover the same gene-repertoire-wide correlative trends that we found when focusing on manually annotated gene models only. We acknowledge that analyses on the individual gene level clearly benefit from manual curation. However, as genome sequencing and annotation projects often differ in the extent of their manual annotation and curation efforts, our results indicate that comparative studies analyzing gene structural properties in these genomes can nonetheless be justifiable and informative. Electronic supplementary material The online version of this article (10.1186/s12864-019-6064-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jeanne Wilbrandt
- Center for molecular Biodiversity Research, Zoological Research Museum Alexander Koenig (ZFMK), Adenauerallee 160, 53113, Bonn, Germany. .,Present address: Hoffmann Research Group, Leibniz Institute on Aging - Fritz Lipmann Institute, Beutenbergstraße 11, 07745, Jena, Germany.
| | - Bernhard Misof
- Center for molecular Biodiversity Research, Zoological Research Museum Alexander Koenig (ZFMK), Adenauerallee 160, 53113, Bonn, Germany
| | - Kristen A Panfilio
- School of Life Sciences, University of Warwick, Gibbet Hill Campus, Coventry, CV4 7AL, UK
| | - Oliver Niehuis
- Evolutionary Biology and Ecology, Institute of Biology I (Zoology), Albert Ludwig University, Hauptstr. 1, 79104, Freiburg, Germany
| |
Collapse
|
18
|
Babarinde IA, Li Y, Hutchins AP. Computational Methods for Mapping, Assembly and Quantification for Coding and Non-coding Transcripts. Comput Struct Biotechnol J 2019; 17:628-637. [PMID: 31193391 PMCID: PMC6526290 DOI: 10.1016/j.csbj.2019.04.012] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2019] [Revised: 04/24/2019] [Accepted: 04/29/2019] [Indexed: 12/17/2022] Open
Abstract
The measurement of gene expression has long provided significant insight into biological functions. The development of high-throughput short-read sequencing technology has revealed transcriptional complexity at an unprecedented scale, and informed almost all areas of biology. However, as researchers have sought to gather more insights from the data, these new technologies have also increased the computational analysis burden. In this review, we describe typical computational pipelines for RNA-Seq analysis and discuss their strengths and weaknesses for the assembly, quantification and analysis of coding and non-coding RNAs. We also discuss the assembly of transposable elements into transcripts, and the difficulty these repetitive elements pose. In summary, RNA-Seq is a powerful technology that is likely to remain a key asset in the biologist's toolkit.
Collapse
Affiliation(s)
| | | | - Andrew P. Hutchins
- Department of Biology, Southern University of Science and Technology, 1088 Xueyuan Lu, Shenzhen, China
| |
Collapse
|
19
|
Ahmad SF, Martins C. The Modern View of B Chromosomes Under the Impact of High Scale Omics Analyses. Cells 2019; 8:E156. [PMID: 30781835 PMCID: PMC6406668 DOI: 10.3390/cells8020156] [Citation(s) in RCA: 49] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Revised: 02/10/2019] [Accepted: 02/12/2019] [Indexed: 12/11/2022] Open
Abstract
Supernumerary B chromosomes (Bs) are extra karyotype units in addition to A chromosomes, and are found in some fungi and thousands of animals and plant species. Bs are uniquely characterized due to their non-Mendelian inheritance, and represent one of the best examples of genomic conflict. Over the last decades, their genetic composition, function and evolution have remained an unresolved query, although a few successful attempts have been made to address these phenomena. A classical concept based on cytogenetics and genetics is that Bs are selfish and abundant with DNA repeats and transposons, and in most cases, they do not carry any function. However, recently, the modern quantum development of high scale multi-omics techniques has shifted B research towards a new-born field that we call "B-omics". We review the recent literature and add novel perspectives to the B research, discussing the role of new technologies to understand the mechanistic perspectives of the molecular evolution and function of Bs. The modern view states that B chromosomes are enriched with genes for many significant biological functions, including but not limited to the interesting set of genes related to cell cycle and chromosome structure. Furthermore, the presence of B chromosomes could favor genomic rearrangements and influence the nuclear environment affecting the function of other chromatin regions. We hypothesize that B chromosomes might play a key function in driving their transmission and maintenance inside the cell, as well as offer an extra genomic compartment for evolution.
Collapse
Affiliation(s)
- Syed Farhan Ahmad
- Department of Morphology, Institute of Biosciences at Botucatu, Sao Paulo State University (UNESP), CEP 18618689, Botucatu, SP, Brazil.
| | - Cesar Martins
- Department of Morphology, Institute of Biosciences at Botucatu, Sao Paulo State University (UNESP), CEP 18618689, Botucatu, SP, Brazil.
| |
Collapse
|
20
|
Crawley AB, Barrangou R. Conserved Genome Organization and Core Transcriptome of the Lactobacillus acidophilus Complex. Front Microbiol 2018; 9:1834. [PMID: 30150974 PMCID: PMC6099100 DOI: 10.3389/fmicb.2018.01834] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2018] [Accepted: 07/23/2018] [Indexed: 01/08/2023] Open
Abstract
The Lactobacillus genus encompasses a genetically and functionally diverse group of species, and contains many strains widely formulated in the human food supply chain as probiotics and starter cultures. Within this genetically expansive group, there are several distinct clades that have high levels of homology, one of which is the Lactobacillus acidophilus group. Of the uniting features, small genomes, low GC content, adaptation to dairy environments, and fastidious growth requirements, are some of the most defining characteristics of this group. To better understand what truly links and defines this clade, we sought to characterize the genomic organization and content of the genomes of several members of this group. Through core genome analysis we explored the synteny and intrinsic genetic underpinnings of the L. acidophilus clade, and observed key features related to the evolution and adaptation of these organisms. While genetic content is able to provide a large map of the potential of each organism, it does not always reflect their functionality. Through transcriptomic data we inferred the core transcriptome of the L. acidophilus complex to better define the true metabolic capabilities that unite this clade. Using this approach we have identified seven small ORFs that are both highly conserved and transcribed in diverse members of this clade and could be potential novel small peptide or untranslated RNA regulators. Overall, our results reveal the core features of the L. acidophilus complex and open new avenues for the enhancement and formulation and of next generation probiotics and starter cultures.
Collapse
Affiliation(s)
- Alexandra B Crawley
- Genomic Sciences Program, NC State University, Raleigh, NC, United States.,Department of Food, Bioprocessing and Nutrition Sciences, NC State University, Raleigh, NC, United States
| | - Rodolphe Barrangou
- Genomic Sciences Program, NC State University, Raleigh, NC, United States.,Department of Food, Bioprocessing and Nutrition Sciences, NC State University, Raleigh, NC, United States
| |
Collapse
|
21
|
Bányai L, Kerekes K, Trexler M, Patthy L. Morphological Stasis and Proteome Innovation in Cephalochordates. Genes (Basel) 2018; 9:genes9070353. [PMID: 30013013 PMCID: PMC6071037 DOI: 10.3390/genes9070353] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 07/11/2018] [Accepted: 07/11/2018] [Indexed: 11/16/2022] Open
Abstract
Lancelets, extant representatives of basal chordates, are prototypic examples of evolutionary stasis; they preserved a morphology and body-plan most similar to the fossil chordates from the early Cambrian. Such a low level of morphological evolution is in harmony with a low rate of amino acid substitution; cephalochordate proteins were shown to evolve slower than those of the slowest evolving vertebrate, the elephant shark. Surprisingly, a study comparing the predicted proteomes of Chinese amphioxus, Branchiostoma belcheri and the Florida amphioxus, Branchiostoma floridae has led to the conclusion that the rate of creation of novel domain combinations is orders of magnitude greater in lancelets than in any other Metazoa, a finding that contradicts the notion that high rates of protein innovation are usually associated with major evolutionary innovations. Our earlier studies on a representative sample of proteins have provided evidence suggesting that the differences in the domain architectures of predicted proteins of these two lancelet species reflect annotation errors, rather than true innovations. In the present work, we have extended these studies to include a larger sample of genes and two additional lancelet species, Asymmetron lucayanum and Branchiostoma lanceolatum. These analyses have confirmed that the domain architecture differences of orthologous proteins of the four lancelet species are because of errors of gene prediction, the error rate in the given species being inversely related to the quality of the transcriptome dataset that was used to aid gene prediction.
Collapse
Affiliation(s)
- László Bányai
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, H-1117 Budapest, Hungary.
| | - Krisztina Kerekes
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, H-1117 Budapest, Hungary.
| | - Mária Trexler
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, H-1117 Budapest, Hungary.
| | - László Patthy
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, H-1117 Budapest, Hungary.
| |
Collapse
|
22
|
Lagarde J, Uszczynska-Ratajczak B, Santoyo-Lopez J, Gonzalez JM, Tapanari E, Mudge JM, Steward CA, Wilming L, Tanzer A, Howald C, Chrast J, Vela-Boza A, Rueda A, Lopez-Domingo FJ, Dopazo J, Reymond A, Guigó R, Harrow J. Extension of human lncRNA transcripts by RACE coupled with long-read high-throughput sequencing (RACE-Seq). Nat Commun 2016; 7:12339. [PMID: 27531712 PMCID: PMC4992054 DOI: 10.1038/ncomms12339] [Citation(s) in RCA: 53] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 06/23/2016] [Indexed: 12/22/2022] Open
Abstract
Long non-coding RNAs (lncRNAs) constitute a large, yet mostly uncharacterized fraction of the mammalian transcriptome. Such characterization requires a comprehensive, high-quality annotation of their gene structure and boundaries, which is currently lacking. Here we describe RACE-Seq, an experimental workflow designed to address this based on RACE (rapid amplification of cDNA ends) and long-read RNA sequencing. We apply RACE-Seq to 398 human lncRNA genes in seven tissues, leading to the discovery of 2,556 on-target, novel transcripts. About 60% of the targeted loci are extended in either 5′ or 3′, often reaching genomic hallmarks of gene boundaries. Analysis of the novel transcripts suggests that lncRNAs are as long, have as many exons and undergo as much alternative splicing as protein-coding genes, contrary to current assumptions. Overall, we show that RACE-Seq is an effective tool to annotate an organism's deep transcriptome, and compares favourably to other targeted sequencing techniques. Long non-coding RNAs are increasingly recognised to be important factors in regulating cellular processes and comprise a large faction of the transcriptome, however most are uncharacterised. Here the authors present RACE-Seq, a tool to improve and extend the annotation of low-expression transcripts.
Collapse
Affiliation(s)
- Julien Lagarde
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Dr Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Barbara Uszczynska-Ratajczak
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Dr Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | | | | | - Electra Tapanari
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| | - Jonathan M Mudge
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| | - Charles A Steward
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| | - Laurens Wilming
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| | - Andrea Tanzer
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Dr Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Cédric Howald
- Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland
| | - Jacqueline Chrast
- Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland
| | - Alicia Vela-Boza
- Genomics and Bioinformatics Platform of Andalusia (GBPA), 41092 Seville, Spain.,Roche Diagnostics, 08174 Sant Cugat Del Vallès, Barcelona, Spain
| | - Antonio Rueda
- Genomics and Bioinformatics Platform of Andalusia (GBPA), 41092 Seville, Spain
| | | | - Joaquin Dopazo
- Genomics and Bioinformatics Platform of Andalusia (GBPA), 41092 Seville, Spain.,Computational Genomics Department, Centro de Investigación Príncipe Felipe, 46012 Valencia, Spain.,Functional Genomics Node (INB), Centro de Investigación Príncipe Felipe, 46012 Valencia, Spain
| | - Alexandre Reymond
- Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland
| | - Roderic Guigó
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Dr Aiguader 88, 08003 Barcelona, Spain.,Universitat Pompeu Fabra (UPF), Barcelona, Spain
| | - Jennifer Harrow
- Wellcome Trust Sanger Institute, Hinxton, Cambridgeshire CB10 1HH, UK
| |
Collapse
|
23
|
Putative extremely high rate of proteome innovation in lancelets might be explained by high rate of gene prediction errors. Sci Rep 2016; 6:30700. [PMID: 27476717 PMCID: PMC4967905 DOI: 10.1038/srep30700] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2016] [Accepted: 07/06/2016] [Indexed: 01/17/2023] Open
Abstract
A recent analysis of the genomes of Chinese and Florida lancelets has concluded that the rate of creation of novel protein domain combinations is orders of magnitude greater in lancelets than in other metazoa and it was suggested that continuous activity of transposable elements in lancelets is responsible for this increased rate of protein innovation. Since morphologically Chinese and Florida lancelets are highly conserved, this finding would contradict the observation that high rates of protein innovation are usually associated with major evolutionary innovations. Here we show that the conclusion that the rate of proteome innovation is exceptionally high in lancelets may be unjustified: the differences observed in domain architectures of orthologous proteins of different amphioxus species probably reflect high rates of gene prediction errors rather than true innovation.
Collapse
|
24
|
Vanhoutreve R, Kress A, Legrand B, Gass H, Poch O, Thompson JD. LEON-BIS: multiple alignment evaluation of sequence neighbours using a Bayesian inference system. BMC Bioinformatics 2016; 17:271. [PMID: 27387560 PMCID: PMC4936259 DOI: 10.1186/s12859-016-1146-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2016] [Accepted: 07/01/2016] [Indexed: 11/13/2022] Open
Abstract
Background A standard procedure in many areas of bioinformatics is to use a multiple sequence alignment (MSA) as the basis for various types of homology-based inference. Applications include 3D structure modelling, protein functional annotation, prediction of molecular interactions, etc. These applications, however sophisticated, are generally highly sensitive to the alignment used, and neglecting non-homologous or uncertain regions in the alignment can lead to significant bias in the subsequent inferences. Results Here, we present a new method, LEON-BIS, which uses a robust Bayesian framework to estimate the homologous relations between sequences in a protein multiple alignment. Sequences are clustered into sub-families and relations are predicted at different levels, including ‘core blocks’, ‘regions’ and full-length proteins. The accuracy and reliability of the predictions are demonstrated in large-scale comparisons using well annotated alignment databases, where the homologous sequence segments are detected with very high sensitivity and specificity. Conclusions LEON-BIS uses robust Bayesian statistics to distinguish the portions of multiple sequence alignments that are conserved either across the whole family or within subfamilies. LEON-BIS should thus be useful for automatic, high-throughput genome annotations, 2D/3D structure predictions, protein-protein interaction predictions etc.
Collapse
Affiliation(s)
- Renaud Vanhoutreve
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France
| | - Arnaud Kress
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France
| | - Baptiste Legrand
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France
| | - Hélène Gass
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France
| | - Olivier Poch
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France
| | - Julie D Thompson
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle de Strasbourg, Strasbourg, France.
| |
Collapse
|
25
|
Improving GENCODE reference gene annotation using a high-stringency proteogenomics workflow. Nat Commun 2016; 7:11778. [PMID: 27250503 PMCID: PMC4895710 DOI: 10.1038/ncomms11778] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2015] [Accepted: 04/28/2016] [Indexed: 12/16/2022] Open
Abstract
Complete annotation of the human genome is indispensable for medical research. The GENCODE consortium strives to provide this, augmenting computational and experimental evidence with manual annotation. The rapidly developing field of proteogenomics provides evidence for the translation of genes into proteins and can be used to discover and refine gene models. However, for both the proteomics and annotation groups, there is a lack of guidelines for integrating this data. Here we report a stringent workflow for the interpretation of proteogenomic data that could be used by the annotation community to interpret novel proteogenomic evidence. Based on reprocessing of three large-scale publicly available human data sets, we show that a conservative approach, using stringent filtering is required to generate valid identifications. Evidence has been found supporting 16 novel protein-coding genes being added to GENCODE. Despite this many peptide identifications in pseudogenes cannot be annotated due to the absence of orthogonal supporting evidence. Identifying and annotating functional elements in the human genome remains a challenging but important task. Here the authors propose a priority annotation score to rank identifications and suggest how proteogenomics evidence can be interpreted and what additional information substantiates protein-coding potential for annotation.
Collapse
|
26
|
McGuire AB, Rafi SK, Manzardo AM, Butler MG. Morphometric Analysis of Recognized Genes for Autism Spectrum Disorders and Obesity in Relationship to the Distribution of Protein-Coding Genes on Human Chromosomes. Int J Mol Sci 2016; 17:E673. [PMID: 27164088 PMCID: PMC4881499 DOI: 10.3390/ijms17050673] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2016] [Revised: 04/11/2016] [Accepted: 04/28/2016] [Indexed: 12/20/2022] Open
Abstract
Mammalian chromosomes are comprised of complex chromatin architecture with the specific assembly and configuration of each chromosome influencing gene expression and function in yet undefined ways by varying degrees of heterochromatinization that result in Giemsa (G) negative euchromatic (light) bands and G-positive heterochromatic (dark) bands. We carried out morphometric measurements of high-resolution chromosome ideograms for the first time to characterize the total euchromatic and heterochromatic chromosome band length, distribution and localization of 20,145 known protein-coding genes, 790 recognized autism spectrum disorder (ASD) genes and 365 obesity genes. The individual lengths of G-negative euchromatin and G-positive heterochromatin chromosome bands were measured in millimeters and recorded from scaled and stacked digital images of 850-band high-resolution ideograms supplied by the International Society of Chromosome Nomenclature (ISCN) 2013. Our overall measurements followed established banding patterns based on chromosome size. G-negative euchromatic band regions contained 60% of protein-coding genes while the remaining 40% were distributed across the four heterochromatic dark band sub-types. ASD genes were disproportionately overrepresented in the darker heterochromatic sub-bands, while the obesity gene distribution pattern did not significantly differ from protein-coding genes. Our study supports recent trends implicating genes located in heterochromatin regions playing a role in biological processes including neurodevelopment and function, specifically genes associated with ASD.
Collapse
Affiliation(s)
| | | | - Ann M Manzardo
- Departments of Psychiatry & Behavioral Sciences and Pediatrics, University of Kansas Medical Center, Kansas City, KS 66160, USA.
| | - Merlin G Butler
- Departments of Psychiatry & Behavioral Sciences and Pediatrics, University of Kansas Medical Center, Kansas City, KS 66160, USA.
| |
Collapse
|
27
|
Singh P, Irwin DM. Contrasting Patterns in the Evolution of Vertebrate MLX Interacting Protein (MLXIP) and MLX Interacting Protein-Like (MLXIPL) Genes. PLoS One 2016; 11:e0149682. [PMID: 26910886 PMCID: PMC4766361 DOI: 10.1371/journal.pone.0149682] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2015] [Accepted: 02/03/2016] [Indexed: 01/09/2023] Open
Abstract
ChREBP and MondoA are glucose-sensitive transcription factors that regulate aspects of energy metabolism. Here we performed a phylogenomic analysis of Mlxip (encoding MondoA) and Mlxipl (encoding ChREBP) genes across vertebrates. Analysis of extant Mlxip and Mlxipl genes suggests that the most recent common ancestor of these genes was composed of 17 coding exons. Single copy genes encoding both ChREBP and MondoA, along with their interacting partner Mlx, were found in diverse vertebrate genomes, including fish that have experienced a genome duplication. This observation suggests that a single Mlx gene has been retained to maintain coordinate regulation of ChREBP and MondoA. The ChREBP-β isoform, the more potent and constitutively active isoform, appeared with the evolution of tetrapods and is absent from the Mlxipl genes of fish. Evaluation of the conservation of ChREBP and MondoA sequences demonstrate that MondoA is better conserved and potentially mediates more ancient function in glucose metabolism.
Collapse
Affiliation(s)
- Parmveer Singh
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada
| | - David M. Irwin
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada
- Banting and Best Diabetes Centre, University of Toronto, Toronto, Ontario, Canada
- * E-mail:
| |
Collapse
|
28
|
Not low hanging but still sweet: Metabolic proteomes in cardiovascular disease. J Mol Cell Cardiol 2015; 90:70-3. [PMID: 26611885 DOI: 10.1016/j.yjmcc.2015.11.022] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/07/2015] [Revised: 11/18/2015] [Accepted: 11/19/2015] [Indexed: 11/22/2022]
Abstract
The application of proteomics in biology and medicine has reached a moment of truth. The demand of biologists for transformative insights into how cells work, plus the mandate of basic science research to ultimately impact clinical medicine, crystallize as a test on the rigor and reproducibility of any 'omics measurement. Studies like that by Boylston et al. indicate that proteomics can pass that test.
Collapse
|
29
|
Baričević A, Štifanić M, Hamer B, Batel R. p63 gene structure in the phylum mollusca. Comp Biochem Physiol B Biochem Mol Biol 2015; 186:51-8. [PMID: 25936268 DOI: 10.1016/j.cbpb.2015.04.011] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2015] [Revised: 04/22/2015] [Accepted: 04/22/2015] [Indexed: 11/26/2022]
Abstract
Roles of p53 family ancestor (p63) in the organisms' response to stressful environmental conditions (mainly pollution) have been studied among molluscs, especially in the genus Mytilus, within the last 15 years. Nevertheless, information about gene structure of this regulatory gene in molluscs is scarce. Here we report the first complete genomic structure of the p53 family orthologue in the mollusc Mediterranean mussel Mytilus galloprovincialis and confirm its similarity to vertebrate p63 gene. Our searches within the available molluscan genomes (Aplysia californica, Lottia gigantea, Crassostrea gigas and Biomphalaria glabrata), found only one p53 family member present in a single copy per haploid genome. Comparative analysis of those orthologues, additionally confirmed the conserved p63 gene structure. Conserved p63 gene structure can be a helpful tool to complement or/and revise gene annotations of any future p63 genomic sequence records in molluscs, but also in other animal phyla. Knowledge of the correct gene structure will enable better prediction of possible protein isoforms and their functions. Our analyses also pointed out possible mis-annotations of the p63 gene in sequenced molluscan genomes and stressed the value of manual inspection (based on alignments of cDNA and protein onto the genome sequence) for a reliable and complete gene annotation.
Collapse
Affiliation(s)
- Ana Baričević
- Ruđer Boskovic Institute, Center for Marine Research, Giordano Paliaga 5, 52210 Rovinj, Croatia.
| | | | - Bojan Hamer
- Ruđer Boskovic Institute, Center for Marine Research, Giordano Paliaga 5, 52210 Rovinj, Croatia.
| | - Renato Batel
- Ruđer Boskovic Institute, Center for Marine Research, Giordano Paliaga 5, 52210 Rovinj, Croatia.
| |
Collapse
|
30
|
WISCOD: a statistical web-enabled tool for the identification of significant protein coding regions. BIOMED RESEARCH INTERNATIONAL 2014; 2014:282343. [PMID: 25313355 PMCID: PMC4181902 DOI: 10.1155/2014/282343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/01/2013] [Revised: 12/18/2013] [Accepted: 02/11/2014] [Indexed: 11/17/2022]
Abstract
Classically, gene prediction programs are based on detecting signals such as boundary sites (splice sites, starts, and stops) and coding regions in the DNA sequence in order to build potential exons and join them into a gene structure. Although nowadays it is possible to improve their performance with additional information from related species or/and cDNA databases, further improvement at any step could help to obtain better predictions. Here, we present WISCOD, a web-enabled tool for the identification of significant protein coding regions, a novel software tool that tackles the exon prediction problem in eukaryotic genomes. WISCOD has the capacity to detect real exons from large lists of potential exons, and it provides an easy way to use global P value called expected probability of being a false exon (EPFE) that is useful for ranking potential exons in a probabilistic framework, without additional computational costs. The advantage of our approach is that it significantly increases the specificity and sensitivity (both between 80% and 90%) in comparison to other ab initio methods (where they are in the range of 70–75%). WISCOD is written in JAVA and R and is available to download and to run in a local mode on Linux and Windows platforms.
Collapse
|
31
|
Yu CY, Liu HJ, Hung LY, Kuo HC, Chuang TJ. Is an observed non-co-linear RNA product spliced in trans, in cis or just in vitro? Nucleic Acids Res 2014; 42:9410-23. [PMID: 25053845 PMCID: PMC4132752 DOI: 10.1093/nar/gku643] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Global transcriptome investigations often result in the detection of an enormous number of transcripts composed of non-co-linear sequence fragments. Such ‘aberrant’ transcript products may arise from post-transcriptional events or genetic rearrangements, or may otherwise be false positives (sequencing/alignment errors or in vitro artifacts). Moreover, post-transcriptionally non-co-linear (‘PtNcl’) transcripts can arise from trans-splicing or back-splicing in cis (to generate so-called ‘circular RNA’). Here, we collected previously-predicted human non-co-linear RNA candidates, and designed a validation procedure integrating in silico filters with multiple experimental validation steps to examine their authenticity. We showed that >50% of the tested candidates were in vitro artifacts, even though some had been previously validated by RT-PCR. After excluding the possibility of genetic rearrangements, we distinguished between trans-spliced and circular RNAs, and confirmed that these two splicing forms can share the same non-co-linear junction. Importantly, the experimentally-confirmed PtNcl RNA events and their corresponding PtNcl splicing types (i.e. trans-splicing, circular RNA, or both sharing the same junction) were all expressed in rhesus macaque, and some were even expressed in mouse. Our study thus describes an essential procedure for confirming PtNcl transcripts, and provides further insight into the evolutionary role of PtNcl RNA events, opening up this important, but understudied, class of post-transcriptional events for comprehensive characterization.
Collapse
Affiliation(s)
- Chun-Ying Yu
- Institute of Cellular and Organismic Biology, Academia Sinica, Taipei 11529, Taiwan
| | - Hsiao-Jung Liu
- Institute of Cellular and Organismic Biology, Academia Sinica, Taipei 11529, Taiwan
| | - Li-Yuan Hung
- Division of Physical and Computational Genomics, Genomics Research Center, Academia Sinica, Taipei 11529, Taiwan
| | - Hung-Chih Kuo
- Institute of Cellular and Organismic Biology, Academia Sinica, Taipei 11529, Taiwan
| | - Trees-Juen Chuang
- Division of Physical and Computational Genomics, Genomics Research Center, Academia Sinica, Taipei 11529, Taiwan
| |
Collapse
|
32
|
Gotoh O, Morita M, Nelson DR. Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment. BMC Bioinformatics 2014; 15:189. [PMID: 24927652 PMCID: PMC4065584 DOI: 10.1186/1471-2105-15-189] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2014] [Accepted: 06/09/2014] [Indexed: 03/29/2024] Open
Abstract
Background Accurate computational identification of eukaryotic gene organization is a long-standing problem. Despite the fundamental importance of precise annotation of genes encoded in newly sequenced genomes, the accuracy of predicted gene structures has not been critically evaluated, mostly due to the scarcity of proper assessment methods. Results We present a gene-structure-aware multiple sequence alignment method for gene prediction using amino acid sequences translated from homologous genes from many genomes. The approach provides rich information concerning the reliability of each predicted gene structure. We have also devised an iterative method that attempts to improve the structures of suspiciously predicted genes based on a spliced alignment algorithm using consensus sequences or reliable homologs as templates. Application of our methods to cytochrome P450 and ribosomal proteins from 47 plant genomes indicated that 50 ~ 60 % of the annotated gene structures are likely to contain some defects. Whereas more than half of the defect-containing genes may be intrinsically broken, i.e. they are pseudogenes or gene fragments, located in unfinished sequencing areas, or corresponding to non-productive isoforms, the defects found in a majority of the remaining gene candidates can be remedied by our iterative refinement method. Conclusions Refinement of eukaryotic gene structures mediated by gene-structure-aware multiple protein sequence alignment is a useful strategy to dramatically improve the overall prediction quality of a set of homologous genes. Our method will be applicable to various families of protein-coding genes if their domain structures are evolutionarily stable. It is also feasible to apply our method to gene families from all kingdoms of life, not just plants.
Collapse
Affiliation(s)
- Osamu Gotoh
- Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST), Koto-ku, Tokyo 135-0064, Japan.
| | | | | |
Collapse
|
33
|
Khenoussi W, Vanhoutrève R, Poch O, Thompson JD. SIBIS: a Bayesian model for inconsistent protein sequence estimation. Bioinformatics 2014; 30:2432-9. [PMID: 24825613 DOI: 10.1093/bioinformatics/btu329] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The prediction of protein coding genes is a major challenge that depends on the quality of genome sequencing, the accuracy of the model used to elucidate the exonic structure of the genes and the complexity of the gene splicing process leading to different protein variants. As a consequence, today's protein databases contain a huge amount of inconsistency, due to both natural variants and sequence prediction errors. RESULTS We have developed a new method, called SIBIS, to detect such inconsistencies based on the evolutionary information in multiple sequence alignments. A Bayesian framework, combined with Dirichlet mixture models, is used to estimate the probability of observing specific amino acids and to detect inconsistent or erroneous sequence segments. We evaluated the performance of SIBIS on a reference set of protein sequences with experimentally validated errors and showed that the sensitivity is significantly higher than previous methods, with only a small loss of specificity. We also assessed a large set of human sequences from the UniProt database and found evidence of inconsistency in 48% of the previously uncharacterized sequences. We conclude that the integration of quality control methods like SIBIS in automatic analysis pipelines will be critical for the robust inference of structural, functional and phylogenetic information from these sequences. AVAILABILITY AND IMPLEMENTATION Source code, implemented in C on a linux system, and the datasets of protein sequences are freely available for download at http://www.lbgi.fr/∼julie/SIBIS.
Collapse
Affiliation(s)
- Walyd Khenoussi
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| | - Renaud Vanhoutrève
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| | - Olivier Poch
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| | - Julie D Thompson
- Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fédération de médecine translationnelle, Strasbourg, F-67085, France
| |
Collapse
|
34
|
Nasiri J, Naghavi M, Rad SN, Yolmeh T, Shirazi M, Naderi R, Nasiri M, Ahmadi S. Gene identification programs in bread wheat: a comparison study. NUCLEOSIDES NUCLEOTIDES & NUCLEIC ACIDS 2014; 32:529-54. [PMID: 24124688 DOI: 10.1080/15257770.2013.832773] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Seven ab initio web-based gene prediction programs (i.e., AUGUSTUS, BGF, Fgenesh, Fgenesh+, GeneID, Genemark.hmm, and HMMgene) were assessed to compare their prediction accuracy using protein-coding sequences of bread wheat. At both nucleotide and exon levels, Fgenesh+ was deduced as the superior program and BGF followed by Fgenesh were resided in the next positions, respectively. Conversely, at gene level, Fgenesh with the value of predicting more than 75% of all the genes precisely, concluded as the best ones. It was also found out that programs such as Fgenesh+, BGF, and Fgenesh, because of harboring the highest percentage of correct predictive exons appear to be much more applicable in achieving more trustworthy results, while using both GeneID and HMMgene the percentage of false negatives would be expected to enhance. Regarding initial exon, overall, the frequency of accurate recognition of 3' boundary was significantly higher than that of 5' and the reverse was true if terminal exon is taken into account. Lastly, HMMgene and Genemark.hmm, overall, presented independent tendency against GC content, while the others appear to be slightly more sensitive if GC-poor sequences are employed. Our results, overall, exhibited that to make adequate opportunity in acquiring remarkable results, gene finders still need additional improvements.
Collapse
Affiliation(s)
- Jaber Nasiri
- a Department of Agronomy and Plant Breeding, Division of Molecular Plant Genetics, College of Agricultural & Natural Resources , University of Tehran , Karaj , Tehran , Iran
| | | | | | | | | | | | | | | |
Collapse
|
35
|
Nagy A, Patthy L. FixPred: a resource for correction of erroneous protein sequences. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau032. [PMID: 24705206 PMCID: PMC3975993 DOI: 10.1093/database/bau032] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Protein databases are heavily contaminated with erroneous (mispredicted, abnormal and incomplete) sequences and these erroneous data significantly distort the conclusions drawn from genome-scale protein sequence analyses. In our earlier work we described the MisPred resource that serves to identify erroneous sequences; here we present the FixPred computational pipeline that automatically corrects sequences identified by MisPred as erroneous. The current version of the associated FixPred database contains corrected UniProtKB/Swiss-Prot and NCBI/RefSeq sequences from Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Danio rerio, Fugu rubripes, Ciona intestinalis, Branchostoma floridae, Drosophila melanogaster and Caenorhabditis elegans; future releases of the FixPred database will include corrected sequences of additional Metazoan species. The FixPred computational pipeline and database (http://www.fixpred.com) are easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. Database URL:http://www.fixpred.com
Collapse
Affiliation(s)
| | - László Patthy
- *Corresponding author: Tel: +361 279 3100; Fax: +361 466 5465;
| |
Collapse
|
36
|
Abstract
Genome editing in human cells is of great value in research, medicine, and biotechnology. Programmable nucleases including zinc-finger nucleases, transcription activator-like effector nucleases, and RNA-guided engineered nucleases recognize a specific target sequence and make a double-strand break at that site, which can result in gene disruption, gene insertion, gene correction, or chromosomal rearrangements. The target sequence complexities of these programmable nucleases are higher than 3.2 mega base pairs, the size of the haploid human genome. Here, we briefly introduce the structure of the human genome and the characteristics of each programmable nuclease, and review their applications in human cells including pluripotent stem cells. In addition, we discuss various delivery methods for nucleases, programmable nickases, and enrichment of gene-edited human cells, all of which facilitate efficient and precise genome editing in human cells.
Collapse
Affiliation(s)
- Minjung Song
- Graduate School of Biomedical Science and Engineering, College of Medicine, Hanyang University, Seoul, South Korea
| | - Young-Hoon Kim
- Graduate School of Biomedical Science and Engineering, College of Medicine, Hanyang University, Seoul, South Korea
| | - Jin-Soo Kim
- Center for Genome Engineering, Institute for Basic Science, Seoul, South Korea; Department of Chemistry, Seoul National University, Seoul, South Korea.
| | - Hyongbum Kim
- Graduate School of Biomedical Science and Engineering, College of Medicine, Hanyang University, Seoul, South Korea.
| |
Collapse
|
37
|
Adi SS, Ferreira CE. Syntenic global alignment and its application to the gene prediction problem. JOURNAL OF THE BRAZILIAN COMPUTER SOCIETY 2013. [DOI: 10.1007/s13173-013-0115-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Abstract
Given the increasing number of available genomic sequences, one now faces the task of identifying their protein coding regions. The gene prediction problem can be addressed in several ways, and one of the most promising methods makes use of information derived from the comparison of homologous sequences. In this work, we develop a new comparative-based gene prediction program, called Exon_Finder2. This tool is based on a new type of alignment we propose, called syntenic global alignment, that can deal satisfactorily with sequences that share regions with different rates of conservation. In addition to this new type of alignment itself, we also describe a dynamic programming algorithm that computes a best syntenic global alignment of two sequences, as well as its related score. The applicability of our approach was validated by the promising initial results achieved by Exon_Finder2. On a benchmark including 120 pairs of human and mouse genomic sequences, most of their encoded genes were successfully identified by our program.
Collapse
|
38
|
Nagy A, Patthy L. MisPred: a resource for identification of erroneous protein sequences in public databases. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2013; 2013:bat053. [PMID: 23864220 PMCID: PMC3713709 DOI: 10.1093/database/bat053] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Correct prediction of the structure of protein-coding genes of higher eukaryotes is still a difficult task; therefore, public databases are heavily contaminated with mispredicted sequences. The high rate of misprediction has serious consequences because it significantly affects the conclusions that may be drawn from genome-scale sequence analyses of eukaryotic genomes. Here we present the MisPred database and computational pipeline that provide efficient means for the identification of erroneous sequences in public databases. The MisPred database contains a collection of abnormal, incomplete and mispredicted protein sequences from 19 metazoan species identified as erroneous by MisPred quality control tools in the UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, NCBI/RefSeq and EnsEMBL databases. Major releases of the database are automatically generated and updated regularly. The database (http://www.mispred.com) is easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in a variety of formats. DATABASE URL: http://www.mispred.com.
Collapse
Affiliation(s)
- Alinda Nagy
- Institute of Enzymology, Research Centre for Natural Sciences, Hungarian Academy of Sciences, H-1113 Budapest, Hungary
| | | |
Collapse
|
39
|
McGraw S, Shojaei Saadi HA, Robert C. Meeting the methodological challenges in molecular mapping of the embryonic epigenome. Mol Hum Reprod 2013; 19:809-27. [PMID: 23783346 DOI: 10.1093/molehr/gat046] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
The past decade of life sciences research has been driven by progress in genomics. Many voices are already proclaiming the post-genomics era, in which phenomena other than sequence polymorphism influence gene expression and also explain complex phenotypes. One of these burgeoning fields is the study of the epigenome. Although the mechanisms by which chromatin structure and reorganization as well as cytosine methylation influence gene expression are not fully understood, they are being invoked to explain the now-accepted long-term impact of the environment on gene expression, which appears to be a factor in the development of numerous diseases. Such studies are particularly relevant in early embryonic development, during which waves of epigenetic reprogramming are known to have profound impacts. Since gametes and zygotes are in the process of resetting the genome in order to create embryonic stem cells that will each differentiate to create one of many specific tissue types, this phase of life is now viewed as a window of susceptibility to epigenetic reprogramming errors. Epigenetics could explain the influence of factors such as the nutritional/metabolic status of the mother or the artificial environment of assisted reproductive technologies. However, the peculiar nature of early embryos in addition to their scarcity poses numerous technological challenges that are slowly being overcome. The principal subject of this article is to review the suitability of various current and emerging technological platforms to study oocytes and early embryonic epigenome with more emphasis on studying DNA methylation. Furthermore, the constraint of samples size, inherent to the study of preimplantation embryo development, was put in perspective with the various molecular platforms described.
Collapse
Affiliation(s)
- Serge McGraw
- Department of Human Genetics, Montreal Children's Hospital Research Institute, McGill University, Montréal, QC H3Z 2Z3, Canada
| | | | | |
Collapse
|
40
|
Huang YH, Wu HY, Wu KM, Liu TT, Liou RF, Tsai SF, Shiao MS, Ho LT, Tzean SS, Yang UC. Generation and analysis of the expressed sequence tags from the mycelium of Ganoderma lucidum. PLoS One 2013; 8:e61127. [PMID: 23658685 PMCID: PMC3642047 DOI: 10.1371/journal.pone.0061127] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2011] [Accepted: 03/07/2013] [Indexed: 12/24/2022] Open
Abstract
Ganoderma lucidum (G. lucidum) is a medicinal mushroom renowned in East Asia for its potential biological effects. To enable a systematic exploration of the genes associated with the various phenotypes of the fungus, the genome consortium of G. lucidum has carried out an expressed sequence tag (EST) sequencing project. Using a Sanger sequencing based approach, 47,285 ESTs were obtained from in vitro cultures of G. lucidum mycelium of various durations. These ESTs were further clustered and merged into 7,774 non-redundant expressed loci. The features of these expressed contigs were explored in terms of over-representation, alternative splicing, and natural antisense transcripts. Our results provide an invaluable information resource for exploring the G. lucidum transcriptome and its regulation. Many cases of the genes over-represented in fast-growing dikaryotic mycelium are closely related to growth, such as cell wall and bioactive compound synthesis. In addition, the EST-genome alignments containing putative cassette exons and retained introns were manually curated and then used to make inferences about the predominating splice-site recognition mechanism of G. lucidum. Moreover, a number of putative antisense transcripts have been pinpointed, from which we noticed that two cases are likely to reveal hitherto undiscovered biological pathways. To allow users to access the data and the initial analysis of the results of this project, a dedicated web site has been created at http://csb2.ym.edu.tw/est/.
Collapse
Affiliation(s)
- Yen-Hua Huang
- Department of Biochemistry, Faculty of Medicine, School of Medicine, National Yang-Ming University, Taipei City, Taiwan, R.O.C.
- Center for Systems and Synthetic Biology, National Yang-Ming University, Taipei City, Taiwan, R.O.C.
| | - Hung-Yi Wu
- Department of Plant Pathology and Microbiology, National Taiwan University, Taipei City, Taiwan, R.O.C.
| | - Keh-Ming Wu
- VYM Genome Research Center, National Yang-Ming University, Taipei City, Taiwan, R.O.C.
| | - Tze-Tze Liu
- VYM Genome Research Center, National Yang-Ming University, Taipei City, Taiwan, R.O.C.
| | - Ruey-Fen Liou
- Department of Plant Pathology and Microbiology, National Taiwan University, Taipei City, Taiwan, R.O.C.
| | - Shih-Feng Tsai
- VYM Genome Research Center, National Yang-Ming University, Taipei City, Taiwan, R.O.C.
| | - Ming-Shi Shiao
- Medical Research and Education Department, Taipei Veterans General Hospital, Taipei City, Taiwan, R.O.C.
| | - Low-Tone Ho
- Medical Research and Education Department, Taipei Veterans General Hospital, Taipei City, Taiwan, R.O.C.
| | - Shean-Shong Tzean
- Department of Plant Pathology and Microbiology, National Taiwan University, Taipei City, Taiwan, R.O.C.
| | - Ueng-Cheng Yang
- Institute of Biomedical Informatics, College of Life Science, National Yang-Ming University, Taipei City, Taiwan, R.O.C.
- Center for Systems and Synthetic Biology, National Yang-Ming University, Taipei City, Taiwan, R.O.C.
| |
Collapse
|
41
|
Moore AD, Grath S, Schüler A, Huylmans AK, Bornberg-Bauer E. Quantification and functional analysis of modular protein evolution in a dense phylogenetic tree. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2013; 1834:898-907. [PMID: 23376183 DOI: 10.1016/j.bbapap.2013.01.007] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/19/2012] [Revised: 01/06/2013] [Accepted: 01/09/2013] [Indexed: 12/24/2022]
Abstract
Modularity is a hallmark of molecular evolution. Whether considering gene regulation, the components of metabolic pathways or signaling cascades, the ability to reuse autonomous modules in different molecular contexts can expedite evolutionary innovation. Similarly, protein domains are the modules of proteins, and modular domain rearrangements can create diversity with seemingly few operations in turn allowing for swift changes to an organism's functional repertoire. Here, we assess the patterns and functional effects of modular rearrangements at high resolution. Using a well resolved and diverse group of pancrustaceans, we illustrate arrangement diversity within closely related organisms, estimate arrangement turnover frequency and establish, for the first time, branch-specific rate estimates for fusion, fission, domain addition and terminal loss. Our results show that roughly 16 new arrangements arise per million years and that between 64% and 81% of these can be explained by simple, single-step modular rearrangement events. We find evidence that the frequencies of fission and terminal deletion events increase over time, and that modular rearrangements impact all levels of the cellular signaling apparatus and thus may have strong adaptive potential. Novel arrangements that cannot be explained by simple modular rearrangements contain a significant amount of repeat domains that occur in complex patterns which we term "supra-repeats". Furthermore, these arrangements are significantly longer than those with a single-step rearrangement solution, suggesting that such arrangements may result from multi-step events. In summary, our analysis provides an integrated view and initial quantification of the patterns and functional impact of modular protein evolution in a well resolved phylogenetic tree. This article is part of a Special Issue entitled: The emerging dynamic view of proteins: Protein plasticity in allostery, evolution and self-assembly.
Collapse
Affiliation(s)
- Andrew D Moore
- Institute for Evolution and Biodiversity, Münster, Germany
| | | | | | | | | |
Collapse
|
42
|
Goodswen SJ, Kennedy PJ, Ellis JT. Evaluating high-throughput ab initio gene finders to discover proteins encoded in eukaryotic pathogen genomes missed by laboratory techniques. PLoS One 2012; 7:e50609. [PMID: 23226328 PMCID: PMC3511556 DOI: 10.1371/journal.pone.0050609] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2012] [Accepted: 10/24/2012] [Indexed: 11/25/2022] Open
Abstract
Next generation sequencing technology is advancing genome sequencing at an unprecedented level. By unravelling the code within a pathogen’s genome, every possible protein (prior to post-translational modifications) can theoretically be discovered, irrespective of life cycle stages and environmental stimuli. Now more than ever there is a great need for high-throughput ab initio gene finding. Ab initio gene finders use statistical models to predict genes and their exon-intron structures from the genome sequence alone. This paper evaluates whether existing ab initio gene finders can effectively predict genes to deduce proteins that have presently missed capture by laboratory techniques. An aim here is to identify possible patterns of prediction inaccuracies for gene finders as a whole irrespective of the target pathogen. All currently available ab initio gene finders are considered in the evaluation but only four fulfil high-throughput capability: AUGUSTUS, GeneMark_hmm, GlimmerHMM, and SNAP. These gene finders require training data specific to a target pathogen and consequently the evaluation results are inextricably linked to the availability and quality of the data. The pathogen, Toxoplasma gondii, is used to illustrate the evaluation methods. The results support current opinion that predicted exons by ab initio gene finders are inaccurate in the absence of experimental evidence. However, the results reveal some patterns of inaccuracy that are common to all gene finders and these inaccuracies may provide a focus area for future gene finder developers.
Collapse
Affiliation(s)
- Stephen J. Goodswen
- School of Medical and Molecular Sciences, and the Ithree Institute at the University of Technology Sydney (UTS), New South Wales, Australia
| | - Paul J. Kennedy
- School of Software, Faculty of Engineering and Information Technology and the Centre for Quantum Computation and Intelligent Systems at the University of Technology Sydney (UTS), New South Wales, Australia
| | - John T. Ellis
- School of Medical and Molecular Sciences, and the Ithree Institute at the University of Technology Sydney (UTS), New South Wales, Australia
- * E-mail:
| |
Collapse
|
43
|
Moore N, Jaromczyk JW, Schardl CL. Finding long protein products of alternatively spliced genes. BMC Bioinformatics 2012. [PMCID: PMC3409078 DOI: 10.1186/1471-2105-13-s12-a12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
|
44
|
Guo B, Zou M, Wagner A. Pervasive indels and their evolutionary dynamics after the fish-specific genome duplication. Mol Biol Evol 2012; 29:3005-22. [PMID: 22490820 DOI: 10.1093/molbev/mss108] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
Insertions and deletions (indels) in protein-coding genes are important sources of genetic variation. Their role in creating new proteins may be especially important after gene duplication. However, little is known about how indels affect the divergence of duplicate genes. We here study thousands of duplicate genes in five fish (teleost) species with completely sequenced genomes. The ancestor of these species has been subject to a fish-specific genome duplication (FSGD) event that occurred approximately 350 Ma. We find that duplicate genes contain at least 25% more indels than single-copy genes. These indels accumulated preferentially in the first 40 my after the FSGD. A lack of widespread asymmetric indel accumulation indicates that both members of a duplicate gene pair typically experience relaxed selection. Strikingly, we observe a 30-80% excess of deletions over insertions that is consistent for indels of various lengths and across the five genomes. We also find that indels preferentially accumulate inside loop regions of protein secondary structure and in regions where amino acids are exposed to solvent. We show that duplicate genes with high indel density also show high DNA sequence divergence. Indel density, but not amino acid divergence, can explain a large proportion of the tertiary structure divergence between proteins encoded by duplicate genes. Our observations are consistent across all five fish species. Taken together, they suggest a general pattern of duplicate gene evolution in which indels are important driving forces of evolutionary change.
Collapse
Affiliation(s)
- Baocheng Guo
- Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich, Switzerland
| | | | | |
Collapse
|
45
|
Prosdocimi F, Linard B, Pontarotti P, Poch O, Thompson JD. Controversies in modern evolutionary biology: the imperative for error detection and quality control. BMC Genomics 2012; 13:5. [PMID: 22217008 PMCID: PMC3311146 DOI: 10.1186/1471-2164-13-5] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2011] [Accepted: 01/04/2012] [Indexed: 12/03/2022] Open
Abstract
Background The data from high throughput genomics technologies provide unique opportunities for studies of complex biological systems, but also pose many new challenges. The shift to the genome scale in evolutionary biology, for example, has led to many interesting, but often controversial studies. It has been suggested that part of the conflict may be due to errors in the initial sequences. Most gene sequences are predicted by bioinformatics programs and a number of quality issues have been raised, concerning DNA sequencing errors or badly predicted coding regions, particularly in eukaryotes. Results We investigated the impact of these errors on evolutionary studies and specifically on the identification of important genetic events. We focused on the detection of asymmetric evolution after duplication, which has been the subject of controversy recently. Using the human genome as a reference, we established a reliable set of 688 duplicated genes in 13 complete vertebrate genomes, where significantly different evolutionary rates are observed. We estimated the rates at which protein sequence errors occur and are accumulated in the higher-level analyses. We showed that the majority of the detected events (57%) are in fact artifacts due to the putative erroneous sequences and that these artifacts are sufficient to mask the true functional significance of the events. Conclusions Initial errors are accumulated throughout the evolutionary analysis, generating artificially high rates of event predictions and leading to substantial uncertainty in the conclusions. This study emphasizes the urgent need for error detection and quality control strategies in order to efficiently extract knowledge from the new genome data.
Collapse
Affiliation(s)
- Francisco Prosdocimi
- Department of Integrated Structural Biology, IGBMC (Institut de Génétique et de Biologie Moléculaire et Cellulaire) CNRS/INSERM/Université de Strasbourg, 1 rue Laurent Fries, Illkirch, F-67404, France
| | | | | | | | | |
Collapse
|
46
|
Abstract
Evolutionary genomics is a field that relies heavily upon comparing genomes, that is, the full complement of genes of one species with another. However, given a genome sequence and little else, as is now often the case, genes must first be found and annotated before downstream analyses can be done. Computational gene prediction techniques are brought to bear on the problem of constructing a genome annotation as manual annotation is extremely time-consuming and costly. This chapter reviews the methods by which the individual components of a typical gene structure are detected in genomic sequence and then discusses several popular statistical frameworks for integrated gene prediction on eukaryotic genome sequences.
Collapse
Affiliation(s)
- Tyler Alioto
- Centro Nacional de Análisis Genómico, Barcelona, Spain.
| |
Collapse
|
47
|
Fernández-Pozo N, Canales J, Guerrero-Fernández D, Villalobos DP, Díaz-Moreno SM, Bautista R, Flores-Monterroso A, Guevara MÁ, Perdiguero P, Collada C, Cervera MT, Soto A, Ordás R, Cantón FR, Avila C, Cánovas FM, Claros MG. EuroPineDB: a high-coverage web database for maritime pine transcriptome. BMC Genomics 2011; 12:366. [PMID: 21762488 PMCID: PMC3152544 DOI: 10.1186/1471-2164-12-366] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2011] [Accepted: 07/15/2011] [Indexed: 11/30/2022] Open
Abstract
Background Pinus pinaster is an economically and ecologically important species that is becoming a woody gymnosperm model. Its enormous genome size makes whole-genome sequencing approaches are hard to apply. Therefore, the expressed portion of the genome has to be characterised and the results and annotations have to be stored in dedicated databases. Description EuroPineDB is the largest sequence collection available for a single pine species, Pinus pinaster (maritime pine), since it comprises 951 641 raw sequence reads obtained from non-normalised cDNA libraries and high-throughput sequencing from adult (xylem, phloem, roots, stem, needles, cones, strobili) and embryonic (germinated embryos, buds, callus) maritime pine tissues. Using open-source tools, sequences were optimally pre-processed, assembled, and extensively annotated (GO, EC and KEGG terms, descriptions, SNPs, SSRs, ORFs and InterPro codes). As a result, a 10.5× P. pinaster genome was covered and assembled in 55 322 UniGenes. A total of 32 919 (59.5%) of P. pinaster UniGenes were annotated with at least one description, revealing at least 18 466 different genes. The complete database, which is designed to be scalable, maintainable, and expandable, is freely available at: http://www.scbi.uma.es/pindb/. It can be retrieved by gene libraries, pine species, annotations, UniGenes and microarrays (i.e., the sequences are distributed in two-colour microarrays; this is the only conifer database that provides this information) and will be periodically updated. Small assemblies can be viewed using a dedicated visualisation tool that connects them with SNPs. Any sequence or annotation set shown on-screen can be downloaded. Retrieval mechanisms for sequences and gene annotations are provided. Conclusions The EuroPineDB with its integrated information can be used to reveal new knowledge, offers an easy-to-use collection of information to directly support experimental work (including microarray hybridisation), and provides deeper knowledge on the maritime pine transcriptome.
Collapse
Affiliation(s)
- Noé Fernández-Pozo
- Departamento de Biología Molecular y Bioquímica, Facultad de Ciencias, Campus de Teatinos s/n, Universidad de Málaga, 29071 Málaga, Spain
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
48
|
Reassessing domain architecture evolution of metazoan proteins: major impact of gene prediction errors. Genes (Basel) 2011; 2:449-501. [PMID: 24710207 PMCID: PMC3927609 DOI: 10.3390/genes2030449] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2011] [Revised: 06/14/2011] [Accepted: 06/20/2011] [Indexed: 11/17/2022] Open
Abstract
In view of the fact that appearance of novel protein domain architectures (DA) is closely associated with biological innovations, there is a growing interest in the genome-scale reconstruction of the evolutionary history of the domain architectures of multidomain proteins. In such analyses, however, it is usually ignored that a significant proportion of Metazoan sequences analyzed is mispredicted and that this may seriously affect the validity of the conclusions. To estimate the contribution of errors in gene prediction to differences in DA of predicted proteins, we have used the high quality manually curated UniProtKB/Swiss-Prot database as a reference. For genome-scale analysis of domain architectures of predicted proteins we focused on RefSeq, EnsEMBL and NCBI's GNOMON predicted sequences of Metazoan species with completely sequenced genomes. Comparison of the DA of UniProtKB/Swiss-Prot sequences of worm, fly, zebrafish, frog, chick, mouse, rat and orangutan with those of human Swiss-Prot entries have identified relatively few cases where orthologs had different DA, although the percentage with different DA increased with evolutionary distance. In contrast with this, comparison of the DA of human, orangutan, rat, mouse, chicken, frog, zebrafish, worm and fly RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences with those of the corresponding/orthologous human Swiss-Prot entries identified a significantly higher proportion of domain architecture differences than in the case of the comparison of Swiss-Prot entries. Analysis of RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences with DAs different from those of their Swiss-Prot orthologs confirmed that the higher rate of domain architecture differences is due to errors in gene prediction, the majority of which could be corrected with our FixPred protocol. We have also demonstrated that contamination of databases with incomplete, abnormal or mispredicted sequences introduces a bias in DA differences in as much as it increases the proportion of terminal over internal DA differences. Here we have shown that in the case of RefSeq, EnsEMBL and NCBI's GNOMON predicted protein sequences of Metazoan species, the contribution of gene prediction errors to domain architecture differences of orthologs is comparable to or greater than those due to true gene rearrangements. We have also demonstrated that domain architecture comparison may serve as a useful tool for the quality control of gene predictions and may thus guide the correction of sequence errors. Our findings caution that earlier genome-scale studies based on comparison of predicted (frequently mispredicted) protein sequences may have led to some erroneous conclusions about the evolution of novel domain architectures of multidomain proteins. A reassessment of the DA evolution of orthologous and paralogous proteins is presented in an accompanying paper [1].
Collapse
|
49
|
Cruz JA, Westhof E. Identification and annotation of noncoding RNAs in Saccharomycotina. C R Biol 2011; 334:671-8. [PMID: 21819949 DOI: 10.1016/j.crvi.2011.05.016] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2010] [Accepted: 03/23/2011] [Indexed: 11/16/2022]
Abstract
The importance of ncRNAs in biological processes makes their annotation an essential component of any genome-sequencing project. The identification of ncRNAs in genomes requires specific expertise and tools that are distinct from the traditional protein gene annotation tools. Here, we describe the assembly of two automatic annotation pipelines, integrating publicly available tools, for homology and de novo ncRNA search in genomes. We applied both pipelines to 10 Saccharomycotina genomes and were able to find and annotate 693 ncRNA genes, corresponding to 81% of the ncRNAs expected for those genomes assuming the number of ncRNAs in Saccharomyces cerevisiae (86) as a reference. Several new ncRNAs, not yet known in the Saccharomycotina clade, were also detected. The results show the feasibility of automatic search for ncRNAs in full genomes and the utility of such approaches in large multi-genome sequencing and annotation projects.
Collapse
Affiliation(s)
- José Almeida Cruz
- Architecture et Réactivité de l'ARN, Institut de Biologie Moléculaire et Cellulaire du CNRS, Université de Strasbourg, 15 rue René-Descartes, 67084 Strasbourg cedex, France.
| | | |
Collapse
|
50
|
Gao Y, Wahlberg P, Marthey S, Esquerré D, Jaffrézic F, Lecardonnel J, Hugot K, Rogel-Gaillard C. Analysis of porcine MHC using microarrays. Vet Immunol Immunopathol 2011; 148:78-84. [PMID: 21561666 DOI: 10.1016/j.vetimm.2011.04.007] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2010] [Revised: 03/23/2011] [Accepted: 04/09/2011] [Indexed: 11/26/2022]
Abstract
The major histocompatibility complex (MHC) in Mammals is one of the most gene dense regions of the genome and contains the polymorphic histocompatibility gene families known to be involved in pathogen response and control of auto-immunity. The MHC is a complex genetic system that provides an interesting model system to study genome expression regulation and genetic diversity at the megabase scale. The pig MHC or SLA (Swine Leucocyte Antigen) complex spans 2.4 megabases and 151 loci have been annotated. We will review key results from previous RNA expression studies using microarrays containing probes specific to annotated loci within SLA and in addition present novel data obtained using high-density tiling arrays encompassing the whole SLA complex. We have focused on transcriptome modifications of porcine peripheral blood mononuclear cells stimulated with a mixture of phorbol myristate acetate and ionomycin known to activate B and T cell proliferation. Our results show that numerous loci mapping to the SLA complex are affected by the treatment. A general decreased level of expression for class I and II genes and an up-regulation of genes involved in peptide processing and transport were observed. Tiling array-based experiments contributed to refined gene annotations as presented for one SLA class I gene referred to as SLA-11. In conclusion, high-density tiling arrays can serve as an excellent tool to draw comprehensive transcription maps, and improve genome annotations for the SLA complex. We are currently studying their relevance to characterize SLA genetic diversity in combination with high throughput next generation sequencing.
Collapse
Affiliation(s)
- Yu Gao
- INRA, UMR 1313 de Génétique Animale et Biologie Intégrative, Domaine de Vilvert, 78350 Jouy-en-Josas, France
| | | | | | | | | | | | | | | |
Collapse
|