151
|
Machine Learning for detection of viral sequences in human metagenomic datasets. BMC Bioinformatics 2018; 19:336. [PMID: 30249176 PMCID: PMC6154907 DOI: 10.1186/s12859-018-2340-x] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 08/28/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Detection of highly divergent or yet unknown viruses from metagenomics sequencing datasets is a major bioinformatics challenge. When human samples are sequenced, a large proportion of assembled contigs are classified as "unknown", as conventional methods find no similarity to known sequences. We wished to explore whether machine learning algorithms using Relative Synonymous Codon Usage frequency (RSCU) could improve the detection of viral sequences in metagenomic sequencing data. RESULTS We trained Random Forest and Artificial Neural Network using metagenomic sequences taxonomically classified into virus and non-virus classes. The algorithms achieved accuracies well beyond chance level, with area under ROC curve 0.79. Two codons (TCG and CGC) were found to have a particularly strong discriminative capacity. CONCLUSION RSCU-based machine learning techniques applied to metagenomic sequencing data can help identify a large number of putative viral sequences and provide an addition to conventional methods for taxonomic classification.
Collapse
|
152
|
Medina-Carmona E, Betancor-Fernández I, Santos J, Mesa-Torres N, Grottelli S, Batlle C, Naganathan AN, Oppici E, Cellini B, Ventura S, Salido E, Pey AL. Insight into the specificity and severity of pathogenic mechanisms associated with missense mutations through experimental and structural perturbation analyses. Hum Mol Genet 2018; 28:1-15. [DOI: 10.1093/hmg/ddy323] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2018] [Accepted: 09/09/2018] [Indexed: 12/21/2022] Open
Abstract
Abstract
Most pathogenic missense mutations cause specific molecular phenotypes through protein destabilization. However, how protein destabilization is manifested as a given molecular phenotype is not well understood. We develop here a structural and energetic approach to describe mutational effects on specific traits such as function, regulation, stability, subcellular targeting or aggregation propensity. This approach is tested using large-scale experimental and structural perturbation analyses in over thirty mutations in three different proteins (cancer-associated NQO1, transthyretin related with amyloidosis and AGT linked to primary hyperoxaluria type I) and comprising five very common pathogenic mechanisms (loss-of-function and gain-of-toxic function aggregation, enzyme inactivation, protein mistargeting and accelerated degradation). Our results revealed that the magnitude of destabilizing effects and, particularly, their propagation through the structure to promote disease-associated conformational states largely determine the severity and molecular mechanisms of disease-associated missense mutations. Modulation of the structural perturbation at a mutated site is also shown to cause switches between different molecular phenotypes. When very common disease-associated missense mutations were investigated, we also found that they were not among the most deleterious possible missense mutations at those sites, and required additional contributions from codon bias and effects of CpG sites to explain their high frequency in patients. Our work sheds light on the molecular basis of pathogenic mechanisms and genotype–phenotype relationships, with implications for discriminating between pathogenic and neutral changes within human genome variability from whole genome sequencing studies.
Collapse
Affiliation(s)
- Encarnación Medina-Carmona
- Department of Physical Chemistry, University of Granada, Granada, Spain
- Department of Experimental Medicine, University of Perugia, Piazzale Gambuli, Perugia
| | - Isabel Betancor-Fernández
- Centre for Biomedical Research on Rare Diseases, Hospital Universitario de Canarias, Tenerife, Spain
| | - Jaime Santos
- Institut de Biotecnologia i de Biomedicina and Departament de Bioquímica i Biologia Molecular, Universitat Autónoma de Barcelona, Bellaterra, Spain
| | - Noel Mesa-Torres
- Department of Physical Chemistry, University of Granada, Granada, Spain
| | - Silvia Grottelli
- Department of Experimental Medicine, University of Perugia, Piazzale Gambuli, Perugia
| | - Cristina Batlle
- Institut de Biotecnologia i de Biomedicina and Departament de Bioquímica i Biologia Molecular, Universitat Autónoma de Barcelona, Bellaterra, Spain
| | - Athi N Naganathan
- Department of Biotechnology, Bhupat & Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras (IITM), Chennai, India
| | - Elisa Oppici
- Department of Neurosciences, Biomedicine and Movement Sciences, Section of Biological Chemistry, University of Verona, Strada Le Grazie, Verona, Italy
| | - Barbara Cellini
- Department of Experimental Medicine, University of Perugia, Piazzale Gambuli, Perugia
| | - Salvador Ventura
- Institut de Biotecnologia i de Biomedicina and Departament de Bioquímica i Biologia Molecular, Universitat Autónoma de Barcelona, Bellaterra, Spain
| | - Eduardo Salido
- Centre for Biomedical Research on Rare Diseases, Hospital Universitario de Canarias, Tenerife, Spain
| | - Angel L Pey
- Department of Physical Chemistry, University of Granada, Granada, Spain
| |
Collapse
|
153
|
Mauro VP. Codon Optimization in the Production of Recombinant Biotherapeutics: Potential Risks and Considerations. BioDrugs 2018; 32:69-81. [PMID: 29392566 DOI: 10.1007/s40259-018-0261-x] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Biotherapeutics are increasingly becoming the mainstay in the treatment of a variety of human conditions, particularly in oncology and hematology. The production of therapeutic antibodies, cytokines, and fusion proteins have markedly accelerated these fields over the past decade and are probably the major contributor to improved patient outcomes. Today, most protein therapeutics are expressed as recombinant proteins in mammalian cell lines. An expression technology commonly used to increase protein levels involves codon optimization. This approach is possible because degeneracy of the genetic code enables most amino acids to be encoded by more than one synonymous codon and because codon usage can have a pronounced influence on levels of protein expression. Indeed, codon optimization has been reported to increase protein expression by > 1000-fold. The primary tactic of codon optimization is to increase the rate of translation elongation by overcoming limitations associated with species-specific differences in codon usage and transfer RNA (tRNA) abundance. However, in mammalian cells, assumptions underlying codon optimization appear to be poorly supported or unfounded. Moreover, because not all synonymous codon mutations are neutral, codon optimization can lead to alterations in protein conformation and function. This review discusses codon optimization for therapeutic protein production in mammalian cells.
Collapse
|
154
|
A Novel Marsupial Hepatitis A Virus Corroborates Complex Evolutionary Patterns Shaping the Genus Hepatovirus. J Virol 2018; 92:JVI.00082-18. [PMID: 29695421 PMCID: PMC6002732 DOI: 10.1128/jvi.00082-18] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Accepted: 04/12/2018] [Indexed: 11/30/2022] Open
Abstract
The discovery of highly diverse nonprimate hepatoviruses illuminated the evolutionary origins of hepatitis A virus (HAV) ancestors in mammals other than primates. Marsupials are ancient mammals that diverged from other Eutheria during the Jurassic. Viruses from marsupials may thus provide important insight into virus evolution. To investigate Hepatovirus macroevolutionary patterns, we sampled 112 opossums in northeastern Brazil. A novel marsupial HAV (MHAV) in the Brazilian common opossum (Didelphis aurita) was detected by nested reverse transcription-PCR (RT-PCR). MHAV concentration in the liver was high, at 2.5 × 109 RNA copies/g, and at least 300-fold higher than those in other solid organs, suggesting hepatotropism. Hepatovirus seroprevalence in D. aurita was 26.6% as determined using an enzyme-linked immunosorbent assay (ELISA). Endpoint titers in confirmatory immunofluorescence assays were high, and marsupial antibodies colocalized with anti-HAV control sera, suggesting specificity of serological detection and considerable antigenic relatedness between HAV and MHAV. MHAV showed all genomic hallmarks defining hepatoviruses, including late-domain motifs likely involved in quasi-envelope acquisition, a predicted C-terminal pX extension of VP1, strong avoidance of CpG dinucleotides, and a type 3 internal ribosomal entry site. Translated polyprotein gene sequence distances of at least 23.7% from other hepatoviruses suggested that MHAV represents a novel Hepatovirus species. Conserved predicted cleavage sites suggested similarities in polyprotein processing between HAV and MHAV. MHAV was nested within rodent hepatoviruses in phylogenetic reconstructions, suggesting an ancestral hepatovirus host switch from rodents into marsupials. Cophylogenetic reconciliations of host and hepatovirus phylogenies confirmed that host-independent macroevolutionary patterns shaped the phylogenetic relationships of extant hepatoviruses. Although marsupials are synanthropic and consumed as wild game in Brazil, HAV community protective immunity may limit the zoonotic potential of MHAV. IMPORTANCE Hepatitis A virus (HAV) is a ubiquitous cause of acute hepatitis in humans. Recent findings revealed the evolutionary origins of HAV and the genus Hepatovirus defined by HAV in mammals other than primates in general and in small mammals in particular. The factors shaping the genealogy of extant hepatoviruses are unclear. We sampled marsupials, one of the most ancient mammalian lineages, and identified a novel marsupial HAV (MHAV). The novel MHAV shared specific features with HAV, including hepatotropism, antigenicity, genome structure, and a common ancestor in phylogenetic reconstructions. Coevolutionary analyses revealed that host-independent evolutionary patterns contributed most to the current phylogeny of hepatoviruses and that MHAV was the most drastic example of a cross-order host switch of any hepatovirus observed so far. The divergence of marsupials from other mammals offers unique opportunities to investigate HAV species barriers and whether mechanisms of HAV immune control are evolutionarily conserved.
Collapse
|
155
|
Mier P, Andrade-Navarro MA. Glutamine Codon Usage and polyQ Evolution in Primates Depend on the Q Stretch Length. Genome Biol Evol 2018; 10:816-825. [PMID: 29608721 PMCID: PMC5841385 DOI: 10.1093/gbe/evy046] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/19/2018] [Indexed: 12/16/2022] Open
Abstract
Amino acid usage in a proteome depends mostly on its taxonomy, as it does the codon usage in transcriptomes. Here, we explore the level of variation in the codon usage of a specific amino acid, glutamine, in relation to the number of consecutive glutamine residues. We show that CAG triplets are consistently more abundant in short glutamine homorepeats (polyQ, four to eight residues) than in shorter glutamine stretches (one to three residues), leading to the evolutionary growth of the repeat region in a CAG-dependent manner. The length of orthologous polyQ regions is mostly stable in primates, particularly the short ones. Interestingly, given a short polyQ the CAG usage is higher in unstable-in-length orthologous polyQ regions. This indicates that CAG triplets produce the necessary instability for a glutamine stretch to grow. Proteins related to polyQ-associated diseases behave in a more extreme way, with longer glutamine stretches in human and evolutionarily closer nonhuman primates, and an overall higher CAG usage. In the light of our results, we suggest an evolutionary model to explain the glutamine codon usage in polyQ regions.
Collapse
Affiliation(s)
- Pablo Mier
- Faculty of Biology, Johannes Gutenberg University Mainz, Germany
- Institute of Molecular Biology, Mainz, Germany
| | - Miguel A Andrade-Navarro
- Faculty of Biology, Johannes Gutenberg University Mainz, Germany
- Institute of Molecular Biology, Mainz, Germany
| |
Collapse
|
156
|
Analysis of codon usage bias of Crimean-Congo hemorrhagic fever virus and its adaptation to hosts. INFECTION GENETICS AND EVOLUTION 2017; 58:1-16. [PMID: 29198972 DOI: 10.1016/j.meegid.2017.11.027] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/29/2017] [Revised: 11/02/2017] [Accepted: 11/28/2017] [Indexed: 01/05/2023]
Abstract
Crimean-Congo hemorrhagic fever virus (CCHFV) is a negative-sense, single stranded RNA virus with a three-segmented genome that belongs to the genus Nairovirus within the family Bunyaviridae. CCHFV uses Hyalomma ticks as a vector to infect humans with a wide range of clinical signs, from asymptomatic to Zika-like syndrome. Despite significant progress in genomic analyses, the influences of viral relationships with different hosts on overall viral fitness, survival, and evading the host's immune systems remain unknown. To better understand the evolutionary characteristics of CCHFV, we performed a comprehensive analysis of the codon usage pattern in 179 CCHFV strains by calculating the relative synonymous codon usage (RSCU), effective number of codons (ENC), codon adaptation index (CAI), and other indicators. The results indicate that the codon usage bias of CCHFV is relatively low. Several lines of evidence support the hypothesis that a translation selection factor is shaping codon usage pattern in this virus. A correspondence analysis (CA) showed that other factors, such as base composition, aromaticity, and hydrophobicity may also be involved in shaping the codon usage pattern of CCHFV. Additionally, the results from a comparative analysis of RSCU between CCHFV and its hosts suggest that CCHFV tends to evolve codon usage patterns that are comparable to those of its hosts. Furthermore, the selection pressures from Homo sapiens, Bos taurus, and Ovis aries on the CCHFV RSCU patterns were dominant when compared with selection pressure from Hyalomma spp. vectors. Taken together, both natural selection and mutation pressure are important for shaping the codon usage pattern of CCHFV. We believe that such findings will assist researchers in understanding the evolution of CCHFV and its adaptation to its hosts.
Collapse
|
157
|
Rodriguez A, Wright G, Emrich S, Clark PL. %MinMax: A versatile tool for calculating and comparing synonymous codon usage and its impact on protein folding. Protein Sci 2017; 27:356-362. [PMID: 29090506 DOI: 10.1002/pro.3336] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2017] [Revised: 10/27/2017] [Accepted: 10/30/2017] [Indexed: 11/09/2022]
Abstract
Most amino acids can be encoded by more than one synonymous codon, but these are rarely used with equal frequency. In many coding sequences the usage patterns of rare versus common synonymous codons is nonrandom and under selection. Moreover, synonymous substitutions that alter these patterns can have a substantial impact on the folding efficiency of the encoded protein. This has ignited broad interest in exploring synonymous codon usage patterns. For many protein chemists, biophysicists and structural biologists, the primary motivation for codon analysis is identifying and preserving usage patterns most likely to impact high-yield production of functional proteins. Here we describe the core functions and new features of %MinMax, a codon usage calculator freely available as a web-based portal and downloadable script (http://www.codons.org). %MinMax evaluates the relative usage frequencies of the synonymous codons used to encode a protein sequence of interest and compares these results to a rigorous null model. Crucially, for analyzing codon usage in common host organisms %MinMax requires only the coding sequence as input; with a user-input codon frequency table, %MinMax can be used to evaluate synonymous codon usage patterns for any coding sequence from any fully sequenced genome. %MinMax makes no assumptions regarding the impact of transfer ribonucleic acid concentrations or other molecular-level interactions on translation rates, yet its output is sufficient to predict the effects of synonymous codon substitutions on cotranslational folding mechanisms. A simple calculation included within %MinMax can be used to harmonize codon usage frequencies for heterologous gene expression.
Collapse
Affiliation(s)
- Anabel Rodriguez
- Department of Chemistry & Biochemistry, University of Notre Dame, Notre Dame, Indiana, 46556
| | - Gabriel Wright
- Department of Computer Science & Engineering, University of Notre Dame, Notre Dame, Indiana, 46556
| | - Scott Emrich
- Department of Computer Science & Engineering, University of Notre Dame, Notre Dame, Indiana, 46556
| | - Patricia L Clark
- Department of Chemistry & Biochemistry, University of Notre Dame, Notre Dame, Indiana, 46556.,Department of Chemical & Biomolecular Engineering, University of Notre Dame, Notre Dame, Indiana, 46556
| |
Collapse
|