1
|
Abstract
Escherichia coli was one of the first species to have its genome sequenced and remains one of the best-characterized model organisms. Thus, it is perhaps surprising that recent studies have shown that a substantial number of genes have been overlooked. Genes encoding more than 140 small proteins, defined as those containing 50 or fewer amino acids, have been identified in E. coli in the past 10 years, and there is substantial evidence indicating that many more remain to be discovered. This review covers the methods that have been successful in identifying small proteins and the short open reading frames that encode them. The small proteins that have been functionally characterized to date in this model organism are also discussed. It is hoped that the review, along with the associated databases of known as well as predicted but undetected small proteins, will aid in and provide a roadmap for the continued identification and characterization of these proteins in E. coli as well as other bacteria.
Collapse
|
2
|
Nasir MA, Nawaz S, Huang J. A Mini-review of Computational Approaches to Predict Functions and Findings of Novel Micro Peptides. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200811130522] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
:
New techniques in bioinformatics and the study of the transcriptome at a wide-scale
have uncovered the fact that a large part of the genome is being translated than recently perceived
thoughts and research, bringing about the creation of a various quantity of RNA with proteincoding
and noncoding potential. A lot of RNA particles have been considered as noncoding due to
many reasons, according to developing proofs. Like many sORFs that encode many functional
micro peptides have neglected due to their tiny sizes.
:
Advanced studies reveal many major biological functions of these sORFs and their encoded micro
peptides in a different and wide range of species. All the achievement in the identification of these
sORFs and micro peptides is due to the progressive bioinformatics and high-throughput
sequencing methods. This field has pulled in more consideration due to the detection of a large
number of more sORFs and micro peptides. Nowadays, COVID-19 grabs all the attention of
science as it is a sudden outbreak. sORFs of COVID-19 should be revealed for new ways to
understand this virus. This review discusses ongoing progress in the systems for the identification
and distinguishing proof of sORFs and micro peptides.
Collapse
Affiliation(s)
- Mohsin Ali Nasir
- Center for Informational Biology, University of Electronic Science and Technology of China, No. 2006, Xiyuan Ave, West Hi-Tech Zone, Chengdu 611731, China
| | - Samia Nawaz
- Center for Informational Biology, University of Electronic Science and Technology of China, No. 2006, Xiyuan Ave, West Hi-Tech Zone, Chengdu 611731, China
| | - Jian Huang
- Center for Informational Biology, University of Electronic Science and Technology of China, No. 2006, Xiyuan Ave, West Hi-Tech Zone, Chengdu 611731, China
| |
Collapse
|
3
|
Orr MW, Mao Y, Storz G, Qian SB. Alternative ORFs and small ORFs: shedding light on the dark proteome. Nucleic Acids Res 2020; 48:1029-1042. [PMID: 31504789 DOI: 10.1093/nar/gkz734] [Citation(s) in RCA: 146] [Impact Index Per Article: 36.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2019] [Revised: 08/03/2019] [Accepted: 08/15/2019] [Indexed: 02/06/2023] Open
Abstract
Traditional annotation of protein-encoding genes relied on assumptions, such as one open reading frame (ORF) encodes one protein and minimal lengths for translated proteins. With the serendipitous discoveries of translated ORFs encoded upstream and downstream of annotated ORFs, from alternative start sites nested within annotated ORFs and from RNAs previously considered noncoding, it is becoming clear that these initial assumptions are incorrect. The findings have led to the realization that genetic information is more densely coded and that the proteome is more complex than previously anticipated. As such, interest in the identification and characterization of the previously ignored 'dark proteome' is increasing, though we note that research in eukaryotes and bacteria has largely progressed in isolation. To bridge this gap and illustrate exciting findings emerging from studies of the dark proteome, we highlight recent advances in both eukaryotic and bacterial cells. We discuss progress in the detection of alternative ORFs as well as in the understanding of functions and the regulation of their expression and posit questions for future work.
Collapse
Affiliation(s)
- Mona Wu Orr
- Division of Molecular and Cellular Biology, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Bethesda, MD 20892, USA
| | - Yuanhui Mao
- Division of Nutritional Sciences, Cornell University, Ithaca, NY 14853, USA
| | - Gisela Storz
- Division of Molecular and Cellular Biology, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Bethesda, MD 20892, USA
| | - Shu-Bing Qian
- Division of Nutritional Sciences, Cornell University, Ithaca, NY 14853, USA
| |
Collapse
|
4
|
R Cerqueira F, Vasconcelos ATR. OCCAM: prediction of small ORFs in bacterial genomes by means of a target-decoy database approach and machine learning techniques. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5989499. [PMID: 33206960 PMCID: PMC7673341 DOI: 10.1093/database/baaa067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 07/11/2020] [Accepted: 07/27/2020] [Indexed: 11/14/2022]
Abstract
Small open reading frames (ORFs) have been systematically disregarded by automatic genome annotation. The difficulty in finding patterns in tiny sequences is the main reason that makes small ORFs to be overlooked by computational procedures. However, advances in experimental methods show that small proteins can play vital roles in cellular activities. Hence, it is urgent to make progress in the development of computational approaches to speed up the identification of potential small ORFs. In this work, our focus is on bacterial genomes. We improve a previous approach to identify small ORFs in bacteria. Our method uses machine learning techniques and decoy subject sequences to filter out spurious ORF alignments. We show that an advanced multivariate analysis can be more effective in terms of sensitivity than applying the simplistic and widely used e-value cutoff. This is particularly important in the case of small ORFs for which alignments present higher e-values than usual. Experiments with control datasets show that the machine learning algorithms used in our method to curate significant alignments can achieve average sensitivity and specificity of 97.06% and 99.61%, respectively. Therefore, an important step is provided here toward the construction of more accurate computational tools for the identification of small ORFs in bacteria.
Collapse
Affiliation(s)
- Fabio R Cerqueira
- Department of Production Engineering, Universidade Federal Fluminense, Rua Domingos Silvério s/n, Petrópolis, 25 650-050, Rio de Janeiro, Brazil.,Graduate Program in Computer Science, Universidade Federal de Viçosa, 36570-900, Minas Gerais, Brazil
| | | |
Collapse
|
5
|
VanOrsdel CE, Kelly JP, Burke BN, Lein CD, Oufiero CE, Sanchez JF, Wimmers LE, Hearn DJ, Abuikhdair FJ, Barnhart KR, Duley ML, Ernst SEG, Kenerson BA, Serafin AJ, Hemm MR. Identifying New Small Proteins in Escherichia coli. Proteomics 2018; 18:e1700064. [PMID: 29645342 PMCID: PMC6001520 DOI: 10.1002/pmic.201700064] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2017] [Revised: 03/05/2018] [Indexed: 12/11/2022]
Abstract
The number of small proteins (SPs) encoded in the Escherichia coli genome is unknown, as current bioinformatics and biochemical techniques make short gene and small protein identification challenging. One method of small protein identification involves adding an epitope tag to the 3′ end of a short open reading frame (sORF) on the chromosome, with synthesis confirmed by immunoblot assays. In this study, this strategy was used to identify new E. coli small proteins, tagging 80 sORFs in the E. coli genome, and assayed for protein synthesis. The selected sORFs represent diverse sequence characteristics, including degrees of sORF conservation, predicted transmembrane domains, sORF direction with respect to flanking genes, ribosome binding site (RBS) prediction, and ribosome profiling results. Of 80 sORFs, 36 resulted in encoded synthesized proteins—a 45% success rate. Modeling of detected versus non‐detected small proteins analysis showed predictions based on RBS prediction, transcription data, and ribosome profiling had statistically‐significant correlation with protein synthesis; however, there was no correlation between current sORF annotation and protein synthesis. These results suggest substantial numbers of small proteins remain undiscovered in E. coli, and existing bioinformatics techniques must continue to improve to facilitate identification.
Collapse
Affiliation(s)
- Caitlin E VanOrsdel
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | - John P Kelly
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | - Brittany N Burke
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | - Christina D Lein
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | | | - Joseph F Sanchez
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | - Larry E Wimmers
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | - David J Hearn
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | - Fatimeh J Abuikhdair
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | - Kathryn R Barnhart
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | - Michelle L Duley
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | - Sarah E G Ernst
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | - Briana A Kenerson
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | - Aubrey J Serafin
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| | - Matthew R Hemm
- Department of Biological Sciences, Smith Hall, Towson University, Towson, MD, USA
| |
Collapse
|
6
|
Yagoub D, Tay AP, Chen Z, Hamey JJ, Cai C, Chia SZ, Hart-Smith G, Wilkins MR. Proteogenomic Discovery of a Small, Novel Protein in Yeast Reveals a Strategy for the Detection of Unannotated Short Open Reading Frames. J Proteome Res 2015; 14:5038-47. [DOI: 10.1021/acs.jproteome.5b00734] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Daniel Yagoub
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Aidan P. Tay
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Zhiliang Chen
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Joshua J. Hamey
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Curtis Cai
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Samantha Z. Chia
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Gene Hart-Smith
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| | - Marc R. Wilkins
- Systems Biology Initiative,
School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, New South Wales 2052, Australia
| |
Collapse
|
7
|
Cooke IR, Jones D, Bowen JK, Deng C, Faou P, Hall NE, Jayachandran V, Liem M, Taranto AP, Plummer KM, Mathivanan S. Proteogenomic analysis of the Venturia pirina (Pear Scab Fungus) secretome reveals potential effectors. J Proteome Res 2014; 13:3635-44. [PMID: 24965097 DOI: 10.1021/pr500176c] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
A proteogenomic analysis is presented for Venturia pirina, a fungus that causes scab disease on European pear (Pyrus communis). V. pirina is host-specific, and the infection is thought to be mediated by secreted effector proteins. Currently, only 36 V. pirina proteins are catalogued in GenBank, and the genome sequence is not publicly available. To identify putative effectors, V. pirina was grown in vitro on and in cellophane sheets mimicking its growth in infected leaves. Secreted extracts were analyzed by tandem mass spectrometry, and the data (ProteomeXchange identifier PXD000710) was queried against a protein database generated by combining in silico predicted transcripts with six frame translations of a whole genome sequence of V. pirina (GenBank Accession JEMP00000000 ). We identified 1088 distinct V. pirina protein groups (FDR 1%) including 1085 detected for the first time. Thirty novel (not in silico predicted) proteins were found, of which 14 were identified as potential effectors based on characteristic features of fungal effector protein sequences. We also used evidence from semitryptic peptides at the protein N-terminus to corroborate in silico signal peptide predictions for 22 proteins, including several potential effectors. The analysis highlights the utility of proteogenomics in the study of secreted effectors.
Collapse
Affiliation(s)
- Ira R Cooke
- Department of Biochemistry, La Trobe Institute for Molecular Science, La Trobe University , Melbourne, Victoria 3086, Australia
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Pang CNI, Tay AP, Aya C, Twine NA, Harkness L, Hart-Smith G, Chia SZ, Chen Z, Deshpande NP, Kaakoush NO, Mitchell HM, Kassem M, Wilkins MR. Tools to covisualize and coanalyze proteomic data with genomes and transcriptomes: validation of genes and alternative mRNA splicing. J Proteome Res 2013; 13:84-98. [PMID: 24152167 DOI: 10.1021/pr400820p] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Direct links between proteomic and genomic/transcriptomic data are not frequently made, partly because of lack of appropriate bioinformatics tools. To help address this, we have developed the PG Nexus pipeline. The PG Nexus allows users to covisualize peptides in the context of genomes or genomic contigs, along with RNA-seq reads. This is done in the Integrated Genome Viewer (IGV). A Results Analyzer reports the precise base position where LC-MS/MS-derived peptides cover genes or gene isoforms, on the chromosomes or contigs where this occurs. In prokaryotes, the PG Nexus pipeline facilitates the validation of genes, where annotation or gene prediction is available, or the discovery of genes using a "virtual protein"-based unbiased approach. We illustrate this with a comprehensive proteogenomics analysis of two strains of Campylobacter concisus . For higher eukaryotes, the PG Nexus facilitates gene validation and supports the identification of mRNA splice junction boundaries and splice variants that are protein-coding. This is illustrated with an analysis of splice junctions covered by human phosphopeptides, and other examples of relevance to the Chromosome-Centric Human Proteome Project. The PG Nexus is open-source and available from https://github.com/IntersectAustralia/ap11_Samifier. It has been integrated into Galaxy and made available in the Galaxy tool shed.
Collapse
Affiliation(s)
- Chi Nam Ignatius Pang
- Systems Biology Initiative, The University of New South Wales , Sydney, New South Wales 2052, Australia
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
9
|
Chen S, Zhang CY, Song K. Recognizing short coding sequences of prokaryotic genome using a novel iteratively adaptive sparse partial least squares algorithm. Biol Direct 2013; 8:23. [PMID: 24067167 PMCID: PMC3852556 DOI: 10.1186/1745-6150-8-23] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2013] [Accepted: 09/23/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Significant efforts have been made to address the problem of identifying short genes in prokaryotic genomes. However, most known methods are not effective in detecting short genes. Because of the limited information contained in short DNA sequences, it is very difficult to accurately distinguish between protein coding and non-coding sequences in prokaryotic genomes. We have developed a new Iteratively Adaptive Sparse Partial Least Squares (IASPLS) algorithm as the classifier to improve the accuracy of the identification process. RESULTS For testing, we chose the short coding and non-coding sequences from seven prokaryotic organisms. We used seven feature sets (including GC content, Z-curve, etc.) of short genes.In comparison with GeneMarkS, Metagene, Orphelia, and Heuristic Approachs methods, our model achieved the best prediction performance in identification of short prokaryotic genes. Even when we focused on the very short length group ([60-100 nt)), our model provided sensitivity as high as 83.44% and specificity as high as 92.8%. These values are two or three times higher than three of the other methods while Metagene fails to recognize genes in this length range.The experiments also proved that the IASPLS can improve the identification accuracy in comparison with other widely used classifiers, i.e. Logistic, Random Forest (RF) and K nearest neighbors (KNN). The accuracy in using IASPLS was improved 5.90% or more in comparison with the other methods. In addition to the improvements in accuracy, IASPLS required ten times less computer time than using KNN or RF. CONCLUSIONS It is conclusive that our method is preferable for application as an automated method of short gene classification. Its linearity and easily optimized parameters make it practicable for predicting short genes of newly-sequenced or under-studied species.
Collapse
Affiliation(s)
- Sun Chen
- School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China.
| | | | | |
Collapse
|
10
|
|