1
|
Pandey S, Avuthu N, Guda C. StrainIQ: A Novel n-Gram-Based Method for Taxonomic Profiling of Human Microbiota at the Strain Level. Genes (Basel) 2023; 14:1647. [PMID: 37628698 PMCID: PMC10454763 DOI: 10.3390/genes14081647] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Revised: 08/13/2023] [Accepted: 08/15/2023] [Indexed: 08/27/2023] Open
Abstract
The emergence of next-generation sequencing (NGS) technology has greatly influenced microbiome research and led to the development of novel bioinformatics tools to deeply analyze metagenomics datasets. Identifying strain-level variations in microbial communities is important to understanding the onset and progression of diseases, host-pathogen interrelationships, and drug resistance, in addition to designing new therapeutic regimens. In this study, we developed a novel tool called StrainIQ (strain identification and quantification) based on a new n-gram-based (series of n number of adjacent nucleotides in the DNA sequence) algorithm for predicting and quantifying strain-level taxa from whole-genome metagenomic sequencing data. We thoroughly evaluated our method using simulated and mock metagenomic datasets and compared its performance with existing methods. On average, it showed 85.8% sensitivity and 78.2% specificity on simulated datasets. It also showed higher specificity and sensitivity using n-gram models built from reduced reference genomes and on models with lower coverage sequencing data. It outperforms alternative approaches in genus- and strain-level prediction and strain abundance estimation. Overall, the results show that StrainIQ achieves high accuracy by implementing customized model-building and is an efficient tool for site-specific microbial community profiling.
Collapse
Affiliation(s)
- Sanjit Pandey
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, USA
| | - Nagavardhini Avuthu
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, USA
| | - Chittibabu Guda
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE 68198, USA
- Center for Biomedical Informatics Research and Innovation, University of Nebraska Medical Center, Omaha, NE 68198, USA
| |
Collapse
|
2
|
Numeric Lyndon-based feature embedding of sequencing reads for machine learning approaches. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2022.06.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
3
|
Semwal R, Aier I, Raj U, Varadwaj PK. Pr[m]: An Algorithm for Protein Motif Discovery. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:585-592. [PMID: 32750855 DOI: 10.1109/tcbb.2020.2999262] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Motifs are the evolutionarily conserved patterns which are reported to serve the crucial structural and functional role. Identification of motif patterns in a set of protein sequences has been a prime concern for researchers in computational biology. The discovery of such a protein motif using existing algorithms is purely based on the parameters derived from sequence composition and length. However, the discovery of variable length motif remains a challenging task, as it is not possible to determine the length of a motif in advance. In current work, a k-mer based motif discovery approach called Pr[m], is proposed for the detection of the statistically significant un-gapped motif patterns, with or without wildcard characters. In order to analyze the performance of the proposed approach, a comparative study was performed with MEME and GLAM2, which are two widely used non-discriminative methods for motif discovery. A set of 7,500 test dataset were used to compare the performance of the proposed tool and the ones mentioned above. Pr[m] outperformed the existing methods in terms of predictive quality and performance. The proposed approach is hosted at https://bioserver.iiita.ac.in/Pr[m].
Collapse
|
4
|
Jing X, Dong Q, Hong D, Lu R. Amino Acid Encoding Methods for Protein Sequences: A Comprehensive Review and Assessment. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1918-1931. [PMID: 30998480 DOI: 10.1109/tcbb.2019.2911677] [Citation(s) in RCA: 35] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
As the first step of machine-learning based protein structure and function prediction, the amino acid encoding play a fundamental role in the final success of those methods. Different from the protein sequence encoding, the amino acid encoding can be used in both residue-level and sequence-level prediction of protein properties by combining them with different algorithms. However, it has not attracted enough attention in the past decades, and there are no comprehensive reviews and assessments about encoding methods so far. In this article, we make a systematic classification and propose a comprehensive review and assessment for various amino acid encoding methods. Those methods are grouped into five categories according to their information sources and information extraction methodologies, including binary encoding, physicochemical properties encoding, evolution-based encoding, structure-based encoding, and machine-learning encoding. Then, 16 representative methods from five categories are selected and compared on protein secondary structure prediction and protein fold recognition tasks by using large-scale benchmark datasets. The results show that the evolution-based position-dependent encoding method PSSM achieved the best performance, and the structure-based and machine-learning encoding methods also show some potential for further application, the neural network based distributed representation of amino acids in particular may bring new light to this area. We hope that the review and assessment are useful for future studies in amino acid encoding.
Collapse
|
5
|
Yadav AK, Singla D. VacPred: Sequence-based prediction of plant vacuole proteins using machine-learning techniques. J Biosci 2020. [DOI: 10.1007/s12038-020-00076-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
6
|
Jin C, Cukier RI. Machine learning can be used to distinguish protein families and generate new proteins belonging to those families. J Chem Phys 2019; 151:175102. [PMID: 31703505 DOI: 10.1063/1.5126225] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Proteins are classified into families based on evolutionary relationships and common structure-function characteristics. Availability of large data sets of gene-derived protein sequences drives this classification. Sequence space is exponentially large, making it difficult to characterize family differences. In this work, we show that Machine Learning (ML) methods can be trained to distinguish between protein families. A number of supervised ML algorithms are explored to this end. The most accurate is a Long Short Term Memory (LSTM) classification method that accounts for the sequence context of the amino acids. Sequences for a number of protein families where there are sufficient data to be used in ML are studied. By splitting the data into training and testing sets, we find that this LSTM classifier can be trained to successfully classify the test sequences for all pairs of the families. Also investigated is whether the addition of structural information increases the accuracy of the binary comparisons. It does, but because there is much less available structural than sequence information, the quality of the training degrades. Another variety of LSTM, LSTM_wordGen, a context-dependent word generation algorithm, is used to generate new protein sequences based on seed sequences for the families considered here. Using the original sequences as training data and the generated sequences as test data, the LSTM classification method classifies the generated sequences almost as accurately as the true family members do. Thus, in principle, we have generated new members of these protein families.
Collapse
Affiliation(s)
- Chi Jin
- Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, USA
| | - Robert I Cukier
- Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, USA
| |
Collapse
|
7
|
Novel Bioactive Peptides from Meretrix meretrix Protect Caenorhabditis elegans against Free Radical-Induced Oxidative Stress through the Stress Response Factor DAF-16/FOXO. Mar Drugs 2018; 16:md16110444. [PMID: 30423886 PMCID: PMC6265947 DOI: 10.3390/md16110444] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Revised: 11/03/2018] [Accepted: 11/06/2018] [Indexed: 12/19/2022] Open
Abstract
The hard clam Meretrix meretrix, which has been traditionally used as medicine and seafood, was used in this study to isolate antioxidant peptides. First, a peptide-rich extract was tested for its protective effect against paraquat-induced oxidative stress using the nematode model Caenorhabditis elegans. Then, three novel antioxidant peptides; MmP4 (LSDRLEETGGASS), MmP11 (KEGCREPETEKGHR) and MmP19 (IVTNWDDMEK), were identified and were found to increase the resistance of nematodes against paraquat. Circular dichroism spectroscopy revealed that MmP4 was predominantly in beta-sheet conformation, while MmP11 and MmP19 were primarily in random coil conformation. Using transgenic nematode models, the peptides were shown to promote nuclear translocation of the DAF-16/FOXO transcription factor, a pivotal regulator of stress response and lifespan, and induce the expression of superoxide dismutase 3 (SOD-3), an antioxidant enzyme. Analysis of DAF-16 target genes by real-time PCR reveals that sod-3 was up-regulated by MmP4, MmP11 and MmP19 while ctl-1 and ctl-2 were also up-regulated by MmP4. Further examination of daf-16 using RNA interference suggests that the peptide-increased resistance of C. elegans to oxidative stress was DAF-16 dependent. Taken together, these data demonstrate the antioxidant activity of M. meretrix peptides, which are associated with activation of the stress response factor DAF-16 and regulation of the antioxidant enzyme genes.
Collapse
|
8
|
Frades I, Andreasson E. Phytophthora infestans specific phosphorylation patterns and new putative control targets. Fungal Biol 2016; 120:631-644. [PMID: 27020162 DOI: 10.1016/j.funbio.2016.01.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2015] [Revised: 12/16/2015] [Accepted: 01/06/2016] [Indexed: 11/15/2022]
Abstract
In this study we applied biomathematical searches of gene regulatory mechanisms to learn more about oomycete biology and to identify new putative targets for pesticides or biological control against Phytophthora infestans. First, oomycete phylum-specific phosphorylation motifs were found by discriminative n-gram analysis. We found 11.600 P. infestans specific n-grams, mapping 642 phosphoproteins. The most abundant group among these related to phosphatidylinositol metabolism. Due to the large number of possible targets found and our hypothesis that multi-level control is a sign of usefulness as targets for intervention, we identified overlapping targets with a second screen. This was performed to identify proteins dually regulated by small RNA and phosphorylation. We found 164 proteins to be regulated by both sRNA and phosphorylation and the dominating functions where phosphatidylinositol signalling/metabolism, endocytosis, and autophagy. Furthermore we performed a similar regulatory study and discriminative n-gram analysis of proteins with no clear orthologs in other species and proteins that are known to be unique to P. infestans such as the RxLR effectors, Crinkler (CRN) proteins and elicitins. We identified CRN proteins with specific phospho-motifs present in all life stages. PITG_12626, PITG_14042 and PITG_23175 are CRN proteins that have species-specific phosphorylation motifs and are subject to dual regulation.
Collapse
Affiliation(s)
- Itziar Frades
- Department of Plant Protection Biology, Swedish University of Agricultural Sciences, Alnarp, SE-230 53, Sweden.
| | - Erik Andreasson
- Department of Plant Protection Biology, Swedish University of Agricultural Sciences, Alnarp, SE-230 53, Sweden
| |
Collapse
|
9
|
Asgari E, Mofrad MRK. Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics. PLoS One 2015; 10:e0141287. [PMID: 26555596 PMCID: PMC4640716 DOI: 10.1371/journal.pone.0141287] [Citation(s) in RCA: 349] [Impact Index Per Article: 38.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2015] [Accepted: 10/05/2015] [Indexed: 12/22/2022] Open
Abstract
We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%±0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. Importantly, this model needs to be trained only once and can then be applied to extract a comprehensive set of information regarding proteins of interest. Moreover, this representation can be considered as pre-training for various applications of deep learning in bioinformatics. The related data is available at Life Language Processing Website: http://llp.berkeley.edu and Harvard Dataverse: http://dx.doi.org/10.7910/DVN/JMFHTN.
Collapse
Affiliation(s)
- Ehsaneddin Asgari
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America
| | - Mohammad R. K. Mofrad
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, California 94720, United States of America
- Physical Biosciences Division, Lawrence Berkeley National Lab, Berkeley, California 94720, United States of America
| |
Collapse
|
10
|
Frades I, Resjö S, Andreasson E. Comparison of phosphorylation patterns across eukaryotes by discriminative N-gram analysis. BMC Bioinformatics 2015. [PMID: 26224486 PMCID: PMC4520095 DOI: 10.1186/s12859-015-0657-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background How protein phosphorylation relates to kingdom/phylum divergence is largely unknown and the amino acid residues surrounding the phosphorylation site have profound importance on protein kinase–substrate interactions. Standard motif analysis is not adequate for large scale comparative analysis because each phophopeptide is assigned to a unique motif and perform poorly with the unbalanced nature of the input datasets. Results First the discriminative n-grams of five species from five different kingdom/phyla were identified. A signature with 5540 discriminative n-grams that could be found in other species from the same kingdoms/phyla was created. Using a test data set, the ability of the signature to classify species in their corresponding kingdom/phylum was confirmed using classification methods. Lastly, ortholog proteins among proteins with n-grams were identified in order to determine to what degree was the identity of the detected n-grams a property of phosphosites rather than a consequence of species-specific or kingdom/phylum-specific protein inventory. The motifs were grouped in clusters of equal physico-chemical nature and their distribution was similar between species in the same kingdom/phylum while clear differences were found among species of different kingdom/phylum. For example, the animal-specific top discriminative n-grams contained many basic amino acids and the plant-specific motifs were mainly acidic. Secondary structure prediction methods show that the discriminative n-grams in the majority of the cases lack from a regular secondary structure as on average they had 88 % of random coil compared to 66 % found in the phosphoproteins they were derived from. Conclusions The discriminative n-grams were able to classify organisms in their corresponding kingdom/phylum, they show different patterns among species of different kingdom/phylum and these regions can contribute to evolutionary divergence as they are in disordered regions that can evolve rapidly. The differences found possibly reflect group-specific differences in the kinomes of the different groups of species. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0657-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Itziar Frades
- Department of Plant Protection Biology, Swedish University of Agricultural Sciences, Alnarp, SE-230 53, Sweden.
| | - Svante Resjö
- Department of Plant Protection Biology, Swedish University of Agricultural Sciences, Alnarp, SE-230 53, Sweden.
| | - Erik Andreasson
- Department of Plant Protection Biology, Swedish University of Agricultural Sciences, Alnarp, SE-230 53, Sweden.
| |
Collapse
|
11
|
Zhang Q, Xu Y. Motif mining based on network space compression. BioData Min 2014; 8:29. [PMID: 25525470 PMCID: PMC4269098 DOI: 10.1186/s13040-014-0029-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2014] [Accepted: 11/25/2014] [Indexed: 11/25/2022] Open
Abstract
A network motif is a recurring subnetwork within a network, and it takes on certain functions in practical biological macromolecule applications. Previous algorithms have focused on the computational efficiency of network motif detection, but some problems in storage space and searching time manifested during earlier studies. The considerable computational and spacial complexity also presents a significant challenge. In this paper, we provide a new approach for motif mining based on compressing the searching space. According to the characteristic of the parity nodes, we cut down the searching space and storage space in real graphs and random graphs, thereby reducing the computational cost of verifying the isomorphism of sub-graphs. We obtain a new network with smaller size after removing parity nodes and the “repeated edges” connected with the parity nodes. Random graph structure and sub-graph searching are based on the Back Tracking Method; all sub-graphs can be searched for by adding edges progressively. Experimental results show that this algorithm has higher speed and better stability than its alternatives.
Collapse
Affiliation(s)
- Qiang Zhang
- Key Laboratory of Advanced Design and Intelligent Computing, (Dalian university), Ministry of Education, Dalian, 116622 China
| | - Yuan Xu
- Key Laboratory of Advanced Design and Intelligent Computing, (Dalian university), Ministry of Education, Dalian, 116622 China
| |
Collapse
|
12
|
Li A, Zhang J, Zhou Z. PLEK: a tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinformatics 2014; 15:311. [PMID: 25239089 PMCID: PMC4177586 DOI: 10.1186/1471-2105-15-311] [Citation(s) in RCA: 415] [Impact Index Per Article: 41.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2013] [Accepted: 09/01/2014] [Indexed: 12/21/2022] Open
Abstract
Background High-throughput transcriptome sequencing (RNA-seq) technology promises to discover novel protein-coding and non-coding transcripts, particularly the identification of long non-coding RNAs (lncRNAs) from de novo sequencing data. This requires tools that are not restricted by prior gene annotations, genomic sequences and high-quality sequencing. Results We present an alignment-free tool called PLEK (predictor of long non-coding RNAs and messenger RNAs based on an improved k-mer scheme), which uses a computational pipeline based on an improved k-mer scheme and a support vector machine (SVM) algorithm to distinguish lncRNAs from messenger RNAs (mRNAs), in the absence of genomic sequences or annotations. The performance of PLEK was evaluated on well-annotated mRNA and lncRNA transcripts. 10-fold cross-validation tests on human RefSeq mRNAs and GENCODE lncRNAs indicated that our tool could achieve accuracy of up to 95.6%. We demonstrated the utility of PLEK on transcripts from other vertebrates using the model built from human datasets. PLEK attained >90% accuracy on most of these datasets. PLEK also performed well using a simulated dataset and two real de novo assembled transcriptome datasets (sequenced by PacBio and 454 platforms) with relatively high indel sequencing errors. In addition, PLEK is approximately eightfold faster than a newly developed alignment-free tool, named Coding-Non-Coding Index (CNCI), and 244 times faster than the most popular alignment-based tool, Coding Potential Calculator (CPC), in a single-threading running manner. Conclusions PLEK is an efficient alignment-free computational tool to distinguish lncRNAs from mRNAs in RNA-seq transcriptomes of species lacking reference genomes. PLEK is especially suitable for PacBio or 454 sequencing data and large-scale transcriptome data. Its open-source software can be freely downloaded from https://sourceforge.net/projects/plek/files/. Electronic supplementary material The online version of this article (doi:10.1186/1471-2105-15-311) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
| | - Junying Zhang
- School of Computer Science and Technology, Xidian University, Xi'an, PR China.
| | | |
Collapse
|
13
|
Srinivasan SM, Guda C. MetaID: a novel method for identification and quantification of metagenomic samples. BMC Genomics 2013; 14 Suppl 8:S4. [PMID: 24564518 PMCID: PMC4042266 DOI: 10.1186/1471-2164-14-s8-s4] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Advances in next-generation sequencing (NGS) technology has provided us with an opportunity to analyze and evaluate the rich microbial communities present in all natural environments. The shorter reads obtained from the shortgun technology has paved the way for determining the taxonomic profile of a community by simply aligning the reads against the available reference genomes. While several computational methods are available for taxonomic profiling at the genus- and species-level, none of these methods are effective at the strain-level identification due to the increasing difficulty in detecting variation at that level. Here, we present MetaID, an alignment-free n-gram based approach that can accurately identify microorganisms at the strain level and estimate the abundance of each organism in a sample, given a metagenomic sequencing dataset. RESULTS MetaID is an n-gram based method that calculates the profile of unique and common n-grams from the dataset of 2,031 prokaryotic genomes and assigns weights to each n-gram using a scoring function. This scoring function assigns higher weightage to the n-grams that appear in fewer genomes and vice versa; thus, allows for effective use of both unique and common n-grams for species identification. Our 10-fold cross-validation results on a simulated dataset show a remarkable accuracy of 99.7% at the strain-level identification of the organisms in gut microbiome. We also demonstrated that our model shows impressive performance even by using only 25% or 50% of the genome sequences for modeling. In addition to identification of the species, our method can also estimate the relative abundance of each species in the simulated metagenomic samples. The generic approach employed in this method can be applied for accurate identification of a wide variety of microbial species (viruses, prokaryotes and eukaryotes) present in any environmental sample. CONCLUSIONS The proposed scoring function and approach is able to accurately identify and estimate the entire taxa in any metagenomic community. The weights assigned to the common n-grams by our scoring function are precisely calibrated to match the reads up to the strain level. Our multipronged validation tests demonstrate that MetaID is sufficiently robust to accurately identify and estimate the abundance of each taxon in any natural environment even when using incomplete or partially sequenced genomes.
Collapse
|