1
|
Chaturvedi A, Borkar K, Priyakumar UD, Vinod P. PREHOST: Host prediction of coronaviridae family using machine learning. Heliyon 2023; 9:e13646. [PMID: 36816252 PMCID: PMC9922161 DOI: 10.1016/j.heliyon.2023.e13646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Revised: 02/05/2023] [Accepted: 02/06/2023] [Indexed: 02/13/2023] Open
Abstract
Coronavirus, a zoonotic virus capable of transmitting infections from animals to humans, emerged as a pandemic recently. In such circumstances, it is essential to understand the virus's origin. In this study, we present a novel machine-learning pipeline PreHost for host prediction of the family, Coronaviridae. We leverage the complete viral genome and sequences at the protein level (spike protein, membrane protein, and nucleocapsid protein). Compared with the current state-of-the-art approaches, the random forest model attained high accuracy and recall scores of 99.91% and 0.98, respectively, for genome sequences. In addition to the spike protein sequences, our study shows membrane and nucleocapsid protein sequences can be utilized to predict the host of viruses. We also identified important sites in the viral sequences that help distinguish between different host classes. The host prediction pipeline PreHost will cater as a valuable tool to take effective measures to govern the transmission of future viruses.
Collapse
|
2
|
Soni KK, Rasool A. Quantum-effective exact multiple patterns matching algorithms for biological sequences. PeerJ Comput Sci 2022; 8:e957. [PMID: 35634119 PMCID: PMC9138144 DOI: 10.7717/peerj-cs.957] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 04/01/2022] [Indexed: 06/15/2023]
Abstract
This article presents efficient quantum solutions for exact multiple pattern matching to process the biological sequences. The classical solution takes Ο(mN) time for matching m patterns over N sized text database. The quantum search mechanism is a core for pattern matching, as this reduces time complexity and achieves computational speedup. Few quantum methods are available for multiple pattern matching, which executes search oracle for each pattern in successive iterations. Such solutions are likely acceptable because of classical equivalent quantum designs. However, these methods are constrained with the inclusion of multiplicative factor m in their complexities. An optimal quantum design is to execute multiple search oracle in parallel on the quantum processing unit with a single-core that completely removes the multiplicative factor m, however, this method is impractical to design. We have no effective quantum solutions to process multiple patterns at present. Therefore, we propose quantum algorithms using quantum processing unit with C quantum cores working on shared quantum memory. This quantum parallel design would be effective for searching all t exact occurrences of each pattern. To our knowledge, no attempts have been made to design multiple pattern matching algorithms on quantum multicore processor. Thus, some quantum remarkable exact single pattern matching algorithms are enhanced here with their equivalent versions, namely enhanced quantum memory processing based exact algorithm and enhanced quantum-based combined exact algorithm for multiple pattern matching. Our quantum solutions find all t exact occurrences of each pattern inside the biological sequence in O ( ( m / C ) N ) and O ( ( m / C ) t ) time complexities. This article shows the hybrid simulation of quantum algorithms to validate quantum solutions. Our theoretical-experimental results justify the significant improvements that these algorithms outperform over the existing classical solutions and are proven effective in quantum counterparts.
Collapse
|
3
|
Li J, Lee JY, Liao L. A new algorithm to train hidden Markov models for biological sequences with partial labels. BMC Bioinformatics 2021; 22:162. [PMID: 33771095 PMCID: PMC7995745 DOI: 10.1186/s12859-021-04080-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 03/16/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Hidden Markov models (HMM) are a powerful tool for analyzing biological sequences in a wide variety of applications, from profiling functional protein families to identifying functional domains. The standard method used for HMM training is either by maximum likelihood using counting when sequences are labelled or by expectation maximization, such as the Baum-Welch algorithm, when sequences are unlabelled. However, increasingly there are situations where sequences are just partially labelled. In this paper, we designed a new training method based on the Baum-Welch algorithm to train HMMs for situations in which only partial labeling is available for certain biological problems. RESULTS Compared with a similar method previously reported that is designed for the purpose of active learning in text mining, our method achieves significant improvements in model training, as demonstrated by higher accuracy when the trained models are tested for decoding with both synthetic data and real data. CONCLUSIONS A novel training method is developed to improve the training of hidden Markov models by utilizing partial labelled data. The method will impact on detecting de novo motifs and signals in biological sequence data. In particular, the method will be deployed in active learning mode to the ongoing research in detecting plasmodesmata targeting signals and assess the performance with validations from wet-lab experiments.
Collapse
Affiliation(s)
- Jiefu Li
- Computer and Information Sciences, University of Delaware, 101 Smith Hall, Newark, DE, 19716, USA
| | - Jung-Youn Lee
- Plant and Soil Sciences, University of Delaware, 15 Innovation Way, Newark, 19716, USA.,Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Newark, 19716, USA
| | - Li Liao
- Computer and Information Sciences, University of Delaware, 101 Smith Hall, Newark, DE, 19716, USA. .,Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Newark, 19716, USA. .,Data Science Institute, University of Delaware, 100 Discovery Blvd, Newark, 19713, USA.
| |
Collapse
|
4
|
Pozzi FI, Green GY, Barbona IG, Rodríguez GR, Felitti SA. CleanBSequences: an efficient curator of biological sequences in R. Mol Genet Genomics 2020; 295:837-841. [PMID: 32300860 DOI: 10.1007/s00438-020-01671-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 03/30/2020] [Indexed: 10/24/2022]
Abstract
This work presents a new method and tool to solve a common problem of molecular biologists and geneticists who use molecular markers in their scientific research and developments: curation of sequences. Omic studies conducted by molecular biologists and geneticists usually involve the use of molecular markers. AFLP, cDNA-AFLP, and MSAP are examples of markers that render information at the genomics, transcriptomics, and epigenomics levels, respectively. These three types of molecular markers use adaptors that are the template for PCR amplification. The sequences of the adaptors have to be eliminated for the analysis of the results. Since a large number of sequences are usually obtained in these studies, this clean-up of the data could demand long time and work. To automate this work, an R package, named CleanBSequences, was created that allows the sequences to be curated massively, quickly, without errors and can be used offline. The curating is performed by aligning the forward and/or reverse primers or ends of cloning vectors with the sequences to be removed. After the alignment, new subsequences are generated without biological fragments not desired by the user, i.e., sequences needed by the techniques. In conclusion, the CleanBSequences tool facilitates the work of researchers, reducing time, effort, and working errors. Therefore, the present tool would respond to the problems related to the curation of sequences obtained from the use of some types of molecular markers. In addition to the above, being an open source, CleanBSequences is a flexible tool that has the potential to be used in future improvements to respond to new problems.
Collapse
Affiliation(s)
- Florencia I Pozzi
- Instituto de Tecnología Agropecuaria, EEA Marcos Juárez, Ruta 12 km. 3, 2580, Marcos Juárez, Córdoba, Argentina. .,Cátedra de Microbiología, Facultad de Ciencias Agrarias, Universidad Nacional de Rosario, S2125ZAA, Zavalla, Santa Fe, Argentina.
| | - Gisela Y Green
- Cátedra de Epidemiología, Facultad de Ciencias Veterinarias, Universidad Nacional de Rosario, S2170, Casilda, Santa Fe, Argentina
| | - Ivana G Barbona
- Cátedra de Estadística, Facultad de Ciencias Agrarias, Universidad Nacional de Rosario, S2125ZAA, Zavalla, Santa Fe, Argentina
| | - Gustavo R Rodríguez
- Instituto de Investigaciones en Ciencias Agrarias de Rosario (IICAR) (CONICET-UNR), Zavalla, Argentina.,Cátedra de Genética, Facultad de Ciencias Agrarias, Universidad Nacional de Rosario, S2125ZAA, Zavalla, Santa Fe, Argentina
| | - Silvina A Felitti
- Instituto de Investigaciones en Ciencias Agrarias de Rosario (IICAR) (CONICET-UNR), Zavalla, Argentina
| |
Collapse
|
5
|
Hattori LT, Gutoski M, Vargas Benítez CM, Nunes LF, Lopes HS. A benchmark of optimally folded protein structures using integer programming and the 3D-HP-SC model. Comput Biol Chem 2020; 84:107192. [PMID: 31918170 DOI: 10.1016/j.compbiolchem.2019.107192] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2019] [Revised: 12/09/2019] [Accepted: 12/10/2019] [Indexed: 01/04/2023]
Abstract
The Protein Structure Prediction (PSP) problem comprises, among other issues, forecasting the three-dimensional native structure of proteins using only their primary structure information. Most computational studies in this area use synthetic data instead of real biological data. However, the closer to the real-world, the more the impact of results and their applicability. This work presents 17 real protein sequences extracted from the Protein Data Bank for a benchmark to the PSP problem using the tri-dimensional Hydrophobic-Polar with Side-Chains model (3D-HP-SC). The native structure of these proteins was found by maximizing the number of hydrophobic contacts between the side-chains of amino acids. The problem was treated as an optimization problem and solved by means of an Integer Programming approach. Although the method optimally solves the problem, the processing time has an exponential trend. Therefore, due to computational limitations, the method is a proof-of-concept and it is not applicable to large sequences. For unknown sequences, an upper bound of the number of hydrophobic contacts (using this model) can be found, due to a linear relationship with the number of hydrophobic residues. The comparison between the predicted and the biological structures showed that the highest similarity between them was found with distance thresholds around 5.2-8.2 Å. Both the dataset and the programs developed will be freely available to foster further research in the area.
Collapse
Affiliation(s)
- Leandro Takeshi Hattori
- Bioinformatics and Computational Intelligence Laboratory, Federal University of Technology Paraná (UTFPR), Av. 7 de Setembro, 3165, 80230-901 Curitiba (PR), Brazil.
| | - Matheus Gutoski
- Bioinformatics and Computational Intelligence Laboratory, Federal University of Technology Paraná (UTFPR), Av. 7 de Setembro, 3165, 80230-901 Curitiba (PR), Brazil
| | - César Manuel Vargas Benítez
- Bioinformatics and Computational Intelligence Laboratory, Federal University of Technology Paraná (UTFPR), Av. 7 de Setembro, 3165, 80230-901 Curitiba (PR), Brazil
| | - Luiz Fernando Nunes
- Bioinformatics and Computational Intelligence Laboratory, Federal University of Technology Paraná (UTFPR), Av. 7 de Setembro, 3165, 80230-901 Curitiba (PR), Brazil.
| | - Heitor Silvério Lopes
- Bioinformatics and Computational Intelligence Laboratory, Federal University of Technology Paraná (UTFPR), Av. 7 de Setembro, 3165, 80230-901 Curitiba (PR), Brazil.
| |
Collapse
|