1
|
Adjeroh DA, Zhou X, Paschoal AR, Dimitrova N, Derevyanchuk EG, Shkurat TP, Loeb JA, Martinez I, Lipovich L. Challenges in LncRNA Biology: Views and Opinions. Noncoding RNA 2024; 10:43. [PMID: 39195572 DOI: 10.3390/ncrna10040043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 06/26/2024] [Accepted: 07/04/2024] [Indexed: 08/29/2024] Open
Abstract
This is a mini-review capturing the views and opinions of selected participants at the 2021 IEEE BIBM 3rd Annual LncRNA Workshop, held in Dubai, UAE. The views and opinions are expressed on five broad themes related to problems in lncRNA, namely, challenges in the computational analysis of lncRNAs, lncRNAs and cancer, lncRNAs in sports, lncRNAs and COVID-19, and lncRNAs in human brain activity.
Collapse
Affiliation(s)
- Donald A Adjeroh
- Lane Department of Computer Science and Electrical Engineering, West Virginia University (WVU), Morgantown, WV 26506, USA
| | - Xiaobo Zhou
- Department of Bioinformatics and Systems Medicine, University of Texas Health Science Center, Houston, TX 77030, USA
| | - Alexandre Rossi Paschoal
- Department of Computer Science, Bioinformatics and Pattern Recognition Group, Federal University of Technology-Paraná-UTFPR, Curitiba 86300-000, Brazil
- Rosalind Franklin Institute, Harwell Science and Innovation Campus, Didcot OX11 0FA, UK
| | - Nadya Dimitrova
- Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT 06520, USA
| | | | - Tatiana P Shkurat
- Department of Genetics, Southern Federal University, Rostov-on-Don 344090, Russia
| | - Jeffrey A Loeb
- Department of Neurology and Rehabilitation, The Center for Clinical and Translational Science, The University of Illinois NeuroRepository, University of Illinois, Chicago, IL 60607, USA
| | - Ivan Martinez
- Department of Microbiology, Immunology & Cell Biology, WVU Cancer Institute, West Virginia University (WVU) School of Medicine, Morgantown, WV 26505, USA
| | - Leonard Lipovich
- Shenzhen Huayuan Biological Science Research Institute, Shenzhen Huayuan Biotechnology Co., Ltd., Shenzhen 518000, China
- Center for Molecular Medicine and Genetics, School of Medicine, Wayne State University, Detroit, MI 48201, USA
- College of Science, Mathematics and Technology, Wenzhou-Kean University, Wenzhou 325060, China
| |
Collapse
|
2
|
Gugulothu P, Bhukya R. Coot-Lion optimized deep learning algorithm for COVID-19 point mutation rate prediction using genome sequences. Comput Methods Biomech Biomed Engin 2024; 27:1410-1429. [PMID: 37668061 DOI: 10.1080/10255842.2023.2244109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Revised: 07/08/2023] [Accepted: 07/28/2023] [Indexed: 09/06/2023]
Abstract
In this study, a deep quantum neural network (DQNN) based on the Lion-based Coot algorithm (LBCA-based Deep QNN) is employed to predict COVID-19. Here, the genome sequences are subjected to feature extraction. The fusion of features is performed using the Bray-Curtis distance and the deep belief network (DBN). Lastly, a deep quantum neural network (Deep QNN) is used to predict COVID-19. The LBCA is obtained by integrating Coot algorithm and LOA. The COVID-19 predictions are done with mutation points. The LBCA-based Deep QNN outperformed with testing accuracy of 0.941, true positive rate of 0.931, and false positive rate of 0.869.
Collapse
Affiliation(s)
- Praveen Gugulothu
- Department of Computer Science and Engineering, National Institute of Technology Warangal, Hanamkonda, Telangana 506004, India
| | - Raju Bhukya
- Department of Computer Science and Engineering, National Institute of Technology Warangal, Hanamkonda, Telangana 506004, India
| |
Collapse
|
3
|
Abbass J, Parisi C. Machine learning-based prediction of proteins' architecture using sequences of amino acids and structural alphabets. J Biomol Struct Dyn 2024:1-16. [PMID: 38505995 DOI: 10.1080/07391102.2024.2328736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 03/05/2024] [Indexed: 03/21/2024]
Abstract
In addition to the growth of protein structures generated through wet laboratory experiments and deposited in the PDB repository, AlphaFold predictions have significantly contributed to the creation of a much larger database of protein structures. Annotating such a vast number of structures has become an increasingly challenging task. CATH is widely recognized as one the most common platforms for addressing this challenge, as it classifies proteins based on their structural and evolutionary relationships, offering the scientific community an invaluable resource for uncovering various properties, including functional annotations. While CATH annotation involves - to some extent - human intervention, keeping up with the classification of the rapidly expanding repositories of protein structures has become exceedingly difficult. Therefore, there is a pressing need for a fully automated approach. On the other hand, the abundance of protein sequences stemming from next generation sequencing technologies, lacking structural annotations, presents an additional challenge to the scientific community. Consequently, 'pre-annotating' protein sequences with structural features, ensuring a high level of precision, could prove highly advantageous. In this paper, after a thorough investigation, we introduce a novel machine-learning model capable of classifying any protein domain, whether it has a known structure or not, into one of the 40 main CATH Architectures. We achieve an F1 Score of 0.92 using only the amino acid sequence and a score of 0.94 using both the sequence of amino acids and the sequence of structural alphabets.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Jad Abbass
- School of Computer Science and Mathematics, Kingston University, London, UK
| | - Charles Parisi
- School of Computer Science and Mathematics, Kingston University, London, UK
- Telecom Physique Strasbourg, Strasbourg University, Strasbourg, France
| |
Collapse
|
4
|
Avila Santos AP, de Almeida BLS, Bonidia RP, Stadler PF, Stefanic P, Mandic-Mulec I, Rocha U, Sanches DS, de Carvalho ACPLF. BioDeepfuse: a hybrid deep learning approach with integrated feature extraction techniques for enhanced non-coding RNA classification. RNA Biol 2024; 21:1-12. [PMID: 38528797 DOI: 10.1080/15476286.2024.2329451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/23/2024] [Indexed: 03/27/2024] Open
Abstract
The accurate classification of non-coding RNA (ncRNA) sequences is pivotal for advanced non-coding genome annotation and analysis, a fundamental aspect of genomics that facilitates understanding of ncRNA functions and regulatory mechanisms in various biological processes. While traditional machine learning approaches have been employed for distinguishing ncRNA, these often necessitate extensive feature engineering. Recently, deep learning algorithms have provided advancements in ncRNA classification. This study presents BioDeepFuse, a hybrid deep learning framework integrating convolutional neural networks (CNN) or bidirectional long short-term memory (BiLSTM) networks with handcrafted features for enhanced accuracy. This framework employs a combination of k-mer one-hot, k-mer dictionary, and feature extraction techniques for input representation. Extracted features, when embedded into the deep network, enable optimal utilization of spatial and sequential nuances of ncRNA sequences. Using benchmark datasets and real-world RNA samples from bacterial organisms, we evaluated the performance of BioDeepFuse. Results exhibited high accuracy in ncRNA classification, underscoring the robustness of our tool in addressing complex ncRNA sequence data challenges. The effective melding of CNN or BiLSTM with external features heralds promising directions for future research, particularly in refining ncRNA classifiers and deepening insights into ncRNAs in cellular processes and disease manifestations. In addition to its original application in the context of bacterial organisms, the methodologies and techniques integrated into our framework can potentially render BioDeepFuse effective in various and broader domains.
Collapse
Affiliation(s)
- Anderson P Avila Santos
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, Brazil
- Department of Applied Microbial Ecology, Helmholtz Centre for Environmental Research - UFZ GmbH, Leipzig, Saxony, Germany
| | - Breno L S de Almeida
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, Brazil
| | - Robson P Bonidia
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos, Brazil
- Department of Computer Science, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio, Brazil
| | - Peter F Stadler
- Department of Computer Science and Interdisciplinary Center of Bioinformatics, University of Leipzig, Leipzig, Saxony, Germany
| | - Polonca Stefanic
- Department of Food Science and Technology, Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
| | - Ines Mandic-Mulec
- Department of Food Science and Technology, Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
| | - Ulisses Rocha
- Department of Applied Microbial Ecology, Helmholtz Centre for Environmental Research - UFZ GmbH, Leipzig, Saxony, Germany
| | - Danilo S Sanches
- Department of Computer Science, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio, Brazil
| | | |
Collapse
|
5
|
Wang H, Zeng W, Huang X, Liu Z, Sun Y, Zhang L. MTTLm 6A: A multi-task transfer learning approach for base-resolution mRNA m 6A site prediction based on an improved transformer. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:272-299. [PMID: 38303423 DOI: 10.3934/mbe.2024013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
N6-methyladenosine (m6A) is a crucial RNA modification involved in various biological activities. Computational methods have been developed for the detection of m6A sites in Saccharomyces cerevisiae at base-resolution due to their cost-effectiveness and efficiency. However, the generalization of these methods has been hindered by limited base-resolution datasets. Additionally, RMBase contains a vast number of low-resolution m6A sites for Saccharomyces cerevisiae, and base-resolution sites are often inferred from these low-resolution results through post-calibration. We propose MTTLm6A, a multi-task transfer learning approach for base-resolution mRNA m6A site prediction based on an improved transformer. First, the RNA sequences are encoded by using one-hot encoding. Then, we construct a multi-task model that combines a convolutional neural network with a multi-head-attention deep framework. This model not only detects low-resolution m6A sites, it also assigns reasonable probabilities to the predicted sites. Finally, we employ transfer learning to predict base-resolution m6A sites based on the low-resolution m6A sites. Experimental results on Saccharomyces cerevisiae m6A and Homo sapiens m1A data demonstrate that MTTLm6A respectively achieved area under the receiver operating characteristic (AUROC) values of 77.13% and 92.9%, outperforming the state-of-the-art models. At the same time, it shows that the model has strong generalization ability. To enhance user convenience, we have made a user-friendly web server for MTTLm6A publicly available at http://47.242.23.141/MTTLm6A/index.php.
Collapse
Affiliation(s)
- Honglei Wang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
- School of Information Engineering, Xuzhou College of Industrial Technology, Xuzhou, China
| | - Wenliang Zeng
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Xiaoling Huang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Zhaoyang Liu
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Yanjing Sun
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| | - Lin Zhang
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, China
| |
Collapse
|
6
|
Shen Y, Gong L, Xu F, Wang S, Liu H, Wang Y, Hu L, Zhu L. Insight into the lncRNA-mRNA Co-Expression Profile and ceRNA Network in Lipopolysaccharide-Induced Acute Lung Injury. Curr Issues Mol Biol 2023; 45:6170-6189. [PMID: 37504305 PMCID: PMC10378513 DOI: 10.3390/cimb45070389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 07/16/2023] [Accepted: 07/20/2023] [Indexed: 07/29/2023] Open
Abstract
Long non-coding RNAs (lncRNAs) participate in acute lung injury (ALI). However, their latent biological function and molecular mechanism have not been fully understood. In the present study, the global expression profiles of lncRNAs and mRNAs between the control and lipopolysaccharide (LPS)-stimulated groups of human normal lung epithelial cells (BEAS-2B) were determined using high-throughput sequencing. Overall, a total of 433 lncRNAs and 183 mRNAs were differentially expressed. A lncRNA-mRNA co-expression network was established, and then the top 10 lncRNAs were screened using topological methods. Gene Ontology and Kyoto Encyclopedia of Genes and Genomes analysis results showed that the key lncRNAs targeting mRNAs were mostly enriched in the inflammatory-related biological processes. Gene set variation analysis and Pearson's correlation coefficients confirmed the close correlation for the top 10 lncRNAs with inflammatory responses. A protein-protein interaction network analysis was conducted based on the key lncRNAs targeting mRNAs, where IL-1β, IL-6, and CXCL8 were regarded as the hub genes. A competing endogenous RNA (ceRNA) modulatory network was created with five lncRNAs, thirteen microRNAs, and twelve mRNAs. Finally, real-time quantitative reverse transcription-polymerase chain reaction was employed to verify the expression levels of several key lncRNAs in BEAS-2B cells and human serum samples.
Collapse
Affiliation(s)
- Yue Shen
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University, Shanghai 200032, China
| | - Linjing Gong
- Department of Respiratory and Critical Care Medicine, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Fan Xu
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University, Shanghai 200032, China
| | - Sijiao Wang
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University, Shanghai 200032, China
| | - Hanhan Liu
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University, Shanghai 200032, China
| | - Yali Wang
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University, Shanghai 200032, China
| | - Lijuan Hu
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University, Shanghai 200032, China
| | - Lei Zhu
- Department of Pulmonary Medicine, Zhongshan Hospital, Fudan University, Shanghai 200032, China
- Department of Pulmonary and Critical Care Medicine, Huadong Hospital, Fudan University, Shanghai 200040, China
| |
Collapse
|
7
|
Liu Z, Lan P, Liu T, Liu X, Liu T. m6Aminer: Predicting the m6Am Sites on mRNA by Fusing Multiple Sequence-Derived Features into a CatBoost-Based Classifier. Int J Mol Sci 2023; 24:ijms24097878. [PMID: 37175594 PMCID: PMC10177809 DOI: 10.3390/ijms24097878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Revised: 04/20/2023] [Accepted: 04/24/2023] [Indexed: 05/15/2023] Open
Abstract
As one of the most important post-transcriptional modifications, m6Am plays a fairly important role in conferring mRNA stability and in the progression of cancers. The accurate identification of the m6Am sites is critical for explaining its biological significance and developing its application in the medical field. However, conventional experimental approaches are time-consuming and expensive, making them unsuitable for the large-scale identification of the m6Am sites. To address this challenge, we exploit a CatBoost-based method, m6Aminer, to identify the m6Am sites on mRNA. For feature extraction, nine different feature-encoding schemes (pseudo electron-ion interaction potential, hash decimal conversion method, dinucleotide binary encoding, nucleotide chemical properties, pseudo k-tuple composition, dinucleotide numerical mapping, K monomeric units, series correlation pseudo trinucleotide composition, and K-spaced nucleotide pair frequency) were utilized to form the initial feature space. To obtain the optimized feature subset, the ExtraTreesClassifier algorithm was adopted to perform feature importance ranking, and the top 300 features were selected as the optimal feature subset. With different performance assessment methods, 10-fold cross-validation and independent test, m6Aminer achieved average AUC of 0.913 and 0.754, demonstrating a competitive performance with the state-of-the-art models m6AmPred (0.905 and 0.735) and DLm6Am (0.897 and 0.730). The prediction model developed in this study can be used to identify the m6Am sites in the whole transcriptome, laying a foundation for the functional research of m6Am.
Collapse
Affiliation(s)
- Ze Liu
- College of Water Resources and Architectural Engineering, Northwest A&F University, Xianyang 712100, China
| | - Pengfei Lan
- College of Water Resources and Architectural Engineering, Northwest A&F University, Xianyang 712100, China
| | - Ting Liu
- College of Water Resources and Architectural Engineering, Northwest A&F University, Xianyang 712100, China
- Department of Mechanical Engineering, Faculty of Engineering, The University of Hong Kong, Hong Kong 999077, China
| | - Xudong Liu
- College of Water Resources and Architectural Engineering, Northwest A&F University, Xianyang 712100, China
- College of Control Science and Engineering, Zhejiang University, Hangzhou 310027, China
| | - Tao Liu
- College of Water Resources and Architectural Engineering, Northwest A&F University, Xianyang 712100, China
- Key Laboratory of Agricultural Soil and Water Engineering in Arid and Semiarid Areas, Ministry of Education, Northwest A&F University, Xianyang 712100, China
| |
Collapse
|
8
|
Wang Y, Wang X, Cui X, Meng J, Rong R. Self-attention enabled deep learning of dihydrouridine (D) modification on mRNAs unveiled a distinct sequence signature from tRNAs. MOLECULAR THERAPY. NUCLEIC ACIDS 2023; 31:411-420. [PMID: 36845339 PMCID: PMC9945750 DOI: 10.1016/j.omtn.2023.01.014] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Accepted: 01/23/2023] [Indexed: 01/28/2023]
Abstract
Dihydrouridine (D) is a modified pyrimidine nucleotide universally found in viral, prokaryotic, and eukaryotic species. It serves as a metabolic modulator for various pathological conditions, and its elevated levels in tumors are associated with a series of cancers. Precise identification of D sites on RNA is vital for understanding its biological function. A number of computational approaches have been developed for predicting D sites on tRNAs; however, none have considered mRNAs. We present here DPred, the first computational tool for predicting D on mRNAs in yeast from the primary RNA sequences. Built on a local self-attention layer and a convolutional neural network (CNN) layer, the proposed deep learning model outperformed classic machine learning approaches (random forest, support vector machines, etc.) and achieved reasonable accuracy and reliability with areas under the curve of 0.9166 and 0.9027 in jackknife cross-validation and on an independent testing dataset, respectively. Importantly, we showed that distinct sequence signatures are associated with the D sites on mRNAs and tRNAs, implying potentially different formation mechanisms and putative divergent functionality of this modification on the two types of RNA. DPred is available as a user-friendly Web server.
Collapse
Affiliation(s)
- Yue Wang
- Department of Mathematical Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China,Department of Computer Science, University of Liverpool, L69 7ZB Liverpool, UK
| | - Xuan Wang
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China
| | - Xiaodong Cui
- School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, Shaanxi 710072, China
| | - Jia Meng
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China,AI University Research Centre, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China,Institute of Systems, Molecular and Integrative Biology, University of Liverpool, L69 7ZB Liverpool, UK
| | - Rong Rong
- Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China,Corresponding author: Rong Rong, Department of Biological Sciences, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu 215123, China.
| |
Collapse
|
9
|
Zhang H, Wang Y, Pan Z, Sun X, Mou M, Zhang B, Li Z, Li H, Zhu F. ncRNAInter: a novel strategy based on graph neural network to discover interactions between lncRNA and miRNA. Brief Bioinform 2022; 23:6747810. [PMID: 36198065 DOI: 10.1093/bib/bbac411] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 08/04/2022] [Accepted: 08/23/2022] [Indexed: 12/14/2022] Open
Abstract
In recent years, many studies have illustrated the significant role that non-coding RNA (ncRNA) plays in biological activities, in which lncRNA, miRNA and especially their interactions have been proved to affect many biological processes. Some in silico methods have been proposed and applied to identify novel lncRNA-miRNA interactions (LMIs), but there are still imperfections in their RNA representation and information extraction approaches, which imply there is still room for further improving their performances. Meanwhile, only a few of them are accessible at present, which limits their practical applications. The construction of a new tool for LMI prediction is thus imperative for the better understanding of their relevant biological mechanisms. This study proposed a novel method, ncRNAInter, for LMI prediction. A comprehensive strategy for RNA representation and an optimized deep learning algorithm of graph neural network were utilized in this study. ncRNAInter was robust and showed better performance of 26.7% higher Matthews correlation coefficient than existing reputable methods for human LMI prediction. In addition, ncRNAInter proved its universal applicability in dealing with LMIs from various species and successfully identified novel LMIs associated with various diseases, which further verified its effectiveness and usability. All source code and datasets are freely available at https://github.com/idrblab/ncRNAInter.
Collapse
Affiliation(s)
- Hanyu Zhang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Yunxia Wang
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Ziqi Pan
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Xiuna Sun
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Minjie Mou
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China
| | - Bing Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Zhaorong Li
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Honglin Li
- School of Computer Science and Technology, East China Normal University, Shanghai 200062, China.,Shanghai Key Laboratory of New Drug Design, East China University of Science and Technology, Shanghai 200237, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, China.,Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| |
Collapse
|
10
|
Bonidia RP, Avila Santos AP, de Almeida BLS, Stadler PF, Nunes da Rocha U, Sanches DS, de Carvalho ACPLF. Information Theory for Biological Sequence Classification: A Novel Feature Extraction Technique Based on Tsallis Entropy. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1398. [PMID: 37420418 DOI: 10.3390/e24101398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 09/16/2022] [Accepted: 09/24/2022] [Indexed: 07/09/2023]
Abstract
In recent years, there has been an exponential growth in sequencing projects due to accelerated technological advances, leading to a significant increase in the amount of data and resulting in new challenges for biological sequence analysis. Consequently, the use of techniques capable of analyzing large amounts of data has been explored, such as machine learning (ML) algorithms. ML algorithms are being used to analyze and classify biological sequences, despite the intrinsic difficulty in extracting and finding representative biological sequence methods suitable for them. Thereby, extracting numerical features to represent sequences makes it statistically feasible to use universal concepts from Information Theory, such as Tsallis and Shannon entropy. In this study, we propose a novel Tsallis entropy-based feature extractor to provide useful information to classify biological sequences. To assess its relevance, we prepared five case studies: (1) an analysis of the entropic index q; (2) performance testing of the best entropic indices on new datasets; (3) a comparison made with Shannon entropy and (4) generalized entropies; (5) an investigation of the Tsallis entropy in the context of dimensionality reduction. As a result, our proposal proved to be effective, being superior to Shannon entropy and robust in terms of generalization, and also potentially representative for collecting information in fewer dimensions compared with methods such as Singular Value Decomposition and Uniform Manifold Approximation and Projection.
Collapse
Affiliation(s)
- Robson P Bonidia
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| | - Anderson P Avila Santos
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research-UFZ GmbH, 04318 Leipzig, Germany
| | - Breno L S de Almeida
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| | - Peter F Stadler
- Department of Computer Science and Interdisciplinary Center of Bioinformatics, University of Leipzig, 04107 Leipzig, Germany
| | - Ulisses Nunes da Rocha
- Department of Environmental Microbiology, Helmholtz Centre for Environmental Research-UFZ GmbH, 04318 Leipzig, Germany
| | - Danilo S Sanches
- Department of Computer Science, Federal University of Technology-Paraná-UTFPR, Cornélio Procópio 86300-000, Brazil
| | - André C P L F de Carvalho
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| |
Collapse
|
11
|
Mathematical Modeling and Computational Prediction of High-Risk Types of Human Papillomaviruses. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:1515810. [PMID: 35912141 PMCID: PMC9334084 DOI: 10.1155/2022/1515810] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 04/22/2022] [Accepted: 06/28/2022] [Indexed: 11/17/2022]
Abstract
Cervical cancer is one of the main causes of cancer death all over the world. Most diseases such as cervical epithelial atypical hyperplasia and invasive cervical cancer are closely related to the continuous infection of high-risk types of human papillomavirus. Therefore, the high-risk types of human papillomavirus are the key to the prevention and treatment of cervical cancer. With the accumulation of high-throughput and clinical data, the use of systematic and quantitative methods for mathematical modeling and computational prediction has become more and more important. This paper summarizes the mathematical models and prediction methods of the risk types of human papillomavirus, especially around the key steps such as feature extraction, feature selection, and prediction algorithms. We summarized and discussed the advantages and disadvantages of existing algorithms, which provides a theoretical basis for follow-up research.
Collapse
|
12
|
Zandavi SM, Koch FC, Vijayan A, Zanini F, Mora F, Ortega D, Vafaee F. Disentangling single-cell omics representation with a power spectral density-based feature extraction. Nucleic Acids Res 2022; 50:5482-5492. [PMID: 35639509 PMCID: PMC9178020 DOI: 10.1093/nar/gkac436] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2021] [Revised: 04/26/2022] [Accepted: 05/10/2022] [Indexed: 12/13/2022] Open
Abstract
Emerging single-cell technologies provide high-resolution measurements of distinct cellular modalities opening new avenues for generating detailed cellular atlases of many and diverse tissues. The high dimensionality, sparsity, and inaccuracy of single cell sequencing measurements, however, can obscure discriminatory information, mask cellular subtype variations and complicate downstream analyses which can limit our understanding of cell function and tissue heterogeneity. Here, we present a novel pre-processing method (scPSD) inspired by power spectral density analysis that enhances the accuracy for cell subtype separation from large-scale single-cell omics data. We comprehensively benchmarked our method on a wide range of single-cell RNA-sequencing datasets and showed that scPSD pre-processing, while being fast and scalable, significantly reduces data complexity, enhances cell-type separation, and enables rare cell identification. Additionally, we applied scPSD to transcriptomics and chromatin accessibility cell atlases and demonstrated its capacity to discriminate over 100 cell types across the whole organism and across different modalities of single-cell omics data.
Collapse
Affiliation(s)
- Seid Miad Zandavi
- School of Biotechnology and Biomolecular Sciences, University of New South Wales (UNSW Sydney), Australia
- Programs in Metabolism and Medical & Population Genetics, Broad Institute, Cambridge, MA, USA
- Division of Genetics and Genomics, Boston Children's Hospital, Boston, MA, USA
- Department of Pediatrics, Harvard Medical School, Boston, MA, USA
| | - Forrest C Koch
- School of Biotechnology and Biomolecular Sciences, University of New South Wales (UNSW Sydney), Australia
| | - Abhishek Vijayan
- School of Biotechnology and Biomolecular Sciences, University of New South Wales (UNSW Sydney), Australia
| | - Fabio Zanini
- Prince of Wales Clinical School, UNSW Sydney, Australia
- Cellular Genomics Future Institute, UNSW Sydney, Australia
| | - Fatima Valdes Mora
- Children's Cancer Institute, Lowy Cancer Research Centre, UNSW Sydney, Australia
- School of Women's and Children's Health, Faculty of Medicine, UNSW, Sydney, Australia
| | - David Gallego Ortega
- School of Biomedical Engineering, University of Technology Sydney (UTS), Australia
| | - Fatemeh Vafaee
- School of Biotechnology and Biomolecular Sciences, University of New South Wales (UNSW Sydney), Australia
- Cellular Genomics Future Institute, UNSW Sydney, Australia
- UNSW Data Science Hub (uDASH), UNSW Sydney, Australia
| |
Collapse
|
13
|
Bonidia RP, Domingues DS, Sanches DS, de Carvalho ACPLF. MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. Brief Bioinform 2022; 23:bbab434. [PMID: 34750626 PMCID: PMC8769707 DOI: 10.1093/bib/bbab434] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 09/18/2021] [Accepted: 09/20/2021] [Indexed: 12/24/2022] Open
Abstract
One of the main challenges in applying machine learning algorithms to biological sequence data is how to numerically represent a sequence in a numeric input vector. Feature extraction techniques capable of extracting numerical information from biological sequences have been reported in the literature. However, many of these techniques are not available in existing packages, such as mathematical descriptors. This paper presents a new package, MathFeature, which implements mathematical descriptors able to extract relevant numerical information from biological sequences, i.e. DNA, RNA and proteins (prediction of structural features along the primary sequence of amino acids). MathFeature makes available 20 numerical feature extraction descriptors based on approaches found in the literature, e.g. multiple numeric mappings, genomic signal processing, chaos game theory, entropy and complex networks. MathFeature also allows the extraction of alternative features, complementing the existing packages. To ensure that our descriptors are robust and to assess their relevance, experimental results are presented in nine case studies. According to these results, the features extracted by MathFeature showed high performance (0.6350-0.9897, accuracy), both applying only mathematical descriptors, but also hybridization with well-known descriptors in the literature. Finally, through MathFeature, we overcame several studies in eight benchmark datasets, exemplifying the robustness and viability of the proposed package. MathFeature has advanced in the area by bringing descriptors not available in other packages, as well as allowing non-experts to use feature extraction techniques.
Collapse
Affiliation(s)
- Robson P Bonidia
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| | - Douglas S Domingues
- Group of Genomics and Transcriptomes in Plants, Institute of Biosciences, São Paulo State University (UNESP), Rio Claro 13506-900, Brazil
| | - Danilo S Sanches
- Department of Computer Science, Federal University of Technology - Paraná, UTFPR, Cornélio Procópio 86300-000, Brazil
| | - André C P L F de Carvalho
- Institute of Mathematics and Computer Sciences, University of São Paulo, São Carlos 13566-590, Brazil
| |
Collapse
|
14
|
Lood C, Boeckaerts D, Stock M, De Baets B, Lavigne R, van Noort V, Briers Y. Digital phagograms: predicting phage infectivity through a multilayer machine learning approach. Curr Opin Virol 2021; 52:174-181. [PMID: 34952265 DOI: 10.1016/j.coviro.2021.12.004] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 11/26/2021] [Accepted: 12/04/2021] [Indexed: 12/19/2022]
Abstract
Machine learning has been broadly implemented to investigate biological systems. In this regard, the field of phage biology has embraced machine learning to elucidate and predict phage-host interactions, based on receptor-binding proteins, (anti-)defense systems, prophage detection, and life cycle recognition. Here, we highlight the enormous potential of integrating information from omics data with insights from systems biology to better understand phage-host interactions. We conceptualize and discuss the potential of a multilayer model that mirrors the phage infection process, integrating adsorption, bacterial pan-immune components and hijacking of the bacterial metabolism to predict phage infectivity. In the future, this model can offer insights into the underlying mechanisms of the infection process, and digital phagograms can support phage cocktail design and phage engineering.
Collapse
Affiliation(s)
- Cédric Lood
- Laboratory of Gene Technology, Department of Biosystems, KU Leuven, Leuven, Belgium; Centre of Microbial and Plant Genetics, Department of Microbial and Molecular Systems, KU Leuven, Leuven, Belgium
| | - Dimitri Boeckaerts
- Laboratory of Applied Biotechnology, Department of Biotechnology, Ghent University, Ghent, Belgium; KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Michiel Stock
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium; BIOBIX, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Bernard De Baets
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Rob Lavigne
- Laboratory of Gene Technology, Department of Biosystems, KU Leuven, Leuven, Belgium.
| | - Vera van Noort
- Centre of Microbial and Plant Genetics, Department of Microbial and Molecular Systems, KU Leuven, Leuven, Belgium; Institute of Biology, Leiden University, Leiden, The Netherlands.
| | - Yves Briers
- Laboratory of Applied Biotechnology, Department of Biotechnology, Ghent University, Ghent, Belgium.
| |
Collapse
|