1
|
Soleymani F, Paquet E, Viktor HL, Michalowski W. Structure-based protein and small molecule generation using EGNN and diffusion models: A comprehensive review. Comput Struct Biotechnol J 2024; 23:2779-2797. [PMID: 39050782 PMCID: PMC11268121 DOI: 10.1016/j.csbj.2024.06.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 06/13/2024] [Accepted: 06/18/2024] [Indexed: 07/27/2024] Open
Abstract
Recent breakthroughs in deep learning have revolutionized protein sequence and structure prediction. These advancements are built on decades of protein design efforts, and are overcoming traditional time and cost limitations. Diffusion models, at the forefront of these innovations, significantly enhance design efficiency by automating knowledge acquisition. In the field of de novo protein design, the goal is to create entirely novel proteins with predetermined structures. Given the arbitrary positions of proteins in 3-D space, graph representations and their properties are widely used in protein generation studies. A critical requirement in protein modelling is maintaining spatial relationships under transformations (rotations, translations, and reflections). This property, known as equivariance, ensures that predicted protein characteristics adapt seamlessly to changes in orientation or position. Equivariant graph neural networks offer a solution to this challenge. By incorporating equivariant graph neural networks to learn the score of the probability density function in diffusion models, one can generate proteins with robust 3-D structural representations. This review examines the latest deep learning advancements, specifically focusing on frameworks that combine diffusion models with equivariant graph neural networks for protein generation.
Collapse
Affiliation(s)
- Farzan Soleymani
- Telfer School of Management, University of Ottawa, ON, K1N 6N5, Canada
| | - Eric Paquet
- National Research Council, 1200 Montreal Road, Ottawa, ON, K1A 0R6, Canada
- School of Electrical Engineering and Computer Science, University of Ottawa, ON, K1N 6N5, Canada
| | - Herna Lydia Viktor
- School of Electrical Engineering and Computer Science, University of Ottawa, ON, K1N 6N5, Canada
| | | |
Collapse
|
2
|
Pratiwi NKC, Tayara H, Chong KT. An Ensemble Classifiers for Improved Prediction of Native-Non-Native Protein-Protein Interaction. Int J Mol Sci 2024; 25:5957. [PMID: 38892144 PMCID: PMC11172808 DOI: 10.3390/ijms25115957] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Revised: 05/27/2024] [Accepted: 05/27/2024] [Indexed: 06/21/2024] Open
Abstract
In this study, we present an innovative approach to improve the prediction of protein-protein interactions (PPIs) through the utilization of an ensemble classifier, specifically focusing on distinguishing between native and non-native interactions. Leveraging the strengths of various base models, including random forest, gradient boosting, extreme gradient boosting, and light gradient boosting, our ensemble classifier integrates these diverse predictions using a logistic regression meta-classifier. Our model was evaluated using a comprehensive dataset generated from molecular dynamics simulations. While the gains in AUC and other metrics might seem modest, they contribute to a model that is more robust, consistent, and adaptable. To assess the effectiveness of various approaches, we compared the performance of logistic regression to four baseline models. Our results indicate that logistic regression consistently underperforms across all evaluated metrics. This suggests that it may not be well-suited to capture the complex relationships within this dataset. Tree-based models, on the other hand, appear to be more effective for problems involving molecular dynamics simulations. Extreme gradient boosting (XGBoost) and light gradient boosting (LightGBM) are optimized for performance and speed, handling datasets effectively and incorporating regularizations to avoid over-fitting. Our findings indicate that the ensemble method enhances the predictive capability of PPIs, offering a promising tool for computational biology and drug discovery by accurately identifying potential interaction sites and facilitating the understanding of complex protein functions within biological systems.
Collapse
Affiliation(s)
- Nor Kumalasari Caecar Pratiwi
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea;
- Department of Electrical Engineering, Telkom University, Bandung 40257, West Java, Indonesia
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, Republic of Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, Republic of Korea;
- Advances Electronics and Information Research Centre, Jeonbuk National University, Jeonju 54896, Republic of Korea
| |
Collapse
|
3
|
Qi X, Zhao Y, Qi Z, Hou S, Chen J. Machine Learning Empowering Drug Discovery: Applications, Opportunities and Challenges. Molecules 2024; 29:903. [PMID: 38398653 PMCID: PMC10892089 DOI: 10.3390/molecules29040903] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 02/08/2024] [Accepted: 02/14/2024] [Indexed: 02/25/2024] Open
Abstract
Drug discovery plays a critical role in advancing human health by developing new medications and treatments to combat diseases. How to accelerate the pace and reduce the costs of new drug discovery has long been a key concern for the pharmaceutical industry. Fortunately, by leveraging advanced algorithms, computational power and biological big data, artificial intelligence (AI) technology, especially machine learning (ML), holds the promise of making the hunt for new drugs more efficient. Recently, the Transformer-based models that have achieved revolutionary breakthroughs in natural language processing have sparked a new era of their applications in drug discovery. Herein, we introduce the latest applications of ML in drug discovery, highlight the potential of advanced Transformer-based ML models, and discuss the future prospects and challenges in the field.
Collapse
Affiliation(s)
- Xin Qi
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| | - Yuanchun Zhao
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| | - Zhuang Qi
- School of Software, Shandong University, Jinan 250101, China;
| | - Siyu Hou
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| | - Jiajia Chen
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou 215011, China; (Y.Z.); (S.H.); (J.C.)
| |
Collapse
|
4
|
Ghosh S, Mitra P. MaTPIP: A deep-learning architecture with eXplainable AI for sequence-driven, feature mixed protein-protein interaction prediction. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 244:107955. [PMID: 38064959 DOI: 10.1016/j.cmpb.2023.107955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 11/09/2023] [Accepted: 11/26/2023] [Indexed: 01/26/2024]
Abstract
BACKGROUND AND OBJECTIVE Protein-protein interaction (PPI) is a vital process in all living cells, controlling essential cell functions such as cell cycle regulation, signal transduction, and metabolic processes with broad applications that include antibody therapeutics, vaccines, and drug discovery. The problem of sequence-based PPI prediction has been a long-standing issue in computational biology. METHODS We introduce MaTPIP, a cutting-edge deep-learning framework for predicting PPI. MaTPIP stands out due to its innovative design, fusing pre-trained Protein Language Model (PLM)-based features with manually curated protein sequence attributes, emphasizing the part-whole relationship by incorporating two-dimensional granular part (amino-acid) level features and one-dimensional whole-level (protein) features. What sets MaTPIP apart is its ability to integrate these features across three different input terminals seamlessly. MatPIP also includes a distinctive configuration of Convolutional Neural Network (CNN) with Transformer components for concurrent utilization of CNN and sequential characteristics in each iteration and a one-dimensional to two-dimensional converter followed by a unified embedding. The statistical significance of this classifier is validated using McNemar's test. RESULTS MaTPIP outperformed the existing methods on both the Human PPI benchmark and cross-species PPI testing datasets, demonstrating its immense generalization capability for PPI prediction. We used seven diverse datasets with varying PPI target class distributions. Notably, within the novel PPI scenario, the most challenging category for Human PPI Benchmark, MaTPIP improves the existing state-of-the-art score from 74.1% to 78.6% (measured in Area under ROC Curve), from 23.2% to 32.8% (in average precision) and from 4.9% to 9.5% (in precision at 3% recall) for 50%, 10% and 0.3% target class distributions, respectively. In cross-species PPI evaluation, hybrid MaTPIP establishes a new benchmark score (measured in Area Under precision-recall curve) of 81.1% from the previous 60.9% for Mouse, 80.9% from 56.2% for Fly, 78.1% from 55.9% for Worm, 59.9% from 41.7% for Yeast, and 66.2% from 58.8% for E.coli. Our eXplainable AI-based assessment reveals an average contribution of different feature families per prediction on these datasets. CONCLUSIONS MaTPIP mixes manually curated features with the feature extracted from the pre-trained PLM to predict sequence-based protein-protein association. Furthermore, MaTPIP demonstrates strong generalization capabilities for cross-species PPI predictions.
Collapse
Affiliation(s)
- Shubhrangshu Ghosh
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal, India; TCS Research, Tata Consultancy Services Limited, Kolkata, West Bengal, India
| | - Pralay Mitra
- Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, West Bengal, India.
| |
Collapse
|
5
|
Sun DZ, Sun ZL, Liu M, Yong SH. LPI-SKMSC: Predicting LncRNA-Protein Interactions with Segmented k-mer Frequencies and Multi-space Clustering. Interdiscip Sci 2024:10.1007/s12539-023-00598-4. [PMID: 38206558 DOI: 10.1007/s12539-023-00598-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2023] [Revised: 11/25/2023] [Accepted: 12/05/2023] [Indexed: 01/12/2024]
Abstract
Long noncoding RNAs (lncRNAs) have significant regulatory roles in gene expression. Interactions with proteins are one of the ways lncRNAs play their roles. Since experiments to determine lncRNA-protein interactions (LPIs) are expensive and time-consuming, many computational methods for predicting LPIs have been proposed as alternatives. In the LPIs prediction problem, there commonly exists the imbalance in the distribution of positive and negative samples. However, there are few existing methods that give specific consideration to this problem. In this paper, we proposed a new clustering-based LPIs prediction method using segmented k-mer frequencies and multi-space clustering (LPI-SKMSC). It was dedicated to handling the imbalance of positive and negative samples. We constructed segmented k-mer frequencies to obtain global and local features of lncRNA and protein sequences. Then, the multi-space clustering was applied to LPI-SKMSC. The convolutional neural network (CNN)-based encoders were used to map different features of a sample to different spaces. It used multiple spaces to jointly constrain the classification of samples. Finally, the distances between the output features of the encoder and the cluster center in each space were calculated. The sum of distances in all spaces was compared with the cluster radius to predict the LPIs. We performed cross-validation on 3 public datasets and LPI-SKMSC showed the best performance compared to other existing methods. Experimental results showed that LPI-SKMSC could predict LPIs more effectively when faced with imbalanced positive and negative samples. In addition, we illustrated that our model was better at uncovering potential lncRNA-protein interaction pairs.
Collapse
Affiliation(s)
- Dian-Zheng Sun
- School of Electrical Engineering and Automation, Anhui University, Hefei, 230601, China
| | - Zhan-Li Sun
- School of Electrical Engineering and Automation, Anhui University, Hefei, 230601, China.
| | - Mengya Liu
- School of Computer Science and Technology, Anhui University, Hefei, 230601, China
| | - Shuang-Hao Yong
- School of Electrical Engineering and Automation, Anhui University, Hefei, 230601, China
| |
Collapse
|
6
|
Rehana H, Çam NB, Basmaci M, Zheng J, Jemiyo C, He Y, Özgür A, Hur J. Evaluation of GPT and BERT-based models on identifying proteinprotein interactions in biomedical text. ARXIV 2023:arXiv:2303.17728v2. [PMID: 38764593 PMCID: PMC11101131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/21/2024]
Abstract
Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical literature, there is a growing need for automated and accurate extraction of PPIs to facilitate scientific knowledge discovery. Pre-trained language models, such as generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT), have shown promising results in natural language processing (NLP) tasks. We evaluated the performance of PPI identification of multiple GPT and BERT models using three manually curated gold-standard corpora: Learning Language in Logic (LLL) with 164 PPIs in 77 sentences, Human Protein Reference Database with 163 PPIs in 145 sentences, and Interaction Extraction Performance Assessment with 335 PPIs in 486 sentences. BERT-based models achieved the best overall performance, with BioBERT achieving the highest recall (91.95%) and F1-score (86.84%) and PubMedBERT achieving the highest precision (85.25%). Interestingly, despite not being explicitly trained for biomedical texts, GPT-4 achieved commendable performance, comparable to the top-performing BERT models. It achieved a precision of 88.37%, a recall of 85.14%, and an F1-score of 86.49% on the LLL dataset. These results suggest that GPT models can effectively detect PPIs from text data, offering promising avenues for application in biomedical literature mining. Further research could explore how these models might be fine-tuned for even more specialized tasks within the biomedical domain.
Collapse
Affiliation(s)
- Hasin Rehana
- Computer Science Graduate Program, University of North Dakota, Grand Forks, North Dakota, 58202, USA
| | - Nur Bengisu Çam
- Department of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey
| | - Mert Basmaci
- Department of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey
| | - Jie Zheng
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, 48109, USA
| | - Christianah Jemiyo
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, North Dakota, 58202, USA
| | - Yongqun He
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, 48109, USA
- Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, 48109, USA
| | - Arzucan Özgür
- Department of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey
| | - Junguk Hur
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, North Dakota, 58202, USA
| |
Collapse
|
7
|
Markus B, C GC, Andreas K, Arkadij K, Stefan L, Gustav O, Elina S, Radka S. Accelerating Biocatalysis Discovery with Machine Learning: A Paradigm Shift in Enzyme Engineering, Discovery, and Design. ACS Catal 2023; 13:14454-14469. [PMID: 37942268 PMCID: PMC10629211 DOI: 10.1021/acscatal.3c03417] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 09/29/2023] [Accepted: 10/03/2023] [Indexed: 11/10/2023]
Abstract
Emerging computational tools promise to revolutionize protein engineering for biocatalytic applications and accelerate the development timelines previously needed to optimize an enzyme to its more efficient variant. For over a decade, the benefits of predictive algorithms have helped scientists and engineers navigate the complexity of functional protein sequence space. More recently, spurred by dramatic advances in underlying computational tools, the promise of faster, cheaper, and more accurate enzyme identification, characterization, and engineering has catapulted terms such as artificial intelligence and machine learning to the must-have vocabulary in the field. This Perspective aims to showcase the current status of applications in pharmaceutical industry and also to discuss and celebrate the innovative approaches in protein science by highlighting their potential in selected recent developments and offering thoughts on future opportunities for biocatalysis. It also critically assesses the technology's limitations, unanswered questions, and unmet challenges.
Collapse
Affiliation(s)
- Braun Markus
- Department
of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010 Graz, Austria
| | - Gruber Christian C
- Enzyme
and Drug Discovery, Innophore. 1700 Montgomery Street, San Francisco, California 94111, United States
| | - Krassnigg Andreas
- Enzyme
and Drug Discovery, Innophore. 1700 Montgomery Street, San Francisco, California 94111, United States
| | - Kummer Arkadij
- Moderna,
Inc., 200 Technology
Square, Cambridge, Massachusetts 02139, United States
| | - Lutz Stefan
- Codexis
Inc., 200 Penobscot Drive, Redwood City, California 94063, United States
| | - Oberdorfer Gustav
- Department
of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010 Graz, Austria
| | - Siirola Elina
- Novartis
Institute for Biomedical Research, Global Discovery Chemistry, Basel CH-4108, Switzerland
| | - Snajdrova Radka
- Novartis
Institute for Biomedical Research, Global Discovery Chemistry, Basel CH-4108, Switzerland
| |
Collapse
|
8
|
Lee M. Recent Advances in Deep Learning for Protein-Protein Interaction Analysis: A Comprehensive Review. Molecules 2023; 28:5169. [PMID: 37446831 DOI: 10.3390/molecules28135169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 06/30/2023] [Accepted: 06/30/2023] [Indexed: 07/15/2023] Open
Abstract
Deep learning, a potent branch of artificial intelligence, is steadily leaving its transformative imprint across multiple disciplines. Within computational biology, it is expediting progress in the understanding of Protein-Protein Interactions (PPIs), key components governing a wide array of biological functionalities. Hence, an in-depth exploration of PPIs is crucial for decoding the intricate biological system dynamics and unveiling potential avenues for therapeutic interventions. As the deployment of deep learning techniques in PPI analysis proliferates at an accelerated pace, there exists an immediate demand for an exhaustive review that encapsulates and critically assesses these novel developments. Addressing this requirement, this review offers a detailed analysis of the literature from 2021 to 2023, highlighting the cutting-edge deep learning methodologies harnessed for PPI analysis. Thus, this review stands as a crucial reference for researchers in the discipline, presenting an overview of the recent studies in the field. This consolidation helps elucidate the dynamic paradigm of PPI analysis, the evolution of deep learning techniques, and their interdependent dynamics. This scrutiny is expected to serve as a vital aid for researchers, both well-established and newcomers, assisting them in maneuvering the rapidly shifting terrain of deep learning applications in PPI analysis.
Collapse
Affiliation(s)
- Minhyeok Lee
- School of Electrical and Electronics Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| |
Collapse
|
9
|
Baquero F, Martínez JL, Sánchez A, Fernández-de-Bobadilla MD, San-Millán A, Rodríguez-Beltrán J. Bacterial Subcellular Architecture, Structural Epistasis, and Antibiotic Resistance. BIOLOGY 2023; 12:biology12050640. [PMID: 37237454 DOI: 10.3390/biology12050640] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Revised: 04/08/2023] [Accepted: 04/20/2023] [Indexed: 05/28/2023]
Abstract
Epistasis refers to the way in which genetic interactions between some genetic loci affect phenotypes and fitness. In this study, we propose the concept of "structural epistasis" to emphasize the role of the variable physical interactions between molecules located in particular spaces inside the bacterial cell in the emergence of novel phenotypes. The architecture of the bacterial cell (typically Gram-negative), which consists of concentrical layers of membranes, particles, and molecules with differing configurations and densities (from the outer membrane to the nucleoid) determines and is in turn determined by the cell shape and size, depending on the growth phases, exposure to toxic conditions, stress responses, and the bacterial environment. Antibiotics change the bacterial cell's internal molecular topology, producing unexpected interactions among molecules. In contrast, changes in shape and size may alter antibiotic action. The mechanisms of antibiotic resistance (and their vectors, as mobile genetic elements) also influence molecular connectivity in the bacterial cell and can produce unexpected phenotypes, influencing the action of other antimicrobial agents.
Collapse
Affiliation(s)
- Fernando Baquero
- Department of Microbiology, Ramón y Cajal University Hospital, Ramón y Cajal Institute for Health Research (IRYCIS), 28034 Madrid, Spain
- CIBER en Epidemiología y Salud Pública (CIBERESP), 28034 Madrid, Spain
| | | | - Alvaro Sánchez
- Centro Nacional de Biotecnología, CSIC, 28049 Madrid, Spain
| | - Miguel D Fernández-de-Bobadilla
- Department of Microbiology, Ramón y Cajal University Hospital, Ramón y Cajal Institute for Health Research (IRYCIS), 28034 Madrid, Spain
- CIBER en Enfermedades Infecciosas (CIBERINFECT), 28034 Madrid, Spain
| | - Alvaro San-Millán
- Centro Nacional de Biotecnología, CSIC, 28049 Madrid, Spain
- CIBER en Enfermedades Infecciosas (CIBERINFECT), 28034 Madrid, Spain
| | - Jerónimo Rodríguez-Beltrán
- Department of Microbiology, Ramón y Cajal University Hospital, Ramón y Cajal Institute for Health Research (IRYCIS), 28034 Madrid, Spain
- CIBER en Enfermedades Infecciosas (CIBERINFECT), 28034 Madrid, Spain
| |
Collapse
|