1
|
Newaz K, Schaefers C, Weisel K, Baumbach J, Frishman D. Prognostic importance of splicing-triggered aberrations of protein complex interfaces in cancer. NAR Genom Bioinform 2024; 6:lqae133. [PMID: 39328266 PMCID: PMC11426328 DOI: 10.1093/nargab/lqae133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2024] [Revised: 08/30/2024] [Accepted: 09/13/2024] [Indexed: 09/28/2024] Open
Abstract
Aberrant alternative splicing (AS) is a prominent hallmark of cancer. AS can perturb protein-protein interactions (PPIs) by adding or removing interface regions encoded by individual exons. Identifying prognostic exon-exon interactions (EEIs) from PPI interfaces can help discover AS-affected cancer-driving PPIs that can serve as potential drug targets. Here, we assessed the prognostic significance of EEIs across 15 cancer types by integrating RNA-seq data with three-dimensional (3D) structures of protein complexes. By analyzing the resulting EEI network we identified patient-specific perturbed EEIs (i.e., EEIs present in healthy samples but absent from the paired cancer samples or vice versa) that were significantly associated with survival. We provide the first evidence that EEIs can be used as prognostic biomarkers for cancer patient survival. Our findings provide mechanistic insights into AS-affected PPI interfaces. Given the ongoing expansion of available RNA-seq data and the number of 3D structurally-resolved (or confidently predicted) protein complexes, our computational framework will help accelerate the discovery of clinically important cancer-promoting AS events.
Collapse
Affiliation(s)
- Khalique Newaz
- Institute for Computational Systems Biology and Center for Data and Computing in Natural Sciences, Universität Hamburg, 22761 Hamburg, Germany
| | - Christoph Schaefers
- Department of Oncology, Hematology and Bone Marrow Transplantation with Division of Pneumology, Universitätsklinikum Hamburg-Eppendorf, 20251 Hamburg, Germany
| | - Katja Weisel
- Department of Oncology, Hematology and Bone Marrow Transplantation with Division of Pneumology, Universitätsklinikum Hamburg-Eppendorf, 20251 Hamburg, Germany
| | - Jan Baumbach
- Institute for Computational Systems Biology and Center for Data and Computing in Natural Sciences, Universität Hamburg, 22761 Hamburg, Germany
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| | - Dmitrij Frishman
- Department of Bioinformatics, School of Life Sciences, Technical University of Munich, 85354 Freising, Germany
| |
Collapse
|
2
|
Zitnik M, Li MM, Wells A, Glass K, Morselli Gysi D, Krishnan A, Murali TM, Radivojac P, Roy S, Baudot A, Bozdag S, Chen DZ, Cowen L, Devkota K, Gitter A, Gosline SJC, Gu P, Guzzi PH, Huang H, Jiang M, Kesimoglu ZN, Koyuturk M, Ma J, Pico AR, Pržulj N, Przytycka TM, Raphael BJ, Ritz A, Sharan R, Shen Y, Singh M, Slonim DK, Tong H, Yang XH, Yoon BJ, Yu H, Milenković T. Current and future directions in network biology. BIOINFORMATICS ADVANCES 2024; 4:vbae099. [PMID: 39143982 PMCID: PMC11321866 DOI: 10.1093/bioadv/vbae099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Revised: 05/31/2024] [Accepted: 07/08/2024] [Indexed: 08/16/2024]
Abstract
Summary Network biology is an interdisciplinary field bridging computational and biological sciences that has proved pivotal in advancing the understanding of cellular functions and diseases across biological systems and scales. Although the field has been around for two decades, it remains nascent. It has witnessed rapid evolution, accompanied by emerging challenges. These stem from various factors, notably the growing complexity and volume of data together with the increased diversity of data types describing different tiers of biological organization. We discuss prevailing research directions in network biology, focusing on molecular/cellular networks but also on other biological network types such as biomedical knowledge graphs, patient similarity networks, brain networks, and social/contact networks relevant to disease spread. In more detail, we highlight areas of inference and comparison of biological networks, multimodal data integration and heterogeneous networks, higher-order network analysis, machine learning on networks, and network-based personalized medicine. Following the overview of recent breakthroughs across these five areas, we offer a perspective on future directions of network biology. Additionally, we discuss scientific communities, educational initiatives, and the importance of fostering diversity within the field. This article establishes a roadmap for an immediate and long-term vision for network biology. Availability and implementation Not applicable.
Collapse
Affiliation(s)
- Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
| | - Michelle M Li
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, United States
| | - Aydin Wells
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
- Lucy Family Institute for Data and Society, University of Notre Dame, Notre Dame, IN 46556, United States
- Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, United States
| | - Kimberly Glass
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, United States
| | - Deisy Morselli Gysi
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, United States
- Department of Statistics, Federal University of Paraná, Curitiba, Paraná 81530-015, Brazil
- Department of Physics, Northeastern University, Boston, MA 02115, United States
| | - Arjun Krishnan
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, United States
| | - T M Murali
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, United States
| | - Sushmita Roy
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53715, United States
- Wisconsin Institute for Discovery, Madison, WI 53715, United States
| | - Anaïs Baudot
- Aix Marseille Université, INSERM, MMG, Marseille, France
| | - Serdar Bozdag
- Department of Computer Science and Engineering, University of North Texas, Denton, TX 76203, United States
- Department of Mathematics, University of North Texas, Denton, TX 76203, United States
| | - Danny Z Chen
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
| | - Lenore Cowen
- Department of Computer Science, Tufts University, Medford, MA 02155, United States
| | - Kapil Devkota
- Department of Computer Science, Tufts University, Medford, MA 02155, United States
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53715, United States
- Morgridge Institute for Research, Madison, WI 53715, United States
| | - Sara J C Gosline
- Biological Sciences Division, Pacific Northwest National Laboratory, Seattle, WA 98109, United States
| | - Pengfei Gu
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
| | - Pietro H Guzzi
- Department of Medical and Surgical Sciences, University Magna Graecia of Catanzaro, Catanzaro, 88100, Italy
| | - Heng Huang
- Department of Computer Science, University of Maryland College Park, College Park, MD 20742, United States
| | - Meng Jiang
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
| | - Ziynet Nesibe Kesimoglu
- Department of Computer Science and Engineering, University of North Texas, Denton, TX 76203, United States
- National Center of Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20814, United States
| | - Mehmet Koyuturk
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Jian Ma
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - Alexander R Pico
- Institute of Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA 94158, United States
| | - Nataša Pržulj
- Department of Computer Science, University College London, London, WC1E 6BT, England
- ICREA, Catalan Institution for Research and Advanced Studies, Barcelona, 08010, Spain
- Barcelona Supercomputing Center (BSC), Barcelona, 08034, Spain
| | - Teresa M Przytycka
- National Center of Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20814, United States
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, NJ 08544, United States
| | - Anna Ritz
- Department of Biology, Reed College, Portland, OR 97202, United States
| | - Roded Sharan
- School of Computer Science, Tel Aviv University, Tel Aviv, 69978, Israel
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States
| | - Mona Singh
- Department of Computer Science, Princeton University, Princeton, NJ 08544, United States
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, United States
| | - Donna K Slonim
- Department of Computer Science, Tufts University, Medford, MA 02155, United States
| | - Hanghang Tong
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
| | - Xinan Holly Yang
- Department of Pediatrics, University of Chicago, Chicago, IL 60637, United States
| | - Byung-Jun Yoon
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States
- Computational Science Initiative, Brookhaven National Laboratory, Upton, NY 11973, United States
| | - Haiyuan Yu
- Department of Computational Biology, Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, NY 14853, United States
| | - Tijana Milenković
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, United States
- Lucy Family Institute for Data and Society, University of Notre Dame, Notre Dame, IN 46556, United States
- Eck Institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, United States
| |
Collapse
|
3
|
Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, Strauss CEM, Leman JK, Cho K, Bonneau R. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol 2024; 42:975-985. [PMID: 37679542 PMCID: PMC11180608 DOI: 10.1038/s41587-023-01917-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Accepted: 07/26/2023] [Indexed: 09/09/2023]
Abstract
Exploiting sequence-structure-function relationships in biotechnology requires improved methods for aligning proteins that have low sequence similarity to previously annotated proteins. We develop two deep learning methods to address this gap, TM-Vec and DeepBLAST. TM-Vec allows searching for structure-structure similarities in large sequence databases. It is trained to accurately predict TM-scores as a metric of structural similarity directly from sequence pairs without the need for intermediate computation or solution of structures. Once structurally similar proteins have been identified, DeepBLAST can structurally align proteins using only sequence information by identifying structurally homologous regions between proteins. It outperforms traditional sequence alignment methods and performs similarly to structure-based alignment methods. We show the merits of TM-Vec and DeepBLAST on a variety of datasets, including better identification of remotely homologous proteins compared with state-of-the-art sequence alignment and structure prediction methods.
Collapse
Grants
- R35GM122515 National Science Foundation (NSF)
- IOS-1546218 National Science Foundation (NSF)
- R35 GM122515 NIGMS NIH HHS
- R01 DK103358 NIDDK NIH HHS
- CBET- 1728858 National Science Foundation (NSF)
- R01 AI130945 NIAID NIH HHS
- This research was supported by NIH R01DK103358, the Simons Foundation, NSF- IOS-1546218, R35GM122515, NSF CBET- 1728858, NIH R01AI130945, to T.H. This research was supported by the intramural research program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) to J.T.M. This research was supported by the Flatiron Institute as part of the Simons Foundation to Robert Blackwell, J.K.L., and N.C. This research was supported by Los Alamos National Lab to C.S. This research was supported by the Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI), Samsung Research (Improving Deep Learning using Latent Structure), and NSF Award 1922658 to K.C.
- Simons Foundation
- U.S. Department of Health & Human Services | NIH | Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD)
Collapse
Affiliation(s)
- Tymor Hamamsy
- Center for Data Science, New York University, New York, NY, USA
| | - James T Morton
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| | - Robert Blackwell
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Daniel Berenberg
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
- Prescient Design, New York, NY, USA
| | - Nicholas Carriero
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | | | | | - Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Kyunghyun Cho
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- CIFAR, Toronto, Ontario, Canada.
| | - Richard Bonneau
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- Department of Biology, New York University, New York, NY, USA.
| |
Collapse
|
4
|
Huang J, Li J, Chen Q, Wang X, Chen G, Tang J. Freeprotmap: waiting-free prediction method for protein distance map. BMC Bioinformatics 2024; 25:176. [PMID: 38704533 PMCID: PMC11069170 DOI: 10.1186/s12859-024-05771-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 04/09/2024] [Indexed: 05/06/2024] Open
Abstract
BACKGROUND Protein residue-residue distance maps are used for remote homology detection, protein information estimation, and protein structure research. However, existing prediction approaches are time-consuming, and hundreds of millions of proteins are discovered each year, necessitating the development of a rapid and reliable prediction method for protein residue-residue distances. Moreover, because many proteins lack known homologous sequences, a waiting-free and alignment-free deep learning method is needed. RESULT In this study, we propose a learning framework named FreeProtMap. In terms of protein representation processing, the proposed group pooling in FreeProtMap effectively mitigates issues arising from high-dimensional sparseness in protein representation. In terms of model structure, we have made several careful designs. Firstly, it is designed based on the locality of protein structures and triangular inequality distance constraints to improve prediction accuracy. Secondly, inference speed is improved by using additive attention and lightweight design. Besides, the generalization ability is improved by using bottlenecks and a neural network block named local microformer. As a result, FreeProtMap can predict protein residue-residue distances in tens of milliseconds and has higher precision than the best structure prediction method. CONCLUSION Several groups of comparative experiments and ablation experiments verify the effectiveness of the designs. The results demonstrate that FreeProtMap significantly outperforms other state-of-the-art methods in accurate protein residue-residue distance prediction, which is beneficial for lots of protein research works. It is worth mentioning that we could scan all proteins discovered each year based on FreeProtMap to find structurally similar proteins in a short time because the fact that the structure similarity calculation method based on distance maps is much less time-consuming than algorithms based on 3D structures.
Collapse
Affiliation(s)
- Jiajian Huang
- Zhejiang Lab, Zhejiang, China.
- Dalian University of Technology, Liaoning, China.
| | - Jinpeng Li
- Zhejiang Lab, Zhejiang, China
- The Chinese University of Hong Kong, Hong Kong, China
| | | | - Xia Wang
- Zhejiang Lab, Zhejiang, China.
- Dalian University of Technology, Liaoning, China.
| | | | | |
Collapse
|
5
|
Lecca P, Lecca M. Graph embedding and geometric deep learning relevance to network biology and structural chemistry. Front Artif Intell 2023; 6:1256352. [PMID: 38035201 PMCID: PMC10687447 DOI: 10.3389/frai.2023.1256352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Accepted: 10/16/2023] [Indexed: 12/02/2023] Open
Abstract
Graphs are used as a model of complex relationships among data in biological science since the advent of systems biology in the early 2000. In particular, graph data analysis and graph data mining play an important role in biology interaction networks, where recent techniques of artificial intelligence, usually employed in other type of networks (e.g., social, citations, and trademark networks) aim to implement various data mining tasks including classification, clustering, recommendation, anomaly detection, and link prediction. The commitment and efforts of artificial intelligence research in network biology are motivated by the fact that machine learning techniques are often prohibitively computational demanding, low parallelizable, and ultimately inapplicable, since biological network of realistic size is a large system, which is characterised by a high density of interactions and often with a non-linear dynamics and a non-Euclidean latent geometry. Currently, graph embedding emerges as the new learning paradigm that shifts the tasks of building complex models for classification, clustering, and link prediction to learning an informative representation of the graph data in a vector space so that many graph mining and learning tasks can be more easily performed by employing efficient non-iterative traditional models (e.g., a linear support vector machine for the classification task). The great potential of graph embedding is the main reason of the flourishing of studies in this area and, in particular, the artificial intelligence learning techniques. In this mini review, we give a comprehensive summary of the main graph embedding algorithms in light of the recent burgeoning interest in geometric deep learning.
Collapse
Affiliation(s)
- Paola Lecca
- Faculty of Engineering, Free University of Bozen-Bolzano, Bolzano, Italy
| | - Michela Lecca
- Fondazione Bruno Kessler, Digital Industry Center, Technologies of Vision, Trento, Italy
| |
Collapse
|
6
|
Piccardi C. Metrics for network comparison using egonet feature distributions. Sci Rep 2023; 13:14657. [PMID: 37669967 PMCID: PMC10480166 DOI: 10.1038/s41598-023-40938-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Accepted: 08/18/2023] [Indexed: 09/07/2023] Open
Abstract
Identifying networks with similar characteristics in a given ensemble, or detecting pattern discontinuities in a temporal sequence of networks, are two examples of tasks that require an effective metric capable of quantifying network (dis)similarity. Here we propose a method based on a global portrait of graph properties built by processing local nodes features. More precisely, a set of dissimilarity measures is defined by elaborating the distributions, over the network, of a few egonet features, namely the degree, the clustering coefficient, and the egonet persistence. The method, which does not require the alignment of the two networks being compared, exploits the statistics of the three features to define one- or multi-dimensional distribution functions, which are then compared to define a distance between the networks. The effectiveness of the method is evaluated using a standard classification test, i.e., recognizing the graphs originating from the same synthetic model. Overall, the proposed distances have performances comparable to the best state-of-the-art techniques (graphlet-based methods) with similar computational requirements. Given its simplicity and flexibility, the method is proposed as a viable approach for network comparison tasks.
Collapse
Affiliation(s)
- Carlo Piccardi
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133, Milan, Italy.
| |
Collapse
|
7
|
Koehler Leman J, Szczerbiak P, Renfrew PD, Gligorijevic V, Berenberg D, Vatanen T, Taylor BC, Chandler C, Janssen S, Pataki A, Carriero N, Fisk I, Xavier RJ, Knight R, Bonneau R, Kosciolek T. Sequence-structure-function relationships in the microbial protein universe. Nat Commun 2023; 14:2351. [PMID: 37100781 PMCID: PMC10133388 DOI: 10.1038/s41467-023-37896-w] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Accepted: 04/05/2023] [Indexed: 04/28/2023] Open
Abstract
For the past half-century, structural biologists relied on the notion that similar protein sequences give rise to similar structures and functions. While this assumption has driven research to explore certain parts of the protein universe, it disregards spaces that don't rely on this assumption. Here we explore areas of the protein universe where similar protein functions can be achieved by different sequences and different structures. We predict ~200,000 structures for diverse protein sequences from 1,003 representative genomes across the microbial tree of life and annotate them functionally on a per-residue basis. Structure prediction is accomplished using the World Community Grid, a large-scale citizen science initiative. The resulting database of structural models is complementary to the AlphaFold database, with regards to domains of life as well as sequence diversity and sequence length. We identify 148 novel folds and describe examples where we map specific functions to structural motifs. We also show that the structural space is continuous and largely saturated, highlighting the need for a shift in focus across all branches of biology, from obtaining structures to putting them into context and from sequence-based to sequence-structure-function based meta-omics analyses.
Collapse
Affiliation(s)
- Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.
- Department of Biology, New York University, New York, NY, USA.
| | - Pawel Szczerbiak
- Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
| | - P Douglas Renfrew
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Department of Biology, New York University, New York, NY, USA
| | - Vladimir Gligorijevic
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| | - Daniel Berenberg
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
- Center for Data Science, New York University, New York, NY, 10011, USA
- Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, NY, USA
| | - Tommi Vatanen
- Broad Institute, Cambridge, MA, USA
- Liggins Institute, University of Auckland, Auckland, New Zealand
- Research Program for Clinical and Molecular Metabolism, Faculty of Medicine, 00014 University of Helsinki, Helsinki, Finland
| | - Bryn C Taylor
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- In Silico Discovery and External Innovation, Janssen Research and Development, San Diego, CA, 92122, USA
| | - Chris Chandler
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Stefan Janssen
- Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, 92093, USA
- Algorithmic Bioinformatics, Justus Liebig University Giessen, Giessen, Germany
| | - Andras Pataki
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Nick Carriero
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Ian Fisk
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Ramnik J Xavier
- Broad Institute, Cambridge, MA, USA
- Center for Microbiome Informatics and Therapeutics, MIT, Cambridge, MA, 02139, USA
| | - Rob Knight
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, 92093, USA
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
- Department of Bioengineering, University of California, San Diego, USA
| | - Richard Bonneau
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Department of Biology, New York University, New York, NY, USA
- Center for Data Science, New York University, New York, NY, 10011, USA
- Courant Institute of Mathematical Sciences, Department of Computer Science, New York University, New York, NY, USA
- Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| | - Tomasz Kosciolek
- Malopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland.
| |
Collapse
|
8
|
Newaz K, Piland J, Clark PL, Emrich SJ, Li J, Milenković T. Multi-layer sequential network analysis improves protein 3D structural classification. Proteins 2022; 90:1721-1731. [PMID: 35441395 PMCID: PMC9356989 DOI: 10.1002/prot.26349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 03/04/2022] [Accepted: 03/30/2022] [Indexed: 11/08/2022]
Abstract
Protein structural classification (PSC) is a supervised problem of assigning proteins into pre-defined structural (e.g., CATH or SCOPe) classes based on the proteins' sequence or 3D structural features. We recently proposed PSC approaches that model protein 3D structures as protein structure networks (PSNs) and analyze PSN-based protein features, which performed better than or comparable to state-of-the-art sequence or other 3D structure-based PSC approaches. However, existing PSN-based PSC approaches model the whole 3D structure of a protein as a static (i.e., single-layer) PSN. Because folding of a protein is a dynamic process, where some parts (i.e., sub-structures) of a protein fold before others, modeling the 3D structure of a protein as a PSN that captures the sub-structures might further help improve the existing PSC performance. Here, we propose to model 3D structures of proteins as multi-layer sequential PSNs that approximate 3D sub-structures of proteins, with the hypothesis that this will improve upon the current state-of-the-art PSC approaches that are based on single-layer PSNs (and thus upon the existing state-of-the-art sequence and other 3D structural approaches). Indeed, we confirm this on 72 datasets spanning ~44 000 CATH and SCOPe protein domains.
Collapse
Affiliation(s)
- Khalique Newaz
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA,Center for Data and Computing in Natural Sciences (CDCS), Institute for Computational Systems Biology, Universität Hamburg, Hamburg, 20146, Germany
| | - Jacob Piland
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Patricia L. Clark
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Scott J. Emrich
- Department of Electrical Engineering and Computer Science; University of Tennessee, Knoxville, TN 37996, USA
| | - Jun Li
- Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Tijana Milenković
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
| |
Collapse
|
9
|
Gu S, Jiang M, Guzzi PH, Milenković T. Modeling multi-scale data via a network of networks. Bioinformatics 2022; 38:2544-2553. [PMID: 35238343 PMCID: PMC9048659 DOI: 10.1093/bioinformatics/btac133] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Revised: 02/01/2022] [Accepted: 02/28/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Prediction of node and graph labels are prominent network science tasks. Data analyzed in these tasks are sometimes related: entities represented by nodes in a higher-level (higher scale) network can themselves be modeled as networks at a lower level. We argue that systems involving such entities should be integrated with a 'network of networks' (NoNs) representation. Then, we ask whether entity label prediction using multi-level NoN data via our proposed approaches is more accurate than using each of single-level node and graph data alone, i.e. than traditional node label prediction on the higher-level network and graph label prediction on the lower-level networks. To obtain data, we develop the first synthetic NoN generator and construct a real biological NoN. We evaluate accuracy of considered approaches when predicting artificial labels from the synthetic NoNs and proteins' functions from the biological NoN. RESULTS For the synthetic NoNs, our NoN approaches outperform or are as good as node- and network-level ones depending on the NoN properties. For the biological NoN, our NoN approaches outperform the single-level approaches for just under half of the protein functions, and for 30% of the functions, only our NoN approaches make meaningful predictions, while node- and network-level ones achieve random accuracy. So, NoN-based data integration is important. AVAILABILITY AND IMPLEMENTATION The software and data are available at https://nd.edu/~cone/NoNs. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shawn Gu
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Meng Jiang
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Pietro Hiram Guzzi
- Department of Surgical and Medical Sciences, University Magna Graecia of Catanzaro, Catanzaro 88100, Italy
| | - Tijana Milenković
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
| |
Collapse
|
10
|
Prabantu VM, Gadiyaram V, Vishveshwara S, Srinivasan N. Understanding structural variability in proteins using protein structural networks. Curr Res Struct Biol 2022; 4:134-145. [PMID: 35586857 PMCID: PMC9108755 DOI: 10.1016/j.crstbi.2022.04.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 03/01/2022] [Accepted: 04/09/2022] [Indexed: 11/13/2022] Open
Abstract
Proteins perform their function by accessing a suitable conformer from the ensemble of available conformations. The conformational diversity of a chosen protein structure can be obtained by experimental methods under different conditions. A key issue is the accurate comparison of different conformations. A gold standard used for such a comparison is the root mean square deviation (RMSD) between the two structures. While extensive refinements of RMSD evaluation at the backbone level are available, a comprehensive framework including the side chain interaction is not well understood. Here we employ protein structure network (PSN) formalism, with the non-covalent interactions of side chain, explicitly treated. The PSNs thus constructed are compared through graph spectral method, which provides a comparison at the local and at the global structural level. In this work, PSNs of multiple crystal conformers of single-chain, single-domain proteins, are subject to pair-wise analysis to examine the dissimilarity in their network topologies and in order to determine the conformational diversity of their native structures. This information is utilized to classify the structural domains of proteins into different categories. It is observed that proteins typically tend to retain structure and interactions at the backbone level. However, some of them also depict variability in either their overall structure or only in their inter-residue connectivity at the sidechain level, or both. Variability of sub-networks based on solvent accessibility and secondary structure is studied. The types of specific interactions are found to contribute differently to structure variability. An ensemble analysis by computing the mathematical variance of edge-weights across multiple conformers provided information on the contribution to overall variability from each edge of the PSN. Interactions that are highly variable are identified and their impact on structure variability has been discussed with the help of a case study. The classification based on the present side-chain network-based studies provides a framework to correlate the structure-function relationships in protein structures. Monomeric, single domain protein structures can exhibit non-rigid behaviour and be highly variable. The comparison of protein structural networks can better discriminate conformations with similar backbones. Specific interactions between solvent accessible and inaccessible residues are poorly preserved. Network edge-variation offers insights on which interacting residues are likely to influence their dynamics and function. These side-chain network-based studies provide a framework to correlate protein structure-function relationships.
Collapse
|
11
|
Brysbaert G, Lensink MF. Centrality Measures in Residue Interaction Networks to Highlight Amino Acids in Protein–Protein Binding. FRONTIERS IN BIOINFORMATICS 2021; 1:684970. [PMID: 36303777 PMCID: PMC9581030 DOI: 10.3389/fbinf.2021.684970] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 05/17/2021] [Indexed: 12/21/2022] Open
Abstract
Residue interaction networks (RINs) describe a protein structure as a network of interacting residues. Central nodes in these networks, identified by centrality analyses, highlight those residues that play a role in the structure and function of the protein. However, little is known about the capability of such analyses to identify residues involved in the formation of macromolecular complexes. Here, we performed six different centrality measures on the RINs generated from the complexes of the SKEMPI 2 database of changes in protein–protein binding upon mutation in order to evaluate the capability of each of these measures to identify major binding residues. The analyses were performed with and without the crystallographic water molecules, in addition to the protein residues. We also investigated the use of a weight factor based on the inter-residue distances to improve the detection of these residues. We show that for the identification of major binding residues, closeness, degree, and PageRank result in good precision, whereas betweenness, eigenvector, and residue centrality analyses give a higher sensitivity. Including water in the analysis improves the sensitivity of all measures without losing precision. Applying weights only slightly raises the sensitivity of eigenvector centrality analysis. We finally show that a combination of multiple centrality analyses is the optimal approach to identify residues that play a role in protein–protein interaction.
Collapse
|
12
|
Newaz K, Wright G, Piland J, Li J, Clark PL, Emrich SJ, Milenković T. Network analysis of synonymous codon usage. Bioinformatics 2020; 36:4876-4884. [PMID: 32609328 DOI: 10.1093/bioinformatics/btaa603] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2019] [Revised: 05/05/2020] [Accepted: 06/22/2020] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION Most amino acids are encoded by multiple synonymous codons, some of which are used more rarely than others. Analyses of positions of such rare codons in protein sequences revealed that rare codons can impact co-translational protein folding and that positions of some rare codons are evolutionarily conserved. Analyses of their positions in protein 3-dimensional structures, which are richer in biochemical information than sequences alone, might further explain the role of rare codons in protein folding. RESULTS We model protein structures as networks and use network centrality to measure the structural position of an amino acid. We first validate that amino acids buried within the structural core are network-central, and those on the surface are not. Then, we study potential differences between network centralities and thus structural positions of amino acids encoded by conserved rare, non-conserved rare and commonly used codons. We find that in 84% of proteins, the three codon categories occupy significantly different structural positions. We examine protein groups showing different codon centrality trends, i.e. different relationships between structural positions of the three codon categories. We see several cases of all proteins from our data with some structural or functional property being in the same group. Also, we see a case of all proteins in some group having the same property. Our work shows that codon usage is linked to the final protein structure and thus possibly to co-translational protein folding. AVAILABILITY AND IMPLEMENTATION https://nd.edu/∼cone/CodonUsage/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Khalique Newaz
- Department of Computer Science and Engineering.,Center for Network and Data Science.,Eck institute for Global Health
| | - Gabriel Wright
- Department of Computer Science and Engineering.,Eck institute for Global Health
| | - Jacob Piland
- Department of Computer Science and Engineering.,Center for Network and Data Science.,Eck institute for Global Health
| | - Jun Li
- Department of Applied and Computational Mathematics and Statistics
| | - Patricia L Clark
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Scott J Emrich
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996, USA
| | - Tijana Milenković
- Department of Computer Science and Engineering.,Center for Network and Data Science.,Eck institute for Global Health
| |
Collapse
|
13
|
Abstract
BACKGROUND Essential genes are those genes that are critical for the survival of an organism. The prediction of essential genes in bacteria can provide targets for the design of novel antibiotic compounds or antimicrobial strategies. RESULTS We propose a deep neural network for predicting essential genes in microbes. Our architecture called DEEPLYESSENTIAL makes minimal assumptions about the input data (i.e., it only uses gene primary sequence and the corresponding protein sequence) to carry out the prediction thus maximizing its practical application compared to existing predictors that require structural or topological features which might not be readily available. We also expose and study a hidden performance bias that effected previous classifiers. Extensive results show that DEEPLYESSENTIAL outperform existing classifiers that either employ down-sampling to balance the training set or use clustering to exclude multiple copies of orthologous genes. CONCLUSION Deep neural network architectures can efficiently predict whether a microbial gene is essential (or not) using only its sequence information.
Collapse
Affiliation(s)
- Md Abid Hasan
- Department of Computer Science and Engineering, University of California Riverside, 900 University Ave, Riverside, 92507 CA USA
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California Riverside, 900 University Ave, Riverside, 92507 CA USA
| |
Collapse
|
14
|
Abstract
MOTIVATION The structure of chromatin impacts gene expression. Its alteration has been shown to coincide with the occurrence of cancer. A key challenge is in understanding the role of chromatin structure (CS) in cellular processes and its implications in diseases. RESULTS We propose a comparative pipeline to analyze CSs and apply it to study chronic lymphocytic leukemia (CLL). We model the chromatin of the affected and control cells as networks and analyze the network topology by state-of-the-art methods. Our results show that CSs are a rich source of new biological and functional information about DNA elements and cells that can complement protein-protein and co-expression data. Importantly, we show the existence of structural markers of cancer-related DNA elements in the chromatin. Surprisingly, CLL driver genes are characterized by specific local wiring patterns not only in the CS network of CLL cells, but also of healthy cells. This allows us to successfully predict new CLL-related DNA elements. Importantly, this shows that we can identify cancer-related DNA elements in other cancer types by investigating the CS network of the healthy cell of origin, a key new insight paving the road to new therapeutic strategies. This gives us an opportunity to exploit chromosome conformation data in healthy cells to predict new drivers. AVAILABILITY AND IMPLEMENTATION Our predicted CLL genes and RNAs are provided as a free resource to the community at https://life.bsc.es/iconbi/chromatin/index.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- N Malod-Dognin
- Department of Life Sciences, Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain
- Department of Computer Science, University College London, London WC1E 6BT, UK
| | - V Pancaldi
- Department of Life Sciences, Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain
- Centre de Recherches en Cancérologie de Toulouse (CRCT), Toulouse 31037, France
- University Paul Sabatier III, Toulouse 31330, France
| | - A Valencia
- Department of Life Sciences, Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain
- ICREA, Pg. Lluís Companys 23, Barcelona 08010, Spain
- Coordination Node, Spanish National Bioinformatics Institute, ELIXIR-Spain (INB, ELIXIR-ES), Madrid 28029, Spain
| | - N Pržulj
- Department of Life Sciences, Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain
- Department of Computer Science, University College London, London WC1E 6BT, UK
- ICREA, Pg. Lluís Companys 23, Barcelona 08010, Spain
| |
Collapse
|
15
|
Newaz K, Ghalehnovi M, Rahnama A, Antsaklis PJ, Milenković T. Network-based protein structural classification. ROYAL SOCIETY OPEN SCIENCE 2020; 7:191461. [PMID: 32742675 PMCID: PMC7353965 DOI: 10.1098/rsos.191461] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Accepted: 05/05/2020] [Indexed: 06/11/2023]
Abstract
Experimental determination of protein function is resource-consuming. As an alternative, computational prediction of protein function has received attention. In this context, protein structural classification (PSC) can help, by allowing for determining structural classes of currently unclassified proteins based on their features, and then relying on the fact that proteins with similar structures have similar functions. Existing PSC approaches rely on sequence-based or direct three-dimensional (3D) structure-based protein features. By contrast, we first model 3D structures of proteins as protein structure networks (PSNs). Then, we use network-based features for PSC. We propose the use of graphlets, state-of-the-art features in many research areas of network science, in the task of PSC. Moreover, because graphlets can deal only with unweighted PSNs, and because accounting for edge weights when constructing PSNs could improve PSC accuracy, we also propose a deep learning framework that automatically learns network features from weighted PSNs. When evaluated on a large set of approximately 9400 CATH and approximately 12 800 SCOP protein domains (spanning 36 PSN sets), the best of our proposed approaches are superior to existing PSC approaches in terms of accuracy, with comparable running times. Our data and code are available at https://doi.org/10.5281/zenodo.3787922.
Collapse
Affiliation(s)
- Khalique Newaz
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
- Center for Network and Data Science, University of Notre Dame, Notre Dame, IN 46556, USA
- Eck institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Mahboobeh Ghalehnovi
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Arash Rahnama
- Department of Electrical Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Panos J. Antsaklis
- Department of Electrical Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Tijana Milenković
- Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
- Center for Network and Data Science, University of Notre Dame, Notre Dame, IN 46556, USA
- Eck institute for Global Health, University of Notre Dame, Notre Dame, IN 46556, USA
| |
Collapse
|
16
|
Quantitative description and classification of protein structures by a novel robust amino acid network: interaction selective network (ISN). Sci Rep 2019. [PMCID: PMC6853966 DOI: 10.1038/s41598-019-52766-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
To quantitatively categorize protein structures, we developed a quantitative coarse-grained model of protein structures with a novel amino acid network, the interaction selective network (ISN), characterized by the links based on interactions in both the main and side chains. We found that the ISN is a novel robust network model to show the higher classification probability in the plots of average vertex degree (k) versus average clustering coefficient (C), both of which are typical network parameters for protein structures, and successfully distinguished between “all-α” and “all-β” proteins. On the other hand, one of the typical conventional networks, the α-carbon network (CAN), was found to be less robust than the ISN, and another typical network, atomic distance network (ADN), failed to distinguish between these two protein structures. Considering that the links in the CAN and ADN are defined by the interactions only between the main chain atoms and by the distance of the closest atom pair between the two amino acid residues, respectively, we can conclude that reflecting structural information from both secondary and tertiary structures in the network parameters improves the quantitative evaluation and robustness in network models, resulting in a quantitative and more robust description of three-dimensional protein structures in the ISN.
Collapse
|
17
|
Tantardini M, Ieva F, Tajoli L, Piccardi C. Comparing methods for comparing networks. Sci Rep 2019; 9:17557. [PMID: 31772246 PMCID: PMC6879644 DOI: 10.1038/s41598-019-53708-y] [Citation(s) in RCA: 62] [Impact Index Per Article: 12.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Accepted: 10/25/2019] [Indexed: 11/17/2022] Open
Abstract
With the impressive growth of available data and the flexibility of network modelling, the problem of devising effective quantitative methods for the comparison of networks arises. Plenty of such methods have been designed to accomplish this task: most of them deal with undirected and unweighted networks only, but a few are capable of handling directed and/or weighted networks too, thus properly exploiting richer information. In this work, we contribute to the effort of comparing the different methods for comparing networks and providing a guide for the selection of an appropriate one. First, we review and classify a collection of network comparison methods, highlighting the criteria they are based on and their advantages and drawbacks. The set includes methods requiring known node-correspondence, such as DeltaCon and Cut Distance, as well as methods not requiring a priori known node-correspondence, such as alignment-based, graphlet-based, and spectral methods, and the recently proposed Portrait Divergence and NetLSD. We test the above methods on synthetic networks and we assess their usability and the meaningfulness of the results they provide. Finally, we apply the methods to two real-world datasets, the European Air Transportation Network and the FAO Trade Network, in order to discuss the results that can be drawn from this type of analysis.
Collapse
Affiliation(s)
| | - Francesca Ieva
- MOX - Modelling and Scientific Computing Lab, Department of Mathematics, Politecnico di Milano, Via Bonardi 9, 20133, Milano, Italy.,CADS - Center for Analysis, Decisions and Society, Human Technopole, 20157, Milano, Italy
| | - Lucia Tajoli
- Department of Management, Economics and Industrial Engineering, Politecnico di Milano, Via Lambruschini 4/b, 20156, Milano, Italy
| | - Carlo Piccardi
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133, Milano, Italy.
| |
Collapse
|
18
|
Malod-Dognin N, Pržulj N. Functional geometry of protein interactomes. Bioinformatics 2019; 35:3727-3734. [PMID: 30821317 DOI: 10.1093/bioinformatics/btz146] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2018] [Revised: 01/15/2019] [Accepted: 02/25/2019] [Indexed: 12/14/2022] Open
Abstract
MOTIVATION Protein-protein interactions (PPIs) are usually modeled as networks. These networks have extensively been studied using graphlets, small induced subgraphs capturing the local wiring patterns around nodes in networks. They revealed that proteins involved in similar functions tend to be similarly wired. However, such simple models can only represent pairwise relationships and cannot fully capture the higher-order organization of protein interactomes, including protein complexes. RESULTS To model the multi-scale organization of these complex biological systems, we utilize simplicial complexes from computational geometry. The question is how to mine these new representations of protein interactomes to reveal additional biological information. To address this, we define simplets, a generalization of graphlets to simplicial complexes. By using simplets, we define a sensitive measure of similarity between simplicial complex representations that allows for clustering them according to their data types better than clustering them by using other state-of-the-art measures, e.g. spectral distance, or facet distribution distance. We model human and baker's yeast protein interactomes as simplicial complexes that capture PPIs and protein complexes as simplices. On these models, we show that our newly introduced simplet-based methods cluster proteins by function better than the clustering methods that use the standard PPI networks, uncovering the new underlying functional organization of the cell. We demonstrate the existence of the functional geometry in the protein interactome data and the superiority of our simplet-based methods to effectively mine for new biological information hidden in the complexity of the higher-order organization of protein interactomes. AVAILABILITY AND IMPLEMENTATION Codes and datasets are freely available at http://www0.cs.ucl.ac.uk/staff/natasa/Simplets/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Noël Malod-Dognin
- Department of Life Sciences, Barcelona Supercomputing Center, Barcelona, Spain
| | - Nataša Pržulj
- Department of Life Sciences, Barcelona Supercomputing Center, Barcelona, Spain.,ICREA, Pg. Lluís Companys 23, Barcelona, Spain
| |
Collapse
|
19
|
Zhao Y, Xue X, Xie X. An alignment-free measure based on physicochemical properties of amino acids for protein sequence comparison. Comput Biol Chem 2019; 80:10-15. [PMID: 30851619 DOI: 10.1016/j.compbiolchem.2019.01.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2018] [Revised: 12/30/2018] [Accepted: 01/17/2019] [Indexed: 01/21/2023]
Abstract
Sequence comparison is an important topic in bioinformatics. With the exponential increase of biological sequences, the traditional protein sequence comparison methods - the alignment methods become limited, so the alignment-free methods are widely proposed in the past two decades. In this paper, we considered not only the six typical physicochemical properties of amino acids, but also their frequency and positional distribution. A 51-dimensional vector was obtained to describe the protein sequence. We got a pairwise distance matrix by computing the standardized Euclidean distance, and discriminant analysis and phylogenetic analysis can be made. The results on the Influenza A virus and ND5 datasets indicate that our method is accurate and efficient for classifying proteins and inferring the phylogeny of species.
Collapse
Affiliation(s)
- Yunxiu Zhao
- College of Science, Northwest A&F University, Yangling, Shaanxi 712100, PR China
| | - Xiaolong Xue
- College of Science, Northwest A&F University, Yangling, Shaanxi 712100, PR China
| | - Xiaoli Xie
- College of Science, Northwest A&F University, Yangling, Shaanxi 712100, PR China.
| |
Collapse
|