1
|
Bovo S, Martelli PL, Di Lena P, Casadio R. NETGE-PLUS: Standard and Network-Based Gene Enrichment Analysis in Human and Model Organisms. J Proteome Res 2020; 19:2873-2878. [PMID: 31971806 DOI: 10.1021/acs.jproteome.9b00749] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Omics techniques provide a spectrum of information at the genomic level, whose analysis can characterize complex traits at a molecular level. The relationship among genotype and phenotype implies that from genome information the molecular pathways and biological processes underlying a given phenotype are discovered. In dealing with this problem, gene enrichment analysis has become the most widely adopted strategy. Here we present NETGE-PLUS, a Web server for standard and network-based functional interpretation of gene sets of human and of model organisms, including Sus scrofa, Saccharomyces cerevisiae, Escherichia coli, and Arabidopsis thaliana. NETGE-PLUS enables the functional enrichment of both simple and ranked lists of genes, introducing also the possibility of exploring relationships among KEGG pathways. A Web interface makes data retrieval complete and user-friendly. NETGE-PLUS is publicly available at http://net-ge2.biocomp.unibo.it.
Collapse
Affiliation(s)
- Samuele Bovo
- Biocomputing Group, Department of Pharmacy and Biotechnology (FABIT), University of Bologna, Via San Giacomo 9/2, 40126 Bologna, Italy.,Department of Agricultural and Food Sciences (DISTAL), Division of Animal Sciences, University of Bologna, Viale Fanin 46, 40127 Bologna, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology (FABIT), University of Bologna, Via San Giacomo 9/2, 40126 Bologna, Italy
| | - Pietro Di Lena
- Department of Computer Science and Engineering (DISI), University of Bologna, Mura Anteo Zamboni 7, 40126 Bologna, Italy
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology (FABIT), University of Bologna, Via San Giacomo 9/2, 40126 Bologna, Italy.,Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies (IBIOM), Italian National Research Council (CNR), 70126 Bari, Italy
| |
Collapse
|
2
|
Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 2014; 31:926-32. [PMID: 25398609 PMCID: PMC4375400 DOI: 10.1093/bioinformatics/btu739] [Citation(s) in RCA: 982] [Impact Index Per Article: 98.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters. RESULTS Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (∼7 times shorter hit list before expansion), faster (∼6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation.
Collapse
Affiliation(s)
- Baris E Suzek
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland
| | - Yuqi Wang
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland
| | - Hongzhan Huang
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland
| | - Peter B McGarvey
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland
| | - Cathy H Wu
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA, Department of Computer Engineering, Muğla Sıtkı Koçman University, Muğla 48000, Turkey, Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, DE 19711, USA, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK and Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland
| | | |
Collapse
|
3
|
Nguyen NT, Zhang X, Wu C, Lange RA, Chilton RJ, Lindsey ML, Jin YF. Integrative computational and experimental approaches to establish a post-myocardial infarction knowledge map. PLoS Comput Biol 2014; 10:e1003472. [PMID: 24651374 PMCID: PMC3961365 DOI: 10.1371/journal.pcbi.1003472] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2013] [Accepted: 01/02/2014] [Indexed: 01/04/2023] Open
Abstract
Vast research efforts have been devoted to providing clinical diagnostic markers of myocardial infarction (MI), leading to over one million abstracts associated with “MI” and “Cardiovascular Diseases” in PubMed. Accumulation of the research results imposed a challenge to integrate and interpret these results. To address this problem and better understand how the left ventricle (LV) remodels post-MI at both the molecular and cellular levels, we propose here an integrative framework that couples computational methods and experimental data. We selected an initial set of MI-related proteins from published human studies and constructed an MI-specific protein-protein-interaction network (MIPIN). Structural and functional analysis of the MIPIN showed that the post-MI LV exhibited increased representation of proteins involved in transcriptional activity, inflammatory response, and extracellular matrix (ECM) remodeling. Known plasma or serum expression changes of the MIPIN proteins in patients with MI were acquired by data mining of the PubMed and UniProt knowledgebase, and served as a training set to predict unlabeled MIPIN protein changes post-MI. The predictions were validated with published results in PubMed, suggesting prognosticative capability of the MIPIN. Further, we established the first knowledge map related to the post-MI response, providing a major step towards enhancing our understanding of molecular interactions specific to MI and linking the molecular interaction, cellular responses, and biological processes to quantify LV remodeling. Heart attack, known medically as myocardial infarction, often occurs as a result of partial shortage of blood supply to a portion of the heart, leading to the death of heart muscle cells. Following myocardial infarction, complications might arise, including arrhythmia, myocardial rupture, left ventricular dysfunction, and heart failure. Although myocardial infarction can be quickly diagnosed using a various number of tests, including blood tests and electrocardiography, there have been no available prognostic tests to predict the long-term outcome in response to myocardial infarction. Here, we present a framework to analyze how the left ventricle responds to myocardial infarction by combining protein interactome and experimental results retrieved from published human studies. The framework organized current understanding of molecular interactions specific to myocardial infarction, cellular responses, and biological processes to quantify left ventricular remodeling process. Specifically, our knowledge map showed that transcriptional activity, inflammatory response, and extracellular matrix remodeling are the main functional themes post myocardial infarction. In addition, text analytics of relevant abstracts revealed differentiated protein expressions in plasma or serum expressions from patients with myocardial infarction. Using this data, we predicted expression levels of other proteins following myocardial infarction.
Collapse
Affiliation(s)
- Nguyen T. Nguyen
- Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, Texas, United States of America
- San Antonio Cardiovascular Proteomics Center, University of Texas Health Science Center at San Antonio, San Antonio, Texas, United States of America
| | - Xiaolin Zhang
- Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, Texas, United States of America
| | - Cathy Wu
- Center for Bioinformatics and Computational Biology and Protein Information Resource, University of Delaware, Newark, Delaware, United States of America
| | - Richard A. Lange
- San Antonio Cardiovascular Proteomics Center, University of Texas Health Science Center at San Antonio, San Antonio, Texas, United States of America
- Department of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, Texas, United States of America
| | - Robert J. Chilton
- San Antonio Cardiovascular Proteomics Center, University of Texas Health Science Center at San Antonio, San Antonio, Texas, United States of America
- Department of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, Texas, United States of America
| | - Merry L. Lindsey
- San Antonio Cardiovascular Proteomics Center, University of Texas Health Science Center at San Antonio, San Antonio, Texas, United States of America
- Mississippi Center for Heart Research, University of Mississippi Medical Center, Jackson, Mississippi, United States of America
- Research Service, G.V. (Sonny) Montgomery Veterans Affairs Medical Center, Jackson, Mississippi, United States of America
| | - Yu-Fang Jin
- Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, Texas, United States of America
- San Antonio Cardiovascular Proteomics Center, University of Texas Health Science Center at San Antonio, San Antonio, Texas, United States of America
- * E-mail:
| |
Collapse
|
4
|
Higdon R, Haynes W, Stanberry L, Stewart E, Yandl G, Howard C, Broomall W, Kolker N, Kolker E. Unraveling the Complexities of Life Sciences Data. BIG DATA 2013; 1:42-50. [PMID: 27447037 DOI: 10.1089/big.2012.1505] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The life sciences have entered into the realm of big data and data-enabled science, where data can either empower or overwhelm. These data bring the challenges of the 5 Vs of big data: volume, veracity, velocity, variety, and value. Both independently and through our involvement with DELSA Global (Data-Enabled Life Sciences Alliance, DELSAglobal.org), the Kolker Lab ( kolkerlab.org ) is creating partnerships that identify data challenges and solve community needs. We specialize in solutions to complex biological data challenges, as exemplified by the community resource of MOPED (Model Organism Protein Expression Database, MOPED.proteinspire.org ) and the analysis pipeline of SPIRE (Systematic Protein Investigative Research Environment, PROTEINSPIRE.org ). Our collaborative work extends into the computationally intensive tasks of analysis and visualization of millions of protein sequences through innovative implementations of sequence alignment algorithms and creation of the Protein Sequence Universe tool (PSU). Pushing into the future together with our collaborators, our lab is pursuing integration of multi-omics data and exploration of biological pathways, as well as assigning function to proteins and porting solutions to the cloud. Big data have come to the life sciences; discovering the knowledge in the data will bring breakthroughs and benefits.
Collapse
Affiliation(s)
- Roger Higdon
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Winston Haynes
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Larissa Stanberry
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Elizabeth Stewart
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Gregory Yandl
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Chris Howard
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
- 5 Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
| | - William Broomall
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Natali Kolker
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
| | - Eugene Kolker
- 1 Bioinformatics and High-throughput Analysis Laboratory, Seattle Children's Research Institute , Seattle, Washington
- 2 High-throughput Analysis Core, Center for Developmental Therapeutics, Seattle Children's Research Institute , Seattle, Washington
- 3 Predictive Analytics, Seattle Children's , Seattle, Washington
- 4 Data-Enabled Life Sciences Alliance (DELSA Global) , Seattle, Washington
- 6 Departments of Biomedical Informatics & Medical Education and Pediatrics, University of Washington , Seattle, Washington
| |
Collapse
|
5
|
Higdon R, Louie B, Kolker E. Modeling sequence and function similarity between proteins for protein functional annotation. PROCEEDINGS OF THE ... INTERNATIONAL SYMPOSIUM ON HIGH PERFORMANCE DISTRIBUTED COMPUTING 2010; 2010:499-502. [PMID: 25101328 DOI: 10.1145/1851476.1851548] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
A common task in biological research is to predict function for proteins by comparing sequences between proteins of known and unknown function. This is often done using pair-wise sequence alignment algorithms (e.g. BLAST). A problem with this approach is the assumption of a simple equivalence between a minimum sequence similarity threshold and the function similarity between proteins. This assumption is based on the binary concept of homology in that proteins are or not homologous. The relationship between sequence and function however is more complex as well as pertinent for predicting protein function, e.g. evaluating BLAST alignments or developing training sets for profile models based on functional rather than homologous groupings. Our motivation for this study was to model sequence and function similarity between proteins to gain insights into the "sequence-function similarity relationship between proteins for predicting function. Using our model we found that function similarity generally increases with sequence similarity but with a high degree of variability. This result has implications for pair-wise approaches in that it appears sequence similarity must be very high to ensure high function similarity. Profile models which enable higher sensitivity are a potential solution. However, multiple sequences alignments (a necessary prerequisite) are a problem in that current algorithms have difficulty aligning sequences with very low sequence similarity, which is common in our data set, or are intractable for high numbers of sequences. Given the importance of predicting protein function and the need for multiple sequence alignments, algorithms for accomplishing this task should be further refined and developed.
Collapse
Affiliation(s)
- Roger Higdon
- Seattle Children's Research Institute, 1900 Ninth Avenue, Seattle, WA 98101,
| | - Brenton Louie
- Seattle Children's Research Institute, 1900 Ninth Avenue, Seattle, WA 98101,
| | - Eugene Kolker
- Seattle Children's Research Institute, 1900 Ninth Avenue, Seattle, WA 98101,
| |
Collapse
|