1
|
Chaudhari JK, Pant S, Jha R, Pathak RK, Singh DB. Biological big-data sources, problems of storage, computational issues, and applications: a comprehensive review. Knowl Inf Syst 2024; 66:3159-3209. [DOI: 10.1007/s10115-023-02049-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 09/12/2023] [Accepted: 12/11/2023] [Indexed: 01/03/2025]
|
2
|
Mohammed A, Biegert G, Adamec J, Helikar T. CancerDiscover: an integrative pipeline for cancer biomarker and cancer class prediction from high-throughput sequencing data. Oncotarget 2017; 9:2565-2573. [PMID: 29416792 PMCID: PMC5788660 DOI: 10.18632/oncotarget.23511] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2017] [Accepted: 12/09/2017] [Indexed: 11/25/2022] Open
Abstract
Accurate identification of cancer biomarkers and classification of cancer type and subtype from High Throughput Sequencing (HTS) data is a challenging problem because it requires manual processing of raw HTS data from various sequencing platforms, quality control, and normalization, which are both tedious and time-consuming. Machine learning techniques for cancer class prediction and biomarker discovery can hasten cancer detection and significantly improve prognosis. To date, great research efforts have been taken for cancer biomarker identification and cancer class prediction. However, currently available tools and pipelines lack flexibility in data preprocessing, running multiple feature selection methods and learning algorithms, therefore, developing a freely available and easy-to-use program is strongly demanded by researchers. Here, we propose CancerDiscover, an integrative open-source software pipeline that allows users to automatically and efficiently process large high-throughput raw datasets, normalize, and selects best performing features from multiple feature selection algorithms. Additionally, the integrative pipeline lets users apply different feature thresholds to identify cancer biomarkers and build various training models to distinguish different types and subtypes of cancer. The open-source software is available at https://github.com/HelikarLab/CancerDiscover and is free for use under the GPL3 license.
Collapse
Affiliation(s)
- Akram Mohammed
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
| | - Greyson Biegert
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
| | - Jiri Adamec
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
| | - Tomáš Helikar
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, Nebraska, United States of America
| |
Collapse
|
3
|
GenExp: an interactive web-based genomic DAS client with client-side data rendering. PLoS One 2011; 6:e21270. [PMID: 21750706 PMCID: PMC3130039 DOI: 10.1371/journal.pone.0021270] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2010] [Accepted: 05/27/2011] [Indexed: 02/01/2023] Open
Abstract
Background The Distributed Annotation System (DAS) offers a standard protocol for sharing and integrating annotations on biological sequences. There are more than 1000 DAS sources available and the number is steadily increasing. Clients are an essential part of the DAS system and integrate data from several independent sources in order to create a useful representation to the user. While web-based DAS clients exist, most of them do not have direct interaction capabilities such as dragging and zooming with the mouse. Results Here we present GenExp, a web based and fully interactive visual DAS client. GenExp is a genome oriented DAS client capable of creating informative representations of genomic data zooming out from base level to complete chromosomes. It proposes a novel approach to genomic data rendering and uses the latest HTML5 web technologies to create the data representation inside the client browser. Thanks to client-side rendering most position changes do not need a network request to the server and so responses to zooming and panning are almost immediate. In GenExp it is possible to explore the genome intuitively moving it with the mouse just like geographical map applications. Additionally, in GenExp it is possible to have more than one data viewer at the same time and to save the current state of the application to revisit it later on. Conclusions GenExp is a new interactive web-based client for DAS and addresses some of the short-comings of the existing clients. It uses client-side data rendering techniques resulting in easier genome browsing and exploration. GenExp is open source under the GPL license and it is freely available at http://gralggen.lsi.upc.edu/recerca/genexp.
Collapse
|
4
|
Sullivan DE, Gabbard JL, Shukla M, Sobral B. Data integration for dynamic and sustainable systems biology resources: challenges and lessons learned. Chem Biodivers 2010; 7:1124-41. [PMID: 20491070 DOI: 10.1002/cbdv.200900317] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Systems-biology and infectious-disease (host-pathogen-environment) research and development is becoming increasingly dependent on integrating data from diverse and dynamic sources. Maintaining integrated resources over long periods of time presents distinct challenges. This review describes experiences and lessons learned from integrating data in two five-year projects focused on pathosystems biology: the Pathosystems Resource Integration Center (PATRIC, http://patric.vbi.vt.edu/), with a goal of developing bioinformatics resources for the research and countermeasures-development communities based on genomics data, and the Resource Center for Biodefense Proteomics Research (RCBPR, http://www.proteomicsresource.org/), with a goal of developing resources based on the experiment data such as microarray and proteomics data from diverse sources and technologies. Some challenges include integrating genomic sequence and experiment data, data synchronization, data quality control, and usability engineering. We present examples of a variety of data-integration problems drawn from our experiences with PATRIC and RBPRC, as well as open research questions related to long-term sustainability, and describe the next steps to meeting these challenges. Novel contributions of this work include 1) an approach for addressing discrepancies between experiment results and interpreted results, and 2) expanding the range of data-integration techniques to include usability engineering at the presentation level.
Collapse
Affiliation(s)
- Daniel E Sullivan
- CyberInfrastructure Section, Virginia Bioinformatics Institute, Washington Street, MC 0477, Virginia Tech, Blacksburg, Virginia 24061, USA.
| | | | | | | |
Collapse
|
5
|
Krallinger M, Leitner F, Valencia A. Analysis of biological processes and diseases using text mining approaches. Methods Mol Biol 2010; 593:341-382. [PMID: 19957157 DOI: 10.1007/978-1-60327-194-3_16] [Citation(s) in RCA: 50] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
A number of biomedical text mining systems have been developed to extract biologically relevant information directly from the literature, complementing bioinformatics methods in the analysis of experimentally generated data. We provide a short overview of the general characteristics of natural language data, existing biomedical literature databases, and lexical resources relevant in the context of biomedical text mining. A selected number of practically useful systems are introduced together with the type of user queries supported and the results they generate. The extraction of biological relationships, such as protein-protein interactions as well as metabolic and signaling pathways using information extraction systems, will be discussed through example cases of cancer-relevant proteins. Basic strategies for detecting associations of genes to diseases together with literature mining of mutations, SNPs, and epigenetic information (methylation) are described. We provide an overview of disease-centric and gene-centric literature mining methods for linking genes to phenotypic and genotypic aspects. Moreover, we discuss recent efforts for finding biomarkers through text mining and for gene list analysis and prioritization. Some relevant issues for implementing a customized biomedical text mining system will be pointed out. To demonstrate the usefulness of literature mining for the molecular oncology domain, we implemented two cancer-related applications. The first tool consists of a literature mining system for retrieving human mutations together with supporting articles. Specific gene mutations are linked to a set of predefined cancer types. The second application consists of a text categorization system supporting breast cancer-specific literature search and document-based breast cancer gene ranking. Future trends in text mining emphasize the importance of community efforts such as the BioCreative challenge for the development and integration of multiple systems into a common platform provided by the BioCreative Metaserver.
Collapse
|
6
|
Kuzniar A, Lin K, He Y, Nijveen H, Pongor S, Leunissen JAM. ProGMap: an integrated annotation resource for protein orthology. Nucleic Acids Res 2009; 37:W428-34. [PMID: 19494185 PMCID: PMC2703891 DOI: 10.1093/nar/gkp462] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Current protein sequence databases employ different classification schemes that often provide conflicting annotations, especially for poorly characterized proteins. ProGMap (Protein Group Mappings, http://www.bioinformatics.nl/progmap) is a web-tool designed to help researchers and database annotators to assess the coherence of protein groups defined in various databases and thereby facilitate the annotation of newly sequenced proteins. ProGMap is based on a non-redundant dataset of over 6.6 million protein sequences which is mapped to 240 000 protein group descriptions collected from UniProt, RefSeq, Ensembl, COG, KOG, OrthoMCL-DB, HomoloGene, TRIBES and PIRSF. ProGMap combines the underlying classification schemes via a network of links constructed by a fast and fully automated mapping approach originally developed for document classification. The web interface enables queries to be made using sequence identifiers, gene symbols, protein functions or amino acid and nucleotide sequences. For the latter query type BLAST similarity search and QuickMatch identity search services have been incorporated, for finding sequences similar (or identical) to a query sequence. ProGMap is meant to help users of high throughput methodologies who deal with partially annotated genomic data.
Collapse
Affiliation(s)
- Arnold Kuzniar
- Laboratory of Bioinformatics, Wageningen University and Research Centre (WUR), Dreijenlaan 3, 6703 HA Wageningen, The Netherlands
| | | | | | | | | | | |
Collapse
|
7
|
Aravindhan G, Kumar GR, Kumar RS, Subha K. AJAX Interface: A Breakthrough in Bioinformatics Web Applications. PROTEOMICS INSIGHTS 2009. [DOI: 10.4137/pri.s2261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Bioinformatics applications are generally multi-server dependants and will have to communicate several information repositories to carry out any analyses. These applications remain computationally intensive and time consuming as they engage lots of data transfer. Hence they face a major bottleneck when ported as web applications. Browser based web applications normally feature the classical request-response approach. If the response becomes late, as it is expected to happen in the case of long running Bioinformatics programs, Apache will get tired and a request timeout error might occur. Alternate approaches like “Client-Pull” models that involve polling strategy with the unpredictable amount of page refreshes, only tend to intensify the network traffic. Hence a technology that is intelligent enough to support the varied exhaustive Bioinformatics processes becomes highly essential. In this review, we propose how AJAX can afford a laconic framework within the Bioinformatics applications to completely reduce the page refresh nuisance and provide a better user experience.
Collapse
Affiliation(s)
- G. Aravindhan
- Bioinformatics Division, AU-KBC Research Centre, MIT Campus, Anna University, Chennai-600 044, India
| | - G. Ramesh Kumar
- Bioinformatics Division, AU-KBC Research Centre, MIT Campus, Anna University, Chennai-600 044, India
| | - R. Sathish Kumar
- NRCFOSS, AU-KBC Research Centre, MIT Campus, Anna University, Chennai-600 044, India
| | - K. Subha
- Bioinformatics Division, AU-KBC Research Centre, MIT Campus, Anna University, Chennai-600 044, India
| |
Collapse
|
8
|
Messina DN, Sonnhammer ELL. DASher: a stand-alone protein sequence client for DAS, the Distributed Annotation System. ACTA ACUST UNITED AC 2009; 25:1333-4. [PMID: 19297349 DOI: 10.1093/bioinformatics/btp153] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
SUMMARY The rise in biological sequence data has led to a proliferation of separate, specialized databases. While there is great value in having many independent annotations, it is critical that there be a way to integrate them in one combined view. The Distributed Annotation System (DAS) was developed for that very purpose. There are currently no DAS clients that are open source, specialized for aggregating and comparing protein sequence annotation, and that can run as a self-contained application outside of a web browser. The speed, flexibility and extensibility that come with a stand-alone application motivated us to create DASher, an open-source Java DAS client. Given a UniProt sequence identifier, DASher automatically queries DAS-supporting servers worldwide for any information on that sequence and then displays the annotations in an interactive viewer for easy comparison. DASher is a fast, Java-based DAS client optimized for viewing protein sequence annotation and compliant with the latest DAS protocol specification 1.53E. AVAILABILITY DASher is available for direct use and download at http://dasher.sbc.su.se including examples and source code under the GPLv3 licence. Java version 6 or higher is required.
Collapse
Affiliation(s)
- David N Messina
- Stockholm Bioinformatics Centre, Stockholm University, 10691 Stockholm, Sweden
| | | |
Collapse
|
9
|
Furney SJ, Calvo B, Larrañaga P, Lozano JA, Lopez-Bigas N. Prioritization of candidate cancer genes--an aid to oncogenomic studies. Nucleic Acids Res 2008; 36:e115. [PMID: 18710882 PMCID: PMC2566894 DOI: 10.1093/nar/gkn482] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
The development of techniques for oncogenomic analyses such as array comparative genomic hybridization, messenger RNA expression arrays and mutational screens have come to the fore in modern cancer research. Studies utilizing these techniques are able to highlight panels of genes that are altered in cancer. However, these candidate cancer genes must then be scrutinized to reveal whether they contribute to oncogenesis or are coincidental and non-causative. We present a computational method for the prioritization of candidate (i) proto-oncogenes and (ii) tumour suppressor genes from oncogenomic experiments. We constructed computational classifiers using different combinations of sequence and functional data including sequence conservation, protein domains and interactions, and regulatory data. We found that these classifiers are able to distinguish between known cancer genes and other human genes. Furthermore, the classifiers also discriminate candidate cancer genes from a recent mutational screen from other human genes. We provide a web-based facility through which cancer biologists may access our results and we propose computational cancer gene classification as a useful method of prioritizing candidate cancer genes identified in oncogenomic studies.
Collapse
Affiliation(s)
- Simon J Furney
- Research Unit on Biomedical Informatics, Experimental and Health Science Department, Universitat Pompeu Fabra, Barcelona 08080, Spain
| | | | | | | | | |
Collapse
|
10
|
Gómez-López G, Valencia A. Bioinformatics and cancer research: building bridges for translational research. Clin Transl Oncol 2008; 10:85-95. [DOI: 10.1007/s12094-008-0161-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
11
|
Kim S, Shin SY, Lee IH, Kim SJ, Sriram R, Zhang BT. PIE: an online prediction system for protein-protein interactions from text. Nucleic Acids Res 2008; 36:W411-5. [PMID: 18508809 PMCID: PMC2447724 DOI: 10.1093/nar/gkn281] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Protein–protein interaction (PPI) extraction has been an important research topic in bio-text mining area, since the PPI information is critical for understanding biological processes. However, there are very few open systems available on the Web and most of the systems focus on keyword searching based on predefined PPIs. PIE (Protein Interaction information Extraction system) is a configurable Web service to extract PPIs from literature, including user-provided papers as well as PubMed articles. After providing abstracts or papers, the prediction results are displayed in an easily readable form with essential, yet compact features. The PIE interface supports more features such as PDF file extraction, PubMed search tool and network communication, which are useful for biologists and bio-system developers. The PIE system utilizes natural language processing techniques and machine learning methodologies to predict PPI sentences, which results in high precision performance for Web users. PIE is freely available at http://bi.snu.ac.kr/pie/.
Collapse
Affiliation(s)
- Sun Kim
- Biointelligence Laboratory, School of Computer Science and Engineering, Seoul National University, Seoul 151-744, Korea
| | | | | | | | | | | |
Collapse
|
12
|
Prlić A, Down TA, Kulesha E, Finn RD, Kähäri A, Hubbard TJP. Integrating sequence and structural biology with DAS. BMC Bioinformatics 2007; 8:333. [PMID: 17850653 PMCID: PMC2031907 DOI: 10.1186/1471-2105-8-333] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2007] [Accepted: 09/12/2007] [Indexed: 11/16/2022] Open
Abstract
Background The Distributed Annotation System (DAS) is a network protocol for exchanging biological data. It is frequently used to share annotations of genomes and protein sequence. Results Here we present several extensions to the current DAS 1.5 protocol. These provide new commands to share alignments, three dimensional molecular structure data, add the possibility for registration and discovery of DAS servers, and provide a convention how to provide different types of data plots. We present examples of web sites and applications that use the new extensions. We operate a public registry of DAS sources, which now includes entries for more than 250 distinct sources. Conclusion Our DAS extensions are essential for the management of the growing number of services and exchange of diverse biological data sets. In addition the extensions allow new types of applications to be developed and scientific questions to be addressed. The registry of DAS sources is available at
Collapse
Affiliation(s)
- Andreas Prlić
- The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK
| | - Thomas A Down
- Wellcome Trust/Cancer Research UK Gurdon Institute, Cambridge University, Cambridge, UK
| | - Eugene Kulesha
- European Bioinformatics Institute, Hinxton, Cambridge, UK
| | - Robert D Finn
- The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK
| | - Andreas Kähäri
- European Bioinformatics Institute, Hinxton, Cambridge, UK
| | - Tim JP Hubbard
- The Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK
| |
Collapse
|