1
|
DNAPred_Prot: Identification of DNA-Binding Proteins Using Composition- and Position-Based Features. Appl Bionics Biomech 2022; 2022:5483115. [PMID: 35465187 PMCID: PMC9020926 DOI: 10.1155/2022/5483115] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 09/15/2021] [Revised: 12/25/2021] [Accepted: 02/05/2022] [Indexed: 12/29/2022] Open
Abstract
In the domain of genome annotation, the identification of DNA-binding protein is one of the crucial challenges. DNA is considered a blueprint for the cell. It contained all necessary information for building and maintaining the trait of an organism. It is DNA, which makes a living thing, a living thing. Protein interaction with DNA performs an essential role in regulating DNA functions such as DNA repair, transcription, and regulation. Identification of these proteins is a crucial task for understanding the regulation of genes. Several methods have been developed to identify the binding sites of DNA and protein depending upon the structures and sequences, but they were costly and time-consuming. Therefore, we propose a methodology named “DNAPred_Prot”, which uses various position and frequency-dependent features from protein sequences for efficient and effective prediction of DNA-binding proteins. Using testing techniques like 10-fold cross-validation and jackknife testing an accuracy of 94.95% and 95.11% was yielded, respectively. The results of SVM and ANN were also compared with those of a random forest classifier. The robustness of the proposed model was evaluated by using the independent dataset PDB186, and an accuracy of 91.47% was achieved by it. From these results, it can be predicted that the suggested methodology performs better than other extant methods for the identification of DNA-binding proteins.
Collapse
|
2
|
Pallotta S, Cascianelli S, Masseroli M. RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor. BMC Bioinformatics 2022; 23:123. [PMID: 35392801 PMCID: PMC8991469 DOI: 10.1186/s12859-022-04648-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 07/24/2021] [Accepted: 03/23/2022] [Indexed: 12/25/2022] Open
Abstract
Background Heterogeneous omics data, increasingly collected through high-throughput technologies, can contain hidden answers to very important and still unsolved biomedical questions. Their integration and processing are crucial mostly for tertiary analysis of Next Generation Sequencing data, although suitable big data strategies still address mainly primary and secondary analysis. Hence, there is a pressing need for algorithms specifically designed to explore big omics datasets, capable of ensuring scalability and interoperability, possibly relying on high-performance computing infrastructures. Results We propose RGMQL, a R/Bioconductor package conceived to provide a set of specialized functions to extract, combine, process and compare omics datasets and their metadata from different and differently localized sources. RGMQL is built over the GenoMetric Query Language (GMQL) data management and computational engine, and can leverage its open curated repository as well as its cloud-based resources, with the possibility of outsourcing computational tasks to GMQL remote services. Furthermore, it overcomes the limits of the GMQL declarative syntax, by guaranteeing a procedural approach in dealing with omics data within the R/Bioconductor environment. But mostly, it provides full interoperability with other packages of the R/Bioconductor framework and extensibility over the most used genomic data structures and processing functions. Conclusions RGMQL is able to combine the query expressiveness and computational efficiency of GMQL with a complete processing flow in the R environment, being a fully integrated extension of the R/Bioconductor framework. Here we provide three fully reproducible example use cases of biological relevance that are particularly explanatory of its flexibility of use and interoperability with other R/Bioconductor packages. They show how RGMQL can easily scale up from local to parallel and cloud computing while it combines and analyzes heterogeneous omics data from local or remote datasets, both public and private, in a completely transparent way to the user. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04648-4.
Collapse
Affiliation(s)
- Simone Pallotta
- Dipartimento di Elettronica, Informazione e Bioingegneria, Via Ponzio 34/5, 20133, Milan, Italy
| | - Silvia Cascianelli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Via Ponzio 34/5, 20133, Milan, Italy.
| | - Marco Masseroli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Via Ponzio 34/5, 20133, Milan, Italy
| |
Collapse
|
3
|
Chen J, Li F, Wang M, Li J, Marquez-Lago TT, Leier A, Revote J, Li S, Liu Q, Song J. BigFiRSt: A Software Program Using Big Data Technique for Mining Simple Sequence Repeats From Large-Scale Sequencing Data. Front Big Data 2022; 4:727216. [PMID: 35118375 PMCID: PMC8805145 DOI: 10.3389/fdata.2021.727216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 06/18/2021] [Accepted: 12/13/2021] [Indexed: 11/22/2022] Open
Abstract
Background Simple Sequence Repeats (SSRs) are short tandem repeats of nucleotide sequences. It has been shown that SSRs are associated with human diseases and are of medical relevance. Accordingly, a variety of computational methods have been proposed to mine SSRs from genomes. Conventional methods rely on a high-quality complete genome to identify SSRs. However, the sequenced genome often misses several highly repetitive regions. Moreover, many non-model species have no entire genomes. With the recent advances of next-generation sequencing (NGS) techniques, large-scale sequence reads for any species can be rapidly generated using NGS. In this context, a number of methods have been proposed to identify thousands of SSR loci within large amounts of reads for non-model species. While the most commonly used NGS platforms (e.g., Illumina platform) on the market generally provide short paired-end reads, merging overlapping paired-end reads has become a common way prior to the identification of SSR loci. This has posed a big data analysis challenge for traditional stand-alone tools to merge short read pairs and identify SSRs from large-scale data. Results In this study, we present a new Hadoop-based software program, termed BigFiRSt, to address this problem using cutting-edge big data technology. BigFiRSt consists of two major modules, BigFLASH and BigPERF, implemented based on two state-of-the-art stand-alone tools, FLASH and PERF, respectively. BigFLASH and BigPERF address the problem of merging short read pairs and mining SSRs in the big data manner, respectively. Comprehensive benchmarking experiments show that BigFiRSt can dramatically reduce the execution times of fast read pairs merging and SSRs mining from very large-scale DNA sequence data. Conclusions The excellent performance of BigFiRSt mainly resorts to the Big Data Hadoop technology to merge read pairs and mine SSRs in parallel and distributed computing on clusters. We anticipate BigFiRSt will be a valuable tool in the coming biological Big Data era.
Collapse
Affiliation(s)
- Jinxiang Chen
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Fuyi Li
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- Department of Microbiology and Immunity, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, VIC, Australia
| | - Miao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Junlong Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Tatiana T. Marquez-Lago
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - André Leier
- Department of Genetics, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jerico Revote
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
| | - Shuqin Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling, China
- Quanzhong Liu
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC, Australia
- Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia
- *Correspondence: Jiangning Song
| |
Collapse
|
4
|
Koppad S, B A, Gkoutos GV, Acharjee A. Cloud Computing Enabled Big Multi-Omics Data Analytics. Bioinform Biol Insights 2021; 15:11779322211035921. [PMID: 34376975 PMCID: PMC8323418 DOI: 10.1177/11779322211035921] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 05/01/2021] [Accepted: 07/12/2021] [Indexed: 12/27/2022] Open
Abstract
High-throughput experiments enable researchers to explore complex multifactorial
diseases through large-scale analysis of omics data. Challenges for such
high-dimensional data sets include storage, analyses, and sharing. Recent
innovations in computational technologies and approaches, especially in cloud
computing, offer a promising, low-cost, and highly flexible solution in the
bioinformatics domain. Cloud computing is rapidly proving increasingly useful in
molecular modeling, omics data analytics (eg, RNA sequencing, metabolomics, or
proteomics data sets), and for the integration, analysis, and interpretation of
phenotypic data. We review the adoption of advanced cloud-based and big data
technologies for processing and analyzing omics data and provide insights into
state-of-the-art cloud bioinformatics applications.
Collapse
Affiliation(s)
- Saraswati Koppad
- Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India
| | - Annappa B
- Department of Computer Science and Engineering, National Institute of Technology Karnataka, Surathkal, India
| | - Georgios V Gkoutos
- Institute of Cancer and Genomic Sciences and Centre for Computational Biology, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospitals Birmingham, Birmingham, UK.,MRC Health Data Research UK (HDR UK), London, UK.,NIHR Experimental Cancer Medicine Centre, Birmingham, UK.,NIHR Biomedical Research Centre, University Hospitals Birmingham, Birmingham, UK
| | - Animesh Acharjee
- Institute of Cancer and Genomic Sciences and Centre for Computational Biology, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Institute of Translational Medicine, University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University Hospitals Birmingham, Birmingham, UK
| |
Collapse
|
5
|
Zou Y, Zhu Y, Li Y, Wu FX, Wang J. Parallel computing for genome sequence processing. Brief Bioinform 2021; 22:6210355. [PMID: 33822883 DOI: 10.1093/bib/bbab070] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 12/04/2020] [Revised: 01/26/2021] [Accepted: 02/10/2021] [Indexed: 01/08/2023] Open
Abstract
The rapid increase of genome data brought by gene sequencing technologies poses a massive challenge to data processing. To solve the problems caused by enormous data and complex computing requirements, researchers have proposed many methods and tools which can be divided into three types: big data storage, efficient algorithm design and parallel computing. The purpose of this review is to investigate popular parallel programming technologies for genome sequence processing. Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is firstly introduced, and then a list of tools or algorithms are summarized in the aspects of principle, hardware platform and computing efficiency. The programming model of each hardware and application provides a reference for researchers to choose high-performance computing tools. Finally, we discuss the limitations and future trends of parallel computing technologies.
Collapse
Affiliation(s)
- You Zou
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yuejie Zhu
- Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China
| | - Yaohang Li
- computer science at Old Dominion University, USA
| | - Fang-Xiang Wu
- College of Engineering and the Department of Computer Science at the University of Saskatchewan, Saskatoon, Canada
| | - Jianxin Wang
- School of Computer Science and Engineering at Central South University, Changsha, Hunan, China
| |
Collapse
|
6
|
Xie W, Feng YE. Prediction of the Disordered Regions of Intrinsically Disordered Proteins Based on the Molecular Functions. Protein Pept Lett 2020; 27:279-286. [PMID: 30819075 DOI: 10.2174/0929866526666190226160629] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 12/08/2018] [Revised: 01/03/2019] [Accepted: 02/08/2019] [Indexed: 01/29/2023]
Abstract
BACKGROUND Intrinsically disordered proteins lack a well-defined three dimensional structure under physiological conditions while possessing the essential biological functions. They take part in various physiological processes such as signal transduction, transcription and posttranslational modifications and etc. The disordered regions are the main functional sites for intrinsically disordered proteins. Therefore, the research of the disordered regions has become a hot issue. OBJECTIVE In this paper, our motivation is to analysis of the features of disordered regions with different molecular functions and predict of different disordered regions using valid features. METHODS In this article, according to the different molecular function, we firstly divided intrinsically disordered proteins into six classes in DisProt database. Then, we extracted four features using bioinformatics methods, namely, Amino Acid Index (AAIndex), codon frequency (Codon), three kinds of protein secondary structure compositions (3PSS) and Chemical Shifts (CSs), and used these features to predict the disordered regions of the different functions by Support Vector Machine (SVM). RESULTS The best overall accuracy was 99.29% using the chemical shift (CSs) as feature. In feature fusion, the overall accuracy can reach 88.70% by using CSs+AAIndex as features. The overall accuracy was up to 86.09% by using CSs+AAIndex+Codon+3PSS as features. CONCLUSION We predicted and analyzed the disordered regions based on the molecular functions. The results showed that the prediction performance can be improved by adding chemical shifts and AAIndex as features, especially chemical shifts. Moreover, the chemical shift was the most effective feature in the prediction. We hoped that our results will be constructive for the study of intrinsically disordered proteins.
Collapse
Affiliation(s)
- WeiXia Xie
- College of Science, Inner Mongolia Agriculture University, Hohhot 010018, China
| | - Yong E Feng
- College of Science, Inner Mongolia Agriculture University, Hohhot 010018, China
| |
Collapse
|
7
|
Zhan Q, Fu Y, Jiang Q, Liu B, Peng J, Wang Y. SpliVert: A Protein Multiple Sequence Alignment Refinement Method Based on Splitting-Splicing Vertically. Protein Pept Lett 2020; 27:295-302. [PMID: 31385760 DOI: 10.2174/0929866526666190806143959] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 04/06/2019] [Revised: 04/26/2019] [Accepted: 06/14/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Multiple Sequence Alignment (MSA) is a fundamental task in bioinformatics and is required for many biological analysis tasks. The more accurate the alignments are, the more credible the downstream analyses. Most protein MSA algorithms realign an alignment to refine it by dividing it into two groups horizontally and then realign the two groups. However, this strategy does not consider that different regions of the sequences have different conservation; this property may lead to incorrect residue-residue or residue-gap pairs, which cannot be corrected by this strategy. OBJECTIVE In this article, our motivation is to develop a novel refinement method based on splitting- splicing vertically. METHODS Here, we present a novel refinement method based on splitting-splicing vertically, called SpliVert. For an alignment, we split it vertically into 3 parts, remove the gap characters in the middle, realign the middle part alone, and splice the realigned middle parts with the other two initial pieces to obtain a refined alignment. In the realign procedure of our method, the aligner will only focus on a certain part, ignoring the disturbance of the other parts, which could help fix the incorrect pairs. RESULTS We tested our refinement strategy for 2 leading MSA tools on 3 standard benchmarks, according to the commonly used average SP (and TC) score. The results show that given appropriate proportions to split the initial alignment, the average scores are increased comparably or slightly after using our method. We also compared the alignments refined by our method with alignments directly refined by the original alignment tools. The results suggest that using our SpliVert method to refine alignments can also outperform direct use of the original alignment tools. CONCLUSION The results reveal that splitting vertically and realigning part of the alignment is a good strategy for the refinement of protein multiple sequence alignments.
Collapse
Affiliation(s)
- Qing Zhan
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yilei Fu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
8
|
Mrozek D. A review of Cloud computing technologies for comprehensive microRNA analyses. Comput Biol Chem 2020; 88:107365. [PMID: 32906056 DOI: 10.1016/j.compbiolchem.2020.107365] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 07/09/2020] [Revised: 08/05/2020] [Accepted: 08/18/2020] [Indexed: 01/08/2023]
Abstract
Cloud computing revolutionized many fields that require ample computational power. Cloud platforms may also provide huge support for microRNA analysis mainly through disclosing scalable resources of different types. In Clouds, these resources are available as services, which simplifies their allocation and releasing. This feature is especially useful during the analysis of large volumes of data, like the one produced by next generation sequencing experiments, which require not only extended storage space but also a distributed computing environment. In this paper, we show which of the Cloud properties and service models can be especially beneficial for microRNA analysis. We also explain the most useful services of the Cloud (including storage space, computational power, web application hosting, machine learning models, and Big Data frameworks) that can be used for microRNA analysis. At the same time, we review several solutions for microRNA and show that the utilization of the Cloud in this field is still weak, but can increase in the future when the awareness of their applicability grows.
Collapse
Affiliation(s)
- Dariusz Mrozek
- Department of Applied Informatics, Silesian University of Technology, Akademicka 16, 44-100 Gliwice, Poland.
| |
Collapse
|
9
|
Li F, Wang Y, Li C, Marquez-Lago TT, Leier A, Rawlings ND, Haffari G, Revote J, Akutsu T, Chou KC, Purcell AW, Pike RN, Webb GI, Ian Smith A, Lithgow T, Daly RJ, Whisstock JC, Song J. Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods. Brief Bioinform 2019; 20:2150-2166. [PMID: 30184176 PMCID: PMC6954447 DOI: 10.1093/bib/bby077] [Citation(s) in RCA: 58] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 06/14/2018] [Revised: 07/26/2018] [Accepted: 08/01/2018] [Indexed: 01/06/2023] Open
Abstract
The roles of proteolytic cleavage have been intensively investigated and discussed during the past two decades. This irreversible chemical process has been frequently reported to influence a number of crucial biological processes (BPs), such as cell cycle, protein regulation and inflammation. A number of advanced studies have been published aiming at deciphering the mechanisms of proteolytic cleavage. Given its significance and the large number of functionally enriched substrates targeted by specific proteases, many computational approaches have been established for accurate prediction of protease-specific substrates and their cleavage sites. Consequently, there is an urgent need to systematically assess the state-of-the-art computational approaches for protease-specific cleavage site prediction to further advance the existing methodologies and to improve the prediction performance. With this goal in mind, in this article, we carefully evaluated a total of 19 computational methods (including 8 scoring function-based methods and 11 machine learning-based methods) in terms of their underlying algorithm, calculated features, performance evaluation and software usability. Then, extensive independent tests were performed to assess the robustness and scalability of the reviewed methods using our carefully prepared independent test data sets with 3641 cleavage sites (specific to 10 proteases). The comparative experimental results demonstrate that PROSPERous is the most accurate generic method for predicting eight protease-specific cleavage sites, while GPS-CCD and LabCaS outperformed other predictors for calpain-specific cleavage sites. Based on our review, we then outlined some potential ways to improve the prediction performance and ease the computational burden by applying ensemble learning, deep learning, positive unlabeled learning and parallel and distributed computing techniques. We anticipate that our study will serve as a practical and useful guide for interested readers to further advance next-generation bioinformatics tools for protease-specific cleavage site prediction.
Collapse
Affiliation(s)
- Fuyi Li
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Yanan Wang
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Department of Biology, Institute of Molecular Systems Biology,ETH Zürich, Zürich 8093, Switzerland
| | - Tatiana T Marquez-Lago
- Department of Genetics and Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - André Leier
- Department of Genetics and Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA
| | - Neil D Rawlings
- EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Wellcome Trust Genome Campus,Hinxton, Cambridgeshire CB10 1SD, UK
| | - Gholamreza Haffari
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Jerico Revote
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto 611-0011, Japan
| | - Kuo-Chen Chou
- Gordon Life Science Institute, Boston, MA 02478, USA
- Center for Informational Biology, School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Anthony W Purcell
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Robert N Pike
- La Trobe Institute for Molecular Science, La Trobe University, Melbourne, VIC 3086, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Geoffrey I Webb
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - A Ian Smith
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Trevor Lithgow
- Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, Victoria 3800, Australia
| | - Roger J Daly
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - James C Whisstock
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry & Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
10
|
Ge J, Meng J, Guo N, Wei Y, Balaji P, Feng S. Counting Kmers for Biological Sequences at Large Scale. Interdiscip Sci 2019; 12:99-108. [PMID: 31734873 DOI: 10.1007/s12539-019-00348-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 07/08/2019] [Revised: 08/19/2019] [Accepted: 10/25/2019] [Indexed: 11/25/2022]
Abstract
Counting the abundance of all the distinct kmers in biological sequence data is a fundamental step in bioinformatics. These applications include de novo genome assembly, error correction, etc. With the development of sequencing technology, the sequence data in a single project can reach Petabyte-scale or Terabyte-scale nucleotides. Counting demand for the abundance of these sequencing data is beyond the memory and computing capacity of single computing node, and how to process it efficiently is a challenge on a high-performance computing cluster. As such, we propose SWAPCounter, a highly scalable distributed approach for kmer counting. This approach is embedded with an MPI streaming I/O module for loading huge data set at high speed, and a counting bloom filter module for both memory and communication efficiency. By overlapping all the counting steps, SWAPCounter achieves high scalability with high parallel efficiency. The experimental results indicate that SWAPCounter has competitive performance with two other tools on shared memory environment, KMC2, and MSPKmerCounter. Moreover, SWAPCounter also shows the highest scalability under strong scaling experiments. In our experiment on Cetus supercomputer, SWAPCounter scales to 32,768 cores with 79% parallel efficiency (using 2048 cores as baseline) when processing 4 TB sequence data of 1000 Genomes. The source code of SWAPCounter is publicly available at https://github.com/mengjintao/SWAPCounter.
Collapse
Affiliation(s)
- Jianqiu Ge
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
| | - Jintao Meng
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
| | - Ning Guo
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
| | - Yanjie Wei
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China.
| | - Pavan Balaji
- Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, IL, 60439-4844, USA
| | - Shengzhong Feng
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Beijing, 518055, China
| |
Collapse
|
11
|
Rao B, Zhou C, Zhang G, Su R, Wei L. ACPred-Fuse: fusing multi-view information improves the prediction of anticancer peptides. Brief Bioinform 2019; 21:1846-1855. [DOI: 10.1093/bib/bbz088] [Citation(s) in RCA: 57] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 04/24/2019] [Revised: 06/06/2019] [Accepted: 06/22/2019] [Indexed: 02/04/2023] Open
Abstract
Abstract
Fast and accurate identification of the peptides with anticancer activity potential from large-scale proteins is currently a challenging task. In this study, we propose a new machine learning predictor, namely, ACPred-Fuse, that can automatically and accurately predict protein sequences with or without anticancer activity in peptide form. Specifically, we establish a feature representation learning model that can explore class and probabilistic information embedded in anticancer peptides (ACPs) by integrating a total of 29 different sequence-based feature descriptors. In order to make full use of various multiview information, we further fused the class and probabilistic features with handcrafted sequential features and then optimized the representation ability of the multiview features, which are ultimately used as input for training our prediction model. By comparing the multiview features and existing feature descriptors, we demonstrate that the fused multiview features have more discriminative ability to capture the characteristics of ACPs. In addition, the information from different views is complementary for the performance improvement. Finally, our benchmarking comparison results showed that the proposed ACPred-Fuse is more precise and promising in the identification of ACPs than existing predictors. To facilitate the use of the proposed predictor, we built a web server, which is now freely available via http://server.malab.cn/ACPred-Fuse.
Collapse
Affiliation(s)
- Bing Rao
- School of Mechanical Electronic & Information Engineering, China University of Mining &Technology, Beijing, China
| | - Chen Zhou
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Guoying Zhang
- School of Mechanical Electronic & Information Engineering, China University of Mining &Technology, Beijing, China
| | - Ran Su
- School of Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
12
|
Hansen AW, Murugan M, Li H, Khayat MM, Wang L, Rosenfeld J, Andrews BK, Jhangiani SN, Coban Akdemir ZH, Sedlazeck FJ, Ashley-Koch AE, Liu P, Muzny DM, Davis EE, Katsanis N, Sabo A, Posey JE, Yang Y, Wangler MF, Eng CM, Sutton VR, Lupski JR, Boerwinkle E, Gibbs RA. A Genocentric Approach to Discovery of Mendelian Disorders. Am J Hum Genet 2019; 105:974-986. [PMID: 31668702 PMCID: PMC6849092 DOI: 10.1016/j.ajhg.2019.09.027] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 06/06/2019] [Accepted: 09/27/2019] [Indexed: 12/20/2022] Open
Abstract
The advent of inexpensive, clinical exome sequencing (ES) has led to the accumulation of genetic data from thousands of samples from individuals affected with a wide range of diseases, but for whom the underlying genetic and molecular etiology of their clinical phenotype remains unknown. In many cases, detailed phenotypes are unavailable or poorly recorded and there is little family history to guide study. To accelerate discovery, we integrated ES data from 18,696 individuals referred for suspected Mendelian disease, together with relatives, in an Apache Hadoop data lake (Hadoop Architecture Lake of Exomes [HARLEE]) and implemented a genocentric analysis that rapidly identified 154 genes harboring variants suspected to cause Mendelian disorders. The approach did not rely on case-specific phenotypic classifications but was driven by optimization of gene- and variant-level filter parameters utilizing historical Mendelian disease-gene association discovery data. Variants in 19 of the 154 candidate genes were subsequently reported as causative of a Mendelian trait and additional data support the association of all other candidate genes with disease endpoints.
Collapse
Affiliation(s)
- Adam W Hansen
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Mullai Murugan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - He Li
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Michael M Khayat
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Liwen Wang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jill Rosenfeld
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - B Kim Andrews
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Shalini N Jhangiani
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Zeynep H Coban Akdemir
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Fritz J Sedlazeck
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Allison E Ashley-Koch
- Duke Molecular Physiology Institute, Duke University Medical Center, Durham, NC 27710, USA; Department of Medicine, Duke University Medical Center, Durham, NC 27710, USA
| | - Pengfei Liu
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Donna M Muzny
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Erica E Davis
- Pediatric Genetic and translational Medicine Center (P-GeM), Stanley Manne Children's Research Institute, Chicago, IL 60611, USA; Department of Pediatrics, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Nicholas Katsanis
- Pediatric Genetic and translational Medicine Center (P-GeM), Stanley Manne Children's Research Institute, Chicago, IL 60611, USA; Department of Pediatrics, Feinberg School of Medicine, Northwestern University, Chicago, IL 60611, USA
| | - Aniko Sabo
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA
| | - Jennifer E Posey
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Yaping Yang
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Michael F Wangler
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Christine M Eng
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA
| | - V Reid Sutton
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Texas Children's Hospital, Houston, TX 77030, USA
| | - James R Lupski
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA; Texas Children's Hospital, Houston, TX 77030, USA; Department of Pediatrics, Baylor College of Medicine, Houston, TX 77030, USA
| | - Eric Boerwinkle
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA; School of Public Health, UTHealth, Houston, TX 77030, USA
| | - Richard A Gibbs
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX 77030, USA; Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX 77030, USA.
| |
Collapse
|
13
|
Pan L, Deng Y. An association coefficient of a belief function and its application in a target recognition system. INT J INTELL SYST 2019. [DOI: 10.1002/int.22200] [Citation(s) in RCA: 59] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Indexed: 12/20/2022]
Affiliation(s)
- Lipeng Pan
- Institute of Fundamental and Frontier ScienceUniversity of Electronic Science and Technology of ChinaChengdu China
| | - Yong Deng
- Institute of Fundamental and Frontier ScienceUniversity of Electronic Science and Technology of ChinaChengdu China
| |
Collapse
|
14
|
Bai N, Tang S, Yu C, Fu H, Wang C, Chen X. GMSA: A Data Sharing System for Multiple Sequence Alignment Across Multiple Users. Curr Bioinform 2019. [DOI: 10.2174/1574893614666190111160101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Indexed: 11/22/2022]
Abstract
Background:
In recent years, the rapid growth of biological datasets in Bioinformatics
has made the computation of Multiple Sequence Alignment (MSA) become extremely slow. Using
the GPU to accelerate MSA has shown to be an effective approach. Moreover, there is a trend that
many bioinformatic researchers or institutes setup a shared server for remote users to submit MSA
jobs via provided web-pages or tools.
Objective:
Given the fact that different MSA jobs submitted by users often process similar datasets,
there can be an opportunity for users to share their computation results between each other,
which can avoid the redundant computation and thereby reduce the overall computing time. Furthermore,
in the heterogeneous CPU/GPU platform, many existing applications assign their computation
on GPU devices only, which leads to a waste of the CPU resources. Co-run computation
can increase the utilization of computing resources on both CPUs and GPUs by dispatching workloads
onto them simultaneously.
Methods:
In this paper, we propose an efficient MSA system called GMSA for multi-users on
shared heterogeneous CPU/GPU platforms. To accelerate the computation of jobs from multiple
users, data sharing is considered in GMSA due to the fact that different MSA jobs often have a
percentage of the same data and tasks. Additionally, we also propose a scheduling strategy based
on the similarity in datasets or tasks between MSA jobs. Furthermore, co-run computation model
is adopted to take full use of both CPUs and GPUs.
Results:
We use four protein datasets which were redesigned according to different similarity. We
compare GMSA with ClustalW and CUDA-ClustalW in multiple users scenarios. Experiments results
showed that GMSA can achieve a speedup of up to 32X.
Conclusion:
GMSA is a system designed for accelerating the computation of MSA jobs with
shared input datasets on heterogeneous CPU/GPU platforms. In this system, a strategy was proposed
and implemented to find the common datasets among jobs submitted by multiple users, and
a scheduling algorithm is presented based on it. To utilize the overall resource of both CPU and
GPU, GMSA employs the co-run computation model. Results showed that it can speed up the total
computation of jobs efficiently.
Collapse
Affiliation(s)
- Na Bai
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| | - Shanjiang Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| | - Ce Yu
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| | - Hao Fu
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| | - Chen Wang
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana- Champaign, Illinois, United States
| | - Xi Chen
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, P.O. Box 300350, China
| |
Collapse
|
15
|
Liu B, Li S. ProtDet-CCH: Protein Remote Homology Detection by Combining Long Short-Term Memory and Ranking Methods. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:1203-1210. [PMID: 29993950 DOI: 10.1109/tcbb.2018.2789880] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Academic Contribution Register] [Indexed: 06/08/2023]
Abstract
As one of the most challenging tasks in sequence analysis, protein remote homology detection has been extensively studied. Methods based on discriminative models and ranking approaches have achieved the state-of-the-art performance, and these two kinds of methods are complementary. In this study, three LSTM models have been applied to construct the predictors for protein remote homology detection, including ULSTM, BLSTM, and CNN-BLSTM. They are able to automatically extract the local and global sequence order information. Combined with PSSMs, the CNN-BLSTM achieved the best performance among the three LSTM-based models. We named this method as CNN-BLSTM-PSSM. Finally, a new method called ProtDet-CCH was proposed by combining CNN-BLSTM-PSSM and a ranking method HHblits. Tested on a widely used SCOP benchmark dataset, ProtDet-CCH achieved an ROC score of 0.998, and an ROC50 score of 0.982, significantly outperforming other existing state-of-the-art methods. Experimental results on two updated SCOPe independent datasets showed that ProtDet-CCH can achieve stable performance. Furthermore, our method can provide useful insights for studying the features and motifs of protein families and superfamilies. It is anticipated that ProtDet-CCH will become a very useful tool for protein remote homology detection.
Collapse
|
16
|
Xuan P, Sun C, Zhang T, Ye Y, Shen T, Dong Y. Gradient Boosting Decision Tree-Based Method for Predicting Interactions Between Target Genes and Drugs. Front Genet 2019; 10:459. [PMID: 31214240 PMCID: PMC6555260 DOI: 10.3389/fgene.2019.00459] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 02/12/2019] [Accepted: 04/30/2019] [Indexed: 02/01/2023] Open
Abstract
Determining the target genes that interact with drugs—drug–target interactions—plays an important role in drug discovery. Identification of drug–target interactions through biological experiments is time consuming, laborious, and costly. Therefore, using computational approaches to predict candidate targets is a good way to reduce the cost of wet-lab experiments. However, the known interactions (positive samples) and the unknown interactions (negative samples) display a serious class imbalance, which has an adverse effect on the accuracy of the prediction results. To mitigate the impact of class imbalance and completely exploit the negative samples, we proposed a new method, named DTIGBDT, based on gradient boosting decision trees, for predicting candidate drug–target interactions. We constructed a drug–target heterogeneous network that contains the drug similarities based on the chemical structures of drugs, the target similarities based on target sequences, and the known drug–target interactions. The topological information of the network was captured by random walks to update the similarities between drugs or targets. The paths between drugs and targets could be divided into multiple categories, and the features of each category of paths were extracted. We constructed a prediction model based on gradient boosting decision trees. The model establishes multiple decision trees with the extracted features and obtains the interaction scores between drugs and targets. DTIGBDT is a method of ensemble learning, and it effectively reduces the impact of class imbalance. The experimental results indicate that DTIGBDT outperforms several state-of-the-art methods for drug–target interaction prediction. In addition, case studies on Quetiapine, Clozapine, Olanzapine, Aripiprazole, and Ziprasidone demonstrate the ability of DTIGBDT to discover potential drug–target interactions.
Collapse
Affiliation(s)
- Ping Xuan
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Chang Sun
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Tiangang Zhang
- School of Mathematical Science, Heilongjiang University, Harbin, China
| | - Yilin Ye
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Tonghui Shen
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| | - Yihua Dong
- School of Computer Science and Technology, Heilongjiang University, Harbin, China
| |
Collapse
|
17
|
Affiliation(s)
- Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
18
|
Ru X, Li L, Wang C. Identification of Phage Viral Proteins With Hybrid Sequence Features. Front Microbiol 2019; 10:507. [PMID: 30972038 PMCID: PMC6443926 DOI: 10.3389/fmicb.2019.00507] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 12/24/2018] [Accepted: 02/27/2019] [Indexed: 02/01/2023] Open
Abstract
The uniqueness of bacteriophages plays an important role in bioinformatics research. In real applications, the function of the bacteriophage virion proteins is the main area of interest. Therefore, it is very important to classify bacteriophage virion proteins and non-phage virion proteins accurately. Extracting comprehensive and effective sequence features from proteins plays a vital role in protein classification. In order to more fully represent protein information, this paper is more comprehensive and effective by combining the features extracted by the feature information representation algorithm based on sequence information (CCPA) and the feature representation algorithm based on sequence and structure information. After extracting features, the Max-Relevance-Max-Distance (MRMD) algorithm is used to select the optimal feature set with the strongest correlation between class labels and low redundancy between features. Given the randomness of the samples selected by the random forest classification algorithm and the randomness features for producing each node variable, a random forest method is employed to perform 10-fold cross-validation on the bacteriophage protein classification. The accuracy of this model is as high as 93.5% in the classification of phage proteins in this study. This study also found that, among the eight physicochemical properties considered, the charge property has the greatest impact on the classification of bacteriophage proteins These results indicate that the model discussed in this paper is an important tool in bacteriophage protein research.
Collapse
Affiliation(s)
- Xiaoqing Ru
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Lihong Li
- School of Information and Electrical Engineering, Hebei University of Engineering, Handan, China
| | - Chunyu Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
19
|
Guo M, Xu E, Ai D. Inferring Bacterial Infiltration in Primary Colorectal Tumors From Host Whole Genome Sequencing Data. Front Genet 2019; 10:213. [PMID: 30930939 PMCID: PMC6428740 DOI: 10.3389/fgene.2019.00213] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 01/19/2019] [Accepted: 02/27/2019] [Indexed: 12/13/2022] Open
Abstract
Colorectal cancer is the third most common cancer worldwide with abysmal survival, thus requiring novel therapy strategies. Numerous studies have frequently observed infiltrating bacteria within the primary tumor tissues derived from patients. These studies have implicated the relative abundance of these bacteria as a contributing factor in tumor progression. Infiltrating bacteria are believed to be among the major drivers of tumorigenesis, progression, and metastasis and, hence, promising targets for new treatments. However, measuring their abundance directly remains challenging. One potential approach is to use the unmapped reads of host whole genome sequencing (hWGS) data, which previous studies have considered as contaminants and discarded. Here, we developed rigorous bioinformatics and statistical procedures to identify tumor-infiltrating bacteria associated with colorectal cancer from such whole genome sequencing data. Our approach used the reads of whole genome sequencing data of colon adenocarcinoma tissues not mapped to the human reference genome, including unmapped paired-end read pairs and single-end reads, the mates of which were mapped. We assembled the unmapped read pairs, remapped all those reads to the collection of human microbiome reference, and then computed their relative abundance of microbes by maximum likelihood (ML) estimation. We analyzed and compared the relative abundance and diversity of infiltrating bacteria between primary tumor tissues and associated normal blood samples. Our results showed that primary tumor tissues contained far more diverse total infiltrating bacteria than normal blood samples. The relative abundance of Bacteroides fragilis, Bacteroides dorei, and Fusobacterium nucleatum was significantly higher in primary colorectal tumors. These three bacteria were among the top ten microbes in the primary tumor tissues, yet were rarely found in normal blood samples. As a validation step, most of these bacteria were also closely associated with colorectal cancer in previous studies with alternative approaches. In summary, our approach provides a new analytic technique for investigating the infiltrating bacterial community within tumor tissues. Our novel cloud-based bioinformatics and statistical pipelines to analyze the infiltrating bacteria in colorectal tumors using the unmapped reads of whole genome sequences can be freely accessed from GitHub at https://github.com/gutmicrobes/UMIB.git.
Collapse
Affiliation(s)
- Man Guo
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China
| | - Er Xu
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China
| | - Dongmei Ai
- School of Mathematics and Physics, University of Science and Technology Beijing, Beijing, China
- Basic Experimental of Natural Science, University of Science and Technology Beijing, Beijing, China
| |
Collapse
|
20
|
Li Y, Niu M, Zou Q. ELM-MHC: An Improved MHC Identification Method with Extreme Learning Machine Algorithm. J Proteome Res 2019; 18:1392-1401. [DOI: 10.1021/acs.jproteome.9b00012] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Indexed: 01/28/2023]
Affiliation(s)
- Yanjuan Li
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Mengting Niu
- School of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China
| |
Collapse
|
21
|
Mirza B, Wang W, Wang J, Choi H, Chung NC, Ping P. Machine Learning and Integrative Analysis of Biomedical Big Data. Genes (Basel) 2019; 10:E87. [PMID: 30696086 PMCID: PMC6410075 DOI: 10.3390/genes10020087] [Citation(s) in RCA: 172] [Impact Index Per Article: 28.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 12/02/2018] [Revised: 01/08/2019] [Accepted: 01/21/2019] [Indexed: 12/11/2022] Open
Abstract
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.
Collapse
Affiliation(s)
- Bilal Mirza
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Wei Wang
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Computer Science, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Scalable Analytics Institute (ScAi), University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Jie Wang
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Howard Choi
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
| | - Neo Christopher Chung
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Institute of Informatics, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Banacha 2, 02-097 Warsaw, Poland.
| | - Peipei Ping
- NIH BD2K Center of Excellence for Biomedical Computing, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Physiology, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Scalable Analytics Institute (ScAi), University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Bioinformatics, University of California Los Angeles, Los Angeles, CA 90095, USA.
- Department of Medicine (Cardiology), University of California Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
22
|
Yan K, Fang X, Xu Y, Liu B. Protein fold recognition based on multi-view modeling. Bioinformatics 2019; 35:2982-2990. [DOI: 10.1093/bioinformatics/btz040] [Citation(s) in RCA: 51] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 07/06/2018] [Revised: 12/29/2018] [Accepted: 01/16/2019] [Indexed: 12/22/2022] Open
Abstract
Abstract
Motivation
Protein fold recognition has attracted increasing attention because it is critical for studies of the 3D structures of proteins and drug design. Researchers have been extensively studying this important task, and several features with high discriminative power have been proposed. However, the development of methods that efficiently combine these features to improve the predictive performance remains a challenging problem.
Results
In this study, we proposed two algorithms: MV-fold and MT-fold. MV-fold is a new computational predictor based on the multi-view learning model for fold recognition. Different features of proteins were treated as different views of proteins, including the evolutionary information, secondary structure information and physicochemical properties. These different views constituted the latent space. The ε-dragging technique was employed to enlarge the margins between different protein folds, improving the predictive performance of MV-fold. Then, MV-fold was combined with two template-based methods: HHblits and HMMER. The ensemble method is called MT-fold incorporating the advantages of both discriminative methods and template-based methods. Experimental results on five widely used benchmark datasets (DD, RDD, EDD, TG and LE) showed that the proposed methods outperformed some state-of-the-art methods in this field, indicating that MV-fold and MT-fold are useful computational tools for protein fold recognition and protein homology detection and would be efficient tools for protein sequence analysis. Finally, we constructed an update and rigorous benchmark dataset based on SCOPe (version 2.07) to fairly evaluate the performance of the proposed method, and our method achieved stable performance on this new dataset. This new benchmark dataset will become a widely used benchmark dataset to fairly evaluate the performance of different methods for fold recognition.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ke Yan
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Xiaozhao Fang
- School of Computer Science and Technology, Guangdong University of Technology, Guangzhou, China
| | - Yong Xu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
23
|
Jiang T, Chen B, Li J, Xu G. Indexing and Search of Order-Preserving Submatrix for Gene Expression Data. IEEE ACCESS 2019; 7:184769-184785. [DOI: 10.1109/access.2019.2960856] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Academic Contribution Register] [Indexed: 09/09/2024]
|
24
|
Banegas-Luna AJ, Imbernón B, Llanes Castro A, Pérez-Garrido A, Cerón-Carrasco JP, Gesing S, Merelli I, D'Agostino D, Pérez-Sánchez H. Advances in distributed computing with modern drug discovery. Expert Opin Drug Discov 2018; 14:9-22. [PMID: 30484337 DOI: 10.1080/17460441.2019.1552936] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Indexed: 12/17/2022]
Abstract
INTRODUCTION Computational chemistry dramatically accelerates the drug discovery process and high-performance computing (HPC) can be used to speed up the most expensive calculations. Supporting a local HPC infrastructure is both costly and time-consuming, and, therefore, many research groups are moving from in-house solutions to remote-distributed computing platforms. Areas covered: The authors focus on the use of distributed technologies, solutions, and infrastructures to gain access to HPC capabilities, software tools, and datasets to run the complex simulations required in computational drug discovery (CDD). Expert opinion: The use of computational tools can decrease the time to market of new drugs. HPC has a crucial role in handling the complex algorithms and large volumes of data required to achieve specificity and avoid undesirable side-effects. Distributed computing environments have clear advantages over in-house solutions in terms of cost and sustainability. The use of infrastructures relying on virtualization reduces set-up costs. Distributed computing resources can be difficult to access, although web-based solutions are becoming increasingly available. There is a trade-off between cost-effectiveness and accessibility in using on-demand computing resources rather than free/academic resources. Graphics processing unit computing, with its outstanding parallel computing power, is becoming increasingly important.
Collapse
Affiliation(s)
- Antonio Jesús Banegas-Luna
- a Bioinformatics and High Performance Computing Research Group (BIO-HPC) , Universidad Católica de Murcia (UCAM) , Murcia , Spain
| | - Baldomero Imbernón
- a Bioinformatics and High Performance Computing Research Group (BIO-HPC) , Universidad Católica de Murcia (UCAM) , Murcia , Spain
| | - Antonio Llanes Castro
- a Bioinformatics and High Performance Computing Research Group (BIO-HPC) , Universidad Católica de Murcia (UCAM) , Murcia , Spain
| | - Alfonso Pérez-Garrido
- a Bioinformatics and High Performance Computing Research Group (BIO-HPC) , Universidad Católica de Murcia (UCAM) , Murcia , Spain
| | - José Pedro Cerón-Carrasco
- a Bioinformatics and High Performance Computing Research Group (BIO-HPC) , Universidad Católica de Murcia (UCAM) , Murcia , Spain
| | - Sandra Gesing
- b Center for Research Computing , University of Notre Dame , Notre Dame , IN , USA
| | - Ivan Merelli
- c Institute for Biomedical Technologies , National Research Council of Italy , Segrate (Milan) , Italy
| | - Daniele D'Agostino
- d Institute for Applied Mathematics and Information Technologies "E. Magenes" , National Research Council of Italy , Genoa , Italy
| | - Horacio Pérez-Sánchez
- a Bioinformatics and High Performance Computing Research Group (BIO-HPC) , Universidad Católica de Murcia (UCAM) , Murcia , Spain
| |
Collapse
|
25
|
Zou Q, Lin G, Jiang X, Liu X, Zeng X. Sequence clustering in bioinformatics: an empirical study. Brief Bioinform 2018; 21:1-10. [PMID: 30239587 DOI: 10.1093/bib/bby090] [Citation(s) in RCA: 72] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 07/03/2018] [Revised: 08/18/2018] [Accepted: 08/18/2018] [Indexed: 12/13/2022] Open
Abstract
Sequence clustering is a basic bioinformatics task that is attracting renewed attention with the development of metagenomics and microbiomics. The latest sequencing techniques have decreased costs and as a result, massive amounts of DNA/RNA sequences are being produced. The challenge is to cluster the sequence data using stable, quick and accurate methods. For microbiome sequencing data, 16S ribosomal RNA operational taxonomic units are typically used. However, there is often a gap between algorithm developers and bioinformatics users. Different software tools can produce diverse results and users can find them difficult to analyze. Understanding the different clustering mechanisms is crucial to understanding the results that they produce. In this review, we selected several popular clustering tools, briefly explained the key computing principles, analyzed their characters and compared them using two independent benchmark datasets. Our aim is to assist bioinformatics users in employing suitable clustering tools effectively to analyze big sequencing data. Related data, codes and software tools were accessible at the link http://lab.malab.cn/∼lg/clustering/.
Collapse
Affiliation(s)
- Quan Zou
- Tianjin University.,University of Electronic Science and Technology of China
| | | | | | | | | |
Collapse
|
26
|
Peng S, Cheng M, Huang K, Cui Y, Zhang Z, Guo R, Zhang X, Yang S, Liao X, Lu Y, Zou Q, Shi B. Efficient computation of motif discovery on Intel Many Integrated Core (MIC) Architecture. BMC Bioinformatics 2018; 19:282. [PMID: 30367570 PMCID: PMC6101076 DOI: 10.1186/s12859-018-2276-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Novel sequence motifs detection is becoming increasingly essential in computational biology. However, the high computational cost greatly constrains the efficiency of most motif discovery algorithms. RESULTS In this paper, we accelerate MEME algorithm targeted on Intel Many Integrated Core (MIC) Architecture and present a parallel implementation of MEME called MIC-MEME base on hybrid CPU/MIC computing framework. Our method focuses on parallelizing the starting point searching method and improving iteration updating strategy of the algorithm. MIC-MEME has achieved significant speedups of 26.6 for ZOOPS model and 30.2 for OOPS model on average for the overall runtime when benchmarked on the experimental platform with two Xeon Phi 3120 coprocessors. CONCLUSIONS Furthermore, MIC-MEME has been compared with state-of-arts methods and it shows good scalability with respect to dataset size and the number of MICs. Source code: https://github.com/hkwkevin28/MIC-MEME .
Collapse
Affiliation(s)
- Shaoliang Peng
- College of Computer Science and Electronic Engineering & National Supercomputing Centre in Changsha, Hunan University, Changsha, 410082, China. .,School of Computer Science, National University of Defense Technology, Changsha, 410073, China.
| | - Minxia Cheng
- College of Computer Science and Electronic Engineering & National Supercomputing Centre in Changsha, Hunan University, Changsha, 410082, China
| | - Kaiwen Huang
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - YingBo Cui
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Zhiqiang Zhang
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Runxin Guo
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Xiaoyu Zhang
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Shunyun Yang
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Xiangke Liao
- School of Computer Science, National University of Defense Technology, Changsha, 410073, China
| | - Yutong Lu
- National Supercomputer Center in Guangzhou, Guangzhou, 510275, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, 300350, China.
| | - Benyun Shi
- School of Cyberspace, Hangzhou Dianzi University, Hangzhou, 310018, China.
| |
Collapse
|
27
|
Guo R, Zhao Y, Zou Q, Fang X, Peng S. Bioinformatics applications on Apache Spark. Gigascience 2018; 7:5067872. [PMID: 30101283 PMCID: PMC6113509 DOI: 10.1093/gigascience/giy098] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 04/11/2018] [Accepted: 07/28/2018] [Indexed: 11/13/2022] Open
Abstract
With the rapid development of next-generation sequencing technology, ever-increasing quantities of genomic data pose a tremendous challenge to data processing. Therefore, there is an urgent need for highly scalable and powerful computational systems. Among the state-of–the-art parallel computing platforms, Apache Spark is a fast, general-purpose, in-memory, iterative computing framework for large-scale data processing that ensures high fault tolerance and high scalability by introducing the resilient distributed dataset abstraction. In terms of performance, Spark can be up to 100 times faster in terms of memory access and 10 times faster in terms of disk access than Hadoop. Moreover, it provides advanced application programming interfaces in Java, Scala, Python, and R. It also supports some advanced components, including Spark SQL for structured data processing, MLlib for machine learning, GraphX for computing graphs, and Spark Streaming for stream computing. We surveyed Spark-based applications used in next-generation sequencing and other biological domains, such as epigenetics, phylogeny, and drug discovery. The results of this survey are used to provide a comprehensive guideline allowing bioinformatics researchers to apply Spark in their own fields.
Collapse
Affiliation(s)
- Runxin Guo
- College of Computer, National University of Defense Technology, No.109, Deya Road, Kaifu District, Changsha, 410073, China
| | - Yi Zhao
- Institute of Computing Technology, Chinese Academy of Sciences, No.6, South Road of the Academy of Sciences, Haidian District, Beijing, 100190, China
| | - Quan Zou
- School of Computer Science and Technology, No.135, Yaguan Road, Jinnan District, Tianjin University, Tianjin, 300050, China
| | - Xiaodong Fang
- BGI Genomics, BGI-Shenzhen, No.21, Mingzhu Road, Yantian District, Shenzhen, 518083, China
| | - Shaoliang Peng
- College of Computer, National University of Defense Technology, No.109, Deya Road, Kaifu District, Changsha, 410073, China.,College of Computer Science and Electronic Engineering & National Supercomputer Centre in Changsha, Hunan University, No.252, Shannan Road, Yuelu District, Changsha, 410082, China
| |
Collapse
|
28
|
Abstract
Nowadays, the analysis of transcriptome sequencing (RNA-seq) data has become the standard method for quantifying the levels of gene expression. In RNA-seq experiments, the mapping of short reads to a reference genome or transcriptome is considered a crucial step that remains as one of the most time-consuming. With the steady development of Next Generation Sequencing (NGS) technologies, unprecedented amounts of genomic data introduce significant challenges in terms of storage, processing and downstream analysis. As cost and throughput continue to improve, there is a growing need for new software solutions that minimize the impact of increasing data volume on RNA read alignment. In this work we introduce HSRA, a Big Data tool that takes advantage of the MapReduce programming model to extend the multithreading capabilities of a state-of-the-art spliced read aligner for RNA-seq data (HISAT2) to distributed memory systems such as multi-core clusters or cloud platforms. HSRA has been built upon the Hadoop MapReduce framework and supports both single- and paired-end reads from FASTQ/FASTA datasets, providing output alignments in SAM format. The design of HSRA has been carefully optimized to avoid the main limitations and major causes of inefficiency found in previous Big Data mapping tools, which cannot fully exploit the raw performance of the underlying aligner. On a 16-node multi-core cluster, HSRA is on average 2.3 times faster than previous Hadoop-based tools. Source code in Java as well as a user’s guide are publicly available for download at http://hsra.dec.udc.es.
Collapse
|
29
|
Mishra A, Pokhrel P, Hoque MT. StackDPPred: a stacking based prediction of DNA-binding protein from sequence. Bioinformatics 2018; 35:433-441. [DOI: 10.1093/bioinformatics/bty653] [Citation(s) in RCA: 64] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 03/27/2018] [Accepted: 07/18/2018] [Indexed: 12/12/2022] Open
Affiliation(s)
- Avdesh Mishra
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Pujan Pokhrel
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| | - Md Tamjidul Hoque
- Department of Computer Science, University of New Orleans, New Orleans, LA, USA
| |
Collapse
|
30
|
Wang J, Guo M. A review of metrics measuring dissimilarity for rooted phylogenetic networks. Brief Bioinform 2018; 20:1972-1980. [DOI: 10.1093/bib/bby062] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 05/22/2018] [Revised: 06/20/2018] [Indexed: 11/14/2022] Open
Abstract
Abstract
A rooted phylogenetic network is an important structure in the description of evolutionary relationships. Computing the distance (topological dissimilarity) between two rooted phylogenetic networks is a fundamental in phylogenic analysis. During the past few decades, several polynomial-time computable metrics have been described. Here, we give a comprehensive review and analysis on those metrics, including the correlation among metrics and the distribution of distance values computed by each metric. Moreover, we describe the software and website, CDRPN (Computing Distance for Rooted Phylogenetic Networks), for measuring the topological dissimilarity between rooted phylogenetic networks.
Availability
http://bioinformatics.imu.edu.cn/distance/
Contact
guomaozu@bucea.edu.cn
Collapse
Affiliation(s)
- Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, Inner Mongolia 010021, P.R. China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, P.R. ChinaBeijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing 100044, P.R. China
| |
Collapse
|
31
|
High-throughput and scalable protein function identification with Hadoop and Map-only pattern of the MapReduce processing model. Knowl Inf Syst 2018. [DOI: 10.1007/s10115-018-1245-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Indexed: 10/28/2022]
|
32
|
Lladós J, Cores F, Guirado F. Scalable Consistency in T-Coffee Through Apache Spark and Cassandra Database. J Comput Biol 2018; 25:894-906. [PMID: 30004242 DOI: 10.1089/cmb.2018.0084] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Indexed: 11/12/2022] Open
Abstract
Next-generation sequencing, also known as high-throughput sequencing, has increased the volume of genetic data processed by sequencers. In the bioinformatic scientific area, highly rated multiple sequence alignment tools, such as MAFFT, ProbCons, and T-Coffee (TC), use the probabilistic consistency as a prior step to the progressive alignment stage to improve the final accuracy. However, such methods are severely limited by the memory required to store the consistency information. Big data processing and persistence techniques are used to manage and store the huge amount of information that is generated. Although these techniques have significant advantages, few biological applications have adopted them. In this article, a novel approach named big data tree-based consistency objective function for alignment evaluation (BDT-Coffee) is presented. BDT-Coffee is based on the integration of consistency information through Cassandra database in TC, previously generated by the MapReduce processing paradigm, to enable large data sets to be processed with the aim of improving the performance and scalability of the original algorithm.
Collapse
Affiliation(s)
- Jordi Lladós
- INSPIRES Research Center, Universitat de Lleida , Lleida, Spain
| | - Fernando Cores
- INSPIRES Research Center, Universitat de Lleida , Lleida, Spain
| | | |
Collapse
|
33
|
Song H, Chen G, Yang W. An Image Classification Algorithm and its Parallel Implementation Based on ANL-RBM. JOURNAL OF INFORMATION TECHNOLOGY RESEARCH 2018. [DOI: 10.4018/jitr.2018070103] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Indexed: 11/09/2022]
Abstract
This article describes how when using Restricted Boltzmann Machine (RBM) algorithm to design the image classification network. The node number in each hidden layer, and the layer number of the entire network are designed by experiments, it increases the complexity for the RBM design. In order to solve the problem, this article proposes an image classification algorithm based on ANL-RBM (Adaptive Nodes and Layers Restricted Boltzmann Machine). The algorithm can automatically calculate the node number in each hidden layer and the layer number of the entire network. It can reduce the complexity for the RBM design. In the meantime, this article has designed the parallel model of the algorithm in the Hadoop platform. The experimental results showed that the image classification algorithm based on an ANL-RBM has a higher execution efficiency, better speedup, better scalability and it is suitable for massive amounts of image data processing.
Collapse
Affiliation(s)
| | | | - Weiwei Yang
- Harbin Engineering University, Harbin, China
| |
Collapse
|
34
|
Pan Y, Gao H, Lin H, Liu Z, Tang L, Li S. Identification of Bacteriophage Virion Proteins Using Multinomial Naïve Bayes with g-Gap Feature Tree. Int J Mol Sci 2018; 19:E1779. [PMID: 29914091 PMCID: PMC6032154 DOI: 10.3390/ijms19061779] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 05/24/2018] [Revised: 06/12/2018] [Accepted: 06/12/2018] [Indexed: 01/29/2023] Open
Abstract
Bacteriophages, which are tremendously important to the ecology and evolution of bacteria, play a key role in the development of genetic engineering. Bacteriophage virion proteins are essential materials of the infectious viral particles and in charge of several of biological functions. The correct identification of bacteriophage virion proteins is of great importance for understanding both life at the molecular level and genetic evolution. However, few computational methods are available for identifying bacteriophage virion proteins. In this paper, we proposed a new method to predict bacteriophage virion proteins using a Multinomial Naïve Bayes classification model based on discrete feature generated from the g-gap feature tree. The accuracy of the proposed model reaches 98.37% with MCC of 96.27% in 10-fold cross-validation. This result suggests that the proposed method can be a useful approach in identifying bacteriophage virion proteins from sequence information. For the convenience of experimental scientists, a web server (PhagePred) that implements the proposed predictor is available, which can be freely accessed on the Internet.
Collapse
Affiliation(s)
- Yanyuan Pan
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Hui Gao
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Hao Lin
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Zhen Liu
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Lixia Tang
- Key Laboratory for Neuro-Information of Ministry of Education, School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| | - Songtao Li
- School of Computer Science and Engineering, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
35
|
Pashazadeh A, Navimipour NJ. Big data handling mechanisms in the healthcare applications: A comprehensive and systematic literature review. J Biomed Inform 2018; 82:47-62. [PMID: 29655946 DOI: 10.1016/j.jbi.2018.03.014] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 06/22/2017] [Revised: 11/19/2017] [Accepted: 03/23/2018] [Indexed: 01/08/2023]
Abstract
Healthcare provides many services such as diagnosing, treatment, prevention of diseases, illnesses, injuries, and other physical and mental disorders. Large-scale distributed data processing applications in healthcare as a basic concept operates on large amounts of data. Therefore, big data application functions are the main part of healthcare operations, but there was not any comprehensive and systematic survey about studying and evaluating the important techniques in this field. Therefore, this paper aims at providing the comprehensive, detailed, and systematic study of the state-of-the-art mechanisms in the big data related to healthcare applications in five categories, including machine learning, cloud-based, heuristic-based, agent-based, and hybrid mechanisms. Also, this paper displayed a systematic literature review (SLR) of the big data applications in the healthcare literature up to the end of 2016. Initially, 205 papers were identified, but a paper selection process reduced the number of papers to 29 important studies.
Collapse
Affiliation(s)
- Asma Pashazadeh
- Department of Computer Engineering, Tabriz Branch, Islamic Azad University, Tabriz, Iran
| | - Nima Jafari Navimipour
- Department of Computer Engineering, Tabriz Branch, Islamic Azad University, Tabriz, Iran.
| |
Collapse
|
36
|
Abstract
BACKGROUND Building the evolutionary trees for massive unaligned DNA sequences is challenging and crucial. However, reconstructing evolutionary tree for ultra-large sequences is hard. Massive multiple sequence alignment is also challenging and time/space consuming. Hadoop and Spark are developed recently, which bring spring light for the classical computational biology problems. In this paper, we tried to solve the multiple sequence alignment and evolutionary reconstruction in parallel. RESULTS HPTree, which is developed in this paper, can deal with big DNA sequence files quickly. It works well on the >1GB files, and gets better performance than other evolutionary reconstruction tools. Users could use HPTree for reonstructing evolutioanry trees on the computer clusters or cloud platform (eg. Amazon Cloud). HPTree could help on population evolution research and metagenomics analysis. CONCLUSIONS In this paper, we employ the Hadoop and Spark platform and design an evolutionary tree reconstruction software tool for unaligned massive DNA sequences. Clustering and multiple sequence alignment are done in parallel. Neighbour-joining model was employed for the evolutionary tree building. We opened our software together with source codes via http://lab.malab.cn/soft/HPtree/ .
Collapse
Affiliation(s)
- Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, People's Republic of China
- Guangdong Province Key Laboratory of Popular High Performance Computers, Shenzhen University, Shenzhen, China
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | - Shixiang Wan
- School of Computer Science and Technology, Tianjin University, Tianjin, People's Republic of China
| | - Xiangxiang Zeng
- Department of Computer Science, Xiamen University, Xiamen, China.
| | - Zhanshan Sam Ma
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China.
| |
Collapse
|
37
|
Yang X, Wu C, Lu K, Fang L, Zhang Y, Li S, Guo G, Du Y. An Interface for Biomedical Big Data Processing on the Tianhe-2 Supercomputer. Molecules 2017; 22:molecules22122116. [PMID: 29194413 PMCID: PMC6149962 DOI: 10.3390/molecules22122116] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 10/25/2017] [Accepted: 11/29/2017] [Indexed: 11/16/2022] Open
Abstract
Big data, cloud computing, and high-performance computing (HPC) are at the verge of convergence. Cloud computing is already playing an active part in big data processing with the help of big data frameworks like Hadoop and Spark. The recent upsurge of high-performance computing in China provides extra possibilities and capacity to address the challenges associated with big data. In this paper, we propose Orion—a big data interface on the Tianhe-2 supercomputer—to enable big data applications to run on Tianhe-2 via a single command or a shell script. Orion supports multiple users, and each user can launch multiple tasks. It minimizes the effort needed to initiate big data applications on the Tianhe-2 supercomputer via automated configuration. Orion follows the “allocate-when-needed” paradigm, and it avoids the idle occupation of computational resources. We tested the utility and performance of Orion using a big genomic dataset and achieved a satisfactory performance on Tianhe-2 with very few modifications to existing applications that were implemented in Hadoop/Spark. In summary, Orion provides a practical and economical interface for big data processing on Tianhe-2.
Collapse
Affiliation(s)
- Xi Yang
- College of Computer, National University of Defense Technology, Changsha 410073, China.
| | - Chengkun Wu
- College of Computer, National University of Defense Technology, Changsha 410073, China.
| | - Kai Lu
- College of Computer, National University of Defense Technology, Changsha 410073, China.
| | - Lin Fang
- National Supercomputing Center of Guangzhou, Guangzhou 510006, China.
| | - Yong Zhang
- National Supercomputing Center of Guangzhou, Guangzhou 510006, China.
| | - Shengkang Li
- National Supercomputing Center of Guangzhou, Guangzhou 510006, China.
| | - Guixin Guo
- Beijing Genomics Institute (BGI) Shenzhen, Shenzhen 518083, China.
| | - YunFei Du
- School of Data and Computer Science, Sun Yat-Sen University, Guangzhou 510000, China.
| |
Collapse
|
38
|
Abstract
Background Methylation is a common modification of DNA. It has been a very important and hot topic to study the correlation between methylation and diseases in medical science. Because of the special process with bisulfite treatment, traditional mapping tools do not work well with such methylation experimental reads. Traditional aligners are not designed for mapping bisulfite-treated reads, where the un-methylated ‘C’s are converted to ‘T’s. Results In this paper, we develop a reliable and visual tool, named VAliBS, for mapping bisulfate sequences to a genome reference. VAliBS works well even on large scale data or high noise data. By comparing with other state-of-the-art tools (BisMark, BSMAP, BS-Seeker2), VAliBS can improve the accuracy of bisulfite mapping. Moreover, VAliBS is a visual tool which makes its operations more easily and the alignment results are shown with colored marks which makes it easier to be read. VAliBS provides fast and accurate mapping of bisulfite-converted reads, and a friendly window system to visualize the detail of mapping of each read. Conclusions VAliBS works well on both simulated data and real data. It can be useful in DNA methylation research. VALiBS implements an X-Window user interface where the methylation positions are visual and the operations are friendly.
Collapse
Affiliation(s)
- Min Li
- School of Information Science and Engineering, Central South University, Changsha, 410083, China
| | - Ping Huang
- School of Information Science and Engineering, Central South University, Changsha, 410083, China
| | - Xiaodong Yan
- School of Information Science and Engineering, Central South University, Changsha, 410083, China
| | - Jianxin Wang
- School of Information Science and Engineering, Central South University, Changsha, 410083, China.
| | - Yi Pan
- School of Information Science and Engineering, Central South University, Changsha, 410083, China. .,Department of Computer Science, Georgia State University, Atlanta, GA, 30302-4110, USA.
| | - Fang-Xiang Wu
- School of Information Science and Engineering, Central South University, Changsha, 410083, China.,Division of Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, SK, S7N 5A9, Canada
| |
Collapse
|
39
|
A parallel approximate string matching under Levenshtein distance on graphics processing units using warp-shuffle operations. PLoS One 2017; 12:e0186251. [PMID: 29016700 PMCID: PMC5634649 DOI: 10.1371/journal.pone.0186251] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 07/03/2017] [Accepted: 09/27/2017] [Indexed: 12/17/2022] Open
Abstract
Approximate string matching with k-differences has a number of practical applications, ranging from pattern recognition to computational biology. This paper proposes an efficient memory-access algorithm for parallel approximate string matching with k-differences on Graphics Processing Units (GPUs). In the proposed algorithm, all threads in the same GPUs warp share data using warp-shuffle operation instead of accessing the shared memory. Moreover, we implement the proposed algorithm by exploiting the memory structure of GPUs to optimize its performance. Experiment results for real DNA packages revealed that the performance of the proposed algorithm and its implementation archived up to 122.64 and 1.53 times compared to that of sequential algorithm on CPU and previous parallel approximate string matching algorithm on GPUs, respectively.
Collapse
|
40
|
Abstract
Background The Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) represents one of the most effective tools to reconstruct gene regulatory networks from large-scale molecular profile datasets. However, previous implementations require intensive computing resources and, in some cases, restrict the number of samples that can be used. These issues can be addressed elegantly in a GPU computing framework, where repeated mathematical computation can be done efficiently, but requires extensive redesign to apply parallel computing techniques to the original serial algorithm, involving detailed optimization efforts based on a deep understanding of both hardware and software architecture. Result Here, we present an accelerated parallel implementation of ARACNE (GPU-ARACNE). By taking advantage of multi-level parallelism and the Compute Unified Device Architecture (CUDA) parallel kernel-call library, GPU-ARACNE successfully parallelizes a serial algorithm and simplifies the user experience from multi-step operations to one step. Using public datasets on comparable hardware configurations, we showed that GPU-ARACNE is faster than previous implementations and is able to reconstruct equally valid gene regulatory networks. Conclusion Given that previous versions of ARACNE are extremely resource demanding, either in computational time or in hardware investment, GPU-ARACNE is remarkably valuable for researchers who need to build complex regulatory networks from large expression datasets, but with limited budget on computational resources. In addition, our GPU-centered optimization of adaptive partitioning for Mutual Information (MI) estimation provides lessons that are applicable to other domains. Electronic supplementary material The online version of this article (doi:10.1186/s12918-017-0458-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jing He
- Department of Biomedical Informatics, Columbia University, 168th Street, New York, 10032, NY, USA.,Department of Systems Biology, 1130 St Nicholas Street, New York, 10032, NY, USA
| | - Zhou Zhou
- Department of Computer Science, New York, 10027, NY, USA
| | - Michael Reed
- Department of Computer Science, New York, 10027, NY, USA
| | - Andrea Califano
- Department of Systems Biology, 1130 St Nicholas Street, New York, 10032, NY, USA.
| |
Collapse
|
41
|
Wang YB, You ZH, Li LP, Huang YA, Yi HC. Detection of Interactions between Proteins by Using Legendre Moments Descriptor to Extract Discriminatory Information Embedded in PSSM. Molecules 2017; 22:molecules22081366. [PMID: 28820478 PMCID: PMC6152086 DOI: 10.3390/molecules22081366] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 07/24/2017] [Accepted: 08/15/2017] [Indexed: 11/16/2022] Open
Abstract
Protein-protein interactions (PPIs) play a very large part in most cellular processes. Although a great deal of research has been devoted to detecting PPIs through high-throughput technologies, these methods are clearly expensive and cumbersome. Compared with the traditional experimental methods, computational methods have attracted much attention because of their good performance in detecting PPIs. In our work, a novel computational method named as PCVM-LM is proposed which combines the probabilistic classification vector machine (PCVM) model and Legendre moments (LMs) to predict PPIs from amino acid sequences. The improvement mainly comes from using the LMs to extract discriminatory information embedded in the position-specific scoring matrix (PSSM) combined with the PCVM classifier to implement prediction. The proposed method was evaluated on Yeast and Helicobacter pylori datasets with five-fold cross-validation experiments. The experimental results show that the proposed method achieves high average accuracies of 96.37% and 93.48%, respectively, which are much better than other well-known methods. To further evaluate the proposed method, we also compared the proposed method with the state-of-the-art support vector machine (SVM) classifier and other existing methods on the same datasets. The comparison results clearly show that our method is better than the SVM-based method and other existing methods. The promising experimental results show the reliability and effectiveness of the proposed method, which can be a useful decision support tool for protein research.
Collapse
Affiliation(s)
- Yan-Bin Wang
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
- University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Zhu-Hong You
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Li-Ping Li
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Yu-An Huang
- Department of Computing, Hong Kong Polytechnic University, Hong Kong, China.
| | - Hai-Cheng Yi
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| |
Collapse
|
42
|
PPCAS: Implementation of a Probabilistic Pairwise Model for Consistency-Based Multiple Alignment in Apache Spark. ACTA ACUST UNITED AC 2017. [DOI: 10.1007/978-3-319-65482-9_45] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Indexed: 04/13/2023]
|
43
|
Nascimento M, Silva FFE, Sáfadi T, Nascimento ACC, Ferreira TEM, Barroso LMA, Ferreira Azevedo C, Guimarães SEF, Serão NVL. Independent Component Analysis (ICA) based-clustering of temporal RNA-seq data. PLoS One 2017; 12:e0181195. [PMID: 28715507 PMCID: PMC5513449 DOI: 10.1371/journal.pone.0181195] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 01/31/2017] [Accepted: 06/27/2017] [Indexed: 11/19/2022] Open
Abstract
Gene expression time series (GETS) analysis aims to characterize sets of genes according to their longitudinal patterns of expression. Due to the large number of genes evaluated in GETS analysis, an useful strategy to summarize biological functional processes and regulatory mechanisms is through clustering of genes that present similar expression pattern over time. Traditional cluster methods usually ignore the challenges in GETS, such as the lack of data normality and small number of temporal observations. Independent Component Analysis (ICA) is a statistical procedure that uses a transformation to convert raw time series data into sets of values of independent variables, which can be used for cluster analysis to identify sets of genes with similar temporal expression patterns. ICA allows clustering small series of distribution-free data while accounting for the dependence between subsequent time-points. Using temporal simulated and real (four libraries of two pig breeds at 21, 40, 70 and 90 days of gestation) RNA-seq data set we present a methodology (ICAclust) that jointly considers independent components analysis (ICA) and a hierarchical method for clustering GETS. We compare ICAclust results with those obtained for K-means clustering. ICAclust presented, on average, an absolute gain of 5.15% over the best K-means scenario. Considering the worst scenario for K-means, the gain was of 84.85%, when compared with the best ICAclust result. For the real data set, genes were grouped into six distinct clusters with 89, 51, 153, 67, 40, and 58 genes each, respectively. In general, it can be observed that the 6 clusters presented very distinct expression patterns. Overall, the proposed two-step clustering method (ICAclust) performed well compared to K-means, a traditional method used for cluster analysis of temporal gene expression data. In ICAclust, genes with similar expression pattern over time were clustered together.
Collapse
Affiliation(s)
- Moysés Nascimento
- Department of Statistics, Federal University of Viçosa, Viçosa, Minas Gerais, Brazil
| | | | - Thelma Sáfadi
- Department of Exact Sciences, Federal University of Lavras, Lavras, Minas Gerais, Brazil
| | | | | | | | | | | | | |
Collapse
|
44
|
Huang T, Sun J, Zhou S, Gao J, Liu Y. Identification of Direct Activator of Adenosine Monophosphate-Activated Protein Kinase (AMPK) by Structure-Based Virtual Screening and Molecular Docking Approach. Int J Mol Sci 2017; 18:ijms18071408. [PMID: 28665353 PMCID: PMC5535900 DOI: 10.3390/ijms18071408] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 05/11/2017] [Revised: 06/24/2017] [Accepted: 06/27/2017] [Indexed: 12/25/2022] Open
Abstract
Adenosine monophosphate-activated protein kinase (AMPK) plays a critical role in the regulation of energy metabolism and has been targeted for drug development of therapeutic intervention in Type II diabetes and related diseases. Recently, there has been renewed interest in the development of direct β1-selective AMPK activators to treat patients with diabetic nephropathy. To investigate the details of AMPK domain structure, sequence alignment and structural comparison were used to identify the key amino acids involved in the interaction with activators and the structure difference between β1 and β2 subunits. Additionally, a series of potential β1-selective AMPK activators were identified by virtual screening using molecular docking. The retrieved hits were filtered on the basis of Lipinski’s rule of five and drug-likeness. Finally, 12 novel compounds with diverse scaffolds were obtained as potential starting points for the design of direct β1-selective AMPK activators.
Collapse
Affiliation(s)
- Tonghui Huang
- Jiangsu Key Laboratory of New Drug Research and Clinical Pharmacy, School of Pharmacy, Xuzhou Medical University, Xuzhou 221004, China.
| | - Jie Sun
- Jiangsu Key Laboratory of New Drug Research and Clinical Pharmacy, School of Pharmacy, Xuzhou Medical University, Xuzhou 221004, China.
| | - Shanshan Zhou
- Jiangsu Key Laboratory of New Drug Research and Clinical Pharmacy, School of Pharmacy, Xuzhou Medical University, Xuzhou 221004, China.
| | - Jian Gao
- Jiangsu Key Laboratory of New Drug Research and Clinical Pharmacy, School of Pharmacy, Xuzhou Medical University, Xuzhou 221004, China.
| | - Yi Liu
- Jiangsu Key Laboratory of New Drug Research and Clinical Pharmacy, School of Pharmacy, Xuzhou Medical University, Xuzhou 221004, China.
| |
Collapse
|
45
|
Elitist Binary Wolf Search Algorithm for Heuristic Feature Selection in High-Dimensional Bioinformatics Datasets. Sci Rep 2017; 7:4354. [PMID: 28659577 PMCID: PMC5489518 DOI: 10.1038/s41598-017-04037-5] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 11/21/2016] [Accepted: 05/09/2017] [Indexed: 12/02/2022] Open
Abstract
Due to the high-dimensional characteristics of dataset, we propose a new method based on the Wolf Search Algorithm (WSA) for optimising the feature selection problem. The proposed approach uses the natural strategy established by Charles Darwin; that is, ‘It is not the strongest of the species that survives, but the most adaptable’. This means that in the evolution of a swarm, the elitists are motivated to quickly obtain more and better resources. The memory function helps the proposed method to avoid repeat searches for the worst position in order to enhance the effectiveness of the search, while the binary strategy simplifies the feature selection problem into a similar problem of function optimisation. Furthermore, the wrapper strategy gathers these strengthened wolves with the classifier of extreme learning machine to find a sub-dataset with a reasonable number of features that offers the maximum correctness of global classification models. The experimental results from the six public high-dimensional bioinformatics datasets tested demonstrate that the proposed method can best some of the conventional feature selection methods up to 29% in classification accuracy, and outperform previous WSAs by up to 99.81% in computational time.
Collapse
|
46
|
Bhattacharya A, Cui Y. A GPU-accelerated algorithm for biclustering analysis and detection of condition-dependent coexpression network modules. Sci Rep 2017. [PMID: 28646174 PMCID: PMC5482832 DOI: 10.1038/s41598-017-04070-4] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Indexed: 02/02/2023] Open
Abstract
In the analysis of large-scale gene expression data, it is important to identify groups of genes with common expression patterns under certain conditions. Many biclustering algorithms have been developed to address this problem. However, comprehensive discovery of functionally coherent biclusters from large datasets remains a challenging problem. Here we propose a GPU-accelerated biclustering algorithm, based on searching for the largest Condition-dependent Correlation Subgroups (CCS) for each gene in the gene expression dataset. We compared CCS with thirteen widely used biclustering algorithms. CCS consistently outperformed all the thirteen biclustering algorithms on both synthetic and real gene expression datasets. As a correlation-based biclustering method, CCS can also be used to find condition-dependent coexpression network modules. We implemented the CCS algorithm using C and implemented the parallelized CCS algorithm using CUDA C for GPU computing. The source code of CCS is available from https://github.com/abhatta3/Condition-dependent-Correlation-Subgroups-CCS.
Collapse
Affiliation(s)
- Anindya Bhattacharya
- Department of Microbiology, Immunology and Biochemistry, Memphis, TN, 38163, USA. .,Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA. .,Department of Computer Science and Engineering, University of California, San Diego, CA, 92093, USA.
| | - Yan Cui
- Department of Microbiology, Immunology and Biochemistry, Memphis, TN, 38163, USA. .,Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA.
| |
Collapse
|
47
|
Wang J, Guo M. A Metric on the Space of kth-order reduced Phylogenetic Networks. Sci Rep 2017; 7:3189. [PMID: 28600511 DOI: 10.1038/s41598-017-03363-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 08/05/2016] [Accepted: 04/27/2017] [Indexed: 11/09/2022] Open
Abstract
Phylogenetic networks can be used to describe the evolutionary history of species which experience a certain number of reticulate events, and represent conflicts in phylogenetic trees that may be due to inadequacies of the evolutionary model used in the construction of the trees. Measuring the dissimilarity between two phylogenetic networks is at the heart of our understanding of the evolutionary history of species. This paper proposes a new metric, i.e. kth-distance, for the space of kth-order reduced phylogenetic networks that can be calculated in polynomial time in the size of the compared networks.
Collapse
Affiliation(s)
- Juan Wang
- School of Computer Science, Inner Mongolia University, Hohhot, 010021, P.R. China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, 100044, P.R. China.
| |
Collapse
|
48
|
Wang Y, You Z, Li X, Chen X, Jiang T, Zhang J. PCVMZM: Using the Probabilistic Classification Vector Machines Model Combined with a Zernike Moments Descriptor to Predict Protein-Protein Interactions from Protein Sequences. Int J Mol Sci 2017; 18:ijms18051029. [PMID: 28492483 PMCID: PMC5454941 DOI: 10.3390/ijms18051029] [Citation(s) in RCA: 46] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 03/24/2017] [Revised: 04/24/2017] [Accepted: 04/29/2017] [Indexed: 01/08/2023] Open
Abstract
Protein–protein interactions (PPIs) are essential for most living organisms’ process. Thus, detecting PPIs is extremely important to understand the molecular mechanisms of biological systems. Although many PPIs data have been generated by high-throughput technologies for a variety of organisms, the whole interatom is still far from complete. In addition, the high-throughput technologies for detecting PPIs has some unavoidable defects, including time consumption, high cost, and high error rate. In recent years, with the development of machine learning, computational methods have been broadly used to predict PPIs, and can achieve good prediction rate. In this paper, we present here PCVMZM, a computational method based on a Probabilistic Classification Vector Machines (PCVM) model and Zernike moments (ZM) descriptor for predicting the PPIs from protein amino acids sequences. Specifically, a Zernike moments (ZM) descriptor is used to extract protein evolutionary information from Position-Specific Scoring Matrix (PSSM) generated by Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST). Then, PCVM classifier is used to infer the interactions among protein. When performed on PPIs datasets of Yeast and H. Pylori, the proposed method can achieve the average prediction accuracy of 94.48% and 91.25%, respectively. In order to further evaluate the performance of the proposed method, the state-of-the-art support vector machines (SVM) classifier is used and compares with the PCVM model. Experimental results on the Yeast dataset show that the performance of PCVM classifier is better than that of SVM classifier. The experimental results indicate that our proposed method is robust, powerful and feasible, which can be used as a helpful tool for proteomics research.
Collapse
Affiliation(s)
- Yanbin Wang
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Zhuhong You
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Xiao Li
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Xing Chen
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China.
| | - Tonghai Jiang
- Xinjiang Technical Institutes of Physics and Chemistry, Chinese Academy of Science, Urumqi 830011, China.
| | - Jingting Zhang
- Department of Mathematics and Statistics, Henan University, Kaifeng 100190, China.
| |
Collapse
|
49
|
Decap D, Reumers J, Herzeel C, Costanza P, Fostier J. Halvade-RNA: Parallel variant calling from transcriptomic data using MapReduce. PLoS One 2017; 12:e0174575. [PMID: 28358893 PMCID: PMC5373595 DOI: 10.1371/journal.pone.0174575] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Academic Contribution Register] [Received: 09/12/2016] [Accepted: 03/10/2017] [Indexed: 12/30/2022] Open
Abstract
Given the current cost-effectiveness of next-generation sequencing, the amount of DNA-seq and RNA-seq data generated is ever increasing. One of the primary objectives of NGS experiments is calling genetic variants. While highly accurate, most variant calling pipelines are not optimized to run efficiently on large data sets. However, as variant calling in genomic data has become common practice, several methods have been proposed to reduce runtime for DNA-seq analysis through the use of parallel computing. Determining the effectively expressed variants from transcriptomics (RNA-seq) data has only recently become possible, and as such does not yet benefit from efficiently parallelized workflows. We introduce Halvade-RNA, a parallel, multi-node RNA-seq variant calling pipeline based on the GATK Best Practices recommendations. Halvade-RNA makes use of the MapReduce programming model to create and manage parallel data streams on which multiple instances of existing tools such as STAR and GATK operate concurrently. Whereas the single-threaded processing of a typical RNA-seq sample requires ∼28h, Halvade-RNA reduces this runtime to ∼2h using a small cluster with two 20-core machines. Even on a single, multi-core workstation, Halvade-RNA can significantly reduce runtime compared to using multi-threading, thus providing for a more cost-effective processing of RNA-seq data. Halvade-RNA is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR.
Collapse
Affiliation(s)
- Dries Decap
- Department of Information Technology, IDLab, Ghent University - imec, Ghent, Belgium
- ExaScience Life Lab, Leuven, Belgium
| | - Joke Reumers
- Janssen Research & Development, a division of Janssen Pharmaceutica N.V., Beerse, Belgium
- ExaScience Life Lab, Leuven, Belgium
| | | | - Pascal Costanza
- Intel Corporation Belgium, Leuven, Belgium
- ExaScience Life Lab, Leuven, Belgium
| | - Jan Fostier
- Department of Information Technology, IDLab, Ghent University - imec, Ghent, Belgium
- ExaScience Life Lab, Leuven, Belgium
- * E-mail:
| |
Collapse
|
50
|
MapReduce Algorithms for Inferring Gene Regulatory Networks from Time-Series Microarray Data Using an Information-Theoretic Approach. BIOMED RESEARCH INTERNATIONAL 2017; 2017:6261802. [PMID: 28243601 PMCID: PMC5294223 DOI: 10.1155/2017/6261802] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Academic Contribution Register] [Received: 06/07/2016] [Revised: 11/14/2016] [Accepted: 12/13/2016] [Indexed: 12/15/2022]
Abstract
Gene regulation is a series of processes that control gene expression and its extent. The connections among genes and their regulatory molecules, usually transcription factors, and a descriptive model of such connections are known as gene regulatory networks (GRNs). Elucidating GRNs is crucial to understand the inner workings of the cell and the complexity of gene interactions. To date, numerous algorithms have been developed to infer gene regulatory networks. However, as the number of identified genes increases and the complexity of their interactions is uncovered, networks and their regulatory mechanisms become cumbersome to test. Furthermore, prodding through experimental results requires an enormous amount of computation, resulting in slow data processing. Therefore, new approaches are needed to expeditiously analyze copious amounts of experimental data resulting from cellular GRNs. To meet this need, cloud computing is promising as reported in the literature. Here, we propose new MapReduce algorithms for inferring gene regulatory networks on a Hadoop cluster in a cloud environment. These algorithms employ an information-theoretic approach to infer GRNs using time-series microarray data. Experimental results show that our MapReduce program is much faster than an existing tool while achieving slightly better prediction accuracy than the existing tool.
Collapse
|