1
|
Taufer M, Estrada T, Johnston T. A survey of algorithms for transforming molecular dynamics data into metadata for in situ analytics based on machine learning methods. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2020; 378:20190063. [PMID: 31955686 PMCID: PMC7015296 DOI: 10.1098/rsta.2019.0063] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 12/04/2019] [Indexed: 06/10/2023]
Abstract
This paper presents the survey of three algorithms to transform atomic-level molecular snapshots from molecular dynamics (MD) simulations into metadata representations that are suitable for in situ analytics based on machine learning methods. MD simulations studying the classical time evolution of a molecular system at atomic resolution are widely recognized in the fields of chemistry, material sciences, molecular biology and drug design; these simulations are one of the most common simulations on supercomputers. Next-generation supercomputers will have a dramatically higher performance than current systems, generating more data that needs to be analysed (e.g. in terms of number and length of MD trajectories). In the future, the coordination of data generation and analysis can no longer rely on manual, centralized analysis traditionally performed after the simulation is completed or on current data representations that have been defined for traditional visualization tools. Powerful data preparation phases (i.e. phases in which original row data is transformed to concise and still meaningful representations) will need to proceed data analysis phases. Here, we discuss three algorithms for transforming traditionally used molecular representations into concise and meaningful metadata representations. The transformations can be performed locally. The new metadata can be fed into machine learning methods for runtime in situ analysis of larger MD trajectories supported by high-performance computing. In this paper, we provide an overview of the three algorithms and their use for three different applications: protein-ligand docking in drug design; protein folding simulations; and protein engineering based on analytics of protein functions depending on proteins' three-dimensional structures. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'.
Collapse
Affiliation(s)
- Michela Taufer
- Electrical Engineering and Computer Science Department, The University of Tennessee Knoxville, 401 Min H. Kao Bldg., 1520 Middle Drive, Knoxville, TN 37996-2250, USA
| | - Trilce Estrada
- Computer Science Department, University of New Mexico, MSC01 1130, Albuquerque, NM 87131-1070, USA
| | - Travis Johnston
- Oak Ridge National Laboratory, PO Box 2008, Oak Ridge, TN 37831, USA
| |
Collapse
|
2
|
Alnasir JJ, Shanahan HP. The application of Hadoop in structural bioinformatics. Brief Bioinform 2018; 21:96-105. [PMID: 30462158 DOI: 10.1093/bib/bby106] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Revised: 09/20/2018] [Accepted: 10/05/2018] [Indexed: 11/13/2022] Open
Abstract
The paper reviews the use of the Hadoop platform in structural bioinformatics applications. For structural bioinformatics, Hadoop provides a new framework to analyse large fractions of the Protein Data Bank that is key for high-throughput studies of, for example, protein-ligand docking, clustering of protein-ligand complexes and structural alignment. Specifically we review in the literature a number of implementations using Hadoop of high-throughput analyses and their scalability. We find that these deployments for the most part use known executables called from MapReduce rather than rewriting the algorithms. The scalability exhibits a variable behaviour in comparison with other batch schedulers, particularly as direct comparisons on the same platform are generally not available. Direct comparisons of Hadoop with batch schedulers are absent in the literature but we note there is some evidence that Message Passing Interface implementations scale better than Hadoop. A significant barrier to the use of the Hadoop ecosystem is the difficulty of the interface and configuration of a resource to use Hadoop. This will improve over time as interfaces to Hadoop, e.g. Spark improve, usage of cloud platforms (e.g. Azure and Amazon Web Services (AWS)) increases and standardised approaches such as Workflow Languages (i.e. Workflow Definition Language, Common Workflow Language and Nextflow) are taken up.
Collapse
Affiliation(s)
- Jamie J Alnasir
- Institute of Cancer Research, Old Brompton Road, London, United Kingdom
| | - Hugh P Shanahan
- Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, United Kingdom
| |
Collapse
|
3
|
Pashazadeh A, Navimipour NJ. Big data handling mechanisms in the healthcare applications: A comprehensive and systematic literature review. J Biomed Inform 2018; 82:47-62. [PMID: 29655946 DOI: 10.1016/j.jbi.2018.03.014] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2017] [Revised: 11/19/2017] [Accepted: 03/23/2018] [Indexed: 01/08/2023]
Abstract
Healthcare provides many services such as diagnosing, treatment, prevention of diseases, illnesses, injuries, and other physical and mental disorders. Large-scale distributed data processing applications in healthcare as a basic concept operates on large amounts of data. Therefore, big data application functions are the main part of healthcare operations, but there was not any comprehensive and systematic survey about studying and evaluating the important techniques in this field. Therefore, this paper aims at providing the comprehensive, detailed, and systematic study of the state-of-the-art mechanisms in the big data related to healthcare applications in five categories, including machine learning, cloud-based, heuristic-based, agent-based, and hybrid mechanisms. Also, this paper displayed a systematic literature review (SLR) of the big data applications in the healthcare literature up to the end of 2016. Initially, 205 papers were identified, but a paper selection process reduced the number of papers to 29 important studies.
Collapse
Affiliation(s)
- Asma Pashazadeh
- Department of Computer Engineering, Tabriz Branch, Islamic Azad University, Tabriz, Iran
| | - Nima Jafari Navimipour
- Department of Computer Engineering, Tabriz Branch, Islamic Azad University, Tabriz, Iran.
| |
Collapse
|
4
|
Tchagna Kouanou A, Tchiotsop D, Kengne R, Zephirin DT, Adele Armele NM, Tchinda R. An optimal big data workflow for biomedical image analysis. INFORMATICS IN MEDICINE UNLOCKED 2018. [DOI: 10.1016/j.imu.2018.05.001] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
|
5
|
Alexander NS, Palczewski K. Crowd sourcing difficult problems in protein science . Protein Sci 2017; 26:2118-2125. [PMID: 28762619 DOI: 10.1002/pro.3247] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 07/21/2017] [Indexed: 11/08/2022]
Abstract
Dedicated computing resources are expensive to develop, maintain, and administrate. Frequently, research groups require bursts of computing power, during which progress is still limited by available computing resources. One way to alleviate this bottleneck would be to use additional computing resources. Today, many computing devices remain idle most of the time. Passive volunteer computing exploits this unemployed reserve of computing power by allowing device-owners to donate computing time on their own devices. Another complementary way to alleviate bottlenecks in computing resources is to use more efficient algorithms. Engaging volunteer computing employs human intuition to help solve challenging problems for which efficient algorithms are difficult to develop or unavailable. Designing engaging volunteer computing projects is challenging but can result in high-quality solutions. Here, we highlight four examples.
Collapse
Affiliation(s)
- Nathan S Alexander
- Department of Pharmacology, School of Medicine, Case Western Reserve University, Cleveland, Ohio, 44106
| | - Krzysztof Palczewski
- Department of Pharmacology, School of Medicine, Case Western Reserve University, Cleveland, Ohio, 44106.,Cleveland Center for Membrane and Structural Biology, Case Western Reserve University, Cleveland, Ohio, 44106
| |
Collapse
|
6
|
Johnston T, Zhang B, Liwo A, Crivelli S, Taufer M. In situ data analytics and indexing of protein trajectories. J Comput Chem 2017; 38:1419-1430. [PMID: 28093787 DOI: 10.1002/jcc.24729] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2016] [Revised: 10/22/2016] [Accepted: 10/27/2016] [Indexed: 11/06/2022]
Abstract
The transition toward exascale computing will be accompanied by a performance dichotomy. Computational peak performance will rapidly increase; I/O performance will either grow slowly or be completely stagnant. Essentially, the rate at which data are generated will grow much faster than the rate at which data can be read from and written to the disk. MD simulations will soon face the I/O problem of efficiently writing to and reading from disk on the next generation of supercomputers. This article targets MD simulations at the exascale and proposes a novel technique for in situ data analysis and indexing of MD trajectories. Our technique maps individual trajectories' substructures (i.e., α-helices, β-strands) to metadata frame by frame. The metadata captures the conformational properties of the substructures. The ensemble of metadata can be used for automatic, strategic analysis within a trajectory or across trajectories, without manually identify those portions of trajectories in which critical changes take place. We demonstrate our technique's effectiveness by applying it to 26.3k helices and 31.2k strands from 9917 PDB proteins and by providing three empirical case studies. © 2017 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Travis Johnston
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA
| | - Boyu Zhang
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA
| | - Adam Liwo
- Department of Theort, Chemistry, University of Gdansk, 80-952, Gdańsk, Poland
| | - Silvia Crivelli
- Department of Computer Science, University of California, Davis, CA, 95616, USA
| | - Michela Taufer
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19711, USA
| |
Collapse
|
7
|
Roche DB, Brackenridge DA, McGuffin LJ. Proteins and Their Interacting Partners: An Introduction to Protein-Ligand Binding Site Prediction Methods. Int J Mol Sci 2015; 16:29829-42. [PMID: 26694353 PMCID: PMC4691145 DOI: 10.3390/ijms161226202] [Citation(s) in RCA: 49] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2015] [Revised: 12/02/2015] [Accepted: 12/10/2015] [Indexed: 01/14/2023] Open
Abstract
Elucidating the biological and biochemical roles of proteins, and subsequently determining their interacting partners, can be difficult and time consuming using in vitro and/or in vivo methods, and consequently the majority of newly sequenced proteins will have unknown structures and functions. However, in silico methods for predicting protein-ligand binding sites and protein biochemical functions offer an alternative practical solution. The characterisation of protein-ligand binding sites is essential for investigating new functional roles, which can impact the major biological research spheres of health, food, and energy security. In this review we discuss the role in silico methods play in 3D modelling of protein-ligand binding sites, along with their role in predicting biochemical functionality. In addition, we describe in detail some of the key alternative in silico prediction approaches that are available, as well as discussing the Critical Assessment of Techniques for Protein Structure Prediction (CASP) and the Continuous Automated Model EvaluatiOn (CAMEO) projects, and their impact on developments in the field. Furthermore, we discuss the importance of protein function prediction methods for tackling 21st century problems.
Collapse
Affiliation(s)
- Daniel Barry Roche
- Institut de Biologie Computationnelle, LIRMM, CNRS, Université de Montpellier, Montpellier 34095, France.
- Centre de Recherche de Biochimie Macromoléculaire, CNRS-UMR 5237, Montpellier 34293, France.
| | | | | |
Collapse
|
8
|
Komiyama Y, Banno M, Ueki K, Saad G, Shimizu K. Automatic generation of bioinformatics tools for predicting protein-ligand binding sites. ACTA ACUST UNITED AC 2015; 32:901-7. [PMID: 26545824 PMCID: PMC4803387 DOI: 10.1093/bioinformatics/btv593] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2015] [Accepted: 10/12/2015] [Indexed: 11/13/2022]
Abstract
MOTIVATION Predictive tools that model protein-ligand binding on demand are needed to promote ligand research in an innovative drug-design environment. However, it takes considerable time and effort to develop predictive tools that can be applied to individual ligands. An automated production pipeline that can rapidly and efficiently develop user-friendly protein-ligand binding predictive tools would be useful. RESULTS We developed a system for automatically generating protein-ligand binding predictions. Implementation of this system in a pipeline of Semantic Web technique-based web tools will allow users to specify a ligand and receive the tool within 0.5-1 day. We demonstrated high prediction accuracy for three machine learning algorithms and eight ligands. AVAILABILITY AND IMPLEMENTATION The source code and web application are freely available for download at http://utprot.net They are implemented in Python and supported on Linux. CONTACT shimizu@bi.a.u-tokyo.ac.jp SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yusuke Komiyama
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo 108-8639, Japan and
| | - Masaki Banno
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Kokoro Ueki
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Gul Saad
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Kentaro Shimizu
- Department of Biotechnology, Graduate School of Agricultural and Life Sciences, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan
| |
Collapse
|
9
|
Abstract
We provide a future perspective of the virtual screening field. A number of challenges will be highlighted that virtual screening will likely face when compound data will further grow at or beyond current rates and when much more target information will become available. These challenges go beyond computational efficiency issues (that will of course also play a critical role). For example, for structure-based approaches, the accuracy of scoring functions and energy calculations will need to be improved. For ligand-based approaches, the compound class-dependence of similarity methods needs to be further explored and relationships between molecular similarity and activity similarity need to be established. We also comment on the current and future value of virtual screening. Opportunities for further development in a postgenome era are also discussed. It is hoped that some of the views and hypotheses we articulate might stimulate further discussion about the virtual screening field going forward.
Collapse
Affiliation(s)
- Kathrin Heikamp
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr. 2, D-53113 Bonn, Germany
| | | |
Collapse
|