1
|
Jang KW, Jeong WJ, Kang Y. Development of a GPU-Accelerated NDT Localization Algorithm for GNSS-Denied Urban Areas. Sensors (Basel) 2022; 22:1913. [PMID: 35271060 DOI: 10.3390/s22051913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 02/22/2022] [Accepted: 02/26/2022] [Indexed: 02/04/2023]
Abstract
There are numerous global navigation satellite system-denied regions in urban areas, where the localization of autonomous driving remains a challenge. To address this problem, a high-resolution light detection and ranging (LiDAR) sensor was recently developed. Various methods have been proposed to improve the accuracy of localization using precise distance measurements derived from LiDAR sensors. This study proposes an algorithm to accelerate the computational speed of LiDAR localization while maintaining the original accuracy of lightweight map-matching algorithms. To this end, first, a point cloud map was transformed into a normal distribution (ND) map. During this process, vector-based normal distribution transform, suitable for graphics processing unit (GPU) parallel processing, was used. In this study, we introduce an algorithm that enabled GPU parallel processing of an existing ND map-matching process. The performance of the proposed algorithm was verified using an open dataset and simulations. To verify the practical performance of the proposed algorithm, the real-time serial and parallel processing performances of the localization were compared using high-performance and embedded computers, respectively. The distance root-mean-square error and computational time of the proposed algorithm were compared. The algorithm increased the computational speed of the embedded computer almost 100-fold while maintaining high localization precision.
Collapse
|
2
|
Fang J, Wei Z, Yang H. Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU. Micromachines (Basel) 2021; 12:mi12101262. [PMID: 34683312 PMCID: PMC8537857 DOI: 10.3390/mi12101262] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/29/2021] [Revised: 09/26/2021] [Accepted: 10/12/2021] [Indexed: 11/16/2022]
Abstract
GPGPUs has gradually become a mainstream acceleration component in high-performance computing. The long latency of memory operations is the bottleneck of GPU performance. In the GPU, multiple threads are divided into one warp for scheduling and execution. The L1 data caches have little capacity, while multiple warps share one small cache. That makes the cache suffer a large amount of cache contention and pipeline stall. We propose Locality-Based Cache Management (LCM), combined with the Locality-Based Warp Scheduling (LWS), to reduce cache contention and improve GPU performance. Each load instruction can be divided into three types according to locality: only used once as streaming data locality, accessed multiple times in the same warp as intra-warp locality, and accessed in different warps as inter-warp data locality. According to the locality of the load instruction, LWS applies cache bypass to the streaming locality request to improve the cache utilization rate, extend inter-warp memory request coalescing to make full use of the inter-warp locality, and combine with the LWS to alleviate cache contention. LCM and LWS can effectively improve cache performance, thereby improving overall GPU performance. Through experimental evaluation, our LCM and LWS can obtain an average performance improvement of 26% over baseline GPU.
Collapse
Affiliation(s)
- Juan Fang
- Correspondence: ; Tel.: +86-139-1129-6256
| | | | | |
Collapse
|
3
|
Ruf B, Mohrs J, Weinmann M, Hinz S, Beyerer J. ReS 2tAC-UAV-Borne Real-Time SGM Stereo Optimized for Embedded ARM and CUDA Devices. Sensors (Basel) 2021; 21:3938. [PMID: 34200481 DOI: 10.3390/s21113938] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Revised: 05/20/2021] [Accepted: 05/28/2021] [Indexed: 11/16/2022]
Abstract
With the emergence of low-cost robotic systems, such as unmanned aerial vehicle, the importance of embedded high-performance image processing has increased. For a long time, FPGAs were the only processing hardware that were capable of high-performance computing, while at the same time preserving a low power consumption, essential for embedded systems. However, the recently increasing availability of embedded GPU-based systems, such as the NVIDIA Jetson series, comprised of an ARM CPU and a NVIDIA Tegra GPU, allows for massively parallel embedded computing on graphics hardware. With this in mind, we propose an approach for real-time embedded stereo processing on ARM and CUDA-enabled devices, which is based on the popular and widely used Semi-Global Matching algorithm. In this, we propose an optimization of the algorithm for embedded CUDA GPUs, by using massively parallel computing, as well as using the NEON intrinsics to optimize the algorithm for vectorized SIMD processing on embedded ARM CPUs. We have evaluated our approach with different configurations on two public stereo benchmark datasets to demonstrate that they can reach an error rate as low as 3.3%. Furthermore, our experiments show that the fastest configuration of our approach reaches up to 46 FPS on VGA image resolution. Finally, in a use-case specific qualitative evaluation, we have evaluated the power consumption of our approach and deployed it on the DJI Manifold 2-G attached to a DJI Matrix 210v2 RTK unmanned aerial vehicle (UAV), demonstrating its suitability for real-time stereo processing onboard a UAV.
Collapse
|
4
|
Klippel H, Süssmaier S, Röthlin M, Afrasiabi M, Pala U, Wegener K. Simulation of the ductile machining mode of silicon. Int J Adv Manuf Technol 2021; 115:1565-1578. [PMID: 34776579 PMCID: PMC8550667 DOI: 10.1007/s00170-021-07167-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Accepted: 04/27/2021] [Indexed: 06/13/2023]
Abstract
Diamond wire sawing has been developed to reduce the cutting loss when cutting silicon wafers from ingots. The surface of silicon solar cells must be flawless in order to achieve the highest possible efficiency. However, the surface is damaged during sawing. The extent of the damage depends primarily on the material removal mode. Under certain conditions, the generally brittle material can be machined in ductile mode, whereby considerably fewer cracks occur in the surface than with brittle material removal. In the presented paper, a numerical model is developed in order to support the optimisation of the machining process regarding the transition between ductile and brittle material removal. The simulations are performed with an GPU-accelerated in-house developed code using mesh-free methods which easily handle large deformations while classic methods like FEM would require intensive remeshing. The Johnson-Cook flow stress model is implemented and used to evaluate the applicability of a model for ductile material behaviour in the transition zone between ductile and brittle removal mode. The simulation results are compared with results obtained from single grain scratch experiments using a real, non-idealised grain geometry as present in the diamond wire sawing process.
Collapse
Affiliation(s)
- Hagen Klippel
- Department of Mechanical Engineering, Institute of Machine Tools and Manufacturing (IWF), ETH Zürich, Zürich, Switzerland
| | - Stefan Süssmaier
- Department of Mechanical Engineering, Institute of Machine Tools and Manufacturing (IWF), ETH Zürich, Zürich, Switzerland
| | - Matthias Röthlin
- Operation Center 1 at Federal Office of Meteorology & Climatology, MeteoSwiss, Switzerland
| | | | | | - Konrad Wegener
- Department of Mechanical Engineering, Institute of Machine Tools and Manufacturing (IWF), ETH Zürich, Zürich, Switzerland
| |
Collapse
|
5
|
Niedzwiedzki J, Niewola A, Lipinski P, Swaczyna P, Bobinski A, Poryzala P, Podsedkowski L. Real-Time Parallel-Serial LiDAR-Based Localization Algorithm with Centimeter Accuracy for GPS-Denied Environments. Sensors (Basel) 2020; 20:s20247123. [PMID: 33322587 PMCID: PMC7764368 DOI: 10.3390/s20247123] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Revised: 12/02/2020] [Accepted: 12/06/2020] [Indexed: 11/24/2022]
Abstract
In this paper, we introduce a real-time parallel-serial algorithm for autonomous robot positioning for GPS-denied, dark environments, such as caves and mine galleries. To achieve a good complexity-accuracy trade-off, we fuse data from light detection and ranging (LiDAR) and an inertial measurement unit (IMU). The proposed algorithm’s main novelty is that, unlike in most algorithms, we apply an extended Kalman filter (EKF) to each LiDAR scan point and calculate the location relative to a triangular mesh. We also introduce three implementations of the algorithm: serial, parallel, and parallel-serial. The first implementation verifies the correctness of our innovative approach, but is too slow for real-time execution. The second approach implements a well-known parallel data fusion approach, but is still too slow for our application. The third and final implementation of the presented algorithm along with the state-of-the-art GPU data structures achieves real-time performance. According to our experimental findings, our algorithm outperforms the reference Gaussian mixture model (GMM) localization algorithm in terms of accuracy by a factor of two.
Collapse
Affiliation(s)
- Jakub Niedzwiedzki
- Institute of Machine Tools and Production Engineering, Lodz University of Technology, ul. Stefanowskiego 1/15, 90-924 Lodz, Poland; (J.N.); (A.N.); (P.S.); (A.B.); (L.P.)
| | - Adam Niewola
- Institute of Machine Tools and Production Engineering, Lodz University of Technology, ul. Stefanowskiego 1/15, 90-924 Lodz, Poland; (J.N.); (A.N.); (P.S.); (A.B.); (L.P.)
| | - Piotr Lipinski
- Institute of Information Technology, Lodz University of Technology, ul. Wolczanska 215, 90-924 Lodz, Poland
- Correspondence:
| | - Piotr Swaczyna
- Institute of Machine Tools and Production Engineering, Lodz University of Technology, ul. Stefanowskiego 1/15, 90-924 Lodz, Poland; (J.N.); (A.N.); (P.S.); (A.B.); (L.P.)
| | - Aleksander Bobinski
- Institute of Machine Tools and Production Engineering, Lodz University of Technology, ul. Stefanowskiego 1/15, 90-924 Lodz, Poland; (J.N.); (A.N.); (P.S.); (A.B.); (L.P.)
| | - Pawel Poryzala
- Institute of Electronics, Lodz University of Technology, ul. Wolczanska 211/215, 93-005 Lodz, Poland;
| | - Leszek Podsedkowski
- Institute of Machine Tools and Production Engineering, Lodz University of Technology, ul. Stefanowskiego 1/15, 90-924 Lodz, Poland; (J.N.); (A.N.); (P.S.); (A.B.); (L.P.)
| |
Collapse
|
6
|
Kasahara K, Terazawa H, Itaya H, Goto S, Nakamura H, Takahashi T, Higo J. myPresto/omegagene 2020: a molecular dynamics simulation engine for virtual-system coupled sampling. Biophys Physicobiol 2020; 17:140-146. [PMID: 33240741 PMCID: PMC7671739 DOI: 10.2142/biophysico.bsj-2020013] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Accepted: 10/10/2020] [Indexed: 12/03/2022] Open
Abstract
The molecular dynamics (MD) method is a promising approach for investigating the molecular mechanisms of microscopic phenomena. In particular, generalized ensemble MD methods can efficiently explore the conformational space with a rugged free-energy surface. However, the implementation and acquisition of technical knowledge for each generalized ensemble MD method are not straightforward for end-users. Here, we present a new version of the myPresto/omegagene software, which is an MD simulation engine tailored for a series of generalized ensemble methods, which are virtual-system coupled multicanonical MD (V-McMD), virtual-system coupled adaptive umbrella sampling (V-AUS), and virtual-system coupled canonical MD (VcMD). This program has been applied in several studies analyzing free-energy landscapes of a variety of molecular systems with all-atom simulations. The updated version provides new functionality for coarse-grained simulations powered by the hydrophobicity scale method. The software package includes a step-by-step tutorial document for enhanced conformational sampling of the poly-glutamine (poly-Q) oligomer expressed as a one-bead per residue model. The myPresto/omegagene software is freely available at the following URL: https://github.com/kotakasahara/omegagene under the Apache2 license.
Collapse
Affiliation(s)
- Kota Kasahara
- College of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan
| | - Hiroki Terazawa
- Graduate School of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan
| | - Hayato Itaya
- Graduate School of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan
| | - Satoshi Goto
- Graduate School of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan
| | - Haruki Nakamura
- Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
| | - Takuya Takahashi
- College of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan
| | - Junichi Higo
- Graduate School of Simulation Studies, University of Hyogo, Kobe, Hyogo 650-0047, Japan
| |
Collapse
|
7
|
Kabiri Chimeh M, Heywood P, Pennisi M, Pappalardo F, Richmond P. Parallelisation strategies for agent based simulation of immune systems. BMC Bioinformatics 2019; 20:579. [PMID: 31823716 DOI: 10.1186/s12859-019-3181-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2019] [Accepted: 10/29/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In recent years, the study of immune response behaviour using bottom up approach, Agent Based Modeling (ABM), has attracted considerable efforts. The ABM approach is a very common technique in the biological domain due to high demand for a large scale analysis tools for the collection and interpretation of information to solve biological problems. Simulating massive multi-agent systems (i.e. simulations containing a large number of agents/entities) requires major computational effort which is only achievable through the use of parallel computing approaches. RESULTS This paper explores different approaches to parallelising the key component of biological and immune system models within an ABM model: pairwise interactions. The focus of this paper is on the performance and algorithmic design choices of cell interactions in continuous and discrete space where agents/entities are competing to interact with one another within a parallel environment. CONCLUSIONS Our performance results demonstrate the applicability of these methods to a broader class of biological systems exhibiting typical cell to cell interactions. The advantage and disadvantage of each implementation is discussed showing each can be used as the basis for developing complete immune system models on parallel hardware.
Collapse
|
8
|
Varga L, Kovács A, Grósz T, Thury G, Hadarits F, Dégi R, Dombi J. Automatic segmentation of hyperreflective foci in OCT images. Comput Methods Programs Biomed 2019; 178:91-103. [PMID: 31416566 DOI: 10.1016/j.cmpb.2019.06.019] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Revised: 05/26/2019] [Accepted: 06/16/2019] [Indexed: 05/25/2023]
Abstract
BACKGROUND AND OBJECTIVE The leading cause of vision loss in the Western World is Age-related Macular Degeneration (AMD), but together with modern medicines, tracking the number of Hyperreflective Foci (HF) on Optical Coherence Tomography (OCT) images should assist the treatment of patients. Here, we developed a framework based on deep learning for the automatic segmentation of HF in OCT images. METHODS We collected OCT images and annotated them, then these images underwent image preprocessing, and feature extraction steps. Using the prepared data we trained different types of Conventional-, Deep- and Convolutional Neural Networks to perform the task of the automatic segmentation of HF. RESULTS We evaluated the various Neural Networks, by performing HF segmentation of clinical data belonging to patients, whose data were excluded from the training process. The results suggest that our systems can achieve reasonably high Dice Coefficient values, and they are comparable with (i.e., in most cases above 95%) the similarity between manual annotations performed by different physicians. CONCLUSION From the results, it can be concluded that neural networks can be used to accurately segment HF in OCT images. The results are sufficiently accurate for us to incorporate them into the next phase of the research, building a decision support system for everyday clinical practice.
Collapse
Affiliation(s)
- László Varga
- University of Szeged, Interdisciplinary Excellence Centre, Hungary.
| | - Attila Kovács
- University of Szeged, Interdisciplinary Excellence Centre, Hungary
| | - Tamás Grósz
- University of Szeged, Interdisciplinary Excellence Centre, Hungary
| | - Géza Thury
- University of Szeged, Interdisciplinary Excellence Centre, Hungary
| | - Flóra Hadarits
- University of Szeged, Interdisciplinary Excellence Centre, Hungary
| | - Rózsa Dégi
- University of Szeged, Interdisciplinary Excellence Centre, Hungary
| | - József Dombi
- University of Szeged, Interdisciplinary Excellence Centre, Hungary
| |
Collapse
|
9
|
Na JC, Lee I, Rhee JK, Shin SY. Fast single individual haplotyping method using GPGPU. Comput Biol Med 2019; 113:103421. [PMID: 31499396 DOI: 10.1016/j.compbiomed.2019.103421] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2019] [Revised: 08/28/2019] [Accepted: 08/28/2019] [Indexed: 11/27/2022]
Abstract
BACKGROUND Most bioinformatic tools for next generation sequencing (NGS) data are computationally intensive, requiring a large amount of computational power for processing and analysis. Here the utility of graphic processing units (GPUs) for NGS data computation is assessed. METHOD In a previous study, we developed a probabilistic evolutionary algorithm with toggling for haplotyping (PEATH) method based on the estimation of distribution algorithm and toggling heuristic. Here, we parallelized the PEATH method (PEATH/G) using general-purpose computing on GPU (GPGPU). RESULTS The PEATH/G runs approximately 46.8 times and 25.4 times faster than PEATH on the NA12878 fosmid-sequencing dataset and the HuRef dataset, respectively, with an NVIDIA GeForce GTX 1660Ti. Moreover, the PEATH/G is approximately 13.3 times faster on the fosmid-sequencing dataset, even with an inexpensive conventional GPGPU (NVIDIA GeForce GTX 950). CONCLUSIONS PEATH/G can be a practical single individual haplotyping tool in terms of both its accuracy and speed. GPGPU can help reduce the running time of NGS analysis tools.
Collapse
Affiliation(s)
- Joong Chae Na
- Department of Computer Science and Engineering, Sejong University, Seoul, 05006, South Korea
| | - Inbok Lee
- Department of Software, Korea Aerospace University, Goyang, 10540, South Korea
| | - Je-Keun Rhee
- School of Systems Biomedical Science, Soongsil University, Seoul, 06978, South Korea.
| | - Soo-Yong Shin
- Department of Digital Health, SAIHST, Sungkyunkwan University, Seoul, 06351, South Korea; Big Data Research Center, Samsung Medical Center, Seoul, 06351, South Korea.
| |
Collapse
|
10
|
Hernandez-Fernandez M, Reguly I, Jbabdi S, Giles M, Smith S, Sotiropoulos SN. Using GPUs to accelerate computational diffusion MRI: From microstructure estimation to tractography and connectomes. Neuroimage 2019; 188:598-615. [PMID: 30537563 PMCID: PMC6614035 DOI: 10.1016/j.neuroimage.2018.12.015] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2018] [Revised: 11/20/2018] [Accepted: 12/07/2018] [Indexed: 12/27/2022] Open
Abstract
The great potential of computational diffusion MRI (dMRI) relies on indirect inference of tissue microstructure and brain connections, since modelling and tractography frameworks map diffusion measurements to neuroanatomical features. This mapping however can be computationally highly expensive, particularly given the trend of increasing dataset sizes and the complexity in biophysical modelling. Limitations on computing resources can restrict data exploration and methodology development. A step forward is to take advantage of the computational power offered by recent parallel computing architectures, especially Graphics Processing Units (GPUs). GPUs are massive parallel processors that offer trillions of floating point operations per second, and have made possible the solution of computationally-intensive scientific problems that were intractable before. However, they are not inherently suited for all problems. Here, we present two different frameworks for accelerating dMRI computations using GPUs that cover the most typical dMRI applications: a framework for performing biophysical modelling and microstructure estimation, and a second framework for performing tractography and long-range connectivity estimation. The former provides a front-end and automatically generates a GPU executable file from a user-specified biophysical model, allowing accelerated non-linear model fitting in both deterministic and stochastic ways (Bayesian inference). The latter performs probabilistic tractography, can generate whole-brain connectomes and supports new functionality for imposing anatomical constraints, such as inherent consideration of surface meshes (GIFTI files) along with volumetric images. We validate the frameworks against well-established CPU-based implementations and we show that despite the very different challenges for parallelising these problems, a single GPU achieves better performance than 200 CPU cores thanks to our parallel designs.
Collapse
Affiliation(s)
- Moises Hernandez-Fernandez
- Wellcome Centre for Integrative Neuroimaging - Centre for Functional Magnetic Resonance Imaging of the Brain (FMRIB), University of Oxford, Oxford, United Kingdom; Center for Biomedical Image Computing and Analytics (CBICA), Department of Radiology, University of Pennsylvania, Philadelphia, PA, United States.
| | - Istvan Reguly
- Faculty of Information Technology and Bionics, Pazmany Peter Catholic University, Budapest, Hungary
| | - Saad Jbabdi
- Wellcome Centre for Integrative Neuroimaging - Centre for Functional Magnetic Resonance Imaging of the Brain (FMRIB), University of Oxford, Oxford, United Kingdom
| | - Mike Giles
- Mathematical Institute, University of Oxford, Oxford, United Kingdom
| | - Stephen Smith
- Wellcome Centre for Integrative Neuroimaging - Centre for Functional Magnetic Resonance Imaging of the Brain (FMRIB), University of Oxford, Oxford, United Kingdom
| | - Stamatios N Sotiropoulos
- Wellcome Centre for Integrative Neuroimaging - Centre for Functional Magnetic Resonance Imaging of the Brain (FMRIB), University of Oxford, Oxford, United Kingdom; Sir Peter Mansfield Imaging Centre, School of Medicine, University of Nottingham, Nottingham, United Kingdom
| |
Collapse
|
11
|
Okada S, Murakami K, Incerti S, Amako K, Sasaki T. MPEXS-DNA, a new GPU-based Monte Carlo simulator for track structures and radiation chemistry at subcellular scale. Med Phys 2019; 46:1483-1500. [PMID: 30593679 PMCID: PMC6850505 DOI: 10.1002/mp.13370] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2018] [Revised: 12/17/2018] [Accepted: 12/19/2018] [Indexed: 11/23/2022] Open
Abstract
Purpose Track structure simulation codes can accurately reproduce the stochastic nature of particle–matter interactions in order to evaluate quantitatively radiation damage in biological cells such as DNA strand breaks and base damage. Such simulations handle large numbers of secondary charged particles and molecular species created in the irradiated medium. Every particle and molecular species are tracked step‐by‐step using a Monte Carlo method to calculate energy loss patterns and spatial distributions of molecular species inside a cell nucleus with high spatial accuracy. The Geant4‐DNA extension of the Geant4 general‐purpose Monte Carlo simulation toolkit allows for such track structure simulations and can be run on CPUs. However, long execution times have been observed for the simulation of DNA damage in cells. We present in this work an improvement of the computing performance of such simulations using ultraparallel processing on a graphical processing unit (GPU). Methods A new Monte Carlo simulator named MPEXS‐DNA, allowing high computing performance by using a GPU, has been developed for track structure and radiolysis simulations at the subcellular scale. MPEXS‐DNA physics and chemical processes are based on Geant4‐DNA processes available in Geant4 version 10.02 p03. We have reimplemented the Geant4‐DNA process codes of the physics stage (electromagnetic processes of charged particles) and the chemical stage (diffusion and chemical reactions for molecular species) for microdosimetry simulation by using the CUDA language. MPEXS‐DNA can calculate a distribution of energy loss in the irradiated medium caused by charged particles and also simulate production, diffusion, and chemical interactions of molecular species from water radiolysis to quantitatively assess initial damage to DNA. The validation of MPEXS‐DNA physics and chemical simulations was performed by comparing various types of distributions, namely the radial dose distributions for the physics stage, and the G‐value profiles for each chemical product and their linear energy transfer dependency for the chemical stage, to existing experimental data and simulation results obtained by other simulation codes, including PARTRAC. Results For physics validation, radial dose distributions calculated by MPEXS‐DNA are consistent with experimental data and numerical simulations. For chemistry validation, MPEXS‐DNA can also reproduce G‐value profiles for each molecular species with the same tendency as existing experimental data. MPEXS‐DNA also agrees with simulations by PARTRAC reasonably well. However, we have confirmed that there are slight discrepancies in G‐value profiles calculated by MPEXS‐DNA for molecular species such as H2 and H2O2 when compared to experimental data and PARTRAC simulations. The differences in G‐value profiles between MPEXS‐DNA and PARTRAC are caused by the different chemical reactions considered. MPEXS‐DNA can drastically boost the computing performance of track structure and radiolysis simulations. By using NVIDIA's GPU devices adopting the Volta architecture, MPEXS‐DNA has achieved speedup factors up to 2900 against Geant4‐DNA simulations with a single CPU core. Conclusion The MPEXS‐DNA Monte Carlo simulation achieves similar accuracy to Monte Carlo simulations performed using other codes such as Geant4‐DNA and PARTRAC, and its predictions are consistent with experimental data. Notably, MPEXS‐DNA allows calculations that are, at maximum, 2900 times faster than conventional simulations using a CPU.
Collapse
Affiliation(s)
- Shogo Okada
- KEK, 1-1, Oho, Tsukuba, Ibaraki, 305-0801, Japan
| | | | - Sebastien Incerti
- University of Bordeaux, CENBG, UMR 5797, Gradignan, F-33170, France.,CNRS, IN2P3, CENBG, UMR 5797, Gradignan, F-33170, France
| | | | | |
Collapse
|
12
|
Abstract
Background We present a performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal pairwise alignment of huge DNA sequences in multi-GPU platforms using the exact Smith-Waterman method. Results Our study includes acceleration factors, performance, scalability, power efficiency and energy costs. We also quantify the influence of the contents of the compared sequences, identify potential scenarios for energy savings on speculative executions, and calculate performance and energy usage differences among distinct GPU generations and models. For a sequence alignment on chromosome-wide scale (around 2 Petacells), we are able to reduce execution times from 9.5 h on a Kepler GPU to just 2.5 h on a Pascal counterpart, with energy costs cut by 60%. Conclusions We find GPUs to be an order of magnitude ahead in performance per watt compared to Xeon Phis. Finally, versus typical low-power devices like FPGAs, GPUs keep similar GFLOPS/w ratios in 2017 on a five times faster execution.
Collapse
|
13
|
Abstract
We introduce the Xpuck swarm, a research platform with an aggregate raw processing power in excess of two teraflops. The swarm uses 16 e-puck robots augmented with custom hardware that uses the substantial CPU and GPU processing power available from modern mobile system-on-chip devices. The augmented robots, called Xpucks, have at least an order of magnitude greater performance than previous swarm robotics platforms. The platform enables new experiments that require high individual robot computation and multiple robots. Uses include online evolution or learning of swarm controllers, simulation for answering what-if questions about possible actions, distributed super-computing for mobile platforms, and real-world applications of swarm robotics that requires image processing, or SLAM. The teraflop swarm could also be used to explore swarming in nature by providing platforms with similar computational power as simple insects. We demonstrate the computational capability of the swarm by implementing a fast physics-based robot simulator and using this within a distributed island model evolutionary system, all hosted on the Xpucks.
Collapse
Affiliation(s)
- Simon Jones
- University of Bristol, Bristol, United Kingdom.,University of the West of England, Bristol, United Kingdom.,Bristol Robotics Laboratory, University of the West of England, Bristol, United Kingdom
| | - Matthew Studley
- University of the West of England, Bristol, United Kingdom.,Bristol Robotics Laboratory, University of the West of England, Bristol, United Kingdom
| | - Sabine Hauert
- University of Bristol, Bristol, United Kingdom.,Bristol Robotics Laboratory, University of the West of England, Bristol, United Kingdom
| | - Alan Frank Thomas Winfield
- University of the West of England, Bristol, United Kingdom.,Bristol Robotics Laboratory, University of the West of England, Bristol, United Kingdom
| |
Collapse
|
14
|
Medina L, Diez-Ochoa M, Correal R, Cuenca-Asensi S, Serrano A, Godoy J, Martínez-Álvarez A, Villagra J. A Comparison of FPGA and GPGPU Designs for Bayesian Occupancy Filters. Sensors (Basel) 2017; 17:E2599. [PMID: 29137137 DOI: 10.3390/s17112599] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Revised: 11/03/2017] [Accepted: 11/06/2017] [Indexed: 11/16/2022]
Abstract
Grid-based perception techniques in the automotive sector based on fusing information from different sensors and their robust perceptions of the environment are proliferating in the industry. However, one of the main drawbacks of these techniques is the traditionally prohibitive, high computing performance that is required for embedded automotive systems. In this work, the capabilities of new computing architectures that embed these algorithms are assessed in a real car. The paper compares two ad hoc optimized designs of the Bayesian Occupancy Filter; one for General Purpose Graphics Processing Unit (GPGPU) and the other for Field-Programmable Gate Array (FPGA). The resulting implementations are compared in terms of development effort, accuracy and performance, using datasets from a realistic simulator and from a real automated vehicle.
Collapse
|
15
|
Abstract
BACKGROUND BarraCUDA is an open source C program which uses the BWA algorithm in parallel with nVidia CUDA to align short next generation DNA sequences against a reference genome. Recently its source code was optimised using "Genetic Improvement". RESULTS The genetically improved (GI) code is up to three times faster on short paired end reads from The 1000 Genomes Project and 60% more accurate on a short BioPlanet.com GCAT alignment benchmark. GPGPU BarraCUDA running on a single K80 Tesla GPU can align short paired end nextGen sequences up to ten times faster than bwa on a 12 core server. CONCLUSIONS The speed up was such that the GI version was adopted and has been regularly downloaded from SourceForge for more than 12 months.
Collapse
Affiliation(s)
- W B Langdon
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT UK
| | - Brian Yee Hong Lam
- University of Cambridge Metabolic Research Laboratories, Addenbrooke's Hospital, Cambridge, UK
| |
Collapse
|
16
|
Abstract
Computational structure-based protein design (CSPD) is an important problem in computational biology, which aims to design or improve a prescribed protein function based on a protein structure template. It provides a practical tool for real-world protein engineering applications. A popular CSPD method that guarantees to find the global minimum energy solution (GMEC) is to combine both dead-end elimination (DEE) and A* tree search algorithms. However, in this framework, the A* search algorithm can run in exponential time in the worst case, which may become the computation bottleneck of large-scale computational protein design process. To address this issue, we extend and add a new module to the OSPREY program that was previously developed in the Donald lab (Gainza et al., Methods Enzymol 523:87, 2013) to implement a GPU-based massively parallel A* algorithm for improving protein design pipeline. By exploiting the modern GPU computational framework and optimizing the computation of the heuristic function for A* search, our new program, called gOSPREY, can provide up to four orders of magnitude speedups in large protein design cases with a small memory overhead comparing to the traditional A* search algorithm implementation, while still guaranteeing the optimality. In addition, gOSPREY can be configured to run in a bounded-memory mode to tackle the problems in which the conformation space is too large and the global optimal solution cannot be computed previously. Furthermore, the GPU-based A* algorithm implemented in the gOSPREY program can be combined with the state-of-the-art rotamer pruning algorithms such as iMinDEE (Gainza et al., PLoS Comput Biol 8:e1002335, 2012) and DEEPer (Hallen et al., Proteins 81:18-39, 2013) to also consider continuous backbone and side-chain flexibility.
Collapse
Affiliation(s)
- Yichao Zhou
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, P. R. China
| | - Bruce R Donald
- Department of Computer Science, Duke University, Durham, NC, USA
- Department of Biochemistry, Duke University Medical Center, Durham, NC, USA
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, P. R. China.
| |
Collapse
|
17
|
Teodoro G, Kurc T, Andrade G, Kong J, Ferreira R, Saltz J. Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs: A Case Study with Microscopy Image Analysis. Int J High Perform Comput Appl 2017; 31:32-51. [PMID: 28239253 PMCID: PMC5319667 DOI: 10.1177/1094342015594519] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core-MIC) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core operations of the application. We correlate the observed performance with the characteristics of computing devices and data access patterns, computation complexities, and parallelization forms of the operations. The results show a significant variability in the performance of operations with respect to the device used. The performances of operations with regular data access are comparable or sometimes better on a MIC than that on a GPU. GPUs are more efficient than MICs for operations that access data irregularly, because of the lower bandwidth of the MIC for random data accesses. We propose new performance-aware scheduling strategies that consider variabilities in operation speedups. Our scheduling strategies significantly improve application performance compared to classic strategies in hybrid configurations.
Collapse
Affiliation(s)
- George Teodoro
- Department of Computer Science, University of Brasília, Brasília, DF, Brazil
| | - Tahsin Kurc
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
- Scientific Data Group, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | - Guilherme Andrade
- Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil
| | - Jun Kong
- Department of Biomedical Informatics, Emory University, Atlanta, GA, USA
| | - Renato Ferreira
- Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, MG, Brazil
| | - Joel Saltz
- Department of Biomedical Informatics, Stony Brook University, Stony Brook, NY, USA
- Scientific Data Group, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| |
Collapse
|
18
|
Katouda M, Naruse A, Hirano Y, Nakajima T. Massively parallel algorithm and implementation of RI-MP2 energy calculation for peta-scale many-core supercomputers. J Comput Chem 2016; 37:2623-2633. [PMID: 27634573 DOI: 10.1002/jcc.24491] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Revised: 08/25/2016] [Accepted: 08/26/2016] [Indexed: 01/15/2023]
Abstract
A new parallel algorithm and its implementation for the RI-MP2 energy calculation utilizing peta-flop-class many-core supercomputers are presented. Some improvements from the previous algorithm (J. Chem. Theory Comput. 2013, 9, 5373) have been performed: (1) a dual-level hierarchical parallelization scheme that enables the use of more than 10,000 Message Passing Interface (MPI) processes and (2) a new data communication scheme that reduces network communication overhead. A multi-node and multi-GPU implementation of the present algorithm is presented for calculations on a central processing unit (CPU)/graphics processing unit (GPU) hybrid supercomputer. Benchmark results of the new algorithm and its implementation using the K computer (CPU clustering system) and TSUBAME 2.5 (CPU/GPU hybrid system) demonstrate high efficiency. The peak performance of 3.1 PFLOPS is attained using 80,199 nodes of the K computer. The peak performance of the multi-node and multi-GPU implementation is 514 TFLOPS using 1349 nodes and 4047 GPUs of TSUBAME 2.5. © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Michio Katouda
- Computational Molecular Science Research Team, RIKEN Advanced Institute for Computational Science, 7-1-26 Minatojima-minami-machi, Chuo-ku, Kobe, 650-0047, Japan
| | - Akira Naruse
- NVIDIA Corporation, 2-11-7 Akasaka, Minato-ku, Tokyo, 107-0052, Japan
| | - Yukihiko Hirano
- NVIDIA Corporation, 2-11-7 Akasaka, Minato-ku, Tokyo, 107-0052, Japan
| | - Takahito Nakajima
- Computational Molecular Science Research Team, RIKEN Advanced Institute for Computational Science, 7-1-26 Minatojima-minami-machi, Chuo-ku, Kobe, 650-0047, Japan.
| |
Collapse
|
19
|
Adamo ME, Gerber SA. Tempest: Accelerated MS/MS Database Search Software for Heterogeneous Computing Platforms. Curr Protoc Bioinformatics 2016; 55:13.29.1-13.29.23. [PMID: 27603022 PMCID: PMC5736398 DOI: 10.1002/cpbi.15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
MS/MS database search algorithms derive a set of candidate peptide sequences from in silico digest of a protein sequence database, and compute theoretical fragmentation patterns to match these candidates against observed MS/MS spectra. The original Tempest publication described these operations mapped to a CPU-GPU model, in which the CPU (central processing unit) generates peptide candidates that are asynchronously sent to a discrete GPU (graphics processing unit) to be scored against experimental spectra in parallel. The current version of Tempest expands this model, incorporating OpenCL to offer seamless parallelization across multicore CPUs, GPUs, integrated graphics chips, and general-purpose coprocessors. Three protocols describe how to configure and run a Tempest search, including discussion of how to leverage Tempest's unique feature set to produce optimal results. © 2016 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Mark E Adamo
- Norris Cotton Cancer Center, Geisel School at Dartmouth, Lebanon, New Hampshire
| | - Scott A Gerber
- Norris Cotton Cancer Center, Geisel School at Dartmouth, Lebanon, New Hampshire
- Department of Genetics, Geisel School at Dartmouth, Lebanon, New Hampshire
- Department of Biochemistry, Geisel School at Dartmouth, Lebanon, New Hampshire
| |
Collapse
|
20
|
Kasahara K, Ma B, Goto K, Dasgupta B, Higo J, Fukuda I, Mashimo T, Akiyama Y, Nakamura H. myPresto/omegagene: a GPU-accelerated molecular dynamics simulator tailored for enhanced conformational sampling methods with a non-Ewald electrostatic scheme. Biophys Physicobiol 2016; 13:209-216. [PMID: 27924276 PMCID: PMC5060096 DOI: 10.2142/biophysico.13.0_209] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2016] [Accepted: 08/08/2016] [Indexed: 12/01/2022] Open
Abstract
Molecular dynamics (MD) is a promising computational approach to investigate dynamical behavior of molecular systems at the atomic level. Here, we present a new MD simulation engine named "myPresto/omegagene" that is tailored for enhanced conformational sampling methods with a non-Ewald electrostatic potential scheme. Our enhanced conformational sampling methods, e.g., the virtual-system-coupled multi-canonical MD (V-McMD) method, replace a multi-process parallelized run with multiple independent runs to avoid inter-node communication overhead. In addition, adopting the non-Ewald-based zero-multipole summation method (ZMM) makes it possible to eliminate the Fourier space calculations altogether. The combination of these state-of-the-art techniques realizes efficient and accurate calculations of the conformational ensemble at an equilibrium state. By taking these advantages, myPresto/omegagene is specialized for the single process execution with Graphics Processing Unit (GPU). We performed benchmark simulations for the 20-mer peptide, Trp-cage, with explicit solvent. One of the most thermodynamically stable conformations generated by the V-McMD simulation is very similar to an experimentally solved native conformation. Furthermore, the computation speed is four-times faster than that of our previous simulation engine, myPresto/psygene-G. The new simulator, myPresto/omegagene, is freely available at the following URLs: http://www.protein.osaka-u.ac.jp/rcsfp/pi/omegagene/ and http://presto.protein.osaka-u.ac.jp/myPresto4/.
Collapse
Affiliation(s)
- Kota Kasahara
- College of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan
| | - Benson Ma
- College of Engineering, University of Illinois, Urbana-Champaign, United States
| | - Kota Goto
- School of Computing, Tokyo Institute of Technology, Tokyo 152-8550, Japan
| | - Bhaskar Dasgupta
- Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan; Technology Research Association for Next Generation Natural Products Chemistry, Tokyo 135-0064, Japan
| | - Junichi Higo
- Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
| | - Ikuo Fukuda
- Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
| | - Tadaaki Mashimo
- Technology Research Association for Next Generation Natural Products Chemistry, Tokyo 135-0064, Japan
| | - Yutaka Akiyama
- School of Computing, Tokyo Institute of Technology, Tokyo 152-8550, Japan; Molecular Profiling Research Center for Drug Discovery, National Institute of Advanced Industrial Science and Technology, Tokyo 135-0064, Japan
| | - Haruki Nakamura
- Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan
| |
Collapse
|
21
|
Abstract
A new Newton–Raphson method based preconditioner for Krylov type linear equation solvers for GPGPU is developed, and the performance is investigated. Conventional preconditioners improve the convergence of Krylov type solvers, and perform well on CPUs. However, they do not perform well on GPGPUs, because of the complexity of implementing powerful preconditioners. The developed preconditioner is based on the BFGS Hessian matrix approximation technique, which is well known as a robust and fast nonlinear equation solver. Because the Hessian matrix in the BFGS represents the coefficient matrix of a system of linear equations in some sense, the approximated Hessian matrix can be a preconditioner. On the other hand, BFGS is required to store dense matrices and to invert them, which should be avoided on modern computers and supercomputers. To overcome these disadvantages, we therefore introduce a limited memory BFGS, which requires less memory space and less computational effort than the BFGS. In addition, a limited memory BFGS can be implemented with BLAS libraries, which are well optimized for target architectures. There are advantages and disadvantages to the Hessian matrix approximation becoming better as the Krylov solver iteration continues. The preconditioning matrix varies through Krylov solver iterations, and only flexible Krylov solvers can work well with the developed preconditioner. The GCR method, which is a flexible Krylov solver, is employed because of the prevalence of GCR as a Krylov solver with a variable preconditioner. As a result of the performance investigation, the new preconditioner indicates the following benefits: (1) The new preconditioner is robust; i.e., it converges while conventional preconditioners (the diagonal scaling, and the SSOR preconditioners) fail. (2) In the best case scenarios, it is over 10 times faster than conventional preconditioners on a CPU. (3) Because it requries only simple operations, it performs well on a GPGPU. In addition, the research has confirmed that the new preconditioner improves the condition of matrices from a mathematical point of view by calculating the condition numbers of preconditioned matrices, as anticipated by the theoretical analysis.
Collapse
|
22
|
Sayyid F, Kalvala S. On the importance of modelling the internal spatial dynamics of biological cells. Biosystems 2016; 145:53-66. [PMID: 27262415 DOI: 10.1016/j.biosystems.2016.05.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Revised: 05/25/2016] [Accepted: 05/31/2016] [Indexed: 11/16/2022]
Abstract
Spatial effects such as cell shape have very often been considered negligible in models of cellular pathways, and many existing simulation infrastructures do not take such effects into consideration. Recent experimental results are reversing this judgement by showing that very small spatial variations can make a big difference in the fate of a cell. This is particularly the case when considering eukaryotic cells, which have a complex physical structure and many subtle control mechanisms, but bacteria are also interesting for the huge variation in shape both between species and in different phases of their lifecycle. In this work we perform simulations that measure the effect of three common bacterial shapes on the behaviour of model cellular pathways. To perform these experiments we develop ReDi-Cell, a highly scalable GPGPU cell simulation infrastructure for the modelling of cellular pathways in spatially detailed environments. ReDi-Cell is validated against known-good simulations, prior to its use in new work. We then use ReDi-Cell to conduct novel experiments that demonstrate the effect that three common bacterial shapes (Cocci, Bacilli and Spirilli) have on the behaviour of model cellular pathways. Pathway wavefront shape, pathway concentration gradients, and chemical species distribution are measured in the three different shapes. We also quantify the impact of internal cellular clutter on the same pathways. Through this work we show that variations in the shape or configuration of these common cell shapes alter model cell behaviour.
Collapse
Affiliation(s)
- Faiz Sayyid
- Department of Computer Science, University of Warwick, Coventry, West Midlands, United Kingdom.
| | - Sara Kalvala
- Department of Computer Science, University of Warwick, Coventry, West Midlands, United Kingdom.
| |
Collapse
|
23
|
Hatt CR, Speidel MA, Raval AN. Real-time pose estimation of devices from x-ray images: Application to x-ray/echo registration for cardiac interventions. Med Image Anal 2016; 34:101-108. [PMID: 27179366 DOI: 10.1016/j.media.2016.04.008] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2016] [Revised: 04/08/2016] [Accepted: 04/23/2016] [Indexed: 11/18/2022]
Abstract
In recent years, registration between x-ray fluoroscopy (XRF) and transesophageal echocardiography (TEE) has been rapidly developed, validated, and translated to the clinic as a tool for advanced image guidance of structural heart interventions. This technology relies on accurate pose-estimation of the TEE probe via standard 2D/3D registration methods. It has been shown that latencies caused by slow registrations can result in errors during untracked frames, and a real-time ( > 15 hz) tracking algorithm is needed to minimize these errors. This paper presents two novel similarity metrics designed for accurate, robust, and extremely fast pose-estimation of devices from XRF images: Direct Splat Correlation (DSC) and Patch Gradient Correlation (PGC). Both metrics were implemented in CUDA C, and validated on simulated and clinical datasets against prior methods presented in the literature. It was shown that by combining DSC and PGC in a hybrid method (HYB), target registration errors comparable to previously reported methods were achieved, but at much higher speeds and lower failure rates. In simulated datasets, the proposed HYB method achieved a median projected target registration error (pTRE) of 0.33 mm and a mean registration frame-rate of 12.1 hz, while previously published methods produced median pTREs greater than 1.5 mm and mean registration frame-rates less than 4 hz. In clinical datasets, the HYB method achieved a median pTRE of 1.1 mm and a mean registration frame-rate of 20.5 hz, while previously published methods produced median pTREs greater than 1.3 mm and mean registration frame-rates less than 12 hz. The proposed hybrid method also had much lower failure rates than previously published methods.
Collapse
Affiliation(s)
- Charles R Hatt
- University of Wisconsin - Madison, Department of Medical Physics, 1111 Highland Ave, Rm 1005 Madison, WI, 53705, United States.
| | - Michael A Speidel
- University of Wisconsin - Madison, Department of Medical Physics, United States
| | - Amish N Raval
- University of Wisconsin - Madison, School of Medicine and Public Health, Division of Cardiovascular Medicine, United States
| |
Collapse
|
24
|
Li H, Yu D, Kumar A, Tu YC. Performance Modeling in CUDA Streams - A Means for High-Throughput Data Processing. Proc IEEE Int Conf Big Data 2015; 2014:301-310. [PMID: 26566545 DOI: 10.1109/bigdata.2014.7004245] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Push-based database management system (DBMS) is a new type of data processing software that streams large volume of data to concurrent query operators. The high data rate of such systems requires large computing power provided by the query engine. In our previous work, we built a push-based DBMS named G-SDMS to harness the unrivaled computational capabilities of modern GPUs. A major design goal of G-SDMS is to support concurrent processing of heterogenous query processing operations and enable resource allocation among such operations. Understanding the performance of operations as a result of resource consumption is thus a premise in the design of G-SDMS. With NVIDIA's CUDA framework as the system implementation platform, we present our recent work on performance modeling of CUDA kernels running concurrently under a runtime mechanism named CUDA stream. Specifically, we explore the connection between performance and resource occupancy of compute-bound kernels and develop a model that can predict the performance of such kernels. Furthermore, we provide an in-depth anatomy of the CUDA stream mechanism and summarize the main kernel scheduling disciplines in it. Our models and derived scheduling disciplines are verified by extensive experiments using synthetic and real-world CUDA kernels.
Collapse
|
25
|
Rajchl M, Baxter JS, McLeod AJ, Yuan J, Qiu W, Peters TM, Khan AR. Hierarchical max-flow segmentation framework for multi-atlas segmentation with Kohonen self-organizing map based Gaussian mixture modeling. Med Image Anal 2016; 27:45-56. [PMID: 26072170 DOI: 10.1016/j.media.2015.05.005] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2014] [Revised: 05/02/2015] [Accepted: 05/06/2015] [Indexed: 11/22/2022]
Abstract
The incorporation of intensity, spatial, and topological information into large-scale multi-region segmentation has been a topic of ongoing research in medical image analysis. Multi-region segmentation problems, such as segmentation of brain structures, pose unique challenges in image segmentation in which regions may not have a defined intensity, spatial, or topological distinction, but rely on a combination of the three. We propose a novel framework within the Advanced segmentation tools (ASETS)(2), which combines large-scale Gaussian mixture models trained via Kohonen self-organizing maps, with deformable registration, and a convex max-flow optimization algorithm incorporating region topology as a hierarchy or tree. Our framework is validated on two publicly available neuroimaging datasets, the OASIS and MRBrainS13 databases, against the more conventional Potts model, achieving more accurate segmentations. Each component is accelerated using general-purpose programming on graphics processing Units to ensure computational feasibility.
Collapse
|
26
|
Jing Y, Zeng W, Wang N, Ren T, Shi Y, Yin J, Xu Q. GPU-based parallel group ICA for functional magnetic resonance data. Comput Methods Programs Biomed 2015; 119:9-16. [PMID: 25704870 DOI: 10.1016/j.cmpb.2015.02.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 12/16/2014] [Accepted: 02/02/2015] [Indexed: 06/04/2023]
Abstract
The goal of our study is to develop a fast parallel implementation of group independent component analysis (ICA) for functional magnetic resonance imaging (fMRI) data using graphics processing units (GPU). Though ICA has become a standard method to identify brain functional connectivity of the fMRI data, it is computationally intensive, especially has a huge cost for the group data analysis. GPU with higher parallel computation power and lower cost are used for general purpose computing, which could contribute to fMRI data analysis significantly. In this study, a parallel group ICA (PGICA) on GPU, mainly consisting of GPU-based PCA using SVD and Infomax-ICA, is presented. In comparison to the serial group ICA, the proposed method demonstrated both significant speedup with 6-11 times and comparable accuracy of functional networks in our experiments. This proposed method is expected to perform the real-time post-processing for fMRI data analysis.
Collapse
Affiliation(s)
- Yanshan Jing
- Lab of Digital Image and Intelligent Computation, Shanghai Maritime University, Shanghai 201306, China
| | - Weiming Zeng
- Lab of Digital Image and Intelligent Computation, Shanghai Maritime University, Shanghai 201306, China.
| | - Nizhuan Wang
- Lab of Digital Image and Intelligent Computation, Shanghai Maritime University, Shanghai 201306, China
| | - Tianlong Ren
- Lab of Digital Image and Intelligent Computation, Shanghai Maritime University, Shanghai 201306, China
| | - Yingchao Shi
- Lab of Digital Image and Intelligent Computation, Shanghai Maritime University, Shanghai 201306, China
| | - Jun Yin
- Lab of Digital Image and Intelligent Computation, Shanghai Maritime University, Shanghai 201306, China
| | - Qi Xu
- Lab of Digital Image and Intelligent Computation, Shanghai Maritime University, Shanghai 201306, China
| |
Collapse
|
27
|
Sumiyoshi K, Hirata K, Hiroi N, Funahashi A. Acceleration of discrete stochastic biochemical simulation using GPGPU. Front Physiol 2015; 6:42. [PMID: 25762936 PMCID: PMC4327578 DOI: 10.3389/fphys.2015.00042] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2014] [Accepted: 01/29/2015] [Indexed: 11/30/2022] Open
Abstract
For systems made up of a small number of molecules, such as a biochemical network in a single cell, a simulation requires a stochastic approach, instead of a deterministic approach. The stochastic simulation algorithm (SSA) simulates the stochastic behavior of a spatially homogeneous system. Since stochastic approaches produce different results each time they are used, multiple runs are required in order to obtain statistical results; this results in a large computational cost. We have implemented a parallel method for using SSA to simulate a stochastic model; the method uses a graphics processing unit (GPU), which enables multiple realizations at the same time, and thus reduces the computational time and cost. During the simulation, for the purpose of analysis, each time course is recorded at each time step. A straightforward implementation of this method on a GPU is about 16 times faster than a sequential simulation on a CPU with hybrid parallelization; each of the multiple simulations is run simultaneously, and the computational tasks within each simulation are parallelized. We also implemented an improvement to the memory access and reduced the memory footprint, in order to optimize the computations on the GPU. We also implemented an asynchronous data transfer scheme to accelerate the time course recording function. To analyze the acceleration of our implementation on various sizes of model, we performed SSA simulations on different model sizes and compared these computation times to those for sequential simulations with a CPU. When used with the improved time course recording function, our method was shown to accelerate the SSA simulation by a factor of up to 130.
Collapse
Affiliation(s)
- Kei Sumiyoshi
- Systems Biology Laboratory, Department of Biosciences and Informatics, Keio University Yokohama, Japan
| | - Kazuki Hirata
- Systems Biology Laboratory, Department of Biosciences and Informatics, Keio University Yokohama, Japan
| | - Noriko Hiroi
- Systems Biology Laboratory, Department of Biosciences and Informatics, Keio University Yokohama, Japan
| | - Akira Funahashi
- Systems Biology Laboratory, Department of Biosciences and Informatics, Keio University Yokohama, Japan
| |
Collapse
|
28
|
Teodoro G, Pan T, Kurc T, Kong J, Cooper L, Klasky S, Saltz J. Region Templates: Data Representation and Management for High-Throughput Image Analysis. Parallel Comput 2014; 40:589-610. [PMID: 26139953 PMCID: PMC4484879 DOI: 10.1016/j.parco.2014.09.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
We introduce a region template abstraction and framework for the efficient storage, management and processing of common data types in analysis of large datasets of high resolution images on clusters of hybrid computing nodes. The region template abstraction provides a generic container template for common data structures, such as points, arrays, regions, and object sets, within a spatial and temporal bounding box. It allows for different data management strategies and I/O implementations, while providing a homogeneous, unified interface to applications for data storage and retrieval. A region template application is represented as a hierarchical dataflow in which each computing stage may be represented as another dataflow of finer-grain tasks. The execution of the application is coordinated by a runtime system that implements optimizations for hybrid machines, including performance-aware scheduling for maximizing the utilization of computing devices and techniques to reduce the impact of data transfers between CPUs and GPUs. An experimental evaluation on a state-of-the-art hybrid cluster using a microscopy imaging application shows that the abstraction adds negligible overhead (about 3%) and achieves good scalability and high data transfer rates. Optimizations in a high speed disk based storage implementation of the abstraction to support asynchronous data transfers and computation result in an application performance gain of about 1.13×. Finally, a processing rate of 11,730 4K×4K tiles per minute was achieved for the microscopy imaging application on a cluster with 100 nodes (300 GPUs and 1,200 CPU cores). This computation rate enables studies with very large datasets.
Collapse
Affiliation(s)
- George Teodoro
- Department of Computer Science, University of Brasília, Brasília, DF, Brazil
| | - Tony Pan
- Biomedical Informatics Department, Emory University, Atlanta, GA, USA
| | - Tahsin Kurc
- Biomedical Informatics Department, Stony Brook University, Stony Brook, NY, USA
- Scientific Data Group, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | - Jun Kong
- Biomedical Informatics Department, Emory University, Atlanta, GA, USA
| | - Lee Cooper
- Biomedical Informatics Department, Emory University, Atlanta, GA, USA
| | - Scott Klasky
- Scientific Data Group, Oak Ridge National Laboratory, Oak Ridge, TN, USA
| | - Joel Saltz
- Biomedical Informatics Department, Stony Brook University, Stony Brook, NY, USA
| |
Collapse
|
29
|
Szénási S. Segmentation of colon tissue sample images using multiple graphics accelerators. Comput Biol Med 2014; 51:93-103. [PMID: 24893331 DOI: 10.1016/j.compbiomed.2014.05.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2014] [Revised: 04/30/2014] [Accepted: 05/02/2014] [Indexed: 10/25/2022]
Abstract
Nowadays, processing medical images is increasingly done through using digital imagery and custom software solutions. The distributed algorithm presented in this paper is used to detect special tissue parts, the nuclei on haematoxylin and eosin stained colon tissue sample images. The main aim of this work is the development of a new data-parallel region growing algorithm that can be implemented even in an environment using multiple video accelerators. This new method has three levels of parallelism: (a) the parallel region growing itself, (b) starting more region growing in the device, and (c) using more than one accelerator. We use the split-and-merge technique based on our already existing data-parallel cell nuclei segmentation algorithm extended with a fast, backtracking-based, non-overlapping cell filter method. This extension does not cause significant degradation of the accuracy; the results are practically the same as those of the original sequential region growing method. However, as expected, using more devices usually means that less time is needed to process the tissue image; in the case of the configuration of one central processing unit and two graphics cards, the average speed-up is about 4-6×. The implemented algorithm has the additional advantage of efficiently processing very large images with high memory requirements.
Collapse
Affiliation(s)
- Sándor Szénási
- John von Neumann Faculty of Informatics, Óbuda University, 96/B Bécsi út, H-1034 Budapest, Hungary.
| |
Collapse
|
30
|
Mafi R, Sirouspour S. GPU-based acceleration of computations in nonlinear finite element deformation analysis. Int J Numer Method Biomed Eng 2014; 30:365-381. [PMID: 24166875 DOI: 10.1002/cnm.2607] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/24/2014] [Revised: 06/18/2013] [Accepted: 09/23/2013] [Indexed: 06/02/2023]
Abstract
The physics of deformation for biological soft-tissue is best described by nonlinear continuum mechanics-based models, which then can be discretized by the FEM for a numerical solution. However, computational complexity of such models have limited their use in applications requiring real-time or fast response. In this work, we propose a graphic processing unit-based implementation of the FEM using implicit time integration for dynamic nonlinear deformation analysis. This is the most general formulation of the deformation analysis. It is valid for large deformations and strains and can account for material nonlinearities. The data-parallel nature and the intense arithmetic computations of nonlinear FEM equations make it particularly suitable for implementation on a parallel computing platform such as graphic processing unit. In this work, we present and compare two different designs based on the matrix-free and conventional preconditioned conjugate gradients algorithms for solving the FEM equations arising in deformation analysis. The speedup achieved with the proposed parallel implementations of the algorithms will be instrumental in the development of advanced surgical simulators and medical image registration methods involving soft-tissue deformation.
Collapse
Affiliation(s)
- Ramin Mafi
- McMaster University, 1280 Main St. W, Hamilton, ON, Canada, L8S 4K1
| | | |
Collapse
|
31
|
Martínez-Zarzuela M, Gómez C, Díaz-Pernas FJ, Fernández A, Hornero R. Cross-Approximate Entropy parallel computation on GPUs for biomedical signal analysis. Application to MEG recordings. Comput Methods Programs Biomed 2013; 112:189-199. [PMID: 23915803 DOI: 10.1016/j.cmpb.2013.07.005] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/16/2012] [Revised: 06/06/2013] [Accepted: 07/04/2013] [Indexed: 06/02/2023]
Abstract
Cross-Approximate Entropy (Cross-ApEn) is a useful measure to quantify the statistical dissimilarity of two time series. In spite of the advantage of Cross-ApEn over its one-dimensional counterpart (Approximate Entropy), only a few studies have applied it to biomedical signals, mainly due to its high computational cost. In this paper, we propose a fast GPU-based implementation of the Cross-ApEn that makes feasible its use over a large amount of multidimensional data. The scheme followed is fully scalable, thus maximizes the use of the GPU despite of the number of neural signals being processed. The approach consists in processing many trials or epochs simultaneously, with independence of its origin. In the case of MEG data, these trials can proceed from different input channels or subjects. The proposed implementation achieves an average speedup greater than 250× against a CPU parallel version running on a processor containing six cores. A dataset of 30 subjects containing 148 MEG channels (49 epochs of 1024 samples per channel) can be analyzed using our development in about 30min. The same processing takes 5 days on six cores and 15 days when running on a single core. The speedup is much larger if compared to a basic sequential Matlab(®) implementation, that would need 58 days per subject. To our knowledge, this is the first contribution of Cross-ApEn measure computation using GPUs. This study demonstrates that this hardware is, to the day, the best option for the signal processing of biomedical data with Cross-ApEn.
Collapse
Affiliation(s)
- Mario Martínez-Zarzuela
- Imaging and Telematics Group, E.T.S. Ingenieros de Telecomunicación, University of Valladolid, Paseo de Belén 15, 47011 Valladolid, Spain.
| | | | | | | | | |
Collapse
|
32
|
Franke R, Ivanova G. FALCON or how to compute measures time efficiently on dynamically evolving dense complex networks? J Biomed Inform 2013; 47:62-70. [PMID: 24060602 DOI: 10.1016/j.jbi.2013.09.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2013] [Revised: 08/07/2013] [Accepted: 09/10/2013] [Indexed: 10/26/2022]
Abstract
A large number of topics in biology, medicine, neuroscience, psychology and sociology can be generally described via complex networks in order to investigate fundamental questions of structure, connectivity, information exchange and causality. Especially, research on biological networks like functional spatiotemporal brain activations and changes, caused by neuropsychiatric pathologies, is promising. Analyzing those so-called complex networks, the calculation of meaningful measures can be very long-winded depending on their size and structure. Even worse, in many labs only standard desktop computers are accessible to perform those calculations. Numerous investigations on complex networks regard huge but sparsely connected network structures, where most network nodes are connected to only a few others. Currently, there are several libraries available to tackle this kind of networks. A problem arises when not only a few big and sparse networks have to be analyzed, but hundreds or thousands of smaller and conceivably dense networks (e.g. in measuring brain activation over time). Then every minute per network is crucial. For these cases there several possibilities to use standard hardware more efficiently. It is not sufficient to apply just standard algorithms for dense graph characteristics. This article introduces the new library FALCON developed especially for the exploration of dense complex networks. Currently, it offers 12 different measures (like clustering coefficients), each for undirected-unweighted, undirected-weighted and directed-unweighted networks. It uses a multi-core approach in combination with comprehensive code and hardware optimizations. There is an alternative massively parallel GPU implementation for the most time-consuming measures, too. Finally, a comparing benchmark is integrated to support the choice of the most suitable library for a particular network issue.
Collapse
Affiliation(s)
- R Franke
- Institute of Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany.
| | - G Ivanova
- Institute of Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, 10099 Berlin, Germany.
| |
Collapse
|
33
|
Teodoro G, Pan T, Kurc TM, Kong J, Cooper LAD, Podhorszki N, Klasky S, Saltz JH. High-throughput Analysis of Large Microscopy Image Datasets on CPU-GPU Cluster Platforms. Proc IPDPS (Conf) 2013; 2013:103-114. [PMID: 25419546 PMCID: PMC4240318 DOI: 10.1109/ipdps.2013.11] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
Analysis of large pathology image datasets offers significant opportunities for the investigation of disease morphology, but the resource requirements of analysis pipelines limit the scale of such studies. Motivated by a brain cancer study, we propose and evaluate a parallel image analysis application pipeline for high throughput computation of large datasets of high resolution pathology tissue images on distributed CPU-GPU platforms. To achieve efficient execution on these hybrid systems, we have built runtime support that allows us to express the cancer image analysis application as a hierarchical data processing pipeline. The application is implemented as a coarse-grain pipeline of stages, where each stage may be further partitioned into another pipeline of fine-grain operations. The fine-grain operations are efficiently managed and scheduled for computation on CPUs and GPUs using performance aware scheduling techniques along with several optimizations, including architecture aware process placement, data locality conscious task assignment, data prefetching, and asynchronous data copy. These optimizations are employed to maximize the utilization of the aggregate computing power of CPUs and GPUs and minimize data copy overheads. Our experimental evaluation shows that the cooperative use of CPUs and GPUs achieves significant improvements on top of GPU-only versions (up to 1.6×) and that the execution of the application as a set of fine-grain operations provides more opportunities for runtime optimizations and attains better performance than coarser-grain, monolithic implementations used in other works. An implementation of the cancer image analysis pipeline using the runtime support was able to process an image dataset consisting of 36,848 4Kx4K-pixel image tiles (about 1.8TB uncompressed) in less than 4 minutes (150 tiles/second) on 100 nodes of a state-of-the-art hybrid cluster system.
Collapse
Affiliation(s)
- George Teodoro
- Center for Comprehensive Informatics, Emory University, Atlanta, GA
| | - Tony Pan
- Scientific Data Group, Oak Ridge National Laboratory, Oak Ridge, TN
| | | | | | | | | | | | | |
Collapse
|
34
|
Teodoro G, Pan T, Kurc T, Kong J, Cooper L, Saltz J. Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines. Parallel Comput 2013; 39:189-211. [PMID: 23908562 PMCID: PMC3727669 DOI: 10.1016/j.parco.2013.03.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
We address the problem of efficient execution of a computation pattern, referred to here as the irregular wavefront propagation pattern (IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in several image processing operations. In the IWPP, data elements in the wavefront propagate waves to their neighboring elements on a grid if a propagation condition is satisfied. Elements receiving the propagated waves become part of the wavefront. This pattern results in irregular data accesses and computations. We develop and evaluate strategies for efficient computation and propagation of wavefronts using a multi-level queue structure. This queue structure improves the utilization of fast memories in a GPU and reduces synchronization overheads. We also develop a tile-based parallelization strategy to support execution on multiple CPUs and GPUs. We evaluate our approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs and 2 multicore CPUs) using the IWPP implementations of two widely used image processing operations: morphological reconstruction and euclidean distance transform. Our results show significant performance improvements on GPUs. The use of multiple CPUs and GPUs cooperatively attains speedups of 50× and 85× with respect to single core CPU executions for morphological reconstruction and euclidean distance transform, respectively.
Collapse
|