1
|
Kobayashi C, Jung J, Matsunaga Y, Mori T, Ando T, Tamura K, Kamiya M, Sugita Y. GENESIS 1.1: A hybrid-parallel molecular dynamics simulator with enhanced sampling algorithms on multiple computational platforms. J Comput Chem 2017; 38:2193-2206. [PMID: 28718930 DOI: 10.1002/jcc.24874] [Citation(s) in RCA: 132] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2017] [Revised: 06/08/2017] [Accepted: 06/09/2017] [Indexed: 01/09/2023]
Abstract
GENeralized-Ensemble SImulation System (GENESIS) is a software package for molecular dynamics (MD) simulation of biological systems. It is designed to extend limitations in system size and accessible time scale by adopting highly parallelized schemes and enhanced conformational sampling algorithms. In this new version, GENESIS 1.1, new functions and advanced algorithms have been added. The all-atom and coarse-grained potential energy functions used in AMBER and GROMACS packages now become available in addition to CHARMM energy functions. The performance of MD simulations has been greatly improved by further optimization, multiple time-step integration, and hybrid (CPU + GPU) computing. The string method and replica-exchange umbrella sampling with flexible collective variable choice are used for finding the minimum free-energy pathway and obtaining free-energy profiles for conformational changes of a macromolecule. These new features increase the usefulness and power of GENESIS for modeling and simulation in biological research. © 2017 Wiley Periodicals, Inc.
Collapse
|
Research Support, Non-U.S. Gov't |
8 |
132 |
2
|
Wittek A, Joldes G, Couton M, Warfield SK, Miller K. Patient-specific non-linear finite element modelling for predicting soft organ deformation in real-time: application to non-rigid neuroimage registration. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2010; 103:292-303. [PMID: 20868706 PMCID: PMC3107968 DOI: 10.1016/j.pbiomolbio.2010.09.001] [Citation(s) in RCA: 66] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 02/11/2010] [Revised: 08/30/2010] [Accepted: 09/14/2010] [Indexed: 11/18/2022]
Abstract
Long computation times of non-linear (i.e. accounting for geometric and material non-linearity) biomechanical models have been regarded as one of the key factors preventing application of such models in predicting organ deformation for image-guided surgery. This contribution presents real-time patient-specific computation of the deformation field within the brain for six cases of brain shift induced by craniotomy (i.e. surgical opening of the skull) using specialised non-linear finite element procedures implemented on a graphics processing unit (GPU). In contrast to commercial finite element codes that rely on an updated Lagrangian formulation and implicit integration in time domain for steady state solutions, our procedures utilise the total Lagrangian formulation with explicit time stepping and dynamic relaxation. We used patient-specific finite element meshes consisting of hexahedral and non-locking tetrahedral elements, together with realistic material properties for the brain tissue and appropriate contact conditions at the boundaries. The loading was defined by prescribing deformations on the brain surface under the craniotomy. Application of the computed deformation fields to register (i.e. align) the preoperative and intraoperative images indicated that the models very accurately predict the intraoperative deformations within the brain. For each case, computing the brain deformation field took less than 4 s using an NVIDIA Tesla C870 GPU, which is two orders of magnitude reduction in computation time in comparison to our previous study in which the brain deformation was predicted using a commercial finite element solver executed on a personal computer.
Collapse
|
Research Support, N.I.H., Extramural |
15 |
66 |
3
|
Joldes GR, Wittek A, Miller K. Real-Time Nonlinear Finite Element Computations on GPU - Application to Neurosurgical Simulation. COMPUTER METHODS IN APPLIED MECHANICS AND ENGINEERING 2010; 199:3305-3314. [PMID: 21179562 PMCID: PMC3003932 DOI: 10.1016/j.cma.2010.06.037] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Application of biomechanical modeling techniques in the area of medical image analysis and surgical simulation implies two conflicting requirements: accurate results and high solution speeds. Accurate results can be obtained only by using appropriate models and solution algorithms. In our previous papers we have presented algorithms and solution methods for performing accurate nonlinear finite element analysis of brain shift (which includes mixed mesh, different non-linear material models, finite deformations and brain-skull contacts) in less than a minute on a personal computer for models having up to 50.000 degrees of freedom. In this paper we present an implementation of our algorithms on a Graphics Processing Unit (GPU) using the new NVIDIA Compute Unified Device Architecture (CUDA) which leads to more than 20 times increase in the computation speed. This makes possible the use of meshes with more elements, which better represent the geometry, are easier to generate, and provide more accurate results.
Collapse
|
research-article |
15 |
59 |
4
|
Ford TN, Lim D, Mertz J. Fast optically sectioned fluorescence HiLo endomicroscopy. JOURNAL OF BIOMEDICAL OPTICS 2012; 17:021105. [PMID: 22463023 PMCID: PMC3382350 DOI: 10.1117/1.jbo.17.2.021105] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2011] [Revised: 11/04/2011] [Accepted: 11/07/2011] [Indexed: 05/19/2023]
Abstract
We describe a nonscanning, fiber bundle endomicroscope that performs optically sectioned fluorescence imaging with fast frame rates and real-time processing. Our sectioning technique is based on HiLo imaging, wherein two widefield images are acquired under uniform and structured illumination and numerically processed to reject out-of-focus background. This work is an improvement upon an earlier demonstration of widefield optical sectioning through a flexible fiber bundle. The improved device features lateral and axial resolutions of 2.6 and 17 μm, respectively, a net frame rate of 9.5 Hz obtained by real-time image processing with a graphics processing unit (GPU) and significantly reduced motion artifacts obtained by the use of a double-shutter camera. We demonstrate the performance of our system with optically sectioned images and videos of a fluorescently labeled chorioallantoic membrane (CAM) in the developing G. gallus embryo. HiLo endomicroscopy is a candidate technique for low-cost, high-speed clinical optical biopsies.
Collapse
|
Research Support, N.I.H., Extramural |
13 |
52 |
5
|
Bittremieux W, Laukens K, Noble WS. Extremely Fast and Accurate Open Modification Spectral Library Searching of High-Resolution Mass Spectra Using Feature Hashing and Graphics Processing Units. J Proteome Res 2019; 18:3792-3799. [PMID: 31448616 PMCID: PMC6886738 DOI: 10.1021/acs.jproteome.9b00291] [Citation(s) in RCA: 27] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Open modification searching (OMS) is a powerful search strategy to identify peptides with any type of modification. OMS works by using a very wide precursor mass window to allow modified spectra to match against their unmodified variants, after which the modification types can be inferred from the corresponding precursor mass differences. A disadvantage of this strategy, however, is the large computational cost, because each query spectrum has to be compared against a multitude of candidate peptides. We have previously introduced the ANN-SoLo tool for fast and accurate open spectral library searching. ANN-SoLo uses approximate nearest neighbor indexing to speed up OMS by selecting only a limited number of the most relevant library spectra to compare to an unknown query spectrum. Here we demonstrate how this candidate selection procedure can be further optimized using graphics processing units. Additionally, we introduce a feature hashing scheme to convert high-resolution spectra to low-dimensional vectors. On the basis of these algorithmic advances, along with low-level code optimizations, the new version of ANN-SoLo is up to an order of magnitude faster than its initial version. This makes it possible to efficiently perform open searches on a large scale to gain a deeper understanding about the protein modification landscape. We demonstrate the computational efficiency and identification performance of ANN-SoLo based on a large data set of the draft human proteome. ANN-SoLo is implemented in Python and C++. It is freely available under the Apache 2.0 license at https://github.com/bittremieux/ANN-SoLo .
Collapse
|
Research Support, N.I.H., Extramural |
6 |
27 |
6
|
Sawaya NPD, Huh J, Fujita T, Saikin SK, Aspuru-Guzik A. Fast delocalization leads to robust long-range excitonic transfer in a large quantum chlorosome model. NANO LETTERS 2015; 15:1722-1729. [PMID: 25694170 DOI: 10.1021/nl504399d] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Chlorosomes are efficient light-harvesting antennas containing up to hundreds of thousands of bacteriochlorophyll molecules. With massively parallel computer hardware, we use a nonperturbative stochastic Schrödinger equation, while including an atomistically derived spectral density, to study excitonic energy transfer in a realistically sized chlorosome model. We find that fast short-range delocalization leads to robust long-range transfer due to the antennae's concentric-roll structure. Additionally, we discover anomalous behavior arising from different initial conditions, and outline general considerations for simulating excitonic systems on the nanometer to micrometer scale.
Collapse
|
|
10 |
23 |
7
|
Leeser M, Mukherjee S, Brock J. Fast reconstruction of 3D volumes from 2D CT projection data with GPUs. BMC Res Notes 2014; 7:582. [PMID: 25176282 PMCID: PMC4167268 DOI: 10.1186/1756-0500-7-582] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2014] [Accepted: 08/18/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biomedical image reconstruction applications require producing high fidelity images in or close to real-time. We have implemented reconstruction of three dimensional conebeam computed tomography(CBCT) with two dimensional projections. The algorithm takes slices of the target, weights and filters them to backproject the data, then creates the final 3D volume. We have implemented the algorithm using several hardware and software approaches and taken advantage of different types of parallelism in modern processors. The two hardware platforms used are a Central Processing Unit (CPU) and a heterogeneous system with a combination of CPU and GPU. On the CPU we implement serial MATLAB, parallel MATLAB, C and parallel C with OpenMP extensions. These codes are compared against the heterogeneous versions written in CUDA-C and OpenCL. FINDINGS Our results show that GPUs are particularly well suited to accelerating CBCT. Relative performance was evaluated on a mathematical phantom as well as on mouse data. Speedups of up to 200x are observed by using an AMD GPU compared to a parallel version in C with OpenMP constructs. CONCLUSIONS In this paper, we have implemented the Feldkamp-Davis-Kress algorithm, compatible with Fessler's image reconstruction toolbox and tested it on different hardware platforms including CPU and a combination of CPU and GPU. Both NVIDIA and AMD GPUs have been used for performance evaluation. GPUs provide significant speedup over the parallel CPU version.
Collapse
|
research-article |
11 |
17 |
8
|
Hu S, Zhang Q, Wang J, Chen Z. Real-time particle filtering and smoothing algorithms for detecting abrupt changes in neural ensemble spike activity. J Neurophysiol 2017; 119:1394-1410. [PMID: 29357468 DOI: 10.1152/jn.00684.2017] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
Sequential change-point detection from time series data is a common problem in many neuroscience applications, such as seizure detection, anomaly detection, and pain detection. In our previous work (Chen Z, Zhang Q, Tong AP, Manders TR, Wang J. J Neural Eng 14: 036023, 2017), we developed a latent state-space model, known as the Poisson linear dynamical system, for detecting abrupt changes in neuronal ensemble spike activity. In online brain-machine interface (BMI) applications, a recursive filtering algorithm is used to track the changes in the latent variable. However, previous methods have been restricted to Gaussian dynamical noise and have used Gaussian approximation for the Poisson likelihood. To improve the detection speed, we introduce non-Gaussian dynamical noise for modeling a stochastic jump process in the latent state space. To efficiently estimate the state posterior that accommodates non-Gaussian noise and non-Gaussian likelihood, we propose particle filtering and smoothing algorithms for the change-point detection problem. To speed up the computation, we implement the proposed particle filtering algorithms using advanced graphics processing unit computing technology. We validate our algorithms, using both computer simulations and experimental data for acute pain detection. Finally, we discuss several important practical issues in the context of real-time closed-loop BMI applications. NEW & NOTEWORTHY Sequential change-point detection is an important problem in closed-loop neuroscience experiments. This study proposes novel sequential Monte Carlo methods to quickly detect the onset and offset of a stochastic jump process that drives the population spike activity. This new approach is robust with respect to spike sorting noise and varying levels of signal-to-noise ratio. The GPU implementation of the computational algorithm allows for parallel processing in real time.
Collapse
|
Research Support, U.S. Gov't, Non-P.H.S. |
8 |
17 |
9
|
Johnson TS, Li S, Franz E, Huang Z, Dan Li S, Campbell MJ, Huang K, Zhang Y. PseudoFuN: Deriving functional potentials of pseudogenes from integrative relationships with genes and microRNAs across 32 cancers. Gigascience 2019; 8:5480571. [PMID: 31029062 PMCID: PMC6486473 DOI: 10.1093/gigascience/giz046] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2018] [Revised: 12/13/2018] [Accepted: 03/29/2019] [Indexed: 12/14/2022] Open
Abstract
Background Long thought “relics” of evolution, not until recently have pseudogenes been of medical interest regarding regulation in cancer. Often, these regulatory roles are a direct by-product of their close sequence homology to protein-coding genes. Novel pseudogene-gene (PGG) functional associations can be identified through the integration of biomedical data, such as sequence homology, functional pathways, gene expression, pseudogene expression, and microRNA expression. However, not all of the information has been integrated, and almost all previous pseudogene studies relied on 1:1 pseudogene–parent gene relationships without leveraging other homologous genes/pseudogenes. Results We produce PGG families that expand beyond the current 1:1 paradigm. First, we construct expansive PGG databases by (i) CUDAlign graphics processing unit (GPU) accelerated local alignment of all pseudogenes to gene families (totaling 1.6 billion individual local alignments and >40,000 GPU hours) and (ii) BLAST-based assignment of pseudogenes to gene families. Second, we create an open-source web application (PseudoFuN [Pseudogene Functional Networks]) to search for integrative functional relationships of sequence homology, microRNA expression, gene expression, pseudogene expression, and gene ontology. We produce four “flavors” of CUDAlign-based databases (>462,000,000 PGG pairwise alignments and 133,770 PGG families) that can be queried and downloaded using PseudoFuN. These databases are consistent with previous 1:1 PGG annotation and also are much more powerful including millions of de novo PGG associations. For example, we find multiple known (e.g., miR-20a-PTEN-PTENP1) and novel (e.g., miR-375-SOX15-PPP4R1L) microRNA-gene-pseudogene associations in prostate cancer. PseudoFuN provides a “one stop shop” for identifying and visualizing thousands of potential regulatory relationships related to pseudogenes in The Cancer Genome Atlas cancers. Conclusions Thousands of new PGG associations can be explored in the context of microRNA-gene-pseudogene co-expression and differential expression with a simple-to-use online tool by bioinformaticians and oncologists alike.
Collapse
|
Research Support, Non-U.S. Gov't |
6 |
16 |
10
|
Liquet B, Bottolo L, Campanella G, Richardson S, Chadeau-Hyam M. R2GUESS: A Graphics Processing Unit-Based R Package for Bayesian Variable Selection Regression of Multivariate Responses. J Stat Softw 2016; 69. [PMID: 29568242 DOI: 10.18637/jss.v069.i02] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022] Open
Abstract
Technological advances in molecular biology over the past decade have given rise to high dimensional and complex datasets offering the possibility to investigate biological associations between a range of genomic features and complex phenotypes. The analysis of this novel type of data generated unprecedented computational challenges which ultimately led to the definition and implementation of computationally efficient statistical models that were able to scale to genome-wide data, including Bayesian variable selection approaches. While extensive methodological work has been carried out in this area, only few methods capable of handling hundreds of thousands of predictors were implemented and distributed. Among these we recently proposed GUESS, a computationally optimised algorithm making use of graphics processing unit capabilities, which can accommodate multiple outcomes. In this paper we propose R2GUESS, an R package wrapping the original C++ source code. In addition to providing a user-friendly interface of the original code automating its parametrisation, and data handling, R2GUESS also incorporates many features to explore the data, to extend statistical inferences from the native algorithm (e.g., effect size estimation, significance assessment), and to visualize outputs from the algorithm. We first detail the model and its parametrisation, and describe in details its optimised implementation. Based on two examples we finally illustrate its statistical performances and flexibility.
Collapse
|
Journal Article |
9 |
15 |
11
|
da Silva J, Ansorge R, Jena R. Fast Pencil Beam Dose Calculation for Proton Therapy Using a Double-Gaussian Beam Model. Front Oncol 2015; 5:281. [PMID: 26734567 PMCID: PMC4683172 DOI: 10.3389/fonc.2015.00281] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2015] [Accepted: 11/30/2015] [Indexed: 11/15/2022] Open
Abstract
The highly conformal dose distributions produced by scanned proton pencil beams (PBs) are more sensitive to motion and anatomical changes than those produced by conventional radiotherapy. The ability to calculate the dose in real-time as it is being delivered would enable, for example, online dose monitoring, and is therefore highly desirable. We have previously described an implementation of a PB algorithm running on graphics processing units (GPUs) intended specifically for online dose calculation. Here, we present an extension to the dose calculation engine employing a double-Gaussian beam model to better account for the low-dose halo. To the best of our knowledge, it is the first such PB algorithm for proton therapy running on a GPU. We employ two different parameterizations for the halo dose, one describing the distribution of secondary particles from nuclear interactions found in the literature and one relying on directly fitting the model to Monte Carlo simulations of PBs in water. Despite the large width of the halo contribution, we show how in either case the second Gaussian can be included while prolonging the calculation of the investigated plans by no more than 16%, or the calculation of the most time-consuming energy layers by about 25%. Furthermore, the calculation time is relatively unaffected by the parameterization used, which suggests that these results should hold also for different systems. Finally, since the implementation is based on an algorithm employed by a commercial treatment planning system, it is expected that with adequate tuning, it should be able to reproduce the halo dose from a general beam line with sufficient accuracy.
Collapse
|
Journal Article |
10 |
15 |
12
|
Chang CH, Yu X, Ji JX. Compressed sensing MRI reconstruction from 3D multichannel data using GPUs. Magn Reson Med 2017; 78:2265-2274. [PMID: 28198568 DOI: 10.1002/mrm.26636] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Revised: 01/01/2017] [Accepted: 01/18/2017] [Indexed: 11/08/2022]
Abstract
PURPOSE To accelerate iterative reconstructions of compressed sensing (CS) MRI from 3D multichannel data using graphics processing units (GPUs). METHODS The sparsity of MRI signals and parallel array receivers can reduce the data acquisition requirements. However, iterative CS reconstructions from data acquired using an array system may take a significantly long time, especially for a large number of parallel channels. This paper presents an efficient method for CS-MRI reconstruction from 3D multichannel data using GPUs. In this method, CS reconstructions were simultaneously processed in a channel-by-channel fashion on the GPU, in which the computations of multiple-channel 3D-CS reconstructions are highly parallelized. The final image was then produced by a sum-of-squares method on the central processing unit. Implementation details including algorithm, data/memory management, and parallelization schemes are reported in the paper. RESULTS Both simulated data and in vivo MRI array data were tested. The results showed that the proposed method can significantly improve the image reconstruction efficiency, typically shortening the runtime by a factor of 30. CONCLUSIONS Using low-cost GPUs and an efficient algorithm allowed the 3D multislice compressive-sensing reconstruction to be performed in less than 1 s. The rapid reconstructions are expected to help bring high-dimensional, multichannel parallel CS MRI closer to clinical applications. Magn Reson Med 78:2265-2274, 2017. © 2017 International Society for Magnetic Resonance in Medicine.
Collapse
|
Journal Article |
8 |
13 |
13
|
Williams-Young DB, de Jong WA, van Dam HJJ, Yang C. On the Efficient Evaluation of the Exchange Correlation Potential on Graphics Processing Unit Clusters. Front Chem 2020; 8:581058. [PMID: 33363105 PMCID: PMC7758429 DOI: 10.3389/fchem.2020.581058] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Accepted: 09/14/2020] [Indexed: 11/20/2022] Open
Abstract
The predominance of Kohn–Sham density functional theory (KS-DFT) for the theoretical treatment of large experimentally relevant systems in molecular chemistry and materials science relies primarily on the existence of efficient software implementations which are capable of leveraging the latest advances in modern high-performance computing (HPC). With recent trends in HPC leading toward increasing reliance on heterogeneous accelerator-based architectures such as graphics processing units (GPU), existing code bases must embrace these architectural advances to maintain the high levels of performance that have come to be expected for these methods. In this work, we purpose a three-level parallelism scheme for the distributed numerical integration of the exchange-correlation (XC) potential in the Gaussian basis set discretization of the Kohn–Sham equations on large computing clusters consisting of multiple GPUs per compute node. In addition, we purpose and demonstrate the efficacy of the use of batched kernels, including batched level-3 BLAS operations, in achieving high levels of performance on the GPU. We demonstrate the performance and scalability of the implementation of the purposed method in the NWChemEx software package by comparing to the existing scalable CPU XC integration in NWChem.
Collapse
|
|
5 |
11 |
14
|
Fu Z, Kirby RM, Whitaker RT. A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TETRAHEDRAL DOMAINS. SIAM JOURNAL ON SCIENTIFIC COMPUTING : A PUBLICATION OF THE SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS 2013; 35:c473-c494. [PMID: 25221418 PMCID: PMC4162315 DOI: 10.1137/120881956] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous two-dimensional strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers.
Collapse
|
research-article |
12 |
8 |
15
|
Gosui M, Yamazaki T. Real-World-Time Simulation of Memory Consolidation in a Large-Scale Cerebellar Model. Front Neuroanat 2016; 10:21. [PMID: 26973472 PMCID: PMC4776399 DOI: 10.3389/fnana.2016.00021] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2015] [Accepted: 02/18/2016] [Indexed: 11/23/2022] Open
Abstract
We report development of a large-scale spiking network model of the cerebellum composed of more than 1 million neurons. The model is implemented on graphics processing units (GPUs), which are dedicated hardware for parallel computing. Using 4 GPUs simultaneously, we achieve realtime simulation, in which computer simulation of cerebellar activity for 1 s completes within 1 s in the real-world time, with temporal resolution of 1 ms. This allows us to carry out a very long-term computer simulation of cerebellar activity in a practical time with millisecond temporal resolution. Using the model, we carry out computer simulation of long-term gain adaptation of optokinetic response (OKR) eye movements for 5 days aimed to study the neural mechanisms of posttraining memory consolidation. The simulation results are consistent with animal experiments and our theory of posttraining memory consolidation. These results suggest that realtime computing provides a useful means to study a very slow neural process such as memory consolidation in the brain.
Collapse
|
research-article |
9 |
8 |
16
|
Choi S, Kwon OK, Kim J, Kim WY. Performance of heterogeneous computing with graphics processing unit and many integrated core for hartree potential calculations on a numerical grid. J Comput Chem 2016; 37:2193-201. [PMID: 27431905 DOI: 10.1002/jcc.24443] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2016] [Revised: 05/19/2016] [Accepted: 06/13/2016] [Indexed: 12/17/2022]
Abstract
We investigated the performance of heterogeneous computing with graphics processing units (GPUs) and many integrated core (MIC) with 20 CPU cores (20×CPU). As a practical example toward large scale electronic structure calculations using grid-based methods, we evaluated the Hartree potentials of silver nanoparticles with various sizes (3.1, 3.7, 4.9, 6.1, and 6.9 nm) via a direct integral method supported by the sinc basis set. The so-called work stealing scheduler was used for efficient heterogeneous computing via the balanced dynamic distribution of workloads between all processors on a given architecture without any prior information on their individual performances. 20×CPU + 1GPU was up to ∼1.5 and ∼3.1 times faster than 1GPU and 20×CPU, respectively. 20×CPU + 2GPU was ∼4.3 times faster than 20×CPU. The performance enhancement by CPU + MIC was considerably lower than expected because of the large initialization overhead of MIC, although its theoretical performance is similar with that of CPU + GPU. © 2016 Wiley Periodicals, Inc.
Collapse
|
|
9 |
7 |
17
|
Rovere M, Chen Z, Di Pilato A, Pantaleo F, Seez C. CLUE: A Fast Parallel Clustering Algorithm for High Granularity Calorimeters in High-Energy Physics. Front Big Data 2020; 3:591315. [PMID: 33937749 PMCID: PMC8080903 DOI: 10.3389/fdata.2020.591315] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2020] [Accepted: 09/25/2020] [Indexed: 12/02/2022] Open
Abstract
One of the challenges of high granularity calorimeters, such as that to be built to cover the endcap region in the CMS Phase-2 Upgrade for HL-LHC, is that the large number of channels causes a surge in the computing load when clustering numerous digitized energy deposits (hits) in the reconstruction stage. In this article, we propose a fast and fully parallelizable density-based clustering algorithm, optimized for high-occupancy scenarios, where the number of clusters is much larger than the average number of hits in a cluster. The algorithm uses a grid spatial index for fast querying of neighbors and its timing scales linearly with the number of hits within the range considered. We also show a comparison of the performance on CPU and GPU implementations, demonstrating the power of algorithmic parallelization in the coming era of heterogeneous computing in high-energy physics.
Collapse
|
Journal Article |
5 |
7 |
18
|
Landau W, Niemi J, Nettleton D. Fully Bayesian analysis of RNA-seq counts for the detection of gene expression heterosis. J Am Stat Assoc 2018; 114:610-621. [PMID: 31354180 PMCID: PMC6660196 DOI: 10.1080/01621459.2018.1497496] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Revised: 01/01/2018] [Indexed: 01/17/2023]
Abstract
Heterosis, or hybrid vigor, is the enhancement of the phenotype of hybrid progeny relative to their inbred parents. Heterosis is extensively used in agriculture, and the underlying mechanisms are unclear. To investigate the molecular basis of phenotypic heterosis, researchers search tens of thousands of genes for heterosis with respect to expression in the transcriptome. Difficulty arises in the assessment of heterosis due to composite null hypotheses and non-uniform distributions for p-values under these null hypotheses. Thus, we develop a general hierarchical model for count data and a fully Bayesian analysis in which an efficient parallelized Markov chain Monte Carlo algorithm ameliorates the computational burden. We use our method to detect gene expression heterosis in a two-hybrid plant-breeding scenario, both in a real RNA-seq maize dataset and in simulation studies. In the simulation studies, we show our method has well-calibrated posterior probabilities and credible intervals when the model assumed in analysis matches the model used to simulate the data. Although model misspecification can adversely affect calibration, the methodology is still able to accurately rank genes. Finally, we show that hyperparameter posteriors are extremely narrow and an empirical Bayes (eBayes) approach based on posterior means from the fully Bayesian analysis provides virtually equivalent posterior probabilities, credible intervals, and gene rankings relative to the fully Bayesian solution. This evidence of equivalence provides support for the use of eBayes procedures in RNA-seq data analysis if accurate hyperparameter estimates can be obtained.
Collapse
|
research-article |
7 |
6 |
19
|
Chen TW, Henke M, de Visser PHB, Buck-Sorlin G, Wiechers D, Kahlen K, Stützel H. What is the most prominent factor limiting photosynthesis in different layers of a greenhouse cucumber canopy? ANNALS OF BOTANY 2014; 114:677-88. [PMID: 24907313 PMCID: PMC4217677 DOI: 10.1093/aob/mcu100] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2014] [Accepted: 04/10/2014] [Indexed: 05/06/2023]
Abstract
BACKGROUND AND AIMS Maximizing photosynthesis at the canopy level is important for enhancing crop yield, and this requires insights into the limiting factors of photosynthesis. Using greenhouse cucumber (Cucumis sativus) as an example, this study provides a novel approach to quantify different components of photosynthetic limitations at the leaf level and to upscale these limitations to different canopy layers and the whole plant. METHODS A static virtual three-dimensional canopy structure was constructed using digitized plant data in GroIMP. Light interception of the leaves was simulated by a ray-tracer and used to compute leaf photosynthesis. Different components of photosynthetic limitations, namely stomatal (S(L)), mesophyll (M(L)), biochemical (B(L)) and light (L(L)) limitations, were calculated by a quantitative limitation analysis of photosynthesis under different light regimes. KEY RESULTS In the virtual cucumber canopy, B(L) and L(L) were the most prominent factors limiting whole-plant photosynthesis. Diffusional limitations (S(L) + M(L)) contributed <15% to total limitation. Photosynthesis in the lower canopy was more limited by the biochemical capacity, and the upper canopy was more sensitive to light than other canopy parts. Although leaves in the upper canopy received more light, their photosynthesis was more light restricted than in the leaves of the lower canopy, especially when the light condition above the canopy was poor. An increase in whole-plant photosynthesis under diffuse light did not result from an improvement of light use efficiency but from an increase in light interception. Diffuse light increased the photosynthesis of leaves that were directly shaded by other leaves in the canopy by up to 55%. CONCLUSIONS Based on the results, maintaining biochemical capacity of the middle-lower canopy and increasing the leaf area of the upper canopy would be promising strategies to improve canopy photosynthesis in a high-wire cucumber cropping system. Further analyses using the approach described in this study can be expected to provide insights into the influences of horticultural practices on canopy photosynthesis and the design of optimal crop canopies.
Collapse
|
research-article |
11 |
6 |
20
|
Toward Optimal Computation of Ultrasound Image Reconstruction Using CPU and GPU. SENSORS 2016; 16:s16121986. [PMID: 27886149 PMCID: PMC5190967 DOI: 10.3390/s16121986] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/13/2016] [Revised: 10/31/2016] [Accepted: 11/10/2016] [Indexed: 12/03/2022]
Abstract
An ultrasound image is reconstructed from echo signals received by array elements of a transducer. The time of flight of the echo depends on the distance between the focus to the array elements. The received echo signals have to be delayed to make their wave fronts and phase coherent before summing the signals. In digital beamforming, the delays are not always located at the sampled points. Generally, the values of the delayed signals are estimated by the values of the nearest samples. This method is fast and easy, however inaccurate. There are other methods available for increasing the accuracy of the delayed signals and, consequently, the quality of the beamformed signals; for example, the in-phase (I)/quadrature (Q) interpolation, which is more time consuming but provides more accurate values than the nearest samples. This paper compares the signals after dynamic receive beamforming, in which the echo signals are delayed using two methods, the nearest sample method and the I/Q interpolation method. The comparisons of the visual qualities of the reconstructed images and the qualities of the beamformed signals are reported. Moreover, the computational speeds of these methods are also optimized by reorganizing the data processing flow and by applying the graphics processing unit (GPU). The use of single and double precision floating-point formats of the intermediate data is also considered. The speeds with and without these optimizations are also compared.
Collapse
|
Journal Article |
9 |
6 |
21
|
Kuriyama R, Casellato C, D'Angelo E, Yamazaki T. Real-Time Simulation of a Cerebellar Scaffold Model on Graphics Processing Units. Front Cell Neurosci 2021; 15:623552. [PMID: 33897369 PMCID: PMC8058369 DOI: 10.3389/fncel.2021.623552] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Accepted: 03/15/2021] [Indexed: 11/13/2022] Open
Abstract
Large-scale simulation of detailed computational models of neuronal microcircuits plays a prominent role in reproducing and predicting the dynamics of the microcircuits. To reconstruct a microcircuit, one must choose neuron and synapse models, placements, connectivity, and numerical simulation methods according to anatomical and physiological constraints. For reconstruction and refinement, it is useful to be able to replace one module easily while leaving the others as they are. One way to achieve this is via a scaffolding approach, in which a simulation code is built on independent modules for placements, connections, and network simulations. Owing to the modularity of functions, this approach enables researchers to improve the performance of the entire simulation by simply replacing a problematic module with an improved one. Casali et al. (2019) developed a spiking network model of the cerebellar microcircuit using this approach, and while it reproduces electrophysiological properties of cerebellar neurons, it takes too much computational time. Here, we followed this scaffolding approach and replaced the simulation module with an accelerated version on graphics processing units (GPUs). Our cerebellar scaffold model ran roughly 100 times faster than the original version. In fact, our model is able to run faster than real time, with good weak and strong scaling properties. To demonstrate an application of real-time simulation, we implemented synaptic plasticity mechanisms at parallel fiber-Purkinje cell synapses, and carried out simulation of behavioral experiments known as gain adaptation of optokinetic response. We confirmed that the computer simulation reproduced experimental findings while being completed in real time. Actually, a computer simulation for 2 s of the biological time completed within 750 ms. These results suggest that the scaffolding approach is a promising concept for gradual development and refactoring of simulation codes for large-scale elaborate microcircuits. Moreover, a real-time version of the cerebellar scaffold model, which is enabled by parallel computing technology owing to GPUs, may be useful for large-scale simulations and engineering applications that require real-time signal processing and motor control.
Collapse
|
research-article |
4 |
6 |
22
|
Fu Z, Jeong WK, Pan Y, Kirby RM, Whitaker RT. A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TRIANGULATED SURFACES. SIAM JOURNAL ON SCIENTIFIC COMPUTING : A PUBLICATION OF THE SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS 2011; 33:2468-2488. [PMID: 22641200 PMCID: PMC3360588 DOI: 10.1137/100788951] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
This paper presents an efficient, fine-grained parallel algorithm for solving the Eikonal equation on triangular meshes. The Eikonal equation, and the broader class of Hamilton-Jacobi equations to which it belongs, have a wide range of applications from geometric optics and seismology to biological modeling and analysis of geometry and images. The ability to solve such equations accurately and efficiently provides new capabilities for exploring and visualizing parameter spaces and for solving inverse problems that rely on such equations in the forward model. Efficient solvers on state-of-the-art, parallel architectures require new algorithms that are not, in many cases, optimal, but are better suited to synchronous updates of the solution. In previous work [W. K. Jeong and R. T. Whitaker, SIAM J. Sci. Comput., 30 (2008), pp. 2512-2534], the authors proposed the fast iterative method (FIM) to efficiently solve the Eikonal equation on regular grids. In this paper we extend the fast iterative method to solve Eikonal equations efficiently on triangulated domains on the CPU and on parallel architectures, including graphics processors. We propose a new local update scheme that provides solutions of first-order accuracy for both architectures. We also propose a novel triangle-based update scheme and its corresponding data structure for efficient irregular data mapping to parallel single-instruction multiple-data (SIMD) processors. We provide detailed descriptions of the implementations on a single CPU, a multicore CPU with shared memory, and SIMD architectures with comparative results against state-of-the-art Eikonal solvers.
Collapse
|
research-article |
14 |
6 |
23
|
Gao H, Phan L, Lin Y. Parallel multigrid solver of radiative transfer equation for photon transport via graphics processing unit. JOURNAL OF BIOMEDICAL OPTICS 2012; 17:96004-1. [PMID: 23085905 PMCID: PMC3497889 DOI: 10.1117/1.jbo.17.9.096004] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/16/2012] [Revised: 08/01/2012] [Accepted: 08/03/2012] [Indexed: 05/21/2023]
Abstract
A graphics processing unit-based parallel multigrid solver for a radiative transfer equation with vacuum boundary condition or reflection boundary condition is presented for heterogeneous media with complex geometry based on two-dimensional triangular meshes or three-dimensional tetrahedral meshes. The computational complexity of this parallel solver is linearly proportional to the degrees of freedom in both angular and spatial variables, while the full multigrid method is utilized to minimize the number of iterations. The overall gain of speed is roughly 30 to 300 fold with respect to our prior multigrid solver, which depends on the underlying regime and the parallelization. The numerical validations are presented with the MATLAB codes at https://sites.google.com/site/rtefastsolver/.
Collapse
|
Research Support, N.I.H., Extramural |
13 |
5 |
24
|
van Vreumingen D, Tewari S, Verbeek F, van Ruitenbeek JM. Towards Controlled Single-Molecule Manipulation Using "Real-Time" Molecular Dynamics Simulation: A GPU Implementation. MICROMACHINES 2018; 9:E270. [PMID: 30424203 PMCID: PMC6187332 DOI: 10.3390/mi9060270] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/28/2018] [Revised: 05/24/2018] [Accepted: 05/25/2018] [Indexed: 02/04/2023]
Abstract
Molecular electronics saw its birth with the idea to build electronic circuitry with single molecules as individual components. Even though commercial applications are still modest, it has served an important part in the study of fundamental physics at the scale of single atoms and molecules. It is now a routine procedure in many research groups around the world to connect a single molecule between two metallic leads. What is unknown is the nature of this coupling between the molecule and the leads. We have demonstrated recently (Tewari, 2018, Ph.D. Thesis) our new setup based on a scanning tunneling microscope, which can be used to controllably manipulate single molecules and atomic chains. In this article, we will present the extension of our molecular dynamic simulator attached to this system for the manipulation of single molecules in real time using a graphics processing unit (GPU). This will not only aid in controlled lift-off of single molecules, but will also provide details about changes in the molecular conformations during the manipulation. This information could serve as important input for theoretical models and for bridging the gap between the theory and experiments.
Collapse
|
research-article |
7 |
4 |
25
|
Accelerating 3-D GPU-based Motion Tracking for Ultrasound Strain Elastography Using Sum-Tables: Analysis and Initial Results. APPLIED SCIENCES-BASEL 2019; 9. [PMID: 31372306 PMCID: PMC6675029 DOI: 10.3390/app9101991] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Now, with the availability of 3-D ultrasound data, a lot of research efforts are being devoted to developing 3-D ultrasound strain elastography (USE) systems. Because 3-D motion tracking, a core component in any 3-D USE system, is computationally intensive, a lot of efforts are under way to accelerate 3-D motion tracking. In the literature, the concept of Sum-Table has been used in a serial computing environment to reduce the burden of computing signal correlation, which is the single most computationally intensive component in 3-D motion tracking. In this study, parallel programming using graphics processing units (GPU) is used in conjunction with the concept of Sum-Table to improve the computational efficiency of 3-D motion tracking. To our knowledge, sum-tables have not been used in a GPU environment for 3-D motion tracking. Our main objective here is to investigate the feasibility of using sum-table-based normalized correlation coefficient (ST-NCC) method for the above-mentioned GPU-accelerated 3-D USE. More specifically, two different implementations of ST-NCC methods proposed by Lewis et al. and Luo-Konofagou are compared against each other. During the performance comparison, the conventional method for calculating the normalized correlation coefficient (NCC) was used as the baseline. All three methods were implemented using compute unified device architecture (CUDA; Version 9.0, Nvidia Inc., CA, USA) and tested on a professional GeForce GTX TITAN X card (Nvidia Inc., CA, USA). Using 3-D ultrasound data acquired during a tissue-mimicking phantom experiment, both displacement tracking accuracy and computational efficiency were evaluated for the above-mentioned three different methods. Based on data investigated, we found that under the GPU platform, Lou-Konofaguo method can still improve the computational efficiency (17–46%), as compared to the classic NCC method implemented into the same GPU platform. However, the Lewis method does not improve the computational efficiency in some configuration or improves the computational efficiency at a lower rate (7–23%) under the GPU parallel computing environment. Comparable displacement tracking accuracy was obtained by both methods.
Collapse
|
|
6 |
4 |