Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Total Articles

83
(from Reference Citation Analysis)

Article PDFs (32)

Cited by > 0 (70)

Searched Name

CUDA

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Number	Citation Analysis
26	Qazi SA, Tariq F, Ullah I, Omer H. Parallel implementation of L + S signal recovery in dynamic MRI. MAGNETIC RESONANCE MATERIALS IN PHYSICS BIOLOGY AND MEDICINE 2020;34:297-307. [PMID: 32601881 DOI: 10.1007/s10334-020-00861-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/29/2020] [Revised: 06/09/2020] [Accepted: 06/22/2020] [Indexed: 11/25/2022] Abstract Dynamic MRI is useful to diagnose different diseases, e.g. cardiac ailments, by monitoring the structure and function of the heart and blood flow through the valves. Faster data acquisition is highly desirable in dynamic MRI, but this may lead to aliasing artifacts due to under-sampling. Advanced image reconstruction algorithms are required to obtain aliasing-free MR images from the acquired under-sampled data. One major limitation of using the advanced reconstruction algorithms is their computationally expensive and time-consuming nature, which make them infeasible for clinical use, especially for applications like cardiac MRI. L + S decomposition model is an approach provided in literature which separates the sparse and low-rank information in dynamic MRI. However, L + S decomposition model is a computationally complex process demanding significant computation time. In this paper, a parallel framework is proposed to accelerate the image reconstruction process of L + S decomposition model using GPU. Experiments are performed on cardiac perfusion dataset ([Formula: see text]) and cardiac cine dataset ([Formula: see text]) using NVIDIA's GeForce GTX780 GPU and Core-i7 CPU. The results show that the proposed method provides up to 18 × speed-up including the memory transfer time (i.e. data transfer between the CPU and GPU) and ~ 46 × speed-up without memory transfer for the cardiac perfusion dataset in our experiments. This level of improvement in the reconstruction time will increase the usefulness of L + S reconstruction by making it feasible for clinical applications. Collapse Key Words CUDA Cardiac MRI Compressed sensing GPU computing MRI Reconstruction Collapse MESH Headings Algorithms Artifacts Heart Image Processing, Computer-Assisted Magnetic Resonance Imaging Collapse Grants Collapse
27	Hattori LT, Pinheiro BA, Frigori RB, Benítez CMV, Lopes HS. PathMolD-AB: Spatiotemporal pathways of protein folding using parallel molecular dynamics with a coarse-grained model. Comput Biol Chem 2020;87:107301. [PMID: 32554177 DOI: 10.1016/j.compbiolchem.2020.107301] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Revised: 05/25/2020] [Accepted: 05/28/2020] [Indexed: 10/24/2022] Abstract Solving the protein folding problem (PFP) is one of the grand challenges still open in computational biophysics. Globular proteins are believed to evolve from initial configurations through folding pathways connecting several thermodynamically accessible states in a free energy landscape until reaching its minimum, inhabited by the stable native structures. Despite its huge computational burden, molecular dynamics (MD) is the leading approach in the PFP studies by preserving the Newtonian temporal evolution in the canonical ensemble. Non-trivial improvements are provided by highly parallel implementations of MD in cost-effective GPUs, concomitant to multiscale descriptions of proteins by coarse-grained minimalist models. In this vein, we present the PathMolD-AB framework, a comprehensive software package for massively parallel MD simulations using the canonical ensemble, structural analysis, and visualization of the folding pathways using the minimalist AB-model. It has, also, a tool to compare the results with proteins re-scaled from the PDB. We simulate and analyze, as case studies, the folding of four proteins: 13FIBO, 2GB1, 1PLC and 5ANZ, with 13, 55, 99 and 223 amino acids, respectively. The datasets generated from simulations correspond to the MD evolution of 3500 folding pathways, encompassing 35×106 states, which contains the spatial amino acid positions, the protein free energies and radii of gyration at each time step. Results indicate that the speedup of our approach grows logarithmically with the protein length and, therefore, it is suited for most of the proteins in the PDB. The predicted structures simulated by PathMolD-AB were similar to the re-scaled biological structures, indicating that it is promising for the study of the PFP study. Collapse Key Words 3D-AB off-lattice CUDA Canonical ensemble Protein folding dataset Collapse MESH Headings Collapse Grants Collapse
28	Isupov K. Performance data of multiple-precision scalar and vector BLAS operations on CPU and GPU. Data Brief 2020;30:105506. [PMID: 32373682 PMCID: PMC7195515 DOI: 10.1016/j.dib.2020.105506] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2020] [Revised: 03/14/2020] [Accepted: 03/23/2020] [Indexed: 12/02/2022] Open Abstract Many optimized linear algebra packages support the single- and double-precision floating-point data types. However, there are a number of important applications that require a higher level of precision, up to hundreds or even thousands of digits. This article presents performance data of four dense basic linear algebra subprograms – ASUM, DOT, SCAL, and AXPY – implemented using existing extended-/multiple-precision software for conventional central processing units and CUDA compatible graphics processing units. The following open source packages are considered: MPFR, MPDECIMAL, ARPREC, MPACK, XBLAS, GARPREC, CAMPARY, CUMP, and MPRES-BLAS. The execution time of CPU and GPU implementations is measured at a fixed problem size and various levels of numeric precision. The data in this article are related to the research article entitled “Design and implementation of multiple-precision BLAS Level 1 functions for graphics processing units” [1]. Collapse Key Words BLAS CUDA Floating-point computations Graphics processing units Multiple-precision arithmetic Collapse MESH Headings Collapse Grants Collapse
29	Parallel probability density approximation. Behav Res Methods 2020;51:2777-2799. [PMID: 31471826 DOI: 10.3758/s13428-018-1153-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Abstract Probability density approximation (PDA) is a nonparametric method of calculating probability densities. When integrated into Bayesian estimation, it allows researchers to fit psychological processes for which analytic probability functions are unavailable, significantly expanding the scope of theories that can be quantitatively tested. PDA is, however, computationally intensive, requiring large numbers of Monte Carlo simulations in order to attain good precision. We introduce Parallel PDA (pPDA), a highly efficient implementation of this method utilizing the Armadillo C++ and CUDA C libraries to conduct millions of model simulations simultaneously in graphics processing units (GPUs). This approach provides a practical solution for rapidly approximating probability densities with high precision. In addition to demonstrating this method, we fit a piecewise linear ballistic accumulator model (Holmes, Trueblood, & Heathcote, 2016) to empirical data. Finally, we conducted simulation studies to investigate various issues associated with PDA and provide guidelines for pPDA applications to other complex cognitive models. Collapse Key Words Bayesian modeling C++ CUDA GPU Kernel density estimate Markov chain Monte Carlo Probability density approximation R Collapse MESH Headings Collapse Grants Collapse
30	Uzelac I, Iravanian S, Fenton FH. Parallel Acceleration on Removal of Optical Mapping Baseline Wandering. COMPUTING IN CARDIOLOGY 2019;46:10.22489/cinc.2019.433. [PMID: 35719209 PMCID: PMC9202644 DOI: 10.22489/cinc.2019.433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023] Abstract Optical mapping measurements on hearts stained with fluorescent dyes is imagining method widely accepted and recognized as a tool to study complex spatial-temporal dynamics of cardiac electro-physiology. One shortcoming of the method is baseline wandering in obtained fluorescence signals as signals relevant to transmembrane potential (V_m) change and free intracellular calcium concentration ([Ca]_i ⁺²), the two most used dyes, are calculated as a relative signal change in respect to the fluorescence baseline. These changes are small fractional changes often smaller than 10 %. Baseline fluorescence drifts due to dye photo-bleaching, heart contraction/movement artifacts, and stability of the excitation light source over time. Depending on experimental instrumentation, recording duration, signal to noise levels and study aims of the optical imagining, many research groups adopted their own techniques tailored to a specific experimental data. Here we present a technique based on finite impulse response (FIR) filters with paralleled acceleration implemented on GPUs and multi-core CPU, in MATLAB. Collapse Key Words CUDA FIR filter Optical mapping baseline wandering fluorescence baseline parallel acceleration Collapse MESH Headings Collapse Grants R01 HL143450 NHLBI NIH HHS Collapse
31	Na JC, Lee I, Rhee JK, Shin SY. Fast single individual haplotyping method using GPGPU. Comput Biol Med 2019;113:103421. [PMID: 31499396 DOI: 10.1016/j.compbiomed.2019.103421] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2019] [Revised: 08/28/2019] [Accepted: 08/28/2019] [Indexed: 11/27/2022] Abstract BACKGROUND Most bioinformatic tools for next generation sequencing (NGS) data are computationally intensive, requiring a large amount of computational power for processing and analysis. Here the utility of graphic processing units (GPUs) for NGS data computation is assessed. METHOD In a previous study, we developed a probabilistic evolutionary algorithm with toggling for haplotyping (PEATH) method based on the estimation of distribution algorithm and toggling heuristic. Here, we parallelized the PEATH method (PEATH/G) using general-purpose computing on GPU (GPGPU). RESULTS The PEATH/G runs approximately 46.8 times and 25.4 times faster than PEATH on the NA12878 fosmid-sequencing dataset and the HuRef dataset, respectively, with an NVIDIA GeForce GTX 1660Ti. Moreover, the PEATH/G is approximately 13.3 times faster on the fosmid-sequencing dataset, even with an inexpensive conventional GPGPU (NVIDIA GeForce GTX 950). CONCLUSIONS PEATH/G can be a practical single individual haplotyping tool in terms of both its accuracy and speed. GPGPU can help reduce the running time of NGS analysis tools. Collapse Key Words CUDA GPGPU Next generation sequencing PEATH/G Single individual haplotyping Collapse MESH Headings Collapse Grants Collapse
32	Subbiah A, Ogunfunmi T. A Flexible Hybrid BCH Decoder for Modern NAND Flash Memories Using General Purpose Graphical Processing Units (GPGPUs). MICROMACHINES 2019;10:mi10060365. [PMID: 31159191 PMCID: PMC6632097 DOI: 10.3390/mi10060365] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/20/2019] [Revised: 05/16/2019] [Accepted: 05/23/2019] [Indexed: 11/16/2022] Abstract Bose-Chaudhuri-Hocquenghem (BCH) codes are broadly used to correct errors in flash memory systems and digital communications. These codes are cyclic block codes and have their arithmetic fixed over the splitting field of their generator polynomial. There are many solutions proposed using CPUs, hardware, and Graphical Processing Units (GPUs) for the BCH decoders. The performance of these BCH decoders is of ultimate importance for systems involving flash memory. However, it is essential to have a flexible solution to correct multiple bit errors over the different finite fields (GF(2 m )). In this paper, we propose a pragmatic approach to decode BCH codes over the different finite fields using hardware circuits and GPUs in tandem. We propose to employ hardware design for a modified syndrome generator and GPUs for a key-equation solver and an error corrector. Using the above partition, we have shown the ability to support multiple bit errors across different BCH block codes without compromising on the performance. Furthermore, the proposed method to generate modified syndrome has zero latency for scenarios where there are no errors. When there is an error detected, the GPUs are deployed to correct the errors using the iBM and Chien search algorithm. The results have shown that using the modified syndrome approach, we can support different multiple finite fields with high throughput. Collapse Key Words BCH CUDA GPU Galois field decoder flash memory hybrid iBM Collapse MESH Headings Collapse Grants Collapse
33	CuDDI: A CUDA-Based Application for Extracting Drug-Drug Interaction Related Substance Terms from PubMed Literature. Molecules 2019;24:molecules24061081. [PMID: 30893816 PMCID: PMC6470591 DOI: 10.3390/molecules24061081] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2019] [Revised: 03/12/2019] [Accepted: 03/16/2019] [Indexed: 11/30/2022] Open Abstract Drug-drug interaction (DDI) is becoming a serious issue in clinical pharmacy as the use of multiple medications is more common. The PubMed database is one of the biggest literature resources for DDI studies. It contains over 150,000 journal articles related to DDI and is still expanding at a rapid pace. The extraction of DDI-related information, including compounds and proteins from PubMed, is an essential step for DDI research. In this paper, we introduce a tool, CuDDI (compute unified device architecture-based DDI searching), for identification of DDI-related terms (including compounds and proteins) from PubMed. There are three modules in this application, including the automatic retrieval of substances from PubMed, the identification of DDI-related terms, and the display of relationship of DDI-related terms. For DDI term identification, a speedup of 30–105 times was observed for the compute unified device architecture (CUDA)-based version compared with the implementation with a CPU-based Python version. CuDDI can be used to discover DDI-related terms and relationships of these terms, which has the potential to help clinicians and pharmacists better understand the mechanism of DDIs. CuDDI is available at: https://github.com/chengusf/CuDDI. Collapse Key Words CUDA GPU PubMed Substance drug-drug interaction mechanism parallel computing random sampling term Collapse MESH Headings Collapse Grants Collapse
34	Chang HH, Lin YJ, Zhuang AH. An Automatic Parameter Decision System of Bilateral Filtering with GPU-Based Acceleration for Brain MR Images. J Digit Imaging 2019;32:148-161. [PMID: 30088157 PMCID: PMC6382639 DOI: 10.1007/s10278-018-0110-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022] Open Abstract Bilateral filters have been extensively utilized in a number of image denoising applications such as segmentation, registration, and tissue classification. However, it requires burdensome adjustments of the filter parameters to achieve the best performance for each individual image. To address this problem, this paper proposes a computer-aided parameter decision system based on image texture features associated with neural networks. In our approach, parallel computing with the GPU architecture is first developed to accelerate the computation of the conventional bilateral filter. Subsequently, a back propagation network (BPN) scheme using significant image texture features as the input is established to estimate the GPU-based bilateral filter parameters and its denoising process. The k-fold cross validation method is exploited to evaluate the performance of the proposed automatic restoration framework. A wide variety of T1-weighted brain MR images were employed to train and evaluate this parameter-free decision system with GPU-based bilateral filtering, which resulted in a speed-up factor of 208 comparing to the CPU-based computation. The proposed filter parameter prediction system achieved a mean absolute percentage error (MAPE) of 6% and was classified as "high accuracy". Our automatic denoising framework dramatically removed noise in numerous brain MR images and outperformed several state-of-the-art methods based on the peak signal-to-noise ratio (PSNR). The usage of image texture features associated with the BPN to estimate the GPU-based bilateral filter parameters and to automate the denoising process is feasible and validated. It is suggested that this automatic restoration system is advantageous to various brain MR image-processing applications. Collapse Key Words Automation Bilateral filter CUDA Image denoising Image texture Neural networks Collapse MESH Headings Algorithms Brain/diagnostic imaging Humans Image Processing, Computer-Assisted/methods Magnetic Resonance Imaging/methods Neural Networks, Computer Signal-To-Noise Ratio Collapse Grants Ministry of Science and Technology, Taiwan Collapse
35	Okada S, Murakami K, Incerti S, Amako K, Sasaki T. MPEXS-DNA, a new GPU-based Monte Carlo simulator for track structures and radiation chemistry at subcellular scale. Med Phys 2019;46:1483-1500. [PMID: 30593679 PMCID: PMC6850505 DOI: 10.1002/mp.13370] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2018] [Revised: 12/17/2018] [Accepted: 12/19/2018] [Indexed: 11/23/2022] Open Abstract Purpose Track structure simulation codes can accurately reproduce the stochastic nature of particle–matter interactions in order to evaluate quantitatively radiation damage in biological cells such as DNA strand breaks and base damage. Such simulations handle large numbers of secondary charged particles and molecular species created in the irradiated medium. Every particle and molecular species are tracked step‐by‐step using a Monte Carlo method to calculate energy loss patterns and spatial distributions of molecular species inside a cell nucleus with high spatial accuracy. The Geant4‐DNA extension of the Geant4 general‐purpose Monte Carlo simulation toolkit allows for such track structure simulations and can be run on CPUs. However, long execution times have been observed for the simulation of DNA damage in cells. We present in this work an improvement of the computing performance of such simulations using ultraparallel processing on a graphical processing unit (GPU). Methods A new Monte Carlo simulator named MPEXS‐DNA, allowing high computing performance by using a GPU, has been developed for track structure and radiolysis simulations at the subcellular scale. MPEXS‐DNA physics and chemical processes are based on Geant4‐DNA processes available in Geant4 version 10.02 p03. We have reimplemented the Geant4‐DNA process codes of the physics stage (electromagnetic processes of charged particles) and the chemical stage (diffusion and chemical reactions for molecular species) for microdosimetry simulation by using the CUDA language. MPEXS‐DNA can calculate a distribution of energy loss in the irradiated medium caused by charged particles and also simulate production, diffusion, and chemical interactions of molecular species from water radiolysis to quantitatively assess initial damage to DNA. The validation of MPEXS‐DNA physics and chemical simulations was performed by comparing various types of distributions, namely the radial dose distributions for the physics stage, and the G‐value profiles for each chemical product and their linear energy transfer dependency for the chemical stage, to existing experimental data and simulation results obtained by other simulation codes, including PARTRAC. Results For physics validation, radial dose distributions calculated by MPEXS‐DNA are consistent with experimental data and numerical simulations. For chemistry validation, MPEXS‐DNA can also reproduce G‐value profiles for each molecular species with the same tendency as existing experimental data. MPEXS‐DNA also agrees with simulations by PARTRAC reasonably well. However, we have confirmed that there are slight discrepancies in G‐value profiles calculated by MPEXS‐DNA for molecular species such as H₂ and H₂O₂ when compared to experimental data and PARTRAC simulations. The differences in G‐value profiles between MPEXS‐DNA and PARTRAC are caused by the different chemical reactions considered. MPEXS‐DNA can drastically boost the computing performance of track structure and radiolysis simulations. By using NVIDIA's GPU devices adopting the Volta architecture, MPEXS‐DNA has achieved speedup factors up to 2900 against Geant4‐DNA simulations with a single CPU core. Conclusion The MPEXS‐DNA Monte Carlo simulation achieves similar accuracy to Monte Carlo simulations performed using other codes such as Geant4‐DNA and PARTRAC, and its predictions are consistent with experimental data. Notably, MPEXS‐DNA allows calculations that are, at maximum, 2900 times faster than conventional simulations using a CPU. Collapse Key Words CUDA GPGPU Geant4-DNA Monte Carlo simulation microdosimetry Collapse MESH Headings Collapse Grants Collapse
36	DNA sequences alignment in multi-GPUs: acceleration and energy payoff. BMC Bioinformatics 2018;19:421. [PMID: 30453877 PMCID: PMC6245493 DOI: 10.1186/s12859-018-2389-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open Abstract Background We present a performance per watt analysis of CUDAlign 4.0, a parallel strategy to obtain the optimal pairwise alignment of huge DNA sequences in multi-GPU platforms using the exact Smith-Waterman method. Results Our study includes acceleration factors, performance, scalability, power efficiency and energy costs. We also quantify the influence of the contents of the compared sequences, identify potential scenarios for energy savings on speculative executions, and calculate performance and energy usage differences among distinct GPU generations and models. For a sequence alignment on chromosome-wide scale (around 2 Petacells), we are able to reduce execution times from 9.5 h on a Kepler GPU to just 2.5 h on a Pascal counterpart, with energy costs cut by 60%. Conclusions We find GPUs to be an order of magnitude ahead in performance per watt compared to Xeon Phis. Finally, versus typical low-power devices like FPGAs, GPUs keep similar GFLOPS/w ratios in 2017 on a five times faster execution. Collapse Key Words CUDA DNA sequences alignment GPGPU HPC Power efficiency Collapse MESH Headings Collapse Grants Collapse
37	Landau W, Niemi J, Nettleton D. Fully Bayesian analysis of RNA-seq counts for the detection of gene expression heterosis. J Am Stat Assoc 2018;114:610-621. [PMID: 31354180 PMCID: PMC6660196 DOI: 10.1080/01621459.2018.1497496] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Revised: 01/01/2018] [Indexed: 01/17/2023] Abstract Heterosis, or hybrid vigor, is the enhancement of the phenotype of hybrid progeny relative to their inbred parents. Heterosis is extensively used in agriculture, and the underlying mechanisms are unclear. To investigate the molecular basis of phenotypic heterosis, researchers search tens of thousands of genes for heterosis with respect to expression in the transcriptome. Difficulty arises in the assessment of heterosis due to composite null hypotheses and non-uniform distributions for p-values under these null hypotheses. Thus, we develop a general hierarchical model for count data and a fully Bayesian analysis in which an efficient parallelized Markov chain Monte Carlo algorithm ameliorates the computational burden. We use our method to detect gene expression heterosis in a two-hybrid plant-breeding scenario, both in a real RNA-seq maize dataset and in simulation studies. In the simulation studies, we show our method has well-calibrated posterior probabilities and credible intervals when the model assumed in analysis matches the model used to simulate the data. Although model misspecification can adversely affect calibration, the methodology is still able to accurately rank genes. Finally, we show that hyperparameter posteriors are extremely narrow and an empirical Bayes (eBayes) approach based on posterior means from the fully Bayesian analysis provides virtually equivalent posterior probabilities, credible intervals, and gene rankings relative to the fully Bayesian solution. This evidence of equivalence provides support for the use of eBayes procedures in RNA-seq data analysis if accurate hyperparameter estimates can be obtained. Collapse Key Words CUDA empirical Bayes graphics processing unit hierarchical model hybrid vigor negative-binomial Collapse MESH Headings Collapse Grants R01 GM109458 NIGMS NIH HHS Collapse
38	Abbaszadeh O, Khanteymoori AR, Azarpeyvand A. Parallel Algorithms for Inferring Gene Regulatory Networks: A Review. Curr Genomics 2018;19:603-614. [PMID: 30386172 PMCID: PMC6194435 DOI: 10.2174/1389202919666180601081718] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2017] [Revised: 02/20/2018] [Accepted: 05/22/2018] [Indexed: 11/22/2022] Open Abstract System biology problems such as whole-genome network construction from large-scale gene expression data are sophisticated and time-consuming. Therefore, using sequential algorithms are not feasible to obtain a solution in an acceptable amount of time. Today, by using massively parallel computing, it is possible to infer large-scale gene regulatory networks. Recently, establishing gene regulatory networks from large-scale datasets have drawn the noticeable attention of researchers in the field of parallel computing and system biology. In this paper, we attempt to provide a more detailed overview of the recent parallel algorithms for constructing gene regulatory networks. Firstly, fundamentals of gene regulatory networks inference and large-scale datasets challenges are given. Secondly, a detailed description of the four parallel frameworks and libraries including CUDA, OpenMP, MPI, and Hadoop is discussed. Thirdly, parallel algorithms are reviewed. Finally, some conclusions and guidelines for parallel reverse engineering are described. Collapse Key Words CUDA Gene regulatory network Hadoop MPI OpenMP Parallel algorithms Parallel processing Reverse engineering Collapse MESH Headings Collapse Grants Collapse
39	Awan MG, Eslami T, Saeed F. GPU-DAEMON: GPU algorithm design, data management & optimization template for array based big omics data. Comput Biol Med 2018;101:163-173. [PMID: 30145436 DOI: 10.1016/j.compbiomed.2018.08.015] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2018] [Revised: 08/10/2018] [Accepted: 08/12/2018] [Indexed: 11/29/2022] Abstract In the age of ever increasing data, faster and more efficient data processing algorithms are needed. Graphics Processing Units (GPU) are emerging as a cost-effective alternative architecture for high-end computing. The optimal design of GPU algorithms is a challenging task which requires thorough understanding of the high performance computing architecture as well as the algorithmic design. The steep learning curve needed for effective GPU-centric algorithm design and implementation requires considerable expertise, time, and resources. In this paper, we present GPU-DAEMON, a GPU Data Management, Algorithm Design and Optimization technique suitable for processing array based big omics data. Our proposed GPU algorithm design template outlines and provides generic methods to tackle critical bottlenecks which can be followed to implement high performance, scalable GPU algorithms for given big data problem. We study the capability of GPU-DAEMON by reviewing the implementation of GPU-DAEMON based algorithms for three different big data problems. Speed up of as large as 386x (over the sequential version) and 50x (over naive GPU design methods) are observed using the proposed GPU-DAEMON. GPU-DAEMON template is available at https://github.com/pcdslab/GPU-DAEMON and the source codes for GPU-ArraySort, G-MSR and GPU-PCC are available at https://github.com/pcdslab. Collapse Key Words Big-data CUDA GPU High-performance-computing Omics-data Collapse MESH Headings Collapse Grants Collapse
40	Matić T, Aleksi I, Hocenski Ž, Kraus D. Real-time biscuit tile image segmentation method based on edge detection. ISA TRANSACTIONS 2018;76:246-254. [PMID: 29609803 DOI: 10.1016/j.isatra.2018.03.015] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/20/2017] [Revised: 02/23/2018] [Accepted: 03/21/2018] [Indexed: 06/08/2023] Abstract In this paper we propose a novel real-time Biscuit Tile Segmentation (BTS) method for images from ceramic tile production line. BTS method is based on signal change detection and contour tracing with a main goal of separating tile pixels from background in images captured on the production line. Usually, human operators are visually inspecting and classifying produced ceramic tiles. Computer vision and image processing techniques can automate visual inspection process if they fulfill real-time requirements. Important step in this process is a real-time tile pixels segmentation. BTS method is implemented for parallel execution on a GPU device to satisfy the real-time constraints of tile production line. BTS method outperforms 2D threshold-based methods, 1D edge detection methods and contour-based methods. Proposed BTS method is in use in the biscuit tile production line. Collapse Key Words Biscuit tile CUDA Edge detection Image segmentation Real-time Collapse MESH Headings Collapse Grants Collapse
41	Eslami T, Saeed F. Fast-GPU-PCC: A GPU-Based Technique to Compute Pairwise Pearson's Correlation Coefficients for Time Series Data-fMRI Study. High Throughput 2018;7:E11. [PMID: 29677161 PMCID: PMC6023306 DOI: 10.3390/ht7020011] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Revised: 04/04/2018] [Accepted: 04/17/2018] [Indexed: 11/16/2022] Open Abstract Functional magnetic resonance imaging (fMRI) is a non-invasive brain imaging technique, which has been regularly used for studying brain’s functional activities in the past few years. A very well-used measure for capturing functional associations in brain is Pearson’s correlation coefficient. Pearson’s correlation is widely used for constructing functional network and studying dynamic functional connectivity of the brain. These are useful measures for understanding the effects of brain disorders on connectivities among brain regions. The fMRI scanners produce huge number of voxels and using traditional central processing unit (CPU)-based techniques for computing pairwise correlations is very time consuming especially when large number of subjects are being studied. In this paper, we propose a graphics processing unit (GPU)-based algorithm called Fast-GPU-PCC for computing pairwise Pearson’s correlation coefficient. Based on the symmetric property of Pearson’s correlation, this approach returns N ( N − 1 ) / 2 correlation coefficients located at strictly upper triangle part of the correlation matrix. Storing correlations in a one-dimensional array with the order as proposed in this paper is useful for further usage. Our experiments on real and synthetic fMRI data for different number of voxels and varying length of time series show that the proposed approach outperformed state of the art GPU-based techniques as well as the sequential CPU-based versions. We show that Fast-GPU-PCC runs 62 times faster than CPU-based version and about 2 to 3 times faster than two other state of the art GPU-based methods. Collapse Key Words CUDA GPU Pearson’s correlation coefficient fMRI matrix multiplication Collapse MESH Headings Collapse Grants R15 GM120820 NIGMS NIH HHS Collapse
42	Du H, Xia M, Zhao K, Liao X, Yang H, Wang Y, He Y. PAGANI Toolkit: Parallel graph-theoretical analysis package for brain network big data. Hum Brain Mapp 2018;39:1869-1885. [PMID: 29417688 DOI: 10.1002/hbm.23996] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2017] [Revised: 12/12/2017] [Accepted: 01/29/2018] [Indexed: 11/10/2022] Open Abstract The recent collection of unprecedented quantities of neuroimaging data with high spatial resolution has led to brain network big data. However, a toolkit for fast and scalable computational solutions is still lacking. Here, we developed the PArallel Graph-theoretical ANalysIs (PAGANI) Toolkit based on a hybrid central processing unit-graphics processing unit (CPU-GPU) framework with a graphical user interface to facilitate the mapping and characterization of high-resolution brain networks. Specifically, the toolkit provides flexible parameters for users to customize computations of graph metrics in brain network analyses. As an empirical example, the PAGANI Toolkit was applied to individual voxel-based brain networks with ∼200,000 nodes that were derived from a resting-state fMRI dataset of 624 healthy young adults from the Human Connectome Project. Using a personal computer, this toolbox completed all computations in ∼27 h for one subject, which is markedly less than the 118 h required with a single-thread implementation. The voxel-based functional brain networks exhibited prominent small-world characteristics and densely connected hubs, which were mainly located in the medial and lateral fronto-parietal cortices. Moreover, the female group had significantly higher modularity and nodal betweenness centrality mainly in the medial/lateral fronto-parietal and occipital cortices than the male group. Significant correlations between the intelligence quotient and nodal metrics were also observed in several frontal regions. Collectively, the PAGANI Toolkit shows high computational performance and good scalability for analyzing connectome big data and provides a friendly interface without the complicated configuration of computing environments, thereby facilitating high-resolution connectomics research in health and disease. Collapse Key Words Big Data CUDA connectome fMRI graph theory hub Collapse MESH Headings Collapse Grants Collapse
43	Nobile MS, Cazzaniga P, Tangherloni A, Besozzi D. Graphics processing units in bioinformatics, computational biology and systems biology. Brief Bioinform 2017;18:870-885. [PMID: 27402792 PMCID: PMC5862309 DOI: 10.1093/bib/bbw058] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2016] [Indexed: 01/18/2023] Open Abstract Several studies in Bioinformatics, Computational Biology and Systems Biology rely on the definition of physico-chemical or mathematical models of biological systems at different scales and levels of complexity, ranging from the interaction of atoms in single molecules up to genome-wide interaction networks. Traditional computational methods and software tools developed in these research fields share a common trait: they can be computationally demanding on Central Processing Units (CPUs), therefore limiting their applicability in many circumstances. To overcome this issue, general-purpose Graphics Processing Units (GPUs) are gaining an increasing attention by the scientific community, as they can considerably reduce the running time required by standard CPU-based software, and allow more intensive investigations of biological systems. In this review, we present a collection of GPU tools recently developed to perform computational analyses in life science disciplines, emphasizing the advantages and the drawbacks in the use of these parallel architectures. The complete list of GPU-powered tools here reviewed is available at http://bit.ly/gputools. Collapse Key Words CUDA bioinformatics computational biology graphics processing units high-performance computing systems biology Collapse MESH Headings Algorithms Computer Graphics Software Systems Biology Collapse Grants Collapse
44	Pryor A, Ophus C, Miao J. A streaming multi-GPU implementation of image simulation algorithms for scanning transmission electron microscopy. ACTA ACUST UNITED AC 2017;3:15. [PMID: 29104852 PMCID: PMC5656717 DOI: 10.1186/s40679-017-0048-z] [Citation(s) in RCA: 62] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2017] [Accepted: 10/13/2017] [Indexed: 11/25/2022] Abstract Simulation of atomic-resolution image formation in scanning transmission electron microscopy can require significant computation times using traditional methods. A recently developed method, termed plane-wave reciprocal-space interpolated scattering matrix (PRISM), demonstrates potential for significant acceleration of such simulations with negligible loss of accuracy. Here, we present a software package called Prismatic for parallelized simulation of image formation in scanning transmission electron microscopy (STEM) using both the PRISM and multislice methods. By distributing the workload between multiple CUDA-enabled GPUs and multicore processors, accelerations as high as 1000 × for PRISM and 15 × for multislice are achieved relative to traditional multislice implementations using a single 4-GPU machine. We demonstrate a potentially important application of Prismatic, using it to compute images for atomic electron tomography at sufficient speeds to include in the reconstruction pipeline. Prismatic is freely available both as an open-source CUDA/C++ package with a graphical user interface and as a Python package, PyPrismatic. Collapse Key Words Atomic electron tomography CUDA Electron scattering GPU High performance computing Imaging simulation Multislice PRISM Scanning transmission electron microscopy Collapse MESH Headings Collapse Grants Collapse
45	He J, Zhou Z, Reed M, Califano A. Accelerated parallel algorithm for gene network reverse engineering. BMC SYSTEMS BIOLOGY 2017;11:83. [PMID: 28950860 PMCID: PMC5615246 DOI: 10.1186/s12918-017-0458-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Abstract Background The Algorithm for the Reconstruction of Accurate Cellular Networks (ARACNE) represents one of the most effective tools to reconstruct gene regulatory networks from large-scale molecular profile datasets. However, previous implementations require intensive computing resources and, in some cases, restrict the number of samples that can be used. These issues can be addressed elegantly in a GPU computing framework, where repeated mathematical computation can be done efficiently, but requires extensive redesign to apply parallel computing techniques to the original serial algorithm, involving detailed optimization efforts based on a deep understanding of both hardware and software architecture. Result Here, we present an accelerated parallel implementation of ARACNE (GPU-ARACNE). By taking advantage of multi-level parallelism and the Compute Unified Device Architecture (CUDA) parallel kernel-call library, GPU-ARACNE successfully parallelizes a serial algorithm and simplifies the user experience from multi-step operations to one step. Using public datasets on comparable hardware configurations, we showed that GPU-ARACNE is faster than previous implementations and is able to reconstruct equally valid gene regulatory networks. Conclusion Given that previous versions of ARACNE are extremely resource demanding, either in computational time or in hardware investment, GPU-ARACNE is remarkably valuable for researchers who need to build complex regulatory networks from large expression datasets, but with limited budget on computational resources. In addition, our GPU-centered optimization of adaptive partitioning for Mutual Information (MI) estimation provides lessons that are applicable to other domains. Electronic supplementary material The online version of this article (doi:10.1186/s12918-017-0458-5) contains supplementary material, which is available to authorized users. Collapse Key Words CUDA GPU-ARACNE Gene expression dataset Mutual information Parallel computing Regulatory networks Collapse MESH Headings Collapse Grants Collapse
46	Wei JD, Cheng HJ, Lin CY, Ye J, Yeh KY. Embedded-Based Graphics Processing Unit Cluster Platform for Multiple Sequence Alignments. Evol Bioinform Online 2017;13:1176934317724764. [PMID: 28835734 PMCID: PMC5555494 DOI: 10.1177/1176934317724764] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2017] [Accepted: 07/12/2017] [Indexed: 11/20/2022] Open Abstract High-end graphics processing units (GPUs), such as NVIDIA Tesla/Fermi/Kepler series cards with thousands of cores per chip, are widely applied to high-performance computing fields in a decade. These desktop GPU cards should be installed in personal computers/servers with desktop CPUs, and the cost and power consumption of constructing a GPU cluster platform are very high. In recent years, NVIDIA releases an embedded board, called Jetson Tegra K1 (TK1), which contains 4 ARM Cortex-A15 CPUs and 192 Compute Unified Device Architecture cores (belong to Kepler GPUs). Jetson Tegra K1 has several advantages, such as the low cost, low power consumption, and high applicability, and it has been applied into several specific applications. In our previous work, a bioinformatics platform with a single TK1 (STK platform) was constructed, and this previous work is also used to prove that the Web and mobile services can be implemented in the STK platform with a good cost-performance ratio by comparing a STK platform with the desktop CPU and GPU. In this work, an embedded-based GPU cluster platform will be constructed with multiple TK1s (MTK platform). Complex system installation and setup are necessary procedures at first. Then, 2 job assignment modes are designed for the MTK platform to provide services for users. Finally, ClustalW v2.0.11 and ClustalWtk will be ported to the MTK platform. The experimental results showed that the speedup ratios achieved 5.5 and 4.8 times for ClustalW v2.0.11 and ClustalWtk, respectively, by comparing 6 TK1s with a single TK1. The MTK platform is proven to be useful for multiple sequence alignments. Collapse Key Words CUDA NVIDIA Jetson TK1 cluster multiple sequence alignment parallel processing Collapse MESH Headings Collapse Grants Collapse
47	Chang HH, Chang YN. CUDA-based acceleration and BPN-assisted automation of bilateral filtering for brain MR image restoration. Med Phys 2017;44:1420-1436. [PMID: 28196280 DOI: 10.1002/mp.12157] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2016] [Revised: 02/02/2017] [Accepted: 02/08/2017] [Indexed: 11/11/2022] Open Abstract PURPOSE Bilateral filters have been substantially exploited in numerous magnetic resonance (MR) image restoration applications for decades. Due to the deficiency of theoretical basis on the filter parameter setting, empirical manipulation with fixed values and noise variance-related adjustments has generally been employed. The outcome of these strategies is usually sensitive to the variation of the brain structures and not all the three parameter values are optimal. This article is in an attempt to investigate the optimal setting of the bilateral filter, from which an accelerated and automated restoration framework is developed. METHODS To reduce the computational burden of the bilateral filter, parallel computing with the graphics processing unit (GPU) architecture is first introduced. The NVIDIA Tesla K40c GPU with the compute unified device architecture (CUDA) functionality is specifically utilized to emphasize thread usages and memory resources. To correlate the filter parameters with image characteristics for automation, optimal image texture features are subsequently acquired based on the sequential forward floating selection (SFFS) scheme. Subsequently, the selected features are introduced into the back propagation network (BPN) model for filter parameter estimation. Finally, the k-fold cross validation method is adopted to evaluate the accuracy of the proposed filter parameter prediction framework. RESULTS A wide variety of T1-weighted brain MR images with various scenarios of noise levels and anatomic structures were utilized to train and validate this new parameter decision system with CUDA-based bilateral filtering. For a common brain MR image volume of 256 × 256 × 256 pixels, the speed-up gain reached 284. Six optimal texture features were acquired and associated with the BPN to establish a "high accuracy" parameter prediction system, which achieved a mean absolute percentage error (MAPE) of 5.6%. Automatic restoration results on 2460 brain MR images received an average relative error in terms of peak signal-to-noise ratio (PSNR) less than 0.1%. In comparison with many state-of-the-art filters, the proposed automation framework with CUDA-based bilateral filtering provided more favorable results both quantitatively and qualitatively. CONCLUSIONS Possessing unique characteristics and demonstrating exceptional performances, the proposed CUDA-based bilateral filter adequately removed random noise in multifarious brain MR images for further study in neurosciences and radiological sciences. It requires no prior knowledge of the noise variance and automatically restores MR images while preserving fine details. The strategy of exploiting the CUDA to accelerate the computation and incorporating texture features into the BPN to completely automate the bilateral filtering process is achievable and validated, from which the best performance is reached. Collapse Key Words BPN CUDA GPU MRI bilateral filter image texture Collapse MESH Headings Collapse Grants Collapse
48	Kobus R, Hundt C, Müller A, Schmidt B. Accelerating metagenomic read classification on CUDA-enabled GPUs. BMC Bioinformatics 2017;18:11. [PMID: 28049411 PMCID: PMC5209836 DOI: 10.1186/s12859-016-1434-6] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2016] [Accepted: 12/16/2016] [Indexed: 11/10/2022] Open Abstract BACKGROUND Metagenomic sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification; i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes software tools for fast and accurate metagenomic read classification are urgently needed. RESULTS We present cuCLARK, a read-level classifier for CUDA-enabled GPUs, based on the fast and accurate classification of metagenomic sequences using reduced k-mers (CLARK) method. Using the processing power of a single Titan X GPU, cuCLARK can reach classification speeds of up to 50 million reads per minute. Corresponding speedups for species- (genus-)level classification range between 3.2 and 6.6 (3.7 and 6.4) compared to multi-threaded CLARK executed on a 16-core Xeon CPU workstation. CONCLUSION cuCLARK can perform metagenomic read classification at superior speeds on CUDA-enabled GPUs. It is free software licensed under GPL and can be downloaded at https://github.com/funatiq/cuclark free of charge. Collapse Key Words CUDA Exact k-mer matching GPUs Metagenomics Taxonomic assignment Collapse MESH Headings High-Throughput Nucleotide Sequencing Humans Internet Metagenomics Sequence Analysis, DNA User-Computer Interface Collapse Grants Collapse
49	Zhou Y, Donald BR, Zeng J. Parallel Computational Protein Design. Methods Mol Biol 2017;1529:265-277. [PMID: 27914056 PMCID: PMC5192564 DOI: 10.1007/978-1-4939-6637-0_13] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023] Abstract Computational structure-based protein design (CSPD) is an important problem in computational biology, which aims to design or improve a prescribed protein function based on a protein structure template. It provides a practical tool for real-world protein engineering applications. A popular CSPD method that guarantees to find the global minimum energy solution (GMEC) is to combine both dead-end elimination (DEE) and A* tree search algorithms. However, in this framework, the A* search algorithm can run in exponential time in the worst case, which may become the computation bottleneck of large-scale computational protein design process. To address this issue, we extend and add a new module to the OSPREY program that was previously developed in the Donald lab (Gainza et al., Methods Enzymol 523:87, 2013) to implement a GPU-based massively parallel A* algorithm for improving protein design pipeline. By exploiting the modern GPU computational framework and optimizing the computation of the heuristic function for A* search, our new program, called gOSPREY, can provide up to four orders of magnitude speedups in large protein design cases with a small memory overhead comparing to the traditional A* search algorithm implementation, while still guaranteeing the optimality. In addition, gOSPREY can be configured to run in a bounded-memory mode to tackle the problems in which the conformation space is too large and the global optimal solution cannot be computed previously. Furthermore, the GPU-based A* algorithm implemented in the gOSPREY program can be combined with the state-of-the-art rotamer pruning algorithms such as iMinDEE (Gainza et al., PLoS Comput Biol 8:e1002335, 2012) and DEEPer (Hallen et al., Proteins 81:18-39, 2013) to also consider continuous backbone and side-chain flexibility. Collapse Key Words A* CUDA Dead-end elimination GPGPU Parallel computing Protein design Collapse MESH Headings Algorithms Amino Acid Sequence Computational Biology/methods Computer Simulation Models, Molecular Protein Conformation Protein Engineering/methods Proteins/chemistry Proteins/genetics Reproducibility of Results Software Structure-Activity Relationship Collapse Grants R01 GM078031 NIGMS NIH HHS Collapse
50	Toward Optimal Computation of Ultrasound Image Reconstruction Using CPU and GPU. SENSORS 2016;16:s16121986. [PMID: 27886149 PMCID: PMC5190967 DOI: 10.3390/s16121986] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/13/2016] [Revised: 10/31/2016] [Accepted: 11/10/2016] [Indexed: 12/03/2022] Abstract An ultrasound image is reconstructed from echo signals received by array elements of a transducer. The time of flight of the echo depends on the distance between the focus to the array elements. The received echo signals have to be delayed to make their wave fronts and phase coherent before summing the signals. In digital beamforming, the delays are not always located at the sampled points. Generally, the values of the delayed signals are estimated by the values of the nearest samples. This method is fast and easy, however inaccurate. There are other methods available for increasing the accuracy of the delayed signals and, consequently, the quality of the beamformed signals; for example, the in-phase (I)/quadrature (Q) interpolation, which is more time consuming but provides more accurate values than the nearest samples. This paper compares the signals after dynamic receive beamforming, in which the echo signals are delayed using two methods, the nearest sample method and the I/Q interpolation method. The comparisons of the visual qualities of the reconstructed images and the qualities of the beamformed signals are reported. Moreover, the computational speeds of these methods are also optimized by reorganizing the data processing flow and by applying the graphics processing unit (GPU). The use of single and double precision floating-point formats of the intermediate data is also considered. The speeds with and without these optimizations are also compared. Collapse Key Words CUDA array transducer dynamic receive beamforming graphics processing unit image reconstruction ultrasound imaging Collapse MESH Headings Collapse Grants Collapse