1
|
Spouge JL, Das P, Chen Y, Frith M. The Statistics of Parametrized Syncmers in a Simple Mutation Process Without Spurious Matches. J Comput Biol 2024. [PMID: 39530391 DOI: 10.1089/cmb.2024.0508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2024] Open
Abstract
Introduction: Often, bioinformatics uses summary sketches to analyze next-generation sequencing data, but most sketches are not well understood statistically. Under a simple mutation model, Blanca et al. analyzed complete sketches, that is, the complete set of unassembled k-mers, from two closely related sequences. The analysis extracted a point mutation parameter θ quantifying the evolutionary distance between the two sequences. Methods: We extend the results of Blanca et al. for complete sketches to parametrized syncmer sketches with downsampling. A syncmer sketch can sample k-mers much more sparsely than a complete sketch. Consider the following simple mutation model disallowing insertions or deletions. Consider a reference sequence A (e.g., a subsequence from a reference genome), and mutate each nucleotide in it independently with probability θ to produce a mutated sequence B (corresponding to, e.g., a set of reads or draft assembly of a related genome). Then, syncmer counts alone yield an approximate Gaussian distribution for estimating θ. The assumption disallowing insertions and deletions motivates a check on the lengths of A and B. The syncmer count from B yields an approximate Gaussian distribution for its length, and a p-value can test the length of B against the length of A using syncmer counts alone. Results: The Gaussian distributions permit syncmer counts alone to estimate θ and mutated sequence length with a known sampling error. Under some circumstances, the results provide the sampling error for the Mash containment index when applied to syncmer counts. Conclusions: The approximate Gaussian distributions provide hypothesis tests and confidence intervals for phylogenetic distance and sequence length. Our methods are likely to generalize to sketches other than syncmers and may be useful in assembling reads and related applications.
Collapse
Affiliation(s)
- John L Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Pijush Das
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Ye Chen
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan
| | - Martin Frith
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), AIST, Tokyo, Japan
| |
Collapse
|
2
|
Chen SL, Shen YJ, Chen GZ. RNA Sequencing Analysis of Patients with Chronic Hepatitis B Treated Using PEGylated Interferon. Int J Gen Med 2024; 17:4465-4474. [PMID: 39372134 PMCID: PMC11453141 DOI: 10.2147/ijgm.s474284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Accepted: 09/25/2024] [Indexed: 10/08/2024] Open
Abstract
Purpose Worldwide, chronic hepatitis B virus (CHB) infection is a public health concern, ultimately leading to liver cirrhosis and hepatocellular carcinoma. Currently, patients with CHB can be treated using polyethylene glycol (PEG)ylated interferon (PEG-IFN) antiviral therapy, which has both immune modulatory and antiviral properties. This study aimed to reveal the mechanism underlying the effect of PEG-IFN therapy, to rationally optimize this therapeutic option. Patients and Methods Ten patients with CHB who were positive for the hepatitis B virus e antigen (HBeAg) and were receiving PEG-IFN treatment were enrolled. Clinical and virological parameters were monitored during 48 weeks of treatment. In addition, peripheral blood mononuclear cells (PBMCs) were collected from the 10 patients at 0, 24, and 36 weeks. RNA sequencing technology was used to analyze the RNA expression profile in the PBMC samples. Results Following PEG-IFN treatment, we identified 217 differentially expressed genes (DEGs), most of which were upregulated. Gene ontology enrichment analysis of the DEGs revealed that they were enriched in 29 clusters, mainly associated with "antiviral defense", "innate immunity", "immunity", "defense response to virus", "response to virus", "type I interferon signaling pathway", "negative regulation of viral genome replication", "innate immune response", and "RNA-binding". Conclusion After PEG-IFN treatment, a certain mRNA expression profile was observed in patients with CHB, providing further mechanistic insights into the antiviral effect of this therapy.
Collapse
Affiliation(s)
- Shao-Long Chen
- Department of Infectious Disease Control and Prevention, Yueqing Center for Disease Control and Prevention, Wenzhou, 325600, People’s Republic of China
| | - Yao-Jie Shen
- Department of Infectious Diseases, Huashan Hospital, Fudan University, Shanghai, 200040, People’s Republic of China
| | - Guo-Zhi Chen
- Department of Infectious Disease Control and Prevention, Yueqing Center for Disease Control and Prevention, Wenzhou, 325600, People’s Republic of China
| |
Collapse
|
3
|
Yang J, Yang W, Hu Y, Tong L, Liu R, Liu L, Jiang B, Sun Z. Screening of genes co-associated with osteoporosis and chronic HBV infection based on bioinformatics analysis and machine learning. Front Immunol 2024; 15:1472354. [PMID: 39351238 PMCID: PMC11439653 DOI: 10.3389/fimmu.2024.1472354] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Accepted: 08/28/2024] [Indexed: 10/04/2024] Open
Abstract
Objective To identify HBV-related genes (HRGs) implicated in osteoporosis (OP) pathogenesis and develop a diagnostic model for early OP detection in chronic HBV infection (CBI) patients. Methods Five public sequencing datasets were collected from the GEO database. Gene differential expression and LASSO analyses identified genes linked to OP and CBI. Machine learning algorithms (random forests, support vector machines, and gradient boosting machines) further filtered these genes. The best diagnostic model was chosen based on accuracy and Kappa values. A nomogram model based on HRGs was constructed and assessed for reliability. OP patients were divided into two chronic HBV-related clusters using non-negative matrix factorization. Differential gene expression analysis, Gene Ontology, and KEGG enrichment analyses explored the roles of these genes in OP progression, using ssGSEA and GSVA. Differences in immune cell infiltration between clusters and the correlation between HRGs and immune cells were examined using ssGSEA and the Pearson method. Results Differential gene expression analysis of CBI and combined OP dataset identified 822 and 776 differentially expressed genes, respectively, with 43 genes intersecting. Following LASSO analysis and various machine learning recursive feature elimination algorithms, 16 HRGs were identified. The support vector machine emerged as the best predictive model based on accuracy and Kappa values, with AUC values of 0.92, 0.83, 0.74, and 0.7 for the training set, validation set, GSE7429, and GSE7158, respectively. The nomogram model exhibited AUC values of 0.91, 0.79, and 0.68 in the training set, GSE7429, and GSE7158, respectively. Non-negative matrix factorization divided OP patients into two clusters, revealing statistically significant differences in 11 types of immune cell infiltration between clusters. Finally, intersecting the HRGs obtained from LASSO analysis with the HRGs identified three genes. Conclusion This study successfully identified HRGs and developed an efficient diagnostic model based on HRGs, demonstrating high accuracy and strong predictive performance across multiple datasets. This research not only offers new insights into the complex relationship between OP and CBI but also establishes a foundation for the development of early diagnostic and personalized treatment strategies for chronic HBV-related OP.
Collapse
Affiliation(s)
- Jia Yang
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China
| | - Weiguang Yang
- Department of Cardiovascular Surgery, Tianjin Medical University General Hospital, Tianjin, China
| | - Yue Hu
- Clinical School of the Second People’s Hospital, Tianjin Medical University, Tianjin, China
| | - Linjian Tong
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China
| | - Rui Liu
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China
| | - Lice Liu
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China
| | - Bei Jiang
- Clinical School of the Second People’s Hospital, Tianjin Medical University, Tianjin, China
| | - Zhiming Sun
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China
| |
Collapse
|
4
|
Schmidt B, Hildebrandt A. From GPUs to AI and quantum: three waves of acceleration in bioinformatics. Drug Discov Today 2024; 29:103990. [PMID: 38663581 DOI: 10.1016/j.drudis.2024.103990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 04/05/2024] [Accepted: 04/17/2024] [Indexed: 05/01/2024]
Abstract
The enormous growth in the amount of data generated by the life sciences is continuously shifting the field from model-driven science towards data-driven science. The need for efficient processing has led to the adoption of massively parallel accelerators such as graphics processing units (GPUs). Consequently, the development of bioinformatics methods nowadays often heavily depends on the effective use of these powerful technologies. Furthermore, progress in computational techniques and architectures continues to be highly dynamic, involving novel deep neural network models and artificial intelligence (AI) accelerators, and potentially quantum processing units in the future. These are expected to be disruptive for the life sciences as a whole and for drug discovery in particular. Here, we identify three waves of acceleration and their applications in a bioinformatics context: (i) GPU computing, (ii) AI and (iii) next-generation quantum computers.
Collapse
Affiliation(s)
- Bertil Schmidt
- Institut für Informatik, Johannes Gutenberg University, Mainz, Germany.
| | | |
Collapse
|
5
|
Blay E, Hardyman E, Morovic W. PCR-based analytics of gene therapies using adeno-associated virus vectors: Considerations for cGMP method development. Mol Ther Methods Clin Dev 2023; 31:101132. [PMID: 37964893 PMCID: PMC10641278 DOI: 10.1016/j.omtm.2023.101132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2023]
Abstract
The field of gene therapy has evolved and improved so that today the treatment of thousands of genetic diseases is now possible. An integral aspect of the drug development process is generating analytical methods to be used throughout clinical and commercial manufacturing. Enumeration and identification assays using genetic testing are critical to ensure the safety, efficacy, and stability of many active pharmaceutical ingredients. While nucleic acid-based methods are already reliable and rapid, there are unique biological, technological, and regulatory aspects in gene therapies that must be considered. This review surveys aspects of method development and validation using nucleic acid-based testing of gene therapies by focusing on adeno-associated virus (AAV) vectors and their co-transfection factors. Key differences between quantitative PCR and droplet digital technologies are discussed to show how improvements can be made while still adhering to regulatory guidance. Example validation parameters for AAV genome titers are described to demonstrate the scope of analytical development. Finally, several areas for improving analytical testing are presented to inspire future innovation, including next-generation sequencing and artificial intelligence. Reviewing the broad characteristics of gene therapy assessment serves as an introduction for new researchers, while clarifying processes for professionals already involved in pharmaceutical manufacturing.
Collapse
Affiliation(s)
- Emmanuel Blay
- Gene & Cell Therapy, PPD GMP Laboratories, Part of ThermoFisher Scientific, Middleton, WI, USA
| | - Elaine Hardyman
- Gene & Cell Therapy, PPD GMP Laboratories, Part of ThermoFisher Scientific, Middleton, WI, USA
| | - Wesley Morovic
- Gene & Cell Therapy, PPD GMP Laboratories, Part of ThermoFisher Scientific, Middleton, WI, USA
| |
Collapse
|
6
|
Yan L, Yin Z, Zhang H, Zhao Z, Wang M, Müller A, Kallenborn F, Wichmann A, Wei Y, Niu B, Schmidt B, Liu W. RabbitQCPlus 2.0: More efficient and versatile quality control for sequencing data. Methods 2023; 216:39-50. [PMID: 37330158 DOI: 10.1016/j.ymeth.2023.06.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 05/26/2023] [Accepted: 06/12/2023] [Indexed: 06/19/2023] Open
Abstract
Assessing the quality of sequencing data plays a crucial role in downstream data analysis. However, existing tools often achieve sub-optimal efficiency, especially when dealing with compressed files or performing complicated quality control operations such as over-representation analysis and error correction. We present RabbitQCPlus, an ultra-efficient quality control tool for modern multi-core systems. RabbitQCPlus uses vectorization, memory copy reduction, parallel (de)compression, and optimized data structures to achieve substantial performance gains. It is 1.1 to 5.4 times faster when performing basic quality control operations compared to state-of-the-art applications yet requires fewer compute resources. Moreover, RabbitQCPlus is at least 4 times faster than other applications when processing gzip-compressed FASTQ files and 1.3 times faster with the error correction module turned on. Furthermore, it takes less than 4 minutes to process 280 GB of plain FASTQ sequencing data, while other applications take at least 22 minutes on a 48-core server when enabling the per-read over-representation analysis. C++ sources are available at https://github.com/RabbitBio/RabbitQCPlus.
Collapse
Affiliation(s)
- Lifeng Yan
- School of Software, Shandong University, Jinan, China
| | - Zekun Yin
- School of Software, Shandong University, Jinan, China.
| | - Hao Zhang
- School of Software, Shandong University, Jinan, China
| | - Zhan Zhao
- School of Software, Shandong University, Jinan, China
| | - Mingkai Wang
- School of Software, Shandong University, Jinan, China
| | - André Müller
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Felix Kallenborn
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Alexander Wichmann
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Yanjie Wei
- Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Beifang Niu
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Bertil Schmidt
- Institute for Computer Science, Johannes Gutenberg University, Mainz, Germany
| | - Weiguo Liu
- School of Software, Shandong University, Jinan, China
| |
Collapse
|
7
|
Soundararajan S, Selvakumar J, Maria Joseph ZM, Gopinath Y, Saravanan V, Santhanam R. Investigating the modulatory effects of Moringa oleifera on the gut microbiota of chicken model through metagenomic approach. Front Vet Sci 2023; 10:1153769. [PMID: 37323848 PMCID: PMC10267347 DOI: 10.3389/fvets.2023.1153769] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 04/17/2023] [Indexed: 06/17/2023] Open
Abstract
Introduction This study aimed to assess the effects of supplementing chicken feed with Moringa oleifera leaf powder, a phytobiotic, on the gastrointestinal microbiota. The objective was to examine the microbial changes induced by the supplementation. Methods A total of 40, one-day-old chickens were fed their basal diet for 42 days and then divided into two groups: SG1 (basal diet) and SG2 (basal diet + 10 g/kg Moringa oleifera leaf powder). Metagenomics analysis was conducted to analyze operational taxonomic units (OTUs), species annotation, and biodiversity. Additionally, 16S rRNA sequencing was performed for molecular characterization of isolated gut bacteria, identified as Enterococcus faecium. The isolated bacteria were tested for essential metabolites, demonstrating antibacterial, antioxidant, and anticancer activities. Results and discussion The analysis revealed variations in the microbial composition between the control group (SG1) and the M. oleifera-treated group (SG2). SG2 showed a 47% increase in Bacteroides and a 30% decrease in Firmicutes, Proteobacteria, Actinobacteria, and Tenericutes compared to SG1. TM7 bacteria were observed exclusively in the M. oleifera-treated group. These findings suggest that Moringa oleifera leaf powder acts as a modulator that enhances chicken gut microbiota, promoting the colonization of beneficial bacteria. PICRUSt analysis supported these findings, showing increased carbohydrate and lipid metabolism in the M.oleifera-treated gut microbiota. Conclusion This study indicates that supplementing chicken feed with Moringa oleifera leaf powder as a phytobiotic enhances the gut microbiota in chicken models, potentially improving overall health. The observed changes in bacterial composition, increased presence of Bacteroides, and exclusive presence of TM7 bacteria suggest a positive modulation of microbial balance. The essential metabolites from isolated Enterococcus faecium bacteria further support the potential benefits of Moringa oleifera supplementation.
Collapse
Affiliation(s)
- Sowmiya Soundararajan
- Department of Biotechnology and Bioinformatics, Bishop Heber College (Autonomous), Affiliated With Bharathidasan University, Tiruchirappalli, Tamil Nadu, India
| | - Jasmine Selvakumar
- Department of Biotechnology and Bioinformatics, Bishop Heber College (Autonomous), Affiliated With Bharathidasan University, Tiruchirappalli, Tamil Nadu, India
| | - Zion Mercy Maria Joseph
- Department of Biotechnology and Bioinformatics, Bishop Heber College (Autonomous), Affiliated With Bharathidasan University, Tiruchirappalli, Tamil Nadu, India
| | - Yuvapriya Gopinath
- Department of Biotechnology and Bioinformatics, Bishop Heber College (Autonomous), Affiliated With Bharathidasan University, Tiruchirappalli, Tamil Nadu, India
| | - Vaishali Saravanan
- Department of Biotechnology and Bioinformatics, Bishop Heber College (Autonomous), Affiliated With Bharathidasan University, Tiruchirappalli, Tamil Nadu, India
| | - Rameshkumar Santhanam
- Faculty of Science and Marine Environment, Universiti Malaysia Terengganu, Kuala Nerus, Terengganu, Malaysia
| |
Collapse
|
8
|
AL-Jumaili AHA, Muniyandi RC, Hasan MK, Paw JKS, Singh MJ. Big Data Analytics Using Cloud Computing Based Frameworks for Power Management Systems: Status, Constraints, and Future Recommendations. SENSORS (BASEL, SWITZERLAND) 2023; 23:2952. [PMID: 36991663 PMCID: PMC10051254 DOI: 10.3390/s23062952] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 02/03/2023] [Accepted: 02/10/2023] [Indexed: 06/19/2023]
Abstract
Traditional parallel computing for power management systems has prime challenges such as execution time, computational complexity, and efficiency like process time and delays in power system condition monitoring, particularly consumer power consumption, weather data, and power generation for detecting and predicting data mining in the centralized parallel processing and diagnosis. Due to these constraints, data management has become a critical research consideration and bottleneck. To cope with these constraints, cloud computing-based methodologies have been introduced for managing data efficiently in power management systems. This paper reviews the concept of cloud computing architecture that can meet the multi-level real-time requirements to improve monitoring and performance which is designed for different application scenarios for power system monitoring. Then, cloud computing solutions are discussed under the background of big data, and emerging parallel programming models such as Hadoop, Spark, and Storm are briefly described to analyze the advancement, constraints, and innovations. The key performance metrics of cloud computing applications such as core data sampling, modeling, and analyzing the competitiveness of big data was modeled by applying related hypotheses. Finally, it introduces a new design concept with cloud computing and eventually some recommendations focusing on cloud computing infrastructure, and methods for managing real-time big data in the power management system that solve the data mining challenges.
Collapse
Affiliation(s)
- Ahmed Hadi Ali AL-Jumaili
- Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia
- Computer Centre Department, University of Fallujah, Anbar 00964, Iraq
| | - Ravie Chandren Muniyandi
- Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia
| | - Mohammad Kamrul Hasan
- Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia
| | - Johnny Koh Siaw Paw
- Department of Electronic & Communication Engineering, Universiti Tenaga Nasional, Km 7, Jalan Ikram-Uniten, Kajang 43009, Selangor, Malaysia
| | - Mandeep Jit Singh
- Department of Electrical, Electronic and System Engineering, Faculty of Engineering and Built Environment, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia
| |
Collapse
|
9
|
Zhang X, Zhao C, Shao M, Chen Y, Liu P, Chen G. The roadmap of bioeconomy in China. ENGINEERING BIOLOGY 2022; 6:71-81. [PMID: 36968339 PMCID: PMC9995158 DOI: 10.1049/enb2.12026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Accepted: 11/23/2022] [Indexed: 12/05/2022] Open
Abstract
The bioeconomy drives the development of life science and biotechnology as a blueprint for the future development of human society, and offers a cross-cutting perspective on the societal transformation towards long-term sustainability and the transition away from the non-renewable economy. Moreover, the sustainable bioeconomy strategies are consistent with the United Nation's (UN) Sustainable Development Goals (SDG) and are becoming the centre of the achievement for SDG. The Chinese '14th Five-Year Plan for Bioeconomy Development' (2021-2025), including the development goals of China's bioeconomy containing biomedicine, agriculture, bio-manufacturing and bio-security as a strategic priority, is discussed. The plan offers three pathways to improve bioeconomy, including technological innovation, industrialisation and policy supports. Finally, it concludes China's first bioeconomy development plan as a success, suggesting the key role of industrial biotechnology in bioeconomy.
Collapse
Affiliation(s)
- Xu Zhang
- Key Laboratory of Industrial Biocatalysis, Ministry of EducationDepartment of Chemical EngineeringCenter of Synthetic and Systems BiologyTsinghua‐Peking Center of Life SciencesTsinghua UniversityBeijingChina
| | - Cuihuan Zhao
- Center of Synthetic and Systems BiologySchool of Life SciencesTsinghua UniversityBeijingChina
| | - Ming‐Wei Shao
- Center of Synthetic and Systems BiologySchool of Life SciencesTsinghua UniversityBeijingChina
| | - Yi‐Ling Chen
- Center of Synthetic and Systems BiologySchool of Life SciencesTsinghua UniversityBeijingChina
| | - Puyuan Liu
- Faculty of ScienceUniversity of WaterlooWaterlooOntarioCanada
| | - Guo‐Qiang Chen
- Key Laboratory of Industrial Biocatalysis, Ministry of EducationDepartment of Chemical EngineeringCenter of Synthetic and Systems BiologyTsinghua‐Peking Center of Life SciencesTsinghua UniversityBeijingChina
- Center of Synthetic and Systems BiologySchool of Life SciencesTsinghua UniversityBeijingChina
| |
Collapse
|
10
|
Guryleva MV, Penzar DD, Chistyakov DV, Mironov AA, Favorov AV, Sergeeva MG. Investigation of the Role of PUFA Metabolism in Breast Cancer Using a Rank-Based Random Forest Algorithm. Cancers (Basel) 2022; 14:cancers14194663. [PMID: 36230586 PMCID: PMC9562210 DOI: 10.3390/cancers14194663] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Revised: 09/15/2022] [Accepted: 09/21/2022] [Indexed: 11/16/2022] Open
Abstract
Simple Summary Polyunsaturated fatty acids (PUFAs) and their derivatives, oxylipins, are a constant focus of cancer research due to the relationship between cancer and processes of energy metabolism and inflammation, where a PUFA system is an active player. Only recently have methods been developed that allow for studying such complex systems. Using the Rank-based Random Forest (RF) model, we show that PUFA metabolism genes are critical for the pathogenesis of breast cancer (BC); BC subtypes differ in PUFA metabolism gene expression. The enrichment of BC subtypes with various genes associated with oxylipin signaling pathways indicates a different contribution of these compounds to the biology of subtypes. Abstract Polyunsaturated fatty acid (PUFA) metabolism is currently a focus in cancer research due to PUFAs functioning as structural components of the membrane matrix, as fuel sources for energy production, and as sources of secondary messengers, so called oxylipins, important players of inflammatory processes. Although breast cancer (BC) is the leading cause of cancer death among women worldwide, no systematic study of PUFA metabolism as a system of interrelated processes in this disease has been carried out. Here, we implemented a Boruta-based feature selection algorithm to determine the list of most important PUFA metabolism genes altered in breast cancer tissues compared with in normal tissues. A rank-based Random Forest (RF) model was built on the selected gene list (33 genes) and applied to predict the cancer phenotype to ascertain the PUFA genes involved in cancerogenesis. It showed high-performance of dichotomic classification (balanced accuracy of 0.94, ROC AUC 0.99) We also retrieved a list of the important PUFA genes (46 genes) that differed between molecular subtypes at the level of breast cancer molecular subtypes. The balanced accuracy of the classification model built on the specified genes was 0.82, while the ROC AUC for the sensitivity analysis was 0.85. Specific patterns of PUFA metabolic changes were obtained for each molecular subtype of breast cancer. These results show evidence that (1) PUFA metabolism genes are critical for the pathogenesis of breast cancer; (2) BC subtypes differ in PUFA metabolism genes expression; and (3) the lists of genes selected in the models are enriched with genes involved in the metabolism of signaling lipids.
Collapse
Affiliation(s)
- Mariia V. Guryleva
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119234 Moscow, Russia
| | - Dmitry D. Penzar
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119234 Moscow, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
| | - Dmitry V. Chistyakov
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, 119992 Moscow, Russia
- Correspondence: ; Tel.: +7-495-939-4332
| | - Andrey A. Mironov
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, 119234 Moscow, Russia
- Kharkevich Institute of Information Transmission Problems, Russian Academy of Sciences, 127051 Moscow, Russia
| | - Alexander V. Favorov
- Vavilov Institute of General Genetics, Russian Academy of Sciences, 119991 Moscow, Russia
- School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Marina G. Sergeeva
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, 119992 Moscow, Russia
| |
Collapse
|
11
|
Gudur VY, Maheshwari S, Acharyya A, Shafik R. An FPGA Based Energy-Efficient Read Mapper With Parallel Filtering and In-Situ Verification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2697-2711. [PMID: 34415836 DOI: 10.1109/tcbb.2021.3106311] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
In the assembly pipeline of Whole Genome Sequencing (WGS), read mapping is a widely used method to re-assemble the genome. It employs approximate string matching and dynamic programming-based algorithms on a large volume of data and associated structures, making it a computationally intensive process. Currently, the state-of-the-art data centers for genome sequencing incur substantial setup and energy costs for maintaining hardware, data storage and cooling systems. To enable low-cost genomics, we propose an energy-efficient architectural methodology for read mapping using a single system-on-chip (SoC) platform. The proposed methodology is based on the q-gram lemma and designed using a novel architecture for filtering and verification. The filtering algorithm is designed using a parallel sorted q-gram lemma based method for the first time, and it is complemented by an in-situ verification routine using parallel Myers bit-vector algorithm. We have implemented our design on the Zynq Ultrascale+ XCZU9EG MPSoC platform. It is then extensively validated using real genomic data to demonstrate up to 7.8× energy reduction and up to 13.3× less resource utilization when compared with the state-of-the-art software and hardware approaches.
Collapse
|
12
|
Gudur VY, Maheshwari S, Bhardwaj S, Acharyya A, Shafik R. Hardware-Algorithm Codesign for Fast and Energy Efficient Approximate String Matching on FPGA for Computational Biology. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2022; 2022:87-90. [PMID: 36086088 DOI: 10.1109/embc48229.2022.9870924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Myers bit-vector algorithm for approximate string matching (ASM) is a dynamic programming based approach that takes advantage of bit-parallel operations. It is one of the fastest algorithms to find the edit distance between two strings. In computational biology, ASM is used at various stages of the computational pipeline, including proteomics and genomics. The computationally intensive nature of the underlying algorithms for ASM operating on the large volume of data necessitates the acceleration of these algorithms. In this paper, we propose a novel ASM architecture based on Myers bit-vector algorithm for parallel searching of multiple query patterns in the biological databases. The proposed parallel architecture uses multiple processing engines and hardware/software codesign for an accelerated and energy-efficient design of ASM algorithm on hardware. In comparison with related literature, the proposed design achieves 22× better performance with a demonstrative energy efficiency of ∼ 500×109 cell updates per joule.
Collapse
|
13
|
John Cremin C, Dash S, Huang X. Big Data: Historic Advances and Emerging Trends in Biomedical Research. CURRENT RESEARCH IN BIOTECHNOLOGY 2022. [DOI: 10.1016/j.crbiot.2022.02.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022] Open
|
14
|
Zhang F, Petersen M, Johnson L, Hall J, O’Bryant SE. Accelerating Hyperparameter Tuning in Machine Learning for Alzheimer's Disease With High Performance Computing. Front Artif Intell 2021; 4:798962. [PMID: 34957393 PMCID: PMC8692864 DOI: 10.3389/frai.2021.798962] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Accepted: 11/15/2021] [Indexed: 11/27/2022] Open
Abstract
Driven by massive datasets that comprise biomarkers from both blood and magnetic resonance imaging (MRI), the need for advanced learning algorithms and accelerator architectures, such as GPUs and FPGAs has increased. Machine learning (ML) methods have delivered remarkable prediction for the early diagnosis of Alzheimer's disease (AD). Although ML has improved accuracy of AD prediction, the requirement for the complexity of algorithms in ML increases, for example, hyperparameters tuning, which in turn, increases its computational complexity. Thus, accelerating high performance ML for AD is an important research challenge facing these fields. This work reports a multicore high performance support vector machine (SVM) hyperparameter tuning workflow with 100 times repeated 5-fold cross-validation for speeding up ML for AD. For demonstration and evaluation purposes, the high performance hyperparameter tuning model was applied to public MRI data for AD and included demographic factors such as age, sex and education. Results showed that computational efficiency increased by 96%, which helped to shed light on future diagnostic AD biomarker applications. The high performance hyperparameter tuning model can also be applied to other ML algorithms such as random forest, logistic regression, xgboost, etc.
Collapse
Affiliation(s)
- Fan Zhang
- Institute for Translational Research, University of North Texas Health Science Center, Fort Worth, TX, United States
- Department of Family Medicine, University of North Texas Health Science Center, Fort Worth, TX, United States
| | - Melissa Petersen
- Institute for Translational Research, University of North Texas Health Science Center, Fort Worth, TX, United States
- Department of Family Medicine, University of North Texas Health Science Center, Fort Worth, TX, United States
| | - Leigh Johnson
- Institute for Translational Research, University of North Texas Health Science Center, Fort Worth, TX, United States
- Department of Pharmacology and Neuroscience, University of North Texas Health Science Center, Fort Worth, TX, United States
| | - James Hall
- Institute for Translational Research, University of North Texas Health Science Center, Fort Worth, TX, United States
- Department of Pharmacology and Neuroscience, University of North Texas Health Science Center, Fort Worth, TX, United States
| | - Sid E. O’Bryant
- Institute for Translational Research, University of North Texas Health Science Center, Fort Worth, TX, United States
- Department of Pharmacology and Neuroscience, University of North Texas Health Science Center, Fort Worth, TX, United States
| |
Collapse
|
15
|
Tang B, Zhu J, Zhao Z, Lu C, Liu S, Fang S, Zheng L, Zhang N, Chen M, Xu M, Yu R, Ji J. Diagnosis and prognosis models for hepatocellular carcinoma patient's management based on tumor mutation burden. J Adv Res 2021; 33:153-165. [PMID: 34603786 PMCID: PMC8463909 DOI: 10.1016/j.jare.2021.01.018] [Citation(s) in RCA: 54] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 01/19/2021] [Accepted: 01/29/2021] [Indexed: 02/07/2023] Open
Abstract
Introduction The development and prognosis of HCC involve complex molecular mechanisms, which affect the effectiveness of its treatment strategies. Tumor mutational burden (TMB) is related to the efficacy of immunotherapy, but the prognostic role of TMB-related genes in HCC has not yet been determined clearly. Objectives In this study, we identified TMB-specific genes with good prognostic value to build diagnostic and prognostic models and provide guidance for the treatment of HCC patients. Methods Weighted gene co-expression network analysis (WGCNA) was applied to identify the TMB-specific genes. And LASSO method and Cox regression were used in establishing the prognostic model. Results The prognostic model based on SMG5 and MRPL9 showed patients with higher prognostic risk had a remarkedly poorer survival probability than their counterparts with lower prognostic risk in both a TCGA cohort (P < 0.001, HR = 1.93) and an ICGC cohort (P < 0.001, HR = 3.58). In addition, higher infiltrating fractions of memory B cells, M0 macrophages, neutrophils, activated memory CD4 + T cells, follicular helper T cells and regulatory T cells and higher expression of B7H3, CTLA4, PD1, and TIM3 were present in the high-risk group than in the low-risk group (P < 0.05). Patients with high prognostic risk had higher resistance to some chemotherapy and targeted drugs, such as methotrexate, vinblastine and erlotinib, than those with low prognostic risk (P < 0.05). And a diagnostic model considering two genes was able to accurately distinguish patients with HCC from normal samples and those with dysplastic nodules. In addition, knockdown of SMG5 and MRPL9 was determined to significantly inhibit cell proliferation and migration in HCC. Conclusion Our study helps to select patients suitable for chemotherapy, targeted drugs and immunotherapy and provide new ideas for developing treatment strategies to improve disease outcomes in HCC patients.
Collapse
Affiliation(s)
- Bufu Tang
- Key Laboratory of Imaging Diagnosis and Minimally Invasive Intervention Research, Lishui Hospital, School of Medicine, Zhejiang University, Lishui 323000, China.,Department of Radiology, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| | - Jinyu Zhu
- Key Laboratory of Imaging Diagnosis and Minimally Invasive Intervention Research, Lishui Hospital, School of Medicine, Zhejiang University, Lishui 323000, China.,Department of Radiology, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| | - Zhongwei Zhao
- Key Laboratory of Imaging Diagnosis and Minimally Invasive Intervention Research, Lishui Hospital, School of Medicine, Zhejiang University, Lishui 323000, China.,Department of Radiology, the Fifth Affiliated Hospital of Wenzhou Medical University, Lishui 323000, China
| | - Chenying Lu
- Key Laboratory of Imaging Diagnosis and Minimally Invasive Intervention Research, Lishui Hospital, School of Medicine, Zhejiang University, Lishui 323000, China.,Department of Radiology, the Fifth Affiliated Hospital of Wenzhou Medical University, Lishui 323000, China
| | - Siyu Liu
- Department of Laboratory, the Fifth Affiliated Hospital of Wenzhou Medical University, Lishui 323000, China
| | - Shiji Fang
- Department of Radiology, the Fifth Affiliated Hospital of Wenzhou Medical University, Lishui 323000, China
| | - Liyun Zheng
- Department of Radiology, the Fifth Affiliated Hospital of Wenzhou Medical University, Lishui 323000, China
| | - Nannan Zhang
- Key Laboratory of Imaging Diagnosis and Minimally Invasive Intervention Research, Lishui Hospital, School of Medicine, Zhejiang University, Lishui 323000, China.,Department of Radiology, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| | - Minjiang Chen
- Key Laboratory of Imaging Diagnosis and Minimally Invasive Intervention Research, Lishui Hospital, School of Medicine, Zhejiang University, Lishui 323000, China.,Department of Radiology, the Fifth Affiliated Hospital of Wenzhou Medical University, Lishui 323000, China
| | - Min Xu
- Key Laboratory of Imaging Diagnosis and Minimally Invasive Intervention Research, Lishui Hospital, School of Medicine, Zhejiang University, Lishui 323000, China.,Department of Radiology, the Fifth Affiliated Hospital of Wenzhou Medical University, Lishui 323000, China
| | - Risheng Yu
- Department of Radiology, Second Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| | - Jiansong Ji
- Key Laboratory of Imaging Diagnosis and Minimally Invasive Intervention Research, Lishui Hospital, School of Medicine, Zhejiang University, Lishui 323000, China.,Department of Radiology, the Fifth Affiliated Hospital of Wenzhou Medical University, Lishui 323000, China
| |
Collapse
|
16
|
Gene Expression Analysis through Parallel Non-Negative Matrix Factorization. COMPUTATION 2021. [DOI: 10.3390/computation9100106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Genetic expression analysis is a principal tool to explain the behavior of genes in an organism when exposed to different experimental conditions. In the state of art, many clustering algorithms have been proposed. It is overwhelming the amount of biological data whose high-dimensional structure exceeds mostly current computational architectures. The computational time and memory consumption optimization actually become decisive factors in choosing clustering algorithms. We propose a clustering algorithm based on Non-negative Matrix Factorization and K-means to reduce data dimensionality but whilst preserving the biological context and prioritizing gene selection, and it is implemented within parallel GPU-based environments through the CUDA library. A well-known dataset is used in our tests and the quality of the results is measured through the Rand and Accuracy Index. The results show an increase in the acceleration of 6.22× compared to the sequential version. The algorithm is competitive in the biological datasets analysis and it is invariant with respect to the classes number and the size of the gene expression matrix.
Collapse
|
17
|
GAMUT: A genomics big data management tool. J Biosci 2021. [DOI: 10.1007/s12038-021-00213-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
18
|
Jarrige M, Polvèche H, Carteron A, Janczarski S, Peschanski M, Auboeuf D, Martinat C. SISTEMA: A large and standardized collection of transcriptome data sets for human pluripotent stem cell research. iScience 2021; 24:102767. [PMID: 34278269 PMCID: PMC8271161 DOI: 10.1016/j.isci.2021.102767] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Revised: 03/29/2021] [Accepted: 06/21/2021] [Indexed: 12/16/2022] Open
Abstract
Human pluripotent stem cells have ushered in an exciting new era for disease modeling, drug discovery, and cell therapy development. Continued progress toward realizing the potential of human pluripotent stem cells will be facilitated by robust data sets and complementary resources that are easily accessed and interrogated by the stem cell community. In this context, we present SISTEMA, a quality-controlled curated gene expression database, built on a valuable catalog of human pluripotent stem cell lines, and their derivatives for which transcriptomic analyses have been generated using a single experimental pipeline. SISTEMA functions as a one-step resource that will assist the stem cell community to easily evaluate the expression level for genes of interest, while comparing them across different hPSC lines, cell types, pathological conditions, or after pharmacological treatments.
Collapse
Affiliation(s)
| | | | | | - Stéphane Janczarski
- LBMC, Univ Lyon, ENS de Lyon, Univ Claude Bernard, CNRS UMR 5239, INSERM U1210, 46 Allée d'Italie Site Jacques Monod, 69007 Lyon, France
| | | | - Didier Auboeuf
- LBMC, Univ Lyon, ENS de Lyon, Univ Claude Bernard, CNRS UMR 5239, INSERM U1210, 46 Allée d'Italie Site Jacques Monod, 69007 Lyon, France
| | - Cécile Martinat
- INSERM/UEVE UMR 861, Paris Saclay Univ I-STEM, 91100 Corbeil-Essonnes, France
| |
Collapse
|
19
|
Alam I, Kamau AA, Ngugi DK, Gojobori T, Duarte CM, Bajic VB. KAUST Metagenomic Analysis Platform (KMAP), enabling access to massive analytics of re-annotated metagenomic data. Sci Rep 2021; 11:11511. [PMID: 34075103 PMCID: PMC8169707 DOI: 10.1038/s41598-021-90799-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Accepted: 05/18/2021] [Indexed: 11/09/2022] Open
Abstract
Exponential rise of metagenomics sequencing is delivering massive functional environmental genomics data. However, this also generates a procedural bottleneck for on-going re-analysis as reference databases grow and methods improve, and analyses need be updated for consistency, which require acceess to increasingly demanding bioinformatic and computational resources. Here, we present the KAUST Metagenomic Analysis Platform (KMAP), a new integrated open web-based tool for the comprehensive exploration of shotgun metagenomic data. We illustrate the capacities KMAP provides through the re-assembly of ~ 27,000 public metagenomic samples captured in ~ 450 studies sampled across ~ 77 diverse habitats. A small subset of these metagenomic assemblies is used in this pilot study grouped into 36 new habitat-specific gene catalogs, all based on full-length (complete) genes. Extensive taxonomic and gene annotations are stored in Gene Information Tables (GITs), a simple tractable data integration format useful for analysis through command line or for database management. KMAP pilot study provides the exploration and comparison of microbial GITs across different habitats with over 275 million genes. KMAP access to data and analyses is available at https://www.cbrc.kaust.edu.sa/aamg/kmap.start .
Collapse
Affiliation(s)
- Intikhab Alam
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia.
| | - Allan Anthony Kamau
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - David Kamanda Ngugi
- Leibniz Institute DSMZ-German Collection of Microorganisms and Cell Cultures, Inhoffenstraße 7B, 38124, Brunswick, Germany
| | - Takashi Gojobori
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Carlos M Duarte
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia.,Red Sea Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Vladimir B Bajic
- Computational Bioscience Research Center, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| |
Collapse
|
20
|
Banegas-Luna AJ, Peña-García J, Iftene A, Guadagni F, Ferroni P, Scarpato N, Zanzotto FM, Bueno-Crespo A, Pérez-Sánchez H. Towards the Interpretability of Machine Learning Predictions for Medical Applications Targeting Personalised Therapies: A Cancer Case Survey. Int J Mol Sci 2021; 22:4394. [PMID: 33922356 PMCID: PMC8122817 DOI: 10.3390/ijms22094394] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 04/16/2021] [Accepted: 04/20/2021] [Indexed: 12/18/2022] Open
Abstract
Artificial Intelligence is providing astonishing results, with medicine being one of its favourite playgrounds. Machine Learning and, in particular, Deep Neural Networks are behind this revolution. Among the most challenging targets of interest in medicine are cancer diagnosis and therapies but, to start this revolution, software tools need to be adapted to cover the new requirements. In this sense, learning tools are becoming a commodity but, to be able to assist doctors on a daily basis, it is essential to fully understand how models can be interpreted. In this survey, we analyse current machine learning models and other in-silico tools as applied to medicine-specifically, to cancer research-and we discuss their interpretability, performance and the input data they are fed with. Artificial neural networks (ANN), logistic regression (LR) and support vector machines (SVM) have been observed to be the preferred models. In addition, convolutional neural networks (CNNs), supported by the rapid development of graphic processing units (GPUs) and high-performance computing (HPC) infrastructures, are gaining importance when image processing is feasible. However, the interpretability of machine learning predictions so that doctors can understand them, trust them and gain useful insights for the clinical practice is still rarely considered, which is a factor that needs to be improved to enhance doctors' predictive capacity and achieve individualised therapies in the near future.
Collapse
Affiliation(s)
- Antonio Jesús Banegas-Luna
- Structural Bioinformatics and High-Performance Computing Research Group (BIO-HPC), Universidad Católica de Murcia (UCAM), 30107 Murcia, Spain; (J.P.-G.); (A.B.-C.)
| | - Jorge Peña-García
- Structural Bioinformatics and High-Performance Computing Research Group (BIO-HPC), Universidad Católica de Murcia (UCAM), 30107 Murcia, Spain; (J.P.-G.); (A.B.-C.)
| | - Adrian Iftene
- Faculty of Computer Science, Universitatea Alexandru Ioan Cuza (UAIC), 700505 Jashi, Romania;
| | - Fiorella Guadagni
- Interinstitutional Multidisciplinary Biobank (BioBIM), IRCCS San Raffaele Roma, 00166 Rome, Italy; (F.G.); (P.F.)
- Department of Human Sciences and Promotion of the Quality of Life, San Raffaele Roma Open University, 00166 Rome, Italy;
| | - Patrizia Ferroni
- Interinstitutional Multidisciplinary Biobank (BioBIM), IRCCS San Raffaele Roma, 00166 Rome, Italy; (F.G.); (P.F.)
- Department of Human Sciences and Promotion of the Quality of Life, San Raffaele Roma Open University, 00166 Rome, Italy;
| | - Noemi Scarpato
- Department of Human Sciences and Promotion of the Quality of Life, San Raffaele Roma Open University, 00166 Rome, Italy;
| | - Fabio Massimo Zanzotto
- Dipartimento di Ingegneria dell’Impresa “Mario Lucertini”, University of Rome Tor Vergata, 00133 Rome, Italy;
| | - Andrés Bueno-Crespo
- Structural Bioinformatics and High-Performance Computing Research Group (BIO-HPC), Universidad Católica de Murcia (UCAM), 30107 Murcia, Spain; (J.P.-G.); (A.B.-C.)
| | - Horacio Pérez-Sánchez
- Structural Bioinformatics and High-Performance Computing Research Group (BIO-HPC), Universidad Católica de Murcia (UCAM), 30107 Murcia, Spain; (J.P.-G.); (A.B.-C.)
| |
Collapse
|
21
|
Li LR, Du B, Liu HQ, Chen C. Artificial Intelligence for Personalized Medicine in Thyroid Cancer: Current Status and Future Perspectives. Front Oncol 2021; 10:604051. [PMID: 33634025 PMCID: PMC7899964 DOI: 10.3389/fonc.2020.604051] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 12/21/2020] [Indexed: 12/12/2022] Open
Abstract
Thyroid cancers (TC) have increasingly been detected following advances in diagnostic methods. Risk stratification guided by refined information becomes a crucial step toward the goal of personalized medicine. The diagnosis of TC mainly relies on imaging analysis, but visual examination may not reveal much information and not enable comprehensive analysis. Artificial intelligence (AI) is a technology used to extract and quantify key image information by simulating complex human functions. This latent, precise information contributes to stratify TC on the distinct risk and drives tailored management to transit from the surface (population-based) to a point (individual-based). In this review, we started with several challenges regarding personalized care in TC, for example, inconsistent rating ability of ultrasound physicians, uncertainty in cytopathological diagnosis, difficulty in discriminating follicular neoplasms, and inaccurate prognostication. We then analyzed and summarized the advances of AI to extract and analyze morphological, textural, and molecular features to reveal the ground truth of TC. Consequently, their combination with AI technology will make individual medical strategies possible.
Collapse
Affiliation(s)
- Ling-Rui Li
- Department of Breast and Thyroid Surgery, Renmin Hospital of Wuhan University, Wuhan, China
| | - Bo Du
- School of Computer Science, Wuhan University, Wuhan, China.,Institute of Artificial Intelligence, Wuhan University, Wuhan, China
| | - Han-Qing Liu
- Department of Breast and Thyroid Surgery, Renmin Hospital of Wuhan University, Wuhan, China
| | - Chuang Chen
- Department of Breast and Thyroid Surgery, Renmin Hospital of Wuhan University, Wuhan, China
| |
Collapse
|
22
|
Edgar R. Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences. PeerJ 2021; 9:e10805. [PMID: 33604186 PMCID: PMC7869670 DOI: 10.7717/peerj.10805] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 12/30/2020] [Indexed: 12/19/2022] Open
Abstract
Minimizers are widely used to select subsets of fixed-length substrings (k-mers) from biological sequences in applications ranging from read mapping to taxonomy prediction and indexing of large datasets. The minimizer of a string of w consecutive k-mers is the k-mer with smallest value according to an ordering of all k-mers. Syncmers are defined here as a family of alternative methods which select k-mers by inspecting the position of the smallest-valued substring of length s < k within the k-mer. For example, a closed syncmer is selected if its smallest s-mer is at the start or end of the k-mer. At least one closed syncmer must be found in every window of length (k - s) k-mers. Unlike a minimizer, a syncmer is identified by its sequence alone, and is therefore synchronized in the following sense: if a given k-mer is selected from one sequence, it will also be selected from any other sequence. Also, minimizers can be deleted by mutations in flanking sequence, which cannot happen with syncmers. Experiments on minimizers with parameters used in the minimap2 read mapper and Kraken taxonomy prediction algorithm respectively show that syncmers can simultaneously achieve both lower density and higher conservation compared to minimizers.
Collapse
|
23
|
Schmidt B, Hildebrandt A. Deep learning in next-generation sequencing. Drug Discov Today 2021; 26:173-180. [PMID: 33059075 PMCID: PMC7550123 DOI: 10.1016/j.drudis.2020.10.002] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2020] [Revised: 09/16/2020] [Accepted: 10/07/2020] [Indexed: 12/22/2022]
Abstract
Next-generation sequencing (NGS) methods lie at the heart of large parts of biological and medical research. Their fundamental importance has created a continuously increasing demand for processing and analysis methods of the data sets produced, addressing questions such as variant calling, metagenomic classification and quantification, genomic feature detection, or downstream analysis in larger biological or medical contexts. In addition to classical algorithmic approaches, machine-learning (ML) techniques are often used for such tasks. In particular, deep learning (DL) methods that use multilayered artificial neural networks (ANNs) for supervised, semisupervised, and unsupervised learning have gained significant traction for such applications. Here, we highlight important network architectures, application areas, and DL frameworks in a NGS context.
Collapse
Affiliation(s)
- Bertil Schmidt
- Institut für Informatik, Johannes Gutenberg University Mainz, Germany.
| | | |
Collapse
|
24
|
Improving read alignment through the generation of alternative reference via iterative strategy. Sci Rep 2020; 10:18712. [PMID: 33127969 PMCID: PMC7599232 DOI: 10.1038/s41598-020-74526-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2019] [Accepted: 09/30/2020] [Indexed: 11/08/2022] Open
Abstract
There is generally one standard reference sequence for each species. When extensive variations exist in other breeds of the species, it can lead to ambiguous alignment and inaccurate variant calling and, in turn, compromise the accuracy of downstream analysis. Here, with the help of the FPGA hardware platform, we present a method that generates an alternative reference via an iterative strategy to improve the read alignment for breeds that are genetically distant to the reference breed. Compared to the published reference genomes, by using the alternative reference sequences we built, the mapping rates of Chinese indigenous pigs and chickens were improved by 0.61-1.68% and 0.09-0.45%, respectively. These sequences also enable researchers to recover highly variable regions that could be missed using public reference sequences. We also determined that the optimal number of iterations needed to generate alternative reference sequences were seven and five for pigs and chickens, respectively. Our results show that, for genetically distant breeds, generating an alternative reference sequence can facilitate read alignment and variant calling and improve the accuracy of downstream analyses.
Collapse
|
25
|
Abuín JM, Lopes N, Ferreira L, Pena TF, Schmidt B. Big Data in metagenomics: Apache Spark vs MPI. PLoS One 2020; 15:e0239741. [PMID: 33022000 PMCID: PMC7537910 DOI: 10.1371/journal.pone.0239741] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Accepted: 09/14/2020] [Indexed: 11/23/2022] Open
Abstract
The progress of next-generation sequencing has lead to the availability of massive data sets used by a wide range of applications in biology and medicine. This has sparked significant interest in using modern Big Data technologies to process this large amount of information in distributed memory clusters of commodity hardware. Several approaches based on solutions such as Apache Hadoop or Apache Spark, have been proposed. These solutions allow developers to focus on the problem while the need to deal with low level details, such as data distribution schemes or communication patterns among processing nodes, can be ignored. However, performance and scalability are also of high importance when dealing with increasing problems sizes, making in this way the usage of High Performance Computing (HPC) technologies such as the message passing interface (MPI) a promising alternative. Recently, MetaCacheSpark, an Apache Spark based software for detection and quantification of species composition in food samples has been proposed. This tool can be used to analyze high throughput sequencing data sets of metagenomic DNA and allows for dealing with large-scale collections of complex eukaryotic and bacterial reference genome. In this work, we propose MetaCache-MPI, a fast and memory efficient solution for computing clusters which is based on MPI instead of Apache Spark. In order to evaluate its performance a comparison is performed between the original single CPU version of MetaCache, the Spark version and the MPI version we are introducing. Results show that for 32 processes, MetaCache-MPI is 1.65× faster while consuming 48.12% of the RAM memory used by Spark for building a metagenomics database. For querying this database, also with 32 processes, the MPI version is 3.11× faster, while using 55.56% of the memory used by Spark. We conclude that the new MetaCache-MPI version is faster in both building and querying the database and uses less RAM memory, when compared with MetaCacheSpark, while keeping the accuracy of the original implementation.
Collapse
Affiliation(s)
- José M. Abuín
- 2Ai—School of Technology, IPCA, Barcelos, Portugal
- CiTIUS, Universidade de Santiago de Compostela, Santiago de Compostela, Spain
- * E-mail:
| | - Nuno Lopes
- 2Ai—School of Technology, IPCA, Barcelos, Portugal
| | | | - Tomás F. Pena
- CiTIUS, Universidade de Santiago de Compostela, Santiago de Compostela, Spain
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University, Mainz, Germany
| |
Collapse
|
26
|
Xiong Y, Yang G, Wang K, Riaz M, Xu J, Lv Z, Zhou H, Li Q, Li W, Sun J, Tao T, Li J. Genome-Wide Transcriptional Analysis Reveals Alternative Splicing Event Profiles in Hepatocellular Carcinoma and Their Prognostic Significance. Front Genet 2020; 11:879. [PMID: 32849842 PMCID: PMC7432180 DOI: 10.3389/fgene.2020.00879] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Accepted: 07/17/2020] [Indexed: 12/11/2022] Open
Abstract
Accumulating evidence indicates an unexpected role of aberrant splicing in hepatocellular carcinoma (HCC) that has been seriously neglected in previous studies. There is a need for a detailed analysis of alternative splicing (AS) and its underlying biological and clinical relevance in HCC. In this study, clinical information and corresponding RNA sequencing data of HCC patients were obtained from The Cancer Genome Atlas. Percent spliced in (PSI) values and transcriptional splicing patterns of genes were determined from the original RNA sequencing data using SpliceSeq. Then, based on the PSI values of AS events in different patients, a series of bioinformatics methods was used to identify differentially expressed AS events (DEAS), determine potential regulatory relationships, and investigate the correlation between DEAS and the patients' clinicopathological features. Finally, 25,934 AS events originating from 8,795 genes were screened with high reliability; 263 of these AS events were identified as DEAS. The parent genes of these DEAS formed an intricate network with roles in the regulation of cancer-related pathway and liver metabolism. In HCC, 36 splicing factors were involved in the dysregulation of part DEAS, 100 DEAS events were correlated with overall survival, and 71 DEAS events were correlated with disease-free survival. Stratifying HCC patients according to DEAS resulted in four clusters with different survival patterns. Significant variations in AS occurred during HCC initiation and maintenance; these are likely to be vital both for biological processes and in prognosis. The HCC-related AS events identified here and the splicing networks constructed will be valuable in deciphering the underlying role of AS in HCC.
Collapse
Affiliation(s)
- Yongfu Xiong
- Department of Hepatobiliary Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China.,North Sichuan Medical College, Institute of Hepato-Biliary-Pancreatic-Intestinal Disease, Nanchong, China
| | - Gang Yang
- Department of Hepatobiliary Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Kang Wang
- Department of Breast Surgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Muhammad Riaz
- Department of Hepatobiliary Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Jian Xu
- Department of Hepatobiliary Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Zhenbing Lv
- Department of Gastrointestinal Surgery, Nanchong Central Hospital, Nanchong, China
| | - He Zhou
- Department of Gastrointestinal Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Qiang Li
- Department of Hepatobiliary Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Weinan Li
- Department of Hepatobiliary Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Ji Sun
- Department of Hepatobiliary Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Tang Tao
- Department of Hepatobiliary Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China
| | - Jingdong Li
- Department of Hepatobiliary Surgery, Affiliated Hospital of North Sichuan Medical College, Nanchong, China.,North Sichuan Medical College, Institute of Hepato-Biliary-Pancreatic-Intestinal Disease, Nanchong, China
| |
Collapse
|
27
|
Canakoglu A, Pinoli P, Gulino A, Nanni L, Masseroli M, Ceri S. Federated sharing and processing of genomic datasets for tertiary data analysis. Brief Bioinform 2020; 22:5868062. [PMID: 34020536 DOI: 10.1093/bib/bbaa091] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2020] [Revised: 04/05/2020] [Accepted: 04/27/2020] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION With the spreading of biological and clinical uses of next-generation sequencing (NGS) data, many laboratories and health organizations are facing the need of sharing NGS data resources and easily accessing and processing comprehensively shared genomic data; in most cases, primary and secondary data management of NGS data is done at sequencing stations, and sharing applies to processed data. Based on the previous single-instance GMQL system architecture, here we review the model, language and architectural extensions that make the GMQL centralized system innovatively open to federated computing. RESULTS A well-designed extension of a centralized system architecture to support federated data sharing and query processing. Data is federated thanks to simple data sharing instructions. Queries are assigned to execution nodes; they are translated into an intermediate representation, whose computation drives data and processing distributions. The approach allows writing federated applications according to classical styles: centralized, distributed or externalized. AVAILABILITY The federated genomic data management system is freely available for non-commercial use as an open source project at http://www.bioinformatics.deib.polimi.it/FederatedGMQLsystem/. CONTACT {arif.canakoglu, pietro.pinoli}@polimi.it. SUMMARY
Collapse
Affiliation(s)
| | | | - Andrea Gulino
- Computer Science and Engineering at Politecnico di Milano
| | - Luca Nanni
- Computer Science and Engineering at Politecnico di Milano
| | | | | |
Collapse
|
28
|
Herruzo JM, Gonzalez-Navarro S, Ibanez-Marin P, Vinals-Yufera V, Alastruey-Benede J, Plata O. Accelerating Sequence Alignments Based on FM-Index Using the Intel KNL Processor. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1093-1104. [PMID: 30530369 DOI: 10.1109/tcbb.2018.2884701] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
FM-index is a compact data structure suitable for fast matches of short reads to large reference genomes. The matching algorithm using this index exhibits irregular memory access patterns that cause frequent cache misses, resulting in a memory bound problem. This paper analyzes different FM-index versions presented in the literature, focusing on those computing aspects related to the data access. As a result of the analysis, we propose a new organization of FM-index that minimizes the demand for memory bandwidth, allowing a great improvement of performance on processors with high-bandwidth memory, such as the second-generation Intel Xeon Phi (Knights Landing, or KNL), integrating ultra high-bandwidth stacked memory technology. As the roofline model shows, our implementation reaches 95 percent of the peak random access bandwidth limit when executed on the KNL and almost all of the available bandwidth when executed on other Intel Xeon architectures with conventional DDR memory. In addition, the obtained throughput in KNL is much higher than the results reported for GPUs in the literature.
Collapse
|
29
|
Wang L, Alexander CA. Big data analytics in medical engineering and healthcare: methods, advances and challenges. J Med Eng Technol 2020; 44:267-283. [PMID: 32498594 DOI: 10.1080/03091902.2020.1769758] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Big data analytics are gaining popularity in medical engineering and healthcare use cases. Stakeholders are finding big data analytics reduce medical costs and personalise medical services for each individual patient. Big data analytics can be used in large-scale genetics studies, public health, personalised and precision medicine, new drug development, etc. The introduction of the types, sources, and features of big data in healthcare as well as the applications and benefits of big data and big data analytics in healthcare is key to understanding healthcare big data and will be discussed in this article. Major methods, platforms and tools of big data analytics in medical engineering and healthcare are also presented. Advances and technology progress of big data analytics in healthcare are introduced, which includes artificial intelligence (AI) with big data, infrastructure and cloud computing, advanced computation and data processing, privacy and cybersecurity, health economic outcomes and technology management, and smart healthcare with sensing, wearable devices and Internet of things (IoT). Current challenges of dealing with big data and big data analytics in medical engineering and healthcare as well as future work are also presented.
Collapse
Affiliation(s)
- Lidong Wang
- Institute for Systems Engineering Research, Mississippi State University, Vicksburg, MS, USA
| | | |
Collapse
|
30
|
Lichman BR, Godden GT, Buell CR. Gene and genome duplications in the evolution of chemodiversity: perspectives from studies of Lamiaceae. CURRENT OPINION IN PLANT BIOLOGY 2020; 55:74-83. [PMID: 32344371 DOI: 10.1016/j.pbi.2020.03.005] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/10/2019] [Revised: 02/19/2020] [Accepted: 03/04/2020] [Indexed: 05/28/2023]
Abstract
Plants are reservoirs of extreme chemical diversity, yet biosynthetic pathways remain underexplored in the majority of taxa. Access to improved, inexpensive genomic and computational technologies has recently enhanced our understanding of plant specialized metabolism at the biochemical and evolutionary levels including the elucidation of pathways leading to key metabolites. Furthermore, these approaches have provided insights into the mechanisms of chemical evolution, including neofunctionalization and subfunctionalization, structural variation, and modulation of gene expression. The broader utilization of genomic tools across the plant tree of life, and an expansion of genomic resources from multiple accessions within species or populations, will improve our overall understanding of chemodiversity. These data and knowledge will also lead to greater insight into the selective pressures contributing to and maintaining this diversity, which in turn will enable the development of more accurate predictive models of specialized metabolism in plants.
Collapse
Affiliation(s)
- Benjamin R Lichman
- Centre for Novel Agricultural Products, Department of Biology, University of York, York YO10 5DD, UK
| | - Grant T Godden
- Florida Museum of Natural History, University of Florida, Gainesville, FL 32611, USA
| | - Carol Robin Buell
- Department of Plant Biology, Michigan State University, 612 Wilson Road, East Lansing, MI 48824, USA; Plant Resilience Institute, Michigan State University, 612 Wilson Road, East Lansing, MI 48824, USA; MSU AgBioResearch, Michigan State University, 446 West Circle Drive, East Lansing, MI 48824, USA.
| |
Collapse
|
31
|
Kobus R, Abuín JM, Müller A, Hellmann SL, Pichel JC, Pena TF, Hildebrandt A, Hankeln T, Schmidt B. A big data approach to metagenomics for all-food-sequencing. BMC Bioinformatics 2020; 21:102. [PMID: 32164527 PMCID: PMC7069206 DOI: 10.1186/s12859-020-3429-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Accepted: 02/24/2020] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND All-Food-Sequencing (AFS) is an untargeted metagenomic sequencing method that allows for the detection and quantification of food ingredients including animals, plants, and microbiota. While this approach avoids some of the shortcomings of targeted PCR-based methods, it requires the comparison of sequence reads to large collections of reference genomes. The steadily increasing amount of available reference genomes establishes the need for efficient big data approaches. RESULTS We introduce an alignment-free k-mer based method for detection and quantification of species composition in food and other complex biological matters. It is orders-of-magnitude faster than our previous alignment-based AFS pipeline. In comparison to the established tools CLARK, Kraken2, and Kraken2+Bracken it is superior in terms of false-positive rate and quantification accuracy. Furthermore, the usage of an efficient database partitioning scheme allows for the processing of massive collections of reference genomes with reduced memory requirements on a workstation (AFS-MetaCache) or on a Spark-based compute cluster (MetaCacheSpark). CONCLUSIONS We present a fast yet accurate screening method for whole genome shotgun sequencing-based biosurveillance applications such as food testing. By relying on a big data approach it can scale efficiently towards large-scale collections of complex eukaryotic and bacterial reference genomes. AFS-MetaCache and MetaCacheSpark are suitable tools for broad-scale metagenomic screening applications. They are available at https://muellan.github.io/metacache/afs.html (C++ version for a workstation) and https://github.com/jmabuin/MetaCacheSpark (Spark version for big data clusters).
Collapse
Affiliation(s)
- Robin Kobus
- Department of Computer Science, Johannes Gutenberg University, Mainz, 55099 Germany
| | - José M. Abuín
- IPCA, Polytechnic Institute of Cávado and Ave, Barcelos, 4750-810 Portugal
- CiTIUS, Universidade de Santiago de Compostela, Santiago de Compostela, 15782 Spain
| | - André Müller
- Department of Computer Science, Johannes Gutenberg University, Mainz, 55099 Germany
| | - Sören Lukas Hellmann
- Molecular Genetics and Genome Analysis, Institute of Organismal and Molecular Evolution, Johannes Gutenberg University, Mainz, 55099 Germany
| | - Juan C. Pichel
- CiTIUS, Universidade de Santiago de Compostela, Santiago de Compostela, 15782 Spain
| | - Tomás F. Pena
- CiTIUS, Universidade de Santiago de Compostela, Santiago de Compostela, 15782 Spain
| | - Andreas Hildebrandt
- Department of Computer Science, Johannes Gutenberg University, Mainz, 55099 Germany
| | - Thomas Hankeln
- Molecular Genetics and Genome Analysis, Institute of Organismal and Molecular Evolution, Johannes Gutenberg University, Mainz, 55099 Germany
| | - Bertil Schmidt
- Department of Computer Science, Johannes Gutenberg University, Mainz, 55099 Germany
| |
Collapse
|
32
|
Fostier J. BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs. BMC Bioinformatics 2020; 21:81. [PMID: 32164557 PMCID: PMC7068855 DOI: 10.1186/s12859-020-3348-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND The identification of all matches of a large set of position weight matrices (PWMs) in long DNA sequences requires significant computational resources for which a number of efficient yet complex algorithms have been proposed. RESULTS We propose BLAMM, a simple and efficient tool inspired by high performance computing techniques. The workload is expressed in terms of matrix-matrix products that are evaluated with high efficiency using optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs and has a runtime that is independent of the selected p-value. In terms of single-core performance, it is competitive with state-of-the-art software for PWM matching while being much more efficient when using multithreading. Additionally, BLAMM requires negligible memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 13 min with a p-value of 10-4 using a 36-core machine. On a dual GPU system, the same task can be performed in under 5 min. CONCLUSIONS BLAMM is an efficient tool for identifying PWM matches in large DNA sequences. Its C++ source code is available under the GNU General Public License Version 3 at https://github.com/biointec/blamm.
Collapse
Affiliation(s)
- Jan Fostier
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, Ghent (Zwijnaarde), B-9052, Belgium.
| |
Collapse
|
33
|
Chowdhury HA, Bhattacharyya DK, Kalita JK. Differential Expression Analysis of RNA-seq Reads: Overview, Taxonomy, and Tools. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:566-586. [PMID: 30281477 DOI: 10.1109/tcbb.2018.2873010] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Analysis of RNA-sequence (RNA-seq) data is widely used in transcriptomic studies and it has many applications. We review RNA-seq data analysis from RNA-seq reads to the results of differential expression analysis. In addition, we perform a descriptive comparison of tools used in each step of RNA-seq data analysis along with a discussion of important characteristics of these tools. A taxonomy of tools is also provided. A discussion of issues in quality control and visualization of RNA-seq data is also included along with useful tools. Finally, we provide some guidelines for the RNA-seq data analyst, along with research issues and challenges which should be addressed.
Collapse
|
34
|
Yun Y, Hong SA, Kim KK, Baek D, Lee D, Londhe AM, Lee M, Yu J, McEachin ZT, Bassell GJ, Bowser R, Hales CM, Cho SR, Kim J, Pae AN, Cheong E, Kim S, Boulis NM, Bae S, Ha Y. CRISPR-mediated gene correction links the ATP7A M1311V mutations with amyotrophic lateral sclerosis pathogenesis in one individual. Commun Biol 2020; 3:33. [PMID: 31959876 PMCID: PMC6970999 DOI: 10.1038/s42003-020-0755-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Accepted: 12/17/2019] [Indexed: 12/11/2022] Open
Abstract
Amyotrophic lateral sclerosis (ALS) is a severe disease causing motor neuron death, but a complete cure has not been developed and related genes have not been defined in more than 80% of cases. Here we compared whole genome sequencing results from a male ALS patient and his healthy parents to identify relevant variants, and chose one variant in the X-linked ATP7A gene, M1311V, as a strong disease-linked candidate after profound examination. Although this variant is not rare in the Ashkenazi Jewish population according to results in the genome aggregation database (gnomAD), CRISPR-mediated gene correction of this mutation in patient-derived and re-differentiated motor neurons drastically rescued neuronal activities and functions. These results suggest that the ATP7A M1311V mutation has a potential responsibility for ALS in this patient and might be a potential therapeutic target, revealed here by a personalized medicine strategy.
Collapse
Affiliation(s)
- Yeomin Yun
- Department of Neurosurgery, Spine & Spinal Cord Institute, College of Medicine, Yonsei University, Seoul, 03722, South Korea
- Brain Korea 21 PLUS Project for Medical Science, College of Medicine, Yonsei University, Seoul, 03722, South Korea
| | - Sung-Ah Hong
- Department of Chemistry, Hanyang University, Seoul, 04763, South Korea
- Research Institute for Natural Sciences, Hanyang University, Seoul, 04763, South Korea
| | - Ka-Kyung Kim
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, 03722, South Korea
| | - Daye Baek
- Department of Neurosurgery, Spine & Spinal Cord Institute, College of Medicine, Yonsei University, Seoul, 03722, South Korea
- Brain Korea 21 PLUS Project for Medical Science, College of Medicine, Yonsei University, Seoul, 03722, South Korea
| | - Dongsu Lee
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, 03722, South Korea
| | - Ashwini M Londhe
- Convergence Research Center for Diagnosis, Treatment and Care System of Dementia, Korea Institute of Science and Technology, PO Box 131, Cheongryang, Seoul, 130-650, South Korea
- Division of Bio-Medical Science & Technology, KIST School, Korea University of Science and Technology, Seoul, 02792, South Korea
| | - Minhyung Lee
- Stem Cell Convergence Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, South Korea
- Department of Functional Genomics, KRIBB School of Bioscience, Korea University of Science and Technology, Daejeon, 34113, South Korea
| | - Jihyeon Yu
- Department of Chemistry, Hanyang University, Seoul, 04763, South Korea
| | - Zachary T McEachin
- Laboratory of Translational Cell Biology, Emory University School of Medicine, Atlanta, GA, 30322, USA
| | - Gary J Bassell
- Laboratory of Translational Cell Biology, Emory University School of Medicine, Atlanta, GA, 30322, USA
- Department of Cell Biology, Emory University, Atlanta, GA, 30322, USA
| | - Robert Bowser
- Department of Neurobiology, Barrow Neurological Institute and St. Joseph's Hospital and Medical Center, Phoenix, AZ, 85013, USA
| | - Chadwick M Hales
- Department of Neurology, Emory University, Atlanta, GA, 30322, USA
| | - Sung-Rae Cho
- Brain Korea 21 PLUS Project for Medical Science, College of Medicine, Yonsei University, Seoul, 03722, South Korea
- Department and Research Institute of Rehabilitation Medicine, Yonsei University College of Medicine, Seoul, 03722, South Korea
| | - Janghwan Kim
- Stem Cell Convergence Research Center, Korea Research Institute of Bioscience and Biotechnology (KRIBB), Daejeon, 34141, South Korea
- Department of Functional Genomics, KRIBB School of Bioscience, Korea University of Science and Technology, Daejeon, 34113, South Korea
| | - Ae Nim Pae
- Convergence Research Center for Diagnosis, Treatment and Care System of Dementia, Korea Institute of Science and Technology, PO Box 131, Cheongryang, Seoul, 130-650, South Korea
- Division of Bio-Medical Science & Technology, KIST School, Korea University of Science and Technology, Seoul, 02792, South Korea
| | - Eunji Cheong
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, 03722, South Korea
| | - Sangwoo Kim
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, 03722, South Korea
| | - Nicholas M Boulis
- Department of Neurosurgery, Emory University School of Medicine, Atlanta, GA, 30322, USA
| | - Sangsu Bae
- Department of Chemistry, Hanyang University, Seoul, 04763, South Korea.
- Research Institute for Natural Sciences, Hanyang University, Seoul, 04763, South Korea.
| | - Yoon Ha
- Department of Neurosurgery, Spine & Spinal Cord Institute, College of Medicine, Yonsei University, Seoul, 03722, South Korea.
- Brain Korea 21 PLUS Project for Medical Science, College of Medicine, Yonsei University, Seoul, 03722, South Korea.
| |
Collapse
|
35
|
Masseroli M, Canakoglu A, Pinoli P, Kaitoua A, Gulino A, Horlova O, Nanni L, Bernasconi A, Perna S, Stamoulakatou E, Ceri S. Processing of big heterogeneous genomic datasets for tertiary analysis of Next Generation Sequencing data. Bioinformatics 2019; 35:729-736. [PMID: 30101316 DOI: 10.1093/bioinformatics/bty688] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2018] [Revised: 08/01/2018] [Accepted: 08/06/2018] [Indexed: 01/17/2023] Open
Abstract
MOTIVATION We previously proposed a paradigm shift in genomic data management, based on the Genomic Data Model (GDM) for mediating existing data formats and on the GenoMetric Query Language (GMQL) for supporting, at a high level of abstraction, data extraction and the most common data-driven computations required by tertiary data analysis of Next Generation Sequencing datasets. Here, we present a new GMQL-based system with enhanced accessibility, portability, scalability and performance. RESULTS The new system has a well-designed modular architecture featuring: (i) an intermediate representation supporting many different implementations (including Spark, Flink and SciDB); (ii) a high-level technology-independent repository abstraction, supporting different repository technologies (e.g., local file system, Hadoop File System, database or others); (iii) several system interfaces, including a user-friendly Web-based interface, a Web Service interface, and a programmatic interface for Python language. Biological use case examples, using public ENCODE, Roadmap Epigenomics and TCGA datasets, demonstrate the relevance of our work. AVAILABILITY AND IMPLEMENTATION The GMQL system is freely available for non-commercial use as open source project at: http://www.bioinformatics.deib.polimi.it/GMQLsystem/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Marco Masseroli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| | - Arif Canakoglu
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| | - Pietro Pinoli
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| | - Abdulrahman Kaitoua
- The German Research Center for Artificial Intelligence (DFKI), Berlin, Germany
| | - Andrea Gulino
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| | - Olha Horlova
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| | - Luca Nanni
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| | - Anna Bernasconi
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| | - Stefano Perna
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| | - Eirini Stamoulakatou
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| | - Stefano Ceri
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan, Italy
| |
Collapse
|
36
|
Mabvakure BM, Rott R, Dobrowsky L, Van Heusden P, Morris L, Scheepers C, Moore PL. Advancing HIV Vaccine Research With Low-Cost High-Performance Computing Infrastructure: An Alternative Approach for Resource-Limited Settings. Bioinform Biol Insights 2019; 13:1177932219882347. [PMID: 35173421 PMCID: PMC8842485 DOI: 10.1177/1177932219882347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2019] [Accepted: 09/21/2019] [Indexed: 11/17/2022] Open
Abstract
Next-generation sequencing (NGS) technologies have revolutionized biological research by generating genomic data that were once unaffordable by traditional first-generation sequencing technologies. These sequencing methodologies provide an opportunity for in-depth analyses of host and pathogen genomes as they are able to sequence millions of templates at a time. However, these large datasets can only be efficiently explored using bioinformatics analyses requiring huge data storage and computational resources adapted for high-performance processing. High-performance computing allows for efficient handling of large data and tasks that may require multi-threading and prolonged computational times, which is not feasible with ordinary computers. However, high-performance computing resources are costly and therefore not always readily available in low-income settings. We describe the establishment of an affordable high-performance computing bioinformatics cluster consisting of 3 nodes, constructed using ordinary desktop computers and open-source software including Linux Fedora, SLURM Workload Manager, and the Conda package manager. For the analysis of large antibody sequence datasets and for complex viral phylodynamic analyses, the cluster out-performed desktop computers. This has demonstrated that it is possible to construct high-performance computing capacity capable of analyzing large NGS data from relatively low-cost hardware and entirely free (open-source) software, even in resource-limited settings. Such a cluster design has broad utility beyond bioinformatics to other studies that require high-performance computing.
Collapse
Affiliation(s)
- Batsirai M Mabvakure
- Center for HIV and STIs, National Institute for Communicable Diseases, National Health Laboratory Service (NHLS), Johannesburg, South Africa.,Antibody Immunity Research Unit, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.,Division of Transfusion Medicine, Department of Pathology, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | | | | | - Peter Van Heusden
- South African National Bioinformatics Institute, University of the Western Cape, Cape Town, South Africa
| | - Lynn Morris
- Center for HIV and STIs, National Institute for Communicable Diseases, National Health Laboratory Service (NHLS), Johannesburg, South Africa.,Antibody Immunity Research Unit, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.,Centre for the AIDS Programme of Research in South Africa (CAPRISA), University of KwaZulu-Natal, Durban, South Africa
| | - Cathrine Scheepers
- Center for HIV and STIs, National Institute for Communicable Diseases, National Health Laboratory Service (NHLS), Johannesburg, South Africa.,Antibody Immunity Research Unit, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa
| | - Penny L Moore
- Center for HIV and STIs, National Institute for Communicable Diseases, National Health Laboratory Service (NHLS), Johannesburg, South Africa.,Antibody Immunity Research Unit, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa.,Centre for the AIDS Programme of Research in South Africa (CAPRISA), University of KwaZulu-Natal, Durban, South Africa
| |
Collapse
|
37
|
Lightbody G, Haberland V, Browne F, Taggart L, Zheng H, Parkes E, Blayney JK. Review of applications of high-throughput sequencing in personalized medicine: barriers and facilitators of future progress in research and clinical application. Brief Bioinform 2019; 20:1795-1811. [PMID: 30084865 PMCID: PMC6917217 DOI: 10.1093/bib/bby051] [Citation(s) in RCA: 88] [Impact Index Per Article: 17.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Revised: 05/01/2018] [Indexed: 12/28/2022] Open
Abstract
There has been an exponential growth in the performance and output of sequencing technologies (omics data) with full genome sequencing now producing gigabases of reads on a daily basis. These data may hold the promise of personalized medicine, leading to routinely available sequencing tests that can guide patient treatment decisions. In the era of high-throughput sequencing (HTS), computational considerations, data governance and clinical translation are the greatest rate-limiting steps. To ensure that the analysis, management and interpretation of such extensive omics data is exploited to its full potential, key factors, including sample sourcing, technology selection and computational expertise and resources, need to be considered, leading to an integrated set of high-performance tools and systems. This article provides an up-to-date overview of the evolution of HTS and the accompanying tools, infrastructure and data management approaches that are emerging in this space, which, if used within in a multidisciplinary context, may ultimately facilitate the development of personalized medicine.
Collapse
Affiliation(s)
- Gaye Lightbody
- School of Computing, Ulster University, Newtownabbey, UK
| | - Valeriia Haberland
- MRC Integrative Epidemiology Unit, Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
| | - Fiona Browne
- School of Computing, Ulster University, Newtownabbey, UK
| | | | - Huiru Zheng
- School of Computing, Ulster University, Newtownabbey, UK
| | - Eileen Parkes
- Centre for Cancer Research & Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University, Belfast, UK
| | - Jaine K Blayney
- Centre for Cancer Research & Cell Biology, School of Medicine, Dentistry and Biomedical Sciences, Queen's University, Belfast, UK
| |
Collapse
|
38
|
Bagheri H, Muppirala U, Masonbrink RE, Severin AJ, Rajan H. Shared data science infrastructure for genomics data. BMC Bioinformatics 2019; 20:436. [PMID: 31438850 PMCID: PMC6704658 DOI: 10.1186/s12859-019-2967-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Accepted: 06/25/2019] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND Creating a scalable computational infrastructure to analyze the wealth of information contained in data repositories is difficult due to significant barriers in organizing, extracting and analyzing relevant data. Shared data science infrastructures like Boag is needed to efficiently process and parse data contained in large data repositories. The main features of Boag are inspired from existing languages for data intensive computing and can easily integrate data from biological data repositories. RESULTS As a proof of concept, Boa for genomics, Boag, has been implemented to analyze RefSeq's 153,848 annotation (GFF) and assembly (FASTA) file metadata. Boag provides a massive improvement from existing solutions like Python and MongoDB, by utilizing a domain-specific language that uses Hadoop infrastructure for a smaller storage footprint that scales well and requires fewer lines of code. We execute scripts through Boag to answer questions about the genomes in RefSeq. We identify the largest and smallest genomes deposited, explore exon frequencies for assemblies after 2016, identify the most commonly used bacterial genome assembly program, and address how animal genome assemblies have improved since 2016. Boag databases provide a significant reduction in required storage of the raw data and a significant speed up in its ability to query large datasets due to automated parallelization and distribution of Hadoop infrastructure during computations. CONCLUSIONS In order to keep pace with our ability to produce biological data, innovative methods are required. The Shared Data Science Infrastructure, Boag, provides researchers a greater access to researchers to efficiently explore data in new ways. We demonstrate the potential of a the domain specific language Boag using the RefSeq database to explore how deposited genome assemblies and annotations are changing over time. This is a small example of how Boag could be used with large biological datasets.
Collapse
Affiliation(s)
- Hamid Bagheri
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, 50011 USA
| | - Usha Muppirala
- Genome Informatics Facility, Iowa State University, 206 Science I, Ames, 50011 USA
| | - Rick E. Masonbrink
- Genome Informatics Facility, Iowa State University, 206 Science I, Ames, 50011 USA
| | - Andrew J. Severin
- Genome Informatics Facility, Iowa State University, 206 Science I, Ames, 50011 USA
| | - Hridesh Rajan
- Department of Computer Science, Iowa State University, 226 Atanasoff Hall, Ames, 50011 USA
| |
Collapse
|
39
|
Abstract
Pancreatic ductal adenocarcinoma (PDAC) is lethal, and the majority of patients present with locally advanced or metastatic disease that is not amenable to cure. Thus, with surgical resection being the only curative modality, it is critical that disease is identified at an earlier stage to allow the appropriate therapy to be applied. Unfortunately, a specific biomarker for early diagnosis has not yet been identified; hence, no screening process exists. Recently, high-throughput screening and next-generation sequencing (NGS) have led to the identification of novel biomarkers for many disease processes, and work has commenced in PDAC. Genomic data generated by NGS not only have the potential to assist clinicians in early diagnosis and screening, especially in high-risk populations, but also may eventually allow the development of personalized treatment programs with targeted therapies, given the large number of gene mutations seen in PDAC. This review introduces the basic concepts of NGS and provides a comprehensive review of the current understanding of genetics in PDAC as related to discoveries made using NGS.
Collapse
|
40
|
Zebrowska J, Jezewska-Frackowiak J, Wieczerzak E, Kasprzykowski F, Zylicz-Stachula A, Skowron PM. Novel parameter describing restriction endonucleases: Secondary-Cognate-Specificity and chemical stimulation of TsoI leading to substrate specificity change. Appl Microbiol Biotechnol 2019; 103:3439-3451. [PMID: 30879089 PMCID: PMC6449304 DOI: 10.1007/s00253-019-09731-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2018] [Revised: 02/22/2019] [Accepted: 02/27/2019] [Indexed: 11/30/2022]
Abstract
Over 470 prototype Type II restriction endonucleases (REases) are currently known. Most recognise specific DNA sequences 4–8 bp long, with very few exceptions cleaving DNA more frequently. TsoI is a thermostable Type IIC enzyme that recognises the DNA sequence TARCCA (R = A or G) and cleaves downstream at N11/N9. The enzyme exhibits extensive top-strand nicking of the supercoiled single-site DNA substrate. The second DNA strand of such substrate is specifically cleaved only in the presence of duplex oligonucleotides containing a cognate site. We have previously shown that some Type IIC/IIG/IIS enzymes from the Thermus-family exhibit ‘affinity star’ activity, which can be induced by the S-adenosyl-L-methionine (SAM) cofactor analogue—sinefungin (SIN). Here, we define a novel type of inherently built-in ‘star’ activity, exemplified by TsoI. The TsoI ‘star’ activity cannot be described under the definition of the classic ‘star’ activity as it is independent of the reaction conditions used and cannot be separated from the cognate specificity. Therefore, we define this phenomenon as Secondary-Cognate-Specificity (SCS). The TsoI SCS comprises several degenerated variants of the cognate site. Although the efficiency of TsoI SCS cleavage is lower in comparison to the cognate TsoI recognition sequence, it can be stimulated by S-adenosyl-L-cysteine (SAC). We present a new route for the chemical synthesis of SAC. The TsoI/SAC REase may serve as a novel tool for DNA manipulation.
Collapse
Affiliation(s)
- Joanna Zebrowska
- Department of Molecular Biotechnology, Faculty of Chemistry, University of Gdansk, 63 Wita Stwosza Street, 80-308, Gdansk, Poland
| | - Joanna Jezewska-Frackowiak
- Department of Molecular Biotechnology, Faculty of Chemistry, University of Gdansk, 63 Wita Stwosza Street, 80-308, Gdansk, Poland
| | - Ewa Wieczerzak
- Department of Biomedical Chemistry, Faculty of Chemistry, University of Gdansk, 63 Wita Stwosza Street, 80-308, Gdansk, Poland
| | - Franciszek Kasprzykowski
- Department of Biomedical Chemistry, Faculty of Chemistry, University of Gdansk, 63 Wita Stwosza Street, 80-308, Gdansk, Poland
| | - Agnieszka Zylicz-Stachula
- Department of Molecular Biotechnology, Faculty of Chemistry, University of Gdansk, 63 Wita Stwosza Street, 80-308, Gdansk, Poland.
| | - Piotr M Skowron
- Department of Molecular Biotechnology, Faculty of Chemistry, University of Gdansk, 63 Wita Stwosza Street, 80-308, Gdansk, Poland.
| |
Collapse
|
41
|
Renshaw AA, Birdsong GG. Freeing the data from cytology databases in order to improve the quality of cytology. Diagn Cytopathol 2018; 47:48-52. [PMID: 30478895 DOI: 10.1002/dc.24071] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2017] [Accepted: 08/13/2018] [Indexed: 11/06/2022]
Abstract
INTRODUCTION To review how changes in data storage and analysis can impact quality and quality assessment in cytology. METHODS Review of the literature. RESULTS All quality assessment is dependent on the data available for review and the methods available for evaluation. Current laboratory information systems (LISs) incorporate both a relational or hierarchical database and built in methods to analyze current quality assessment standards. In contrast, most information systems outside of medicine are separating data storage from analysis, allowing increasingly more sophisticated forms of evaluation. CONCLUSION There is an opportunity for improvement in cytology by improving the way data can be extracted and analyzed from the cytology LIS.
Collapse
|
42
|
Studying how genetic variants affect mechanism in biological systems. Essays Biochem 2018; 62:575-582. [PMID: 30315099 DOI: 10.1042/ebc20180021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2018] [Revised: 09/13/2018] [Accepted: 09/14/2018] [Indexed: 11/17/2022]
Abstract
Genetic variants are currently a major component of system-wide investigations into biological function or disease. Approaches to select variants (often out of thousands of candidates) that are responsible for a particular phenomenon have many clinical applications and can help illuminate differences between individuals. Selecting meaningful variants is greatly aided by integration with information about molecular mechanism, whether known from protein structures or interactions or biological pathways. In this review we discuss the nature of genetic variants, and recent studies highlighting what is currently known about the relationship between genetic variation, biomolecular function, and disease.
Collapse
|
43
|
Jung J, Yi G. A performance analysis of genome search by matching whole targeted reads on different environments. Soft comput 2018. [DOI: 10.1007/s00500-018-3573-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
44
|
HSRA: Hadoop-based spliced read aligner for RNA sequencing data. PLoS One 2018; 13:e0201483. [PMID: 30063721 PMCID: PMC6067734 DOI: 10.1371/journal.pone.0201483] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2018] [Accepted: 07/16/2018] [Indexed: 01/18/2023] Open
Abstract
Nowadays, the analysis of transcriptome sequencing (RNA-seq) data has become the standard method for quantifying the levels of gene expression. In RNA-seq experiments, the mapping of short reads to a reference genome or transcriptome is considered a crucial step that remains as one of the most time-consuming. With the steady development of Next Generation Sequencing (NGS) technologies, unprecedented amounts of genomic data introduce significant challenges in terms of storage, processing and downstream analysis. As cost and throughput continue to improve, there is a growing need for new software solutions that minimize the impact of increasing data volume on RNA read alignment. In this work we introduce HSRA, a Big Data tool that takes advantage of the MapReduce programming model to extend the multithreading capabilities of a state-of-the-art spliced read aligner for RNA-seq data (HISAT2) to distributed memory systems such as multi-core clusters or cloud platforms. HSRA has been built upon the Hadoop MapReduce framework and supports both single- and paired-end reads from FASTQ/FASTA datasets, providing output alignments in SAM format. The design of HSRA has been carefully optimized to avoid the main limitations and major causes of inefficiency found in previous Big Data mapping tools, which cannot fully exploit the raw performance of the underlying aligner. On a 16-node multi-core cluster, HSRA is on average 2.3 times faster than previous Hadoop-based tools. Source code in Java as well as a user’s guide are publicly available for download at http://hsra.dec.udc.es.
Collapse
|
45
|
Danchin A, Ouzounis C, Tokuyasu T, Zucker JD. No wisdom in the crowd: genome annotation in the era of big data - current status and future prospects. Microb Biotechnol 2018; 11:588-605. [PMID: 29806194 PMCID: PMC6011933 DOI: 10.1111/1751-7915.13284] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Science and engineering rely on the accumulation and dissemination of knowledge to make discoveries and create new designs. Discovery-driven genome research rests on knowledge passed on via gene annotations. In response to the deluge of sequencing big data, standard annotation practice employs automated procedures that rely on majority rules. We argue this hinders progress through the generation and propagation of errors, leading investigators into blind alleys. More subtly, this inductive process discourages the discovery of novelty, which remains essential in biological research and reflects the nature of biology itself. Annotation systems, rather than being repositories of facts, should be tools that support multiple modes of inference. By combining deduction, induction and abduction, investigators can generate hypotheses when accurate knowledge is extracted from model databases. A key stance is to depart from 'the sequence tells the structure tells the function' fallacy, placing function first. We illustrate our approach with examples of critical or unexpected pathways, using MicroScope to demonstrate how tools can be implemented following the principles we advocate. We end with a challenge to the reader.
Collapse
Affiliation(s)
- Antoine Danchin
- Integromics, Institute of Cardiometabolism and Nutrition, Hôpital de la Pitié-Salpêtrière, 47 Boulevard de l'Hôpital, 75013, Paris, France
- School of Biomedical Sciences, Li KaShing Faculty of Medicine, Hong Kong University, 21 Sassoon Road, Pokfulam, Hong Kong
| | - Christos Ouzounis
- Biological Computation and Process Laboratory, Centre for Research and Technology Hellas, Chemical Process and Energy Resources Institute, Thessalonica, 57001, Greece
| | - Taku Tokuyasu
- Shenzhen Institutes of Advanced Technology, Institute of Synthetic Biology, Shenzhen University Town, 1068 Xueyuan Avenue, Shenzhen, China
| | - Jean-Daniel Zucker
- Integromics, Institute of Cardiometabolism and Nutrition, Hôpital de la Pitié-Salpêtrière, 47 Boulevard de l'Hôpital, 75013, Paris, France
| |
Collapse
|
46
|
Li Z, Wang Y, Wang F. A study on fast calling variants from next-generation sequencing data using decision tree. BMC Bioinformatics 2018; 19:145. [PMID: 29673316 PMCID: PMC5907718 DOI: 10.1186/s12859-018-2147-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Accepted: 04/03/2018] [Indexed: 12/30/2022] Open
Abstract
Background The rapid development of next-generation sequencing (NGS) technology has continuously been refreshing the throughput of sequencing data. However, due to the lack of a smart tool that is both fast and accurate, the analysis task for NGS data, especially those with low-coverage, remains challenging. Results We proposed a decision-tree based variant calling algorithm. Experiments on a set of real data indicate that our algorithm achieves high accuracy and sensitivity for SNVs and indels and shows good adaptability on low-coverage data. In particular, our algorithm is obviously faster than 3 widely used tools in our experiments. Conclusions We implemented our algorithm in a software named Fuwa and applied it together with 4 well-known variant callers, i.e., Platypus, GATK-UnifiedGenotyper, GATK-HaplotypeCaller and SAMtools, to three sequencing data sets of a well-studied sample NA12878, which were produced by whole-genome, whole-exome and low-coverage whole-genome sequencing technology respectively. We also conducted additional experiments on the WGS data of 4 newly released samples that have not been used to populate dbSNP.
Collapse
Affiliation(s)
- Zhentang Li
- Shanghai Key Lab of Intelligent Information Processing, Shanghai, China.,School of Computer Science and Technology, Fudan University, Shanghai, China
| | - Yi Wang
- MOE Key Laboratory of Contemporary Anthropology and State Key Laboratory of Genetic Engineering, Collaborative Innovation Center of Genetics and Developmental Biology and School of Life Sciences, Fudan University, Shanghai, 200438, China
| | - Fei Wang
- Shanghai Key Lab of Intelligent Information Processing, Shanghai, China. .,School of Computer Science and Technology, Fudan University, Shanghai, China.
| |
Collapse
|
47
|
D'Argenio V. The High-Throughput Analyses Era: Are We Ready for the Data Struggle? High Throughput 2018; 7:E8. [PMID: 29498666 PMCID: PMC5876534 DOI: 10.3390/ht7010008] [Citation(s) in RCA: 38] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Revised: 02/16/2018] [Accepted: 02/27/2018] [Indexed: 12/23/2022] Open
Abstract
Recent and rapid technological advances in molecular sciences have dramatically increased the ability to carry out high-throughput studies characterized by big data production. This, in turn, led to the consequent negative effect of highlighting the presence of a gap between data yield and their analysis. Indeed, big data management is becoming an increasingly important aspect of many fields of molecular research including the study of human diseases. Now, the challenge is to identify, within the huge amount of data obtained, that which is of clinical relevance. In this context, issues related to data interpretation, sharing and storage need to be assessed and standardized. Once this is achieved, the integration of data from different -omic approaches will improve the diagnosis, monitoring and therapy of diseases by allowing the identification of novel, potentially actionably biomarkers in view of personalized medicine.
Collapse
Affiliation(s)
- Valeria D'Argenio
- CEINGE-Biotecnologie Avanzate, via G. Salvatore 486, 80145 Naples, Italy.
- Department of Molecular Medicine and Medical Biotechnologies, University of Naples Federico II, via Pansini 5, 80131 Naples, Italy.
| |
Collapse
|
48
|
Christensen PA, Ni Y, Bao F, Hendrickson HL, Greenwood M, Thomas JS, Long SW, Olsen RJ. Houston Methodist Variant Viewer: An Application to Support Clinical Laboratory Interpretation of Next-generation Sequencing Data for Cancer. J Pathol Inform 2017; 8:44. [PMID: 29226007 PMCID: PMC5719586 DOI: 10.4103/jpi.jpi_48_17] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2017] [Accepted: 10/12/2017] [Indexed: 01/17/2023] Open
Abstract
Introduction Next-generation-sequencing (NGS) is increasingly used in clinical and research protocols for patients with cancer. NGS assays are routinely used in clinical laboratories to detect mutations bearing on cancer diagnosis, prognosis and personalized therapy. A typical assay may interrogate 50 or more gene targets that encompass many thousands of possible gene variants. Analysis of NGS data in cancer is a labor-intensive process that can become overwhelming to the molecular pathologist or research scientist. Although commercial tools for NGS data analysis and interpretation are available, they are often costly, lack key functionality or cannot be customized by the end user. Methods To facilitate NGS data analysis in our clinical molecular diagnostics laboratory, we created a custom bioinformatics tool termed Houston Methodist Variant Viewer (HMVV). HMVV is a Java-based solution that integrates sequencing instrument output, bioinformatics analysis, storage resources and end user interface. Results Compared to the predicate method used in our clinical laboratory, HMVV markedly simplifies the bioinformatics workflow for the molecular technologist and facilitates the variant review by the molecular pathologist. Importantly, HMVV reduces time spent researching the biological significance of the variants detected, standardizes the online resources used to perform the variant investigation and assists generation of the annotated report for the electronic medical record. HMVV also maintains a searchable variant database, including the variant annotations generated by the pathologist, which is useful for downstream quality improvement and research projects. Conclusions HMVV is a clinical grade, low-cost, feature-rich, highly customizable platform that we have made available for continued development by the pathology informatics community.
Collapse
Affiliation(s)
- Paul A Christensen
- Department of Pathology and Genomic Medicine, Houston Methodist Hospital, Weill Cornell Medical College of Cornell University, Houston, Texas, USA
| | - Yunyun Ni
- Department of Pathology and Genomic Medicine, Houston Methodist Hospital, Weill Cornell Medical College of Cornell University, Houston, Texas, USA.,Helix, San Carlos, California 94070, USA
| | - Feifei Bao
- Department of Pathology and Genomic Medicine, Houston Methodist Hospital, Weill Cornell Medical College of Cornell University, Houston, Texas, USA
| | - Heather L Hendrickson
- Department of Pathology and Genomic Medicine, Houston Methodist Hospital, Weill Cornell Medical College of Cornell University, Houston, Texas, USA
| | - Michael Greenwood
- Department of Pathology and Genomic Medicine, Houston Methodist Hospital, Weill Cornell Medical College of Cornell University, Houston, Texas, USA
| | - Jessica S Thomas
- Department of Pathology and Genomic Medicine, Houston Methodist Hospital, Weill Cornell Medical College of Cornell University, Houston, Texas, USA
| | - S Wesley Long
- Department of Pathology and Genomic Medicine, Houston Methodist Hospital, Weill Cornell Medical College of Cornell University, Houston, Texas, USA
| | - Randall J Olsen
- Department of Pathology and Genomic Medicine, Houston Methodist Hospital, Weill Cornell Medical College of Cornell University, Houston, Texas, USA
| |
Collapse
|
49
|
SPP1, analyzed by bioinformatics methods, promotes the metastasis in colorectal cancer by activating EMT pathway. Biomed Pharmacother 2017; 91:1167-1177. [PMID: 28531945 DOI: 10.1016/j.biopha.2017.05.056] [Citation(s) in RCA: 64] [Impact Index Per Article: 9.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2017] [Revised: 04/26/2017] [Accepted: 05/10/2017] [Indexed: 12/30/2022] Open
Abstract
OBJECTIVE Tumor metastasis is still a great challenge for the prognosis of colorectal cancer (CRC). Although secreted phosphoprotein 1 (SPP1) over-expression is confirmed to associate with invasion, metastasis of CRC, the underlying mechanism by which modulates the CRC metastasis is still not fully explained. METHOD GDS4382 was obtained from GEO database and differentially expressed genes (DEGs) were analyzed by bioinformatics methods 55 paired samples of CRC and adjacent non-cancerous tissues were collected to detect the expression of SPP1 by q-PCR and western blot. Functional analysis of siRNA-SPP1, including proliferation, apoptosis, colony formation, cell cycle, migration, was investigated in CRC cell lines and tumor xenografts were conducted in nude mice. Protein expression of E-cadherin and vimentin was detected by western blot. RESULTS 1887 DEGs were analyzed and selected from GDS4382, of which, SPP1 and epithelial-mesenchymal-transition (EMT) showed a close association by bioinformatics analysis. The mRNA and protein expression of SPP1 were significantly higher in CRC tissues than that in adjacent non-cancerous tissues (P<0.05). Overexpression of SPP1 closely associated with tumor invasion, metastasis and low survival in CRC. Moreover, siRNA-SPP1 repressed proliferation, cell cycle, colony formation, migration and tumor growth in vivo and promoted cell apoptosis in CRC cell lines. In addition, Protein expression of E-cadherin was obviously up-regulated and Vimentin was down-regulated in CRC cells after siRNA-SPP1 (P<0.05). CONCLUSION SPP1 expression was significantly up-regulated in CRC. And SPP1 promoted the metastasis of CRC by activating EMT, which could be a potentially therapeutic target for patients with CRC.
Collapse
|
50
|
Jagodnik KM, Koplev S, Jenkins SL, Ohno-Machado L, Paten B, Schurer SC, Dumontier M, Verborgh R, Bui A, Ping P, McKenna NJ, Madduri R, Pillai A, Ma'ayan A. Developing a framework for digital objects in the Big Data to Knowledge (BD2K) commons: Report from the Commons Framework Pilots workshop. J Biomed Inform 2017; 71:49-57. [PMID: 28501646 DOI: 10.1016/j.jbi.2017.05.006] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2016] [Revised: 05/01/2017] [Accepted: 05/08/2017] [Indexed: 12/11/2022]
Abstract
The volume and diversity of data in biomedical research have been rapidly increasing in recent years. While such data hold significant promise for accelerating discovery, their use entails many challenges including: the need for adequate computational infrastructure, secure processes for data sharing and access, tools that allow researchers to find and integrate diverse datasets, and standardized methods of analysis. These are just some elements of a complex ecosystem that needs to be built to support the rapid accumulation of these data. The NIH Big Data to Knowledge (BD2K) initiative aims to facilitate digitally enabled biomedical research. Within the BD2K framework, the Commons initiative is intended to establish a virtual environment that will facilitate the use, interoperability, and discoverability of shared digital objects used for research. The BD2K Commons Framework Pilots Working Group (CFPWG) was established to clarify goals and work on pilot projects that address existing gaps toward realizing the vision of the BD2K Commons. This report reviews highlights from a two-day meeting involving the BD2K CFPWG to provide insights on trends and considerations in advancing Big Data science for biomedical research in the United States.
Collapse
Affiliation(s)
- Kathleen M Jagodnik
- Department of Pharmacological Sciences, BD2K-LINCS Data Coordination and Integration Center, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1215, New York, NY 10029, USA
| | - Simon Koplev
- Department of Pharmacological Sciences, BD2K-LINCS Data Coordination and Integration Center, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1215, New York, NY 10029, USA
| | - Sherry L Jenkins
- Department of Pharmacological Sciences, BD2K-LINCS Data Coordination and Integration Center, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1215, New York, NY 10029, USA
| | - Lucila Ohno-Machado
- Health System Department of Biomedical Informatics, University of California San Diego, 9500 Gilman Dr., La Jolla, CA 92083, USA; Health Services Research, San Diego Veterans Administration Health System, San Diego, CA 92083, USA
| | - Benedict Paten
- UC Santa Cruz Genomics Institute, University of California, Santa Cruz, 1156 High St., Santa Cruz, CA 95060, USA
| | - Stephan C Schurer
- Department of Molecular and Cellular Pharmacology, University of Miami, 331461120 NW 14th Street, CRB 650 (M-857), Miami, FL 33136, USA
| | - Michel Dumontier
- Institute for Data Science, Universiteit Maastricht, Minderbroedersberg 4-6, 6211 LK Maastricht, Netherlands
| | - Ruben Verborgh
- Ghent University - iMinds Research Foundation Flanders, St. Pietersnieuwstraat 33, 9000 Gent, Belgium
| | - Alex Bui
- Department of Radiological Sciences, UCLA School of Medicine, Los Angeles, CA 90095, USA; Department of Bioengineering, UCLA Henri Samueli School of Engineering, Los Angeles, CA 90095, USA
| | - Peipei Ping
- Departments of Physiology, Medicine, and Bioinformatics, UCLA School of Medicine, Los Angeles, CA 90095, USA
| | - Neil J McKenna
- Department of Molecular and Cellular Biology, Baylor College of Medicine, 1 Baylor Plaza, Houston, TX 77030, USA
| | - Ravi Madduri
- Department of Mathematics and Computer Science, Argonne National Laboratory, 9700 S. Cass Avenue, Argonne, IL 60439, USA
| | - Ajay Pillai
- Division of Genome Sciences, National Human Genome Research Institute, National Institutes of Health, 31 Center Drive, MSC 2152, 9000 Rockville Pike, Bethesda, MD 20892, USA
| | - Avi Ma'ayan
- Department of Pharmacological Sciences, BD2K-LINCS Data Coordination and Integration Center, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1215, New York, NY 10029, USA.
| |
Collapse
|