1
|
Chen C, Chen H, Zhang Y, Thomas HR, Frank MH, He Y, Xia R. TBtools: An Integrative Toolkit Developed for Interactive Analyses of Big Biological Data. MOLECULAR PLANT 2020; 13:1194-1202. [PMID: 32585190 DOI: 10.1016/j.molp.2020.06.009] [Citation(s) in RCA: 7050] [Impact Index Per Article: 1410.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/29/2020] [Revised: 04/04/2020] [Accepted: 06/10/2020] [Indexed: 05/18/2023]
Abstract
The rapid development of high-throughput sequencing techniques has led biology into the big-data era. Data analyses using various bioinformatics tools rely on programming and command-line environments, which are challenging and time-consuming for most wet-lab biologists. Here, we present TBtools (a Toolkit for Biologists integrating various biological data-handling tools), a stand-alone software with a user-friendly interface. The toolkit incorporates over 130 functions, which are designed to meet the increasing demand for big-data analyses, ranging from bulk sequence processing to interactive data visualization. A wide variety of graphs can be prepared in TBtools using a new plotting engine ("JIGplot") developed to maximize their interactive ability; this engine allows quick point-and-click modification of almost every graphic feature. TBtools is platform-independent software that can be run under all operating systems with Java Runtime Environment 1.6 or newer. It is freely available to non-commercial users at https://github.com/CJ-Chen/TBtools/releases.
Collapse
|
|
5 |
7050 |
2
|
Pasolli E, Asnicar F, Manara S, Zolfo M, Karcher N, Armanini F, Beghini F, Manghi P, Tett A, Ghensi P, Collado MC, Rice BL, DuLong C, Morgan XC, Golden CD, Quince C, Huttenhower C, Segata N. Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle. Cell 2019; 176:649-662.e20. [PMID: 30661755 PMCID: PMC6349461 DOI: 10.1016/j.cell.2019.01.001] [Citation(s) in RCA: 921] [Impact Index Per Article: 153.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2018] [Revised: 11/15/2018] [Accepted: 12/28/2018] [Indexed: 02/06/2023]
Abstract
The body-wide human microbiome plays a role in health, but its full diversity remains uncharacterized, particularly outside of the gut and in international populations. We leveraged 9,428 metagenomes to reconstruct 154,723 microbial genomes (45% of high quality) spanning body sites, ages, countries, and lifestyles. We recapitulated 4,930 species-level genome bins (SGBs), 77% without genomes in public repositories (unknown SGBs [uSGBs]). uSGBs are prevalent (in 93% of well-assembled samples), expand underrepresented phyla, and are enriched in non-Westernized populations (40% of the total SGBs). We annotated 2.85 M genes in SGBs, many associated with conditions including infant development (94,000) or Westernization (106,000). SGBs and uSGBs permit deeper microbiome analyses and increase the average mappability of metagenomic reads from 67.76% to 87.51% in the gut (median 94.26%) and 65.14% to 82.34% in the mouth. We thus identify thousands of microbial genomes from yet-to-be-named species, expand the pangenomes of human-associated microbes, and allow better exploitation of metagenomic technologies.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
921 |
3
|
Choi SW, O'Reilly PF. PRSice-2: Polygenic Risk Score software for biobank-scale data. Gigascience 2019; 8:giz082. [PMID: 31307061 PMCID: PMC6629542 DOI: 10.1093/gigascience/giz082] [Citation(s) in RCA: 876] [Impact Index Per Article: 146.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Revised: 03/13/2019] [Accepted: 06/11/2019] [Indexed: 02/03/2023] Open
Abstract
BACKGROUND Polygenic risk score (PRS) analyses have become an integral part of biomedical research, exploited to gain insights into shared aetiology among traits, to control for genomic profile in experimental studies, and to strengthen causal inference, among a range of applications. Substantial efforts are now devoted to biobank projects to collect large genetic and phenotypic data, providing unprecedented opportunity for genetic discovery and applications. To process the large-scale data provided by such biobank resources, highly efficient and scalable methods and software are required. RESULTS Here we introduce PRSice-2, an efficient and scalable software program for automating and simplifying PRS analyses on large-scale data. PRSice-2 handles both genotyped and imputed data, provides empirical association P-values free from inflation due to overfitting, supports different inheritance models, and can evaluate multiple continuous and binary target traits simultaneously. We demonstrate that PRSice-2 is dramatically faster and more memory-efficient than PRSice-1 and alternative PRS software, LDpred and lassosum, while having comparable predictive power. CONCLUSION PRSice-2's combination of efficiency and power will be increasingly important as data sizes grow and as the applications of PRS become more sophisticated, e.g., when incorporated into high-dimensional or gene set-based analyses. PRSice-2 is written in C++, with an R script for plotting, and is freely available for download from http://PRSice.info.
Collapse
|
research-article |
6 |
876 |
4
|
Chen C, Wu Y, Li J, Wang X, Zeng Z, Xu J, Liu Y, Feng J, Chen H, He Y, Xia R. TBtools-II: A "one for all, all for one" bioinformatics platform for biological big-data mining. MOLECULAR PLANT 2023; 16:1733-1742. [PMID: 37740491 DOI: 10.1016/j.molp.2023.09.010] [Citation(s) in RCA: 772] [Impact Index Per Article: 386.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/21/2023] [Revised: 08/22/2023] [Accepted: 09/17/2023] [Indexed: 09/24/2023]
Abstract
Since the official release of the stand-alone bioinformatics toolkit TBtools in 2020, its superior functionality in data analysis has been demonstrated by its widespread adoption by many thousands of users and references in more than 5000 academic articles. Now, TBtools is a commonly used tool in biological laboratories. Over the past 3 years, thanks to invaluable feedback and suggestions from numerous users, we have optimized and expanded the functionality of the toolkit, leading to the development of an upgraded version-TBtools-II. In this upgrade, we have incorporated over 100 new features, such as those for comparative genomics analysis, phylogenetic analysis, and data visualization. Meanwhile, to better meet the increasing needs of personalized data analysis, we have launched the plugin mode, which enables users to develop their own plugins and manage their selection, installation, and removal according to individual needs. To date, the plugin store has amassed over 50 plugins, with more than half of them being independently developed and contributed by TBtools users. These plugins offer a range of data analysis options including co-expression network analysis, single-cell data analysis, and bulked segregant analysis sequencing data analysis. Overall, TBtools is now transforming from a stand-alone software to a comprehensive bioinformatics platform of a vibrant and cooperative community in which users are also developers and contributors. By promoting the theme "one for all, all for one", we believe that TBtools-II will greatly benefit more biological researchers in this big-data era.
Collapse
|
|
2 |
772 |
5
|
Tran HTN, Ang KS, Chevrier M, Zhang X, Lee NYS, Goh M, Chen J. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol 2020; 21:12. [PMID: 31948481 PMCID: PMC6964114 DOI: 10.1186/s13059-019-1850-9] [Citation(s) in RCA: 562] [Impact Index Per Article: 112.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Accepted: 10/03/2019] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND Large-scale single-cell transcriptomic datasets generated using different technologies contain batch-specific systematic variations that present a challenge to batch-effect removal and data integration. With continued growth expected in scRNA-seq data, achieving effective batch integration with available computational resources is crucial. Here, we perform an in-depth benchmark study on available batch correction methods to determine the most suitable method for batch-effect removal. RESULTS We compare 14 methods in terms of computational runtime, the ability to handle large datasets, and batch-effect correction efficacy while preserving cell type purity. Five scenarios are designed for the study: identical cell types with different technologies, non-identical cell types, multiple batches, big data, and simulated data. Performance is evaluated using four benchmarking metrics including kBET, LISI, ASW, and ARI. We also investigate the use of batch-corrected data to study differential gene expression. CONCLUSION Based on our results, Harmony, LIGER, and Seurat 3 are the recommended methods for batch integration. Due to its significantly shorter runtime, Harmony is recommended as the first method to try, with the other methods as viable alternatives.
Collapse
|
research-article |
5 |
562 |
6
|
Moon KR, van Dijk D, Wang Z, Gigante S, Burkhardt DB, Chen WS, Yim K, Elzen AVD, Hirn MJ, Coifman RR, Ivanova NB, Wolf G, Krishnaswamy S. Visualizing structure and transitions in high-dimensional biological data. Nat Biotechnol 2019; 37:1482-1492. [PMID: 31796933 PMCID: PMC7073148 DOI: 10.1038/s41587-019-0336-3] [Citation(s) in RCA: 465] [Impact Index Per Article: 77.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2018] [Accepted: 10/29/2019] [Indexed: 01/12/2023]
Abstract
The high-dimensional data created by high-throughput technologies require visualization tools that reveal data structure and patterns in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure using an information-geometric distance between data points. We compare PHATE to other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data, including continual progressions, branches and clusters, better than other tools. We define a manifold preservation metric, which we call denoised embedding manifold preservation (DEMaP), and show that PHATE produces lower-dimensional embeddings that are quantitatively better denoised as compared to existing visualization methods. An analysis of a newly generated single-cell RNA sequencing dataset on human germ-layer differentiation demonstrates how PHATE reveals unique biological insight into the main developmental branches, including identification of three previously undescribed subpopulations. We also show that PHATE is applicable to a wide variety of data types, including mass cytometry, single-cell RNA sequencing, Hi-C and gut microbiome data.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
465 |
7
|
Abstract
Big data has become the ubiquitous watch word of medical innovation. The rapid development of machine-learning techniques and artificial intelligence in particular has promised to revolutionize medical practice from the allocation of resources to the diagnosis of complex diseases. But with big data comes big risks and challenges, among them significant questions about patient privacy. Here, we outline the legal and ethical challenges big data brings to patient privacy. We discuss, among other topics, how best to conceive of health privacy; the importance of equity, consent, and patient governance in data collection; discrimination in data uses; and how to handle data breaches. We close by sketching possible ways forward for the regulatory system.
Collapse
|
Review |
6 |
443 |
8
|
Gupta R, Srivastava D, Sahu M, Tiwari S, Ambasta RK, Kumar P. Artificial intelligence to deep learning: machine intelligence approach for drug discovery. Mol Divers 2021; 25:1315-1360. [PMID: 33844136 PMCID: PMC8040371 DOI: 10.1007/s11030-021-10217-3] [Citation(s) in RCA: 354] [Impact Index Per Article: 88.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Accepted: 03/22/2021] [Indexed: 02/06/2023]
Abstract
Drug designing and development is an important area of research for pharmaceutical companies and chemical scientists. However, low efficacy, off-target delivery, time consumption, and high cost impose a hurdle and challenges that impact drug design and discovery. Further, complex and big data from genomics, proteomics, microarray data, and clinical trials also impose an obstacle in the drug discovery pipeline. Artificial intelligence and machine learning technology play a crucial role in drug discovery and development. In other words, artificial neural networks and deep learning algorithms have modernized the area. Machine learning and deep learning algorithms have been implemented in several drug discovery processes such as peptide synthesis, structure-based virtual screening, ligand-based virtual screening, toxicity prediction, drug monitoring and release, pharmacophore modeling, quantitative structure-activity relationship, drug repositioning, polypharmacology, and physiochemical activity. Evidence from the past strengthens the implementation of artificial intelligence and deep learning in this field. Moreover, novel data mining, curation, and management techniques provided critical support to recently developed modeling algorithms. In summary, artificial intelligence and deep learning advancements provide an excellent opportunity for rational drug design and discovery process, which will eventually impact mankind. The primary concern associated with drug design and development is time consumption and production cost. Further, inefficiency, inaccurate target delivery, and inappropriate dosage are other hurdles that inhibit the process of drug delivery and development. With advancements in technology, computer-aided drug design integrating artificial intelligence algorithms can eliminate the challenges and hurdles of traditional drug design and development. Artificial intelligence is referred to as superset comprising machine learning, whereas machine learning comprises supervised learning, unsupervised learning, and reinforcement learning. Further, deep learning, a subset of machine learning, has been extensively implemented in drug design and development. The artificial neural network, deep neural network, support vector machines, classification and regression, generative adversarial networks, symbolic learning, and meta-learning are examples of the algorithms applied to the drug design and discovery process. Artificial intelligence has been applied to different areas of drug design and development process, such as from peptide synthesis to molecule design, virtual screening to molecular docking, quantitative structure-activity relationship to drug repositioning, protein misfolding to protein-protein interactions, and molecular pathway identification to polypharmacology. Artificial intelligence principles have been applied to the classification of active and inactive, monitoring drug release, pre-clinical and clinical development, primary and secondary drug screening, biomarker development, pharmaceutical manufacturing, bioactivity identification and physiochemical properties, prediction of toxicity, and identification of mode of action.
Collapse
|
Review |
4 |
354 |
9
|
Deutsch EW, Bandeira N, Sharma V, Perez-Riverol Y, Carver JJ, Kundu DJ, García-Seisdedos D, Jarnuczak AF, Hewapathirana S, Pullman BS, Wertz J, Sun Z, Kawano S, Okuda S, Watanabe Y, Hermjakob H, MacLean B, MacCoss MJ, Zhu Y, Ishihama Y, Vizcaíno JA. The ProteomeXchange consortium in 2020: enabling ' big data' approaches in proteomics. Nucleic Acids Res 2020; 48:D1145-D1152. [PMID: 31686107 PMCID: PMC7145525 DOI: 10.1093/nar/gkz984] [Citation(s) in RCA: 349] [Impact Index Per Article: 69.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2019] [Revised: 10/11/2019] [Accepted: 10/14/2019] [Indexed: 11/24/2022] Open
Abstract
The ProteomeXchange (PX) consortium of proteomics resources (http://www.proteomexchange.org) has standardized data submission and dissemination of mass spectrometry proteomics data worldwide since 2012. In this paper, we describe the main developments since the previous update manuscript was published in Nucleic Acids Research in 2017. Since then, in addition to the four PX existing members at the time (PRIDE, PeptideAtlas including the PASSEL resource, MassIVE and jPOST), two new resources have joined PX: iProX (China) and Panorama Public (USA). We first describe the updated submission guidelines, now expanded to include six members. Next, with current data submission statistics, we demonstrate that the proteomics field is now actively embracing public open data policies. At the end of June 2019, more than 14 100 datasets had been submitted to PX resources since 2012, and from those, more than 9 500 in just the last three years. In parallel, an unprecedented increase of data re-use activities in the field, including 'big data' approaches, is enabling novel research and new data resources. At last, we also outline some of our future plans for the coming years.
Collapse
|
Research Support, N.I.H., Extramural |
5 |
349 |
10
|
Chen C, Chen H, Zhang Y, Thomas HR, Frank MH, He Y, Xia R. TBtools: An Integrative Toolkit Developed for Interactive Analyses of Big Biological Data. MOLECULAR PLANT 2020. [PMID: 32585190 DOI: 10.1101/289660] [Citation(s) in RCA: 304] [Impact Index Per Article: 60.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
The rapid development of high-throughput sequencing techniques has led biology into the big-data era. Data analyses using various bioinformatics tools rely on programming and command-line environments, which are challenging and time-consuming for most wet-lab biologists. Here, we present TBtools (a Toolkit for Biologists integrating various biological data-handling tools), a stand-alone software with a user-friendly interface. The toolkit incorporates over 130 functions, which are designed to meet the increasing demand for big-data analyses, ranging from bulk sequence processing to interactive data visualization. A wide variety of graphs can be prepared in TBtools using a new plotting engine ("JIGplot") developed to maximize their interactive ability; this engine allows quick point-and-click modification of almost every graphic feature. TBtools is platform-independent software that can be run under all operating systems with Java Runtime Environment 1.6 or newer. It is freely available to non-commercial users at https://github.com/CJ-Chen/TBtools/releases.
Collapse
|
|
5 |
304 |
11
|
Yang J, Li Y, Liu Q, Li L, Feng A, Wang T, Zheng S, Xu A, Lyu J. Brief introduction of medical database and data mining technology in big data era. J Evid Based Med 2020; 13:57-69. [PMID: 32086994 PMCID: PMC7065247 DOI: 10.1111/jebm.12373] [Citation(s) in RCA: 286] [Impact Index Per Article: 57.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Accepted: 01/23/2020] [Indexed: 01/14/2023]
Abstract
Data mining technology can search for potentially valuable knowledge from a large amount of data, mainly divided into data preparation and data mining, and expression and analysis of results. It is a mature information processing technology and applies database technology. Database technology is a software science that researches manages, and applies databases. The data in the database are processed and analyzed by studying the underlying theory and implementation methods of the structure, storage, design, management, and application of the database. We have introduced several databases and data mining techniques to help a wide range of clinical researchers better understand and apply database technology.
Collapse
|
Review |
5 |
286 |
12
|
Schüssler-Fiorenza Rose SM, Contrepois K, Moneghetti KJ, Zhou W, Mishra T, Mataraso S, Dagan-Rosenfeld O, Ganz AB, Dunn J, Hornburg D, Rego S, Perelman D, Ahadi S, Sailani MR, Zhou Y, Leopold SR, Chen J, Ashland M, Christle JW, Avina M, Limcaoco P, Ruiz C, Tan M, Butte AJ, Weinstock GM, Slavich GM, Sodergren E, McLaughlin TL, Haddad F, Snyder MP. A longitudinal big data approach for precision health. Nat Med 2019; 25:792-804. [PMID: 31068711 PMCID: PMC6713274 DOI: 10.1038/s41591-019-0414-6] [Citation(s) in RCA: 269] [Impact Index Per Article: 44.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2018] [Accepted: 03/06/2019] [Indexed: 12/31/2022]
Abstract
Precision health relies on the ability to assess disease risk at an individual level, detect early preclinical conditions and initiate preventive strategies. Recent technological advances in omics and wearable monitoring enable deep molecular and physiological profiling and may provide important tools for precision health. We explored the ability of deep longitudinal profiling to make health-related discoveries, identify clinically relevant molecular pathways and affect behavior in a prospective longitudinal cohort (n = 109) enriched for risk of type 2 diabetes mellitus. The cohort underwent integrative personalized omics profiling from samples collected quarterly for up to 8 years (median, 2.8 years) using clinical measures and emerging technologies including genome, immunome, transcriptome, proteome, metabolome, microbiome and wearable monitoring. We discovered more than 67 clinically actionable health discoveries and identified multiple molecular pathways associated with metabolic, cardiovascular and oncologic pathophysiology. We developed prediction models for insulin resistance by using omics measurements, illustrating their potential to replace burdensome tests. Finally, study participation led the majority of participants to implement diet and exercise changes. Altogether, we conclude that deep longitudinal profiling can lead to actionable health discoveries and provide relevant information for precision health.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
269 |
13
|
Lombardo MV, Lai MC, Baron-Cohen S. Big data approaches to decomposing heterogeneity across the autism spectrum. Mol Psychiatry 2019; 24:1435-1450. [PMID: 30617272 PMCID: PMC6754748 DOI: 10.1038/s41380-018-0321-0] [Citation(s) in RCA: 257] [Impact Index Per Article: 42.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Revised: 10/30/2018] [Accepted: 11/12/2018] [Indexed: 12/27/2022]
Abstract
Autism is a diagnostic label based on behavior. While the diagnostic criteria attempt to maximize clinical consensus, it also masks a wide degree of heterogeneity between and within individuals at multiple levels of analysis. Understanding this multi-level heterogeneity is of high clinical and translational importance. Here we present organizing principles to frame research examining multi-level heterogeneity in autism. Theoretical concepts such as 'spectrum' or 'autisms' reflect non-mutually exclusive explanations regarding continuous/dimensional or categorical/qualitative variation between and within individuals. However, common practices of small sample size studies and case-control models are suboptimal for tackling heterogeneity. Big data are an important ingredient for furthering our understanding of heterogeneity in autism. In addition to being 'feature-rich', big data should be both 'broad' (i.e., large sample size) and 'deep' (i.e., multiple levels of data collected on the same individuals). These characteristics increase the likelihood that the study results are more generalizable and facilitate evaluation of the utility of different models of heterogeneity. A model's utility can be measured by its ability to explain clinically or mechanistically important phenomena, and also by explaining how variability manifests across different levels of analysis. The directionality for explaining variability across levels can be bottom-up or top-down, and should include the importance of development for characterizing changes within individuals. While progress can be made with 'supervised' models built upon a priori or theoretically predicted distinctions or dimensions of importance, it will become increasingly important to complement such work with unsupervised data-driven discoveries that leverage unknown and multivariate distinctions within big data. A better understanding of how to model heterogeneity between autistic people will facilitate progress towards precision medicine for symptoms that cause suffering, and person-centered support.
Collapse
|
Review |
6 |
257 |
14
|
Zhao J, Li H, Kung D, Fisher M, Shen Y, Liu R. Impact of the COVID-19 Epidemic on Stroke Care and Potential Solutions. Stroke 2020; 51:1996-2001. [PMID: 32432997 PMCID: PMC7258753 DOI: 10.1161/strokeaha.120.030225] [Citation(s) in RCA: 244] [Impact Index Per Article: 48.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Revised: 04/30/2020] [Accepted: 05/05/2020] [Indexed: 12/12/2022]
Abstract
BACKGROUND AND PURPOSE When the coronavirus disease 2019 (COVID-19) outbreak became paramount, medical care for other devastating diseases was negatively impacted. In this study, we investigated the impact of the COVID-19 outbreak on stroke care across China. METHODS Data from the Big Data Observatory Platform for Stroke of China consisting of 280 hospitals across China demonstrated a significant drop in the number of cases of thrombolysis and thrombectomy. We designed a survey to investigate the major changes during the COVID-19 outbreak and potential causes of these changes. The survey was distributed to the leaders of stroke centers in these 280 hospitals. RESULTS From the data of Big Data Observatory Platform for Stroke of China, the total number of thrombolysis and thrombectomy cases dropped 26.7% (P<0.0001) and 25.3% (P<0.0001), respectively, in February 2020 as compared with February 2019. We retrieved 227 valid complete datasets from the 280 stroke centers. Nearly 50% of these hospitals were designated hospitals for COVID-19. The capacity for stroke care was reduced in the majority of the hospitals. Most of the stroke centers stopped or reduced their efforts in stroke education for the public. Hospital admissions related to stroke dropped ≈40%; thrombolysis and thrombectomy cases dropped ≈25%, which is similar to the results from the Big Data Observatory Platform for Stroke of China as compared with the same period in 2019. Many factors contributed to the reduced admissions and prehospital delays; lack of stroke knowledge and proper transportation were significant limiting factors. Patients not coming to the hospital for fear of virus infection was also a likely key factor. CONCLUSIONS The COVID-19 outbreak impacted stroke care significantly in China, including prehospital and in-hospital care, resulting in a significant drop in admissions, thrombolysis, and thrombectomy. Although many factors contributed, patients not coming to the hospital was probably the major limiting factor. Recommendations based on the data are provided.
Collapse
|
research-article |
5 |
244 |
15
|
Cave A, Kurz X, Arlett P. Real-World Data for Regulatory Decision Making: Challenges and Possible Solutions for Europe. Clin Pharmacol Ther 2019; 106:36-39. [PMID: 30970161 PMCID: PMC6617710 DOI: 10.1002/cpt.1426] [Citation(s) in RCA: 192] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 02/22/2019] [Indexed: 12/21/2022]
|
Journal Article |
6 |
192 |
16
|
Krittanawong C, Johnson KW, Rosenson RS, Wang Z, Aydar M, Baber U, Min JK, Tang WHW, Halperin JL, Narayan SM. Deep learning for cardiovascular medicine: a practical primer. Eur Heart J 2019; 40:2058-2073. [PMID: 30815669 PMCID: PMC6600129 DOI: 10.1093/eurheartj/ehz056] [Citation(s) in RCA: 179] [Impact Index Per Article: 29.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/29/2018] [Revised: 11/02/2018] [Accepted: 01/22/2019] [Indexed: 12/23/2022] Open
Abstract
Deep learning (DL) is a branch of machine learning (ML) showing increasing promise in medicine, to assist in data classification, novel disease phenotyping and complex decision making. Deep learning is a form of ML typically implemented via multi-layered neural networks. Deep learning has accelerated by recent advances in computer hardware and algorithms and is increasingly applied in e-commerce, finance, and voice and image recognition to learn and classify complex datasets. The current medical literature shows both strengths and limitations of DL. Strengths of DL include its ability to automate medical image interpretation, enhance clinical decision-making, identify novel phenotypes, and select better treatment pathways in complex diseases. Deep learning may be well-suited to cardiovascular medicine in which haemodynamic and electrophysiological indices are increasingly captured on a continuous basis by wearable devices as well as image segmentation in cardiac imaging. However, DL also has significant weaknesses including difficulties in interpreting its models (the 'black-box' criticism), its need for extensive adjudicated ('labelled') data in training, lack of standardization in design, lack of data-efficiency in training, limited applicability to clinical trials, and other factors. Thus, the optimal clinical application of DL requires careful formulation of solvable problems, selection of most appropriate DL algorithms and data, and balanced interpretation of results. This review synthesizes the current state of DL for cardiovascular clinicians and investigators, and provides technical context to appreciate the promise, pitfalls, near-term challenges, and opportunities for this exciting new area.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
179 |
17
|
Mirza B, Wang W, Wang J, Choi H, Chung NC, Ping P. Machine Learning and Integrative Analysis of Biomedical Big Data. Genes (Basel) 2019; 10:E87. [PMID: 30696086 PMCID: PMC6410075 DOI: 10.3390/genes10020087] [Citation(s) in RCA: 176] [Impact Index Per Article: 29.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2018] [Revised: 01/08/2019] [Accepted: 01/21/2019] [Indexed: 12/11/2022] Open
Abstract
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
176 |
18
|
Vidaurre D, Abeysuriya R, Becker R, Quinn AJ, Alfaro-Almagro F, Smith SM, Woolrich MW. Discovering dynamic brain networks from big data in rest and task. Neuroimage 2018; 180:646-656. [PMID: 28669905 PMCID: PMC6138951 DOI: 10.1016/j.neuroimage.2017.06.077] [Citation(s) in RCA: 174] [Impact Index Per Article: 24.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Revised: 06/12/2017] [Accepted: 06/28/2017] [Indexed: 11/29/2022] Open
Abstract
Brain activity is a dynamic combination of the responses to sensory inputs and its own spontaneous processing. Consequently, such brain activity is continuously changing whether or not one is focusing on an externally imposed task. Previously, we have introduced an analysis method that allows us, using Hidden Markov Models (HMM), to model task or rest brain activity as a dynamic sequence of distinct brain networks, overcoming many of the limitations posed by sliding window approaches. Here, we present an advance that enables the HMM to handle very large amounts of data, making possible the inference of very reproducible and interpretable dynamic brain networks in a range of different datasets, including task, rest, MEG and fMRI, with potentially thousands of subjects. We anticipate that the generation of large and publicly available datasets from initiatives such as the Human Connectome Project and UK Biobank, in combination with computational methods that can work at this scale, will bring a breakthrough in our understanding of brain function in both health and disease.
Collapse
|
review-article |
7 |
174 |
19
|
Xue Y, Bao Y, Zhang Z, Zhao W, Xiao J, He S, Zhang G, Li Y, Zhao G, Chen R, Song S, Ma L, Zou D, Tian D, Li C, Zhu J, Gong Z, Chen M, Wang A, Ma Y, Li M, Teng X, Cui Y, Duan G, Zhang M, Jin T, Shi C, Du Z, Zhang Y, Liu C, Li R, Zeng J, Hao L, Jiang S, Chen H, Han D, Xiao J, Zhang Z, Zhao W, Xue Y, Bao Y, Zhang T, Kang W, Yang F, Qu J, Zhang W, Bao Y, Liu GH, Liu L, Zhang Y, Niu G, Zhu T, Feng C, Liu X, Zhang Y, Li Z, Chen R, Li Q, Teng X, Ma L, Hua Z, Tian D, Jiang C, Chen Z, He F, Zhao Y, Jin Y, Zhang Z, Huang L, Song S, Yuan Y, Zhou C, Xu Q, He S, Ye W, Cao R, Wang P, Ling Y, Yan X, Wang Q, Zhang G, Li Z, Liu L, Jiang S, Li Q, Feng C, Du Q, Ma L, Zong W, Kang H, Zhang M, Xiong Z, Li R, Huan W, Ling Y, Zhang S, Xia Q, Cao R, Fan X, Wang Z, Zhang G, Chen X, Chen T, Zhang S, Tang B, Zhu J, Dong L, Zhang Z, Wang Z, Kang H, Wang Y, Ma Y, Wu S, Kang H, Chen M, Li C, Tian D, Tang B, Liu X, Teng X, Song S, Tian D, Liu X, Li C, Teng X, Song S, Zhang Y, Zou D, Zhu T, Chen M, Niu G, Liu C, Xiong Y, Hao L, Niu G, Zou D, Zhu T, Shao X, Hao L, Li Y, Zhou H, Chen X, Zheng Y, Kang Q, Hao D, Zhang L, Luo H, Hao Y, Chen R, Zhang P, He S, Zou D, Zhang M, Xiong Z, Nie Z, Yu S, Li R, Li M, Li R, Bao Y, Xiong Z, Li M, Yang F, Ma Y, Sang J, Li Z, Li R, Tang B, Zhang X, Dong L, Zhou Q, Cui Y, Zhai S, Zhang Y, Wang G, Zhao W, Wang Z, Zhu Q, Li X, Zhu J, Tian D, Kang H, Li C, Zhang S, Song S, Li M, Zhao W, Yan J, Sang J, Zou D, Li C, Wang Z, Zhang Y, Zhu T, Song S, Wang X, Hao L, Liu Y, Wang Z, Luo H, Zhu J, Wu X, Tian D, Li C, Zhao W, Jing HC, Chen M, Zou D, Hao L, Zhao L, Wang J, Li Y, Song T, Zheng Y, Chen R, Zhao Y, He S, Zou D, Mehmood F, Ali S, Ali A, Saleem S, Hussain I, Abbasi AA, Ma L, Zou D, Zou D, Jiang S, Zhang Z, Jiang S, Zhao W, Xiao J, Bao Y, Zhang Z, Zuo Z, Ren J, Zhang X, Xiao Y, Li X, Zhang X, Xiao Y, Li X, Tu Y, Xue Y, Wu W, Ji P, Zhao F, Meng X, Chen M, Peng D, Xue Y, Luo H, Gao F, Zhang X, Xiao Y, Li X, Ning W, Xue Y, Lin S, Xue Y, Liu T, Guo AY, Yuan H, Zhang YE, Tan X, Xue Y, Zhang W, Xue Y, Xie Y, Ren J, Wang C, Xue Y, Liu CJ, Guo AY, Yang DC, Tian F, Gao G, Tang D, Xue Y, Yao L, Xue Y, Cui Q, An NA, Li CY, Luo X, Ren J, Zhang X, Xiao Y, Li X. Database Resources of the National Genomics Data Center, China National Center for Bioinformation in 2021. Nucleic Acids Res 2021; 49:D18-D28. [PMID: 33175170 PMCID: PMC7779035 DOI: 10.1093/nar/gkaa1022] [Citation(s) in RCA: 153] [Impact Index Per Article: 38.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/13/2020] [Accepted: 10/16/2020] [Indexed: 12/20/2022] Open
Abstract
The National Genomics Data Center (NGDC), part of the China National Center for Bioinformation (CNCB), provides a suite of database resources to support worldwide research activities in both academia and industry. With the explosive growth of multi-omics data, CNCB-NGDC is continually expanding, updating and enriching its core database resources through big data deposition, integration and translation. In the past year, considerable efforts have been devoted to 2019nCoVR, a newly established resource providing a global landscape of SARS-CoV-2 genomic sequences, variants, and haplotypes, as well as Aging Atlas, BrainBase, GTDB (Glycosyltransferases Database), LncExpDB, and TransCirc (Translation potential for circular RNAs). Meanwhile, a series of resources have been updated and improved, including BioProject, BioSample, GWH (Genome Warehouse), GVM (Genome Variation Map), GEN (Gene Expression Nebulas) as well as several biodiversity and plant resources. Particularly, BIG Search, a scalable, one-stop, cross-database search engine, has been significantly updated by providing easy access to a large number of internal and external biological resources from CNCB-NGDC, our partners, EBI and NCBI. All of these resources along with their services are publicly accessible at https://bigd.big.ac.cn.
Collapse
|
research-article |
4 |
153 |
20
|
Luechtefeld T, Marsh D, Rowlands C, Hartung T. Machine Learning of Toxicological Big Data Enables Read-Across Structure Activity Relationships (RASAR) Outperforming Animal Test Reproducibility. Toxicol Sci 2018; 165:198-212. [PMID: 30007363 PMCID: PMC6135638 DOI: 10.1093/toxsci/kfy152] [Citation(s) in RCA: 152] [Impact Index Per Article: 21.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Earlier we created a chemical hazard database via natural language processing of dossiers submitted to the European Chemical Agency with approximately 10 000 chemicals. We identified repeat OECD guideline tests to establish reproducibility of acute oral and dermal toxicity, eye and skin irritation, mutagenicity and skin sensitization. Based on 350-700+ chemicals each, the probability that an OECD guideline animal test would output the same result in a repeat test was 78%-96% (sensitivity 50%-87%). An expanded database with more than 866 000 chemical properties/hazards was used as training data and to model health hazards and chemical properties. The constructed models automate and extend the read-across method of chemical classification. The novel models called RASARs (read-across structure activity relationship) use binary fingerprints and Jaccard distance to define chemical similarity. A large chemical similarity adjacency matrix is constructed from this similarity metric and is used to derive feature vectors for supervised learning. We show results on 9 health hazards from 2 kinds of RASARs-"Simple" and "Data Fusion". The "Simple" RASAR seeks to duplicate the traditional read-across method, predicting hazard from chemical analogs with known hazard data. The "Data Fusion" RASAR extends this concept by creating large feature vectors from all available property data rather than only the modeled hazard. Simple RASAR models tested in cross-validation achieve 70%-80% balanced accuracies with constraints on tested compounds. Cross validation of data fusion RASARs show balanced accuracies in the 80%-95% range across 9 health hazards with no constraints on tested compounds.
Collapse
|
Research Support, N.I.H., Extramural |
7 |
152 |
21
|
Jing Y, Bian Y, Hu Z, Wang L, Xie XQ. Deep Learning for Drug Design: an Artificial Intelligence Paradigm for Drug Discovery in the Big Data Era. AAPS J 2018; 20:58. [PMID: 29603063 PMCID: PMC6608578 DOI: 10.1208/s12248-018-0210-0] [Citation(s) in RCA: 145] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2017] [Accepted: 02/22/2018] [Indexed: 12/22/2022] Open
Abstract
Over the last decade, deep learning (DL) methods have been extremely successful and widely used to develop artificial intelligence (AI) in almost every domain, especially after it achieved its proud record on computational Go. Compared to traditional machine learning (ML) algorithms, DL methods still have a long way to go to achieve recognition in small molecular drug discovery and development. And there is still lots of work to do for the popularization and application of DL for research purpose, e.g., for small molecule drug research and development. In this review, we mainly discussed several most powerful and mainstream architectures, including the convolutional neural network (CNN), recurrent neural network (RNN), and deep auto-encoder networks (DAENs), for supervised learning and nonsupervised learning; summarized most of the representative applications in small molecule drug design; and briefly introduced how DL methods were used in those applications. The discussion for the pros and cons of DL methods as well as the main challenges we need to tackle were also emphasized.
Collapse
|
Research Support, N.I.H., Extramural |
7 |
145 |
22
|
Yang YJ, Bang CS. Application of artificial intelligence in gastroenterology. World J Gastroenterol 2019; 25:1666-1683. [PMID: 31011253 PMCID: PMC6465941 DOI: 10.3748/wjg.v25.i14.1666] [Citation(s) in RCA: 145] [Impact Index Per Article: 24.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/09/2019] [Revised: 03/04/2019] [Accepted: 03/16/2019] [Indexed: 02/06/2023] Open
Abstract
Artificial intelligence (AI) using deep-learning (DL) has emerged as a breakthrough computer technology. By the era of big data, the accumulation of an enormous number of digital images and medical records drove the need for the utilization of AI to efficiently deal with these data, which have become fundamental resources for a machine to learn by itself. Among several DL models, the convolutional neural network showed outstanding performance in image analysis. In the field of gastroenterology, physicians handle large amounts of clinical data and various kinds of image devices such as endoscopy and ultrasound. AI has been applied in gastroenterology in terms of diagnosis, prognosis, and image analysis. However, potential inherent selection bias cannot be excluded in the form of retrospective study. Because overfitting and spectrum bias (class imbalance) have the possibility of overestimating the accuracy, external validation using unused datasets for model development, collected in a way that minimizes the spectrum bias, is mandatory. For robust verification, prospective studies with adequate inclusion/exclusion criteria, which represent the target populations, are needed. DL has its own lack of interpretability. Because interpretability is important in that it can provide safety measures, help to detect bias, and create social acceptance, further investigations should be performed.
Collapse
|
Minireviews |
6 |
145 |
23
|
Abstract
The digital world is generating data at a staggering and still increasing rate. While these "big data" have unlocked novel opportunities to understand public health, they hold still greater potential for research and practice. This review explores several key issues that have arisen around big data. First, we propose a taxonomy of sources of big data to clarify terminology and identify threads common across some subtypes of big data. Next, we consider common public health research and practice uses for big data, including surveillance, hypothesis-generating research, and causal inference, while exploring the role that machine learning may play in each use. We then consider the ethical implications of the big data revolution with particular emphasis on maintaining appropriate care for privacy in a world in which technology is rapidly changing social norms regarding the need for (and even the meaning of) privacy. Finally, we make suggestions regarding structuring teams and training to succeed in working with big data in research and practice.
Collapse
|
Research Support, N.I.H., Extramural |
7 |
144 |
24
|
Panayides AS, Amini A, Filipovic ND, Sharma A, Tsaftaris SA, Young A, Foran D, Do N, Golemati S, Kurc T, Huang K, Nikita KS, Veasey BP, Zervakis M, Saltz JH, Pattichis CS. AI in Medical Imaging Informatics: Current Challenges and Future Directions. IEEE J Biomed Health Inform 2020; 24:1837-1857. [PMID: 32609615 PMCID: PMC8580417 DOI: 10.1109/jbhi.2020.2991043] [Citation(s) in RCA: 140] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
This paper reviews state-of-the-art research solutions across the spectrum of medical imaging informatics, discusses clinical translation, and provides future directions for advancing clinical practice. More specifically, it summarizes advances in medical imaging acquisition technologies for different modalities, highlighting the necessity for efficient medical data management strategies in the context of AI in big healthcare data analytics. It then provides a synopsis of contemporary and emerging algorithmic methods for disease classification and organ/ tissue segmentation, focusing on AI and deep learning architectures that have already become the de facto approach. The clinical benefits of in-silico modelling advances linked with evolving 3D reconstruction and visualization applications are further documented. Concluding, integrative analytics approaches driven by associate research branches highlighted in this study promise to revolutionize imaging informatics as known today across the healthcare continuum for both radiology and digital pathology applications. The latter, is projected to enable informed, more accurate diagnosis, timely prognosis, and effective treatment planning, underpinning precision medicine.
Collapse
|
Research Support, N.I.H., Extramural |
5 |
140 |
25
|
Di Leo G, Sardanelli F. Statistical significance: p value, 0.05 threshold, and applications to radiomics-reasons for a conservative approach. Eur Radiol Exp 2020; 4:18. [PMID: 32157489 PMCID: PMC7064671 DOI: 10.1186/s41747-020-0145-y] [Citation(s) in RCA: 138] [Impact Index Per Article: 27.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2019] [Accepted: 01/23/2020] [Indexed: 12/17/2022] Open
Abstract
Here, we summarise the unresolved debate about p value and its dichotomisation. We present the statement of the American Statistical Association against the misuse of statistical significance as well as the proposals to abandon the use of p value and to reduce the significance threshold from 0.05 to 0.005. We highlight reasons for a conservative approach, as clinical research needs dichotomic answers to guide decision-making, in particular in the case of diagnostic imaging and interventional radiology. With a reduced p value threshold, the cost of research could increase while spontaneous research could be reduced. Secondary evidence from systematic reviews/meta-analyses, data sharing, and cost-effective analyses are better ways to mitigate the false discovery rate and lack of reproducibility associated with the use of the 0.05 threshold. Importantly, when reporting p values, authors should always provide the actual value, not only statements of "p < 0.05" or "p ≥ 0.05", because p values give a measure of the degree of data compatibility with the null hypothesis. Notably, radiomics and big data, fuelled by the application of artificial intelligence, involve hundreds/thousands of tested features similarly to other "omics" such as genomics, where a reduction in the significance threshold, based on well-known corrections for multiple testing, has been already adopted.
Collapse
|
research-article |
5 |
138 |