301
|
Roy A, Bruce C, Schulte P, Olson L, Pola M. Failure prediction using personalized models and an application to heart failure prediction. BIG DATA ANALYTICS 2020. [DOI: 10.1186/s41044-020-00044-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
To reduce disruptions of processes and the cost of maintenance, predicting the onset of failure (or a similar event) of a physical system (or components of a physical system) has become important. Prediction of onset of failure would allow appropriate corrective actions at the right time. In this paper, we present a method to predict the “onset” of failure (the start of a degradation process or similar types of events) of a physical system that minimizes data collection and personalizes it for the physical system. The method applies to situations where one monitors the operating characteristics of the physical system at regular time intervals by means of attached sensors and other measurement instruments. It creates a model of the physical system, during normal operations, using the time-series data produced by the sensors and measurement instruments. However, it does not create or use any time-series models. It simply examines the distribution of time-series data across different time periods. It uses this model of normal operations in subsequent time periods to monitor the physical system for deviations from normality.
Results
We illustrate this method with an application to predict the “onset” of subsequent decompensated heart failures for patients already treated for a heart failure at a hospital. As part of an NIH study, these heart failure patients received two ECG patches, an accelerometer and a bio-impedance measurement device for regular monitoring for a period after their release from the hospital.
Conclusions
When dealing with non-homogenous, disparate physical systems, personalized models can be better predictors of a phenomenon compared to generalized models based on data collected from an assortment of such physical systems. In medicine such models can be a powerful addition to the set of medical diagnostic tools. And such personalized models can be built rather quickly without waiting for extensive data collection.
Collapse
|
302
|
Jebb D, Huang Z, Pippel M, Hughes GM, Lavrichenko K, Devanna P, Winkler S, Jermiin LS, Skirmuntt EC, Katzourakis A, Burkitt-Gray L, Ray DA, Sullivan KAM, Roscito JG, Kirilenko BM, Dávalos LM, Corthals AP, Power ML, Jones G, Ransome RD, Dechmann DKN, Locatelli AG, Puechmaille SJ, Fedrigo O, Jarvis ED, Hiller M, Vernes SC, Myers EW, Teeling EC. Six reference-quality genomes reveal evolution of bat adaptations. Nature 2020; 583:578-584. [PMID: 32699395 PMCID: PMC8075899 DOI: 10.1038/s41586-020-2486-3] [Citation(s) in RCA: 177] [Impact Index Per Article: 44.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2019] [Accepted: 06/09/2020] [Indexed: 11/08/2022]
Abstract
Bats possess extraordinary adaptations, including flight, echolocation, extreme longevity and unique immunity. High-quality genomes are crucial for understanding the molecular basis and evolution of these traits. Here we incorporated long-read sequencing and state-of-the-art scaffolding protocols1 to generate, to our knowledge, the first reference-quality genomes of six bat species (Rhinolophus ferrumequinum, Rousettus aegyptiacus, Phyllostomus discolor, Myotis myotis, Pipistrellus kuhlii and Molossus molossus). We integrated gene projections from our 'Tool to infer Orthologs from Genome Alignments' (TOGA) software with de novo and homology gene predictions as well as short- and long-read transcriptomics to generate highly complete gene annotations. To resolve the phylogenetic position of bats within Laurasiatheria, we applied several phylogenetic methods to comprehensive sets of orthologous protein-coding and noncoding regions of the genome, and identified a basal origin for bats within Scrotifera. Our genome-wide screens revealed positive selection on hearing-related genes in the ancestral branch of bats, which is indicative of laryngeal echolocation being an ancestral trait in this clade. We found selection and loss of immunity-related genes (including pro-inflammatory NF-κB regulators) and expansions of anti-viral APOBEC3 genes, which highlights molecular mechanisms that may contribute to the exceptional immunity of bats. Genomic integrations of diverse viruses provide a genomic record of historical tolerance to viral infection in bats. Finally, we found and experimentally validated bat-specific variation in microRNAs, which may regulate bat-specific gene-expression programs. Our reference-quality bat genomes provide the resources required to uncover and validate the genomic basis of adaptations of bats, and stimulate new avenues of research that are directly relevant to human health and disease1.
Collapse
Affiliation(s)
- David Jebb
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
- Center for Systems Biology Dresden, Dresden, Germany
| | - Zixia Huang
- School of Biology and Environmental Science, University College Dublin, Dublin, Ireland
| | - Martin Pippel
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Center for Systems Biology Dresden, Dresden, Germany
| | - Graham M Hughes
- School of Biology and Environmental Science, University College Dublin, Dublin, Ireland
| | - Ksenia Lavrichenko
- Neurogenetics of Vocal Communication Group, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
| | - Paolo Devanna
- Neurogenetics of Vocal Communication Group, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
| | - Sylke Winkler
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
| | - Lars S Jermiin
- School of Biology and Environmental Science, University College Dublin, Dublin, Ireland
- Research School of Biology, Australian National University, Canberra, Australian Capital Territory, Australia
- Earth Institute, University College Dublin, Dublin, Ireland
| | - Emilia C Skirmuntt
- Peter Medawar Building for Pathogen Research, Department of Zoology, University of Oxford, Oxford, UK
| | - Aris Katzourakis
- Peter Medawar Building for Pathogen Research, Department of Zoology, University of Oxford, Oxford, UK
| | - Lucy Burkitt-Gray
- Conway Institute of Biomolecular and Biomedical Science, University College Dublin, Dublin, Ireland
| | - David A Ray
- Department of Biological Sciences, Texas Tech University, Lubbock, TX, USA
| | - Kevin A M Sullivan
- Department of Biological Sciences, Texas Tech University, Lubbock, TX, USA
| | - Juliana G Roscito
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
- Center for Systems Biology Dresden, Dresden, Germany
| | - Bogdan M Kirilenko
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany
- Center for Systems Biology Dresden, Dresden, Germany
| | - Liliana M Dávalos
- Department of Ecology and Evolution, Stony Brook University, Stony Brook, NY, USA
- Consortium for Inter-Disciplinary Environmental Research, Stony Brook University, Stony Brook, NY, USA
| | | | - Megan L Power
- School of Biology and Environmental Science, University College Dublin, Dublin, Ireland
| | - Gareth Jones
- School of Biological Sciences, University of Bristol, Bristol, UK
| | - Roger D Ransome
- School of Biological Sciences, University of Bristol, Bristol, UK
| | - Dina K N Dechmann
- Department of Migration, Max Planck Institute of Animal Behavior, Radolfzell, Germany
- Department of Biology, University of Konstanz, Konstanz, Germany
- Smithsonian Tropical Research Institute, Panama City, Panama
| | - Andrea G Locatelli
- School of Biology and Environmental Science, University College Dublin, Dublin, Ireland
| | - Sébastien J Puechmaille
- ISEM, University of Montpellier, Montpellier, France
- Zoological Institute and Museum, University of Greifswald, Greifswald, Germany
| | - Olivier Fedrigo
- Vertebrate Genomes Laboratory, The Rockefeller University, New York, NY, USA
| | - Erich D Jarvis
- Vertebrate Genomes Laboratory, The Rockefeller University, New York, NY, USA
- Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Michael Hiller
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.
- Max Planck Institute for the Physics of Complex Systems, Dresden, Germany.
- Center for Systems Biology Dresden, Dresden, Germany.
| | - Sonja C Vernes
- Neurogenetics of Vocal Communication Group, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands.
- Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands.
| | - Eugene W Myers
- Max Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.
- Center for Systems Biology Dresden, Dresden, Germany.
- Faculty of Computer Science, Technical University Dresden, Dresden, Germany.
| | - Emma C Teeling
- School of Biology and Environmental Science, University College Dublin, Dublin, Ireland.
| |
Collapse
|
303
|
Nicora G, Vitali F, Dagliati A, Geifman N, Bellazzi R. Integrated Multi-Omics Analyses in Oncology: A Review of Machine Learning Methods and Tools. Front Oncol 2020; 10:1030. [PMID: 32695678 PMCID: PMC7338582 DOI: 10.3389/fonc.2020.01030] [Citation(s) in RCA: 110] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Accepted: 05/26/2020] [Indexed: 12/16/2022] Open
Abstract
In recent years, high-throughput sequencing technologies provide unprecedented opportunity to depict cancer samples at multiple molecular levels. The integration and analysis of these multi-omics datasets is a crucial and critical step to gain actionable knowledge in a precision medicine framework. This paper explores recent data-driven methodologies that have been developed and applied to respond major challenges of stratified medicine in oncology, including patients' phenotyping, biomarker discovery, and drug repurposing. We systematically retrieved peer-reviewed journals published from 2014 to 2019, select and thoroughly describe the tools presenting the most promising innovations regarding the integration of heterogeneous data, the machine learning methodologies that successfully tackled the complexity of multi-omics data, and the frameworks to deliver actionable results for clinical practice. The review is organized according to the applied methods: Deep learning, Network-based methods, Clustering, Features Extraction, and Transformation, Factorization. We provide an overview of the tools available in each methodological group and underline the relationship among the different categories. Our analysis revealed how multi-omics datasets could be exploited to drive precision oncology, but also current limitations in the development of multi-omics data integration.
Collapse
Affiliation(s)
- Giovanna Nicora
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
| | - Francesca Vitali
- Center for Innovation in Brain Science, University of Arizona, Tucson, AZ, United States.,Department of Neurology, College of Medicine, University of Arizona, Tucson, AZ, United States.,Center for Biomedical Informatics and Biostatistics, University of Arizona, Tucson, AZ, United States
| | - Arianna Dagliati
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy.,Centre for Health Informatics, The University of Manchester, Manchester, United Kingdom.,The Manchester Molecular Pathology Innovation Centre, The University of Manchester, Manchester, United Kingdom
| | - Nophar Geifman
- Centre for Health Informatics, The University of Manchester, Manchester, United Kingdom.,The Manchester Molecular Pathology Innovation Centre, The University of Manchester, Manchester, United Kingdom
| | - Riccardo Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
| |
Collapse
|
304
|
An analysis of technological frameworks for data streams. PROGRESS IN ARTIFICIAL INTELLIGENCE 2020. [DOI: 10.1007/s13748-020-00210-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
305
|
Machine-Learned Association of Next-Generation Sequencing-Derived Variants in Thermosensitive Ion Channels Genes with Human Thermal Pain Sensitivity Phenotypes. Int J Mol Sci 2020; 21:ijms21124367. [PMID: 32575443 PMCID: PMC7352872 DOI: 10.3390/ijms21124367] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Revised: 06/16/2020] [Accepted: 06/17/2020] [Indexed: 12/20/2022] Open
Abstract
Genetic association studies have shown their usefulness in assessing the role of ion channels in human thermal pain perception. We used machine learning to construct a complex phenotype from pain thresholds to thermal stimuli and associate it with the genetic information derived from the next-generation sequencing (NGS) of 15 ion channel genes which are involved in thermal perception, including ASIC1, ASIC2, ASIC3, ASIC4, TRPA1, TRPC1, TRPM2, TRPM3, TRPM4, TRPM5, TRPM8, TRPV1, TRPV2, TRPV3, and TRPV4. Phenotypic information was complete in 82 subjects and NGS genotypes were available in 67 subjects. A network of artificial neurons, implemented as emergent self-organizing maps, discovered two clusters characterized by high or low pain thresholds for heat and cold pain. A total of 1071 variants were discovered in the 15 ion channel genes. After feature selection, 80 genetic variants were retained for an association analysis based on machine learning. The measured performance of machine learning-mediated phenotype assignment based on this genetic information resulted in an area under the receiver operating characteristic curve of 77.2%, justifying a phenotype classification based on the genetic information. A further item categorization finally resulted in 38 genetic variants that contributed most to the phenotype assignment. Most of them (10) belonged to the TRPV3 gene, followed by TRPM3 (6). Therefore, the analysis successfully identified the particular importance of TRPV3 and TRPM3 for an average pain phenotype defined by the sensitivity to moderate thermal stimuli.
Collapse
|
306
|
Abstract
Benefiting from the rapid development of big data and high-performance computing, more data is available and more tasks could be solved by machine learning now. Even so, it is still difficult to maximum the power of big data due to each dataset is isolated with others. Although open source datasets are available, algorithms’ performance is asymmetric with the data volume. Hence, the AI community wishes to raise a symmetric continuous learning architecture which can automatically learn and adapt to different tasks. Such a learning architecture also is commonly called as lifelong machine learning (LML). This learning paradigm could manage the learning process and accumulate meta-knowledge by itself during learning different tasks. The meta-knowledge is shared among all tasks symmetrically to help them to improve performance. With the growth of meta-knowledge, the performance of each task is expected to be better and better. In order to demonstrate the application of lifelong machine learning, this paper proposed a novel and symmetric lifelong learning approach for sentiment classification as an example to show how it adapts different domains and keeps efficiency meanwhile.
Collapse
|
307
|
Zwitter A, Gstrein OJ. Big data, privacy and COVID-19 - learning from humanitarian expertise in data protection. JOURNAL OF INTERNATIONAL HUMANITARIAN ACTION 2020; 5:4. [PMID: 38624331 PMCID: PMC7232912 DOI: 10.1186/s41018-020-00072-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The COVID-19 pandemic leads governments around the world to resort to tracking technology and other data-driven tools in order to monitor and curb the spread of SARS-CoV-2. Such large-scale incursion into privacy and data protection is unthinkable during times of normalcy. However, in times of a pandemic the use of location data provided by telecom operators and/or technology companies becomes a viable option. Importantly, legal regulations hardly protect people's privacy against governmental and corporate misuse. Established privacy regimes are focused on individual consent, and most human rights treaties know derogations from privacy and data protection norms for states of emergency. This leaves little safeguards nor remedies to guarantee individual and collective autonomy. However, the challenge of responsible data use during a crisis is not novel. The humanitarian sector has more than a decade of experience to offer. International organisations and humanitarian actors have developed detailed guidelines on how to use data responsibly under extreme circumstances. This article briefly addresses the legal gap of data protection and privacy during this global crisis. Then it outlines the state of the art in humanitarian practice and academia on data protection and data responsibility during crisis.
Collapse
Affiliation(s)
- Andrej Zwitter
- Data Research Centre, Campus Fryslân, University of Groningen, Leeuwarden, the Netherlands
| | - Oskar J. Gstrein
- Data Research Centre, Campus Fryslân, University of Groningen, Leeuwarden, the Netherlands
| |
Collapse
|
308
|
Shopovska I, Jovanov L, Philips W. Efficient Training Procedures for Multi-Spectra Demosaicing. SENSORS (BASEL, SWITZERLAND) 2020; 20:s20102850. [PMID: 32429529 PMCID: PMC7287920 DOI: 10.3390/s20102850] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/10/2020] [Revised: 05/12/2020] [Accepted: 05/15/2020] [Indexed: 06/11/2023]
Abstract
The simultaneous acquisition of multi-spectral images on a single sensor can be efficiently performed by single shot capture using a mutli-spectral filter array. This paper focused on the demosaicing of color and near-infrared bands and relied on a convolutional neural network (CNN). To train the deep learning model robustly and accurately, it is necessary to provide enough training data, with sufficient variability. We focused on the design of an efficient training procedure by discovering an optimal training dataset. We propose two data selection strategies, motivated by slightly different concepts. The general term that will be used for the proposed models trained using data selection is data selection-based multi-spectral demosaicing (DSMD). The first idea is clustering-based data selection (DSMD-C), with the goal to discover a representative subset with a high variance so as to train a robust model. The second is an adaptive-based data selection (DSMD-A), a self-guided approach that selects new data based on the current model accuracy. We performed a controlled experimental evaluation of the proposed training strategies and the results show that a careful selection of data does benefit the speed and accuracy of training. We are still able to achieve high reconstruction accuracy with a lightweight model.
Collapse
|
309
|
Miyoshi T, Higaki A, Kawakami H, Yamaguchi O. Automated interpretation of the coronary angioscopy with deep convolutional neural networks. Open Heart 2020; 7:e001177. [PMID: 32404485 PMCID: PMC7228653 DOI: 10.1136/openhrt-2019-001177] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/27/2019] [Revised: 02/28/2020] [Accepted: 04/16/2020] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Coronary angioscopy (CAS) is a useful modality to assess atherosclerotic changes, but interpretation of the images requires expert knowledge. Deep convolutional neural networks (DCNN) can be used for diagnostic prediction and image synthesis. METHODS 107 images from 47 patients, who underwent CAS in our hospital between 2014 and 2017, and 864 images, selected from 142 MEDLINE-indexed articles published between 2000 and 2019, were analysed. First, we developed a prediction model for the angioscopic findings. Next, we made a generative adversarial networks (GAN) model to simulate the CAS images. Finally, we tried to control the output images according to the angioscopic findings with conditional GAN architecture. RESULTS For both yellow colour (YC) grade and neointimal coverage (NC) grade, we could observe strong correlations between the true grades and the predicted values (YC grade, average r=0.80±0.02, p<0.001; NC grade, average r=0.73±0.02, p<0.001). The binary classification model for the red thrombus yielded 0.71±0.03 F1-score and the area under the receiver operator characteristic curve was 0.91±0.02. The standard GAN model could generate realistic CAS images (average Inception score=3.57±0.06). GAN-based data augmentation improved the performance of the prediction models. In the conditional GAN model, there were significant correlations between given values and the expert's diagnosis in YC grade but not in NC grade. CONCLUSION DCNN is useful in both predictive and generative modelling that can help develop the diagnostic support system for CAS.
Collapse
Affiliation(s)
- Toru Miyoshi
- Department of Cardiology, Ehime Prefectural Imabari Hospital, Imabari, Japan
- Department of Cardiology, Pulmonology, Hypertension and Nephrology, Ehime University Graduate School of Medicine, Toon, Japan
| | - Akinori Higaki
- Department of Cardiology, Pulmonology, Hypertension and Nephrology, Ehime University Graduate School of Medicine, Toon, Japan
- Hypertension and Vascular Research Unit, Lady Davis Institute for Medical Research, Montreal, Quebec, Canada
| | - Hideo Kawakami
- Department of Cardiology, Ehime Prefectural Imabari Hospital, Imabari, Japan
| | - Osamu Yamaguchi
- Department of Cardiology, Pulmonology, Hypertension and Nephrology, Ehime University Graduate School of Medicine, Toon, Japan
| |
Collapse
|
310
|
Laflamme B, Dillon MM, Martel A, Almeida RND, Desveaux D, Guttman DS. The pan-genome effector-triggered immunity landscape of a host-pathogen interaction. Science 2020; 367:763-768. [PMID: 32054757 DOI: 10.1126/science.aax4079] [Citation(s) in RCA: 116] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2019] [Revised: 10/18/2019] [Accepted: 01/17/2020] [Indexed: 12/24/2022]
Abstract
Effector-triggered immunity (ETI), induced by host immune receptors in response to microbial effectors, protects plants against virulent pathogens. However, a systematic study of ETI prevalence against species-wide pathogen diversity is lacking. We constructed the Pseudomonas syringae Type III Effector Compendium (PsyTEC) to reduce the pan-genome complexity of 5127 unique effector proteins, distributed among 70 families from 494 strains, to 529 representative alleles. We screened PsyTEC on the model plant Arabidopsis thaliana and identified 59 ETI-eliciting alleles (11.2%) from 19 families (27.1%), with orthologs distributed among 96.8% of P. syringae strains. We also identified two previously undescribed host immune receptors, including CAR1, which recognizes the conserved effectors AvrE and HopAA1, and found that 94.7% of strains harbor alleles predicted to be recognized by either CAR1 or ZAR1.
Collapse
Affiliation(s)
- Bradley Laflamme
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON M5S 3B2, Canada
| | - Marcus M Dillon
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON M5S 3B2, Canada
| | - Alexandre Martel
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON M5S 3B2, Canada
| | - Renan N D Almeida
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON M5S 3B2, Canada
| | - Darrell Desveaux
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON M5S 3B2, Canada.
| | - David S Guttman
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON M5S 3B2, Canada. .,Center for the Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON M5S 3B2, Canada
| |
Collapse
|
311
|
Nwadiugwu MC. Gene-Based Clustering Algorithms: Comparison Between Denclue, Fuzzy-C, and BIRCH. Bioinform Biol Insights 2020; 14:1177932220909851. [PMID: 32284672 PMCID: PMC7133071 DOI: 10.1177/1177932220909851] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2020] [Accepted: 02/02/2020] [Indexed: 11/17/2022] Open
Abstract
The current study seeks to compare 3 clustering algorithms that can be used in gene-based bioinformatics research to understand disease networks, protein-protein interaction networks, and gene expression data. Denclue, Fuzzy-C, and Balanced Iterative and Clustering using Hierarchies (BIRCH) were the 3 gene-based clustering algorithms selected. These algorithms were explored in relation to the subfield of bioinformatics that analyzes omics data, which include but are not limited to genomics, proteomics, metagenomics, transcriptomics, and metabolomics data. The objective was to compare the efficacy of the 3 algorithms and determine their strength and drawbacks. Result of the review showed that unlike Denclue and Fuzzy-C which are more efficient in handling noisy data, BIRCH can handle data set with outliers and have a better time complexity.
Collapse
Affiliation(s)
- Martin C Nwadiugwu
- Department of Biomedical Informatics, University of Nebraska Omaha, Omaha, NE, USA
| |
Collapse
|
312
|
Cortés-Ibáñez JA, González S, Valle-Alonso JJ, Luengo J, García S, Herrera F. Preprocessing methodology for time series: An industrial world application case study. Inf Sci (N Y) 2020. [DOI: 10.1016/j.ins.2019.11.027] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
313
|
Regarding Smart Cities in China, the North and Emerging Economies—One Size Does Not Fit All. SMART CITIES 2020. [DOI: 10.3390/smartcities3020011] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
This article explores the significance of the “Smart city” concept by reviewing its key components, namely: Internet of Things (IoT), big (urban) data, and urban informatics/analytics, which are discussed against the background of two ongoing trends impacting everyone in the world—the Fourth Paradigm (the digital revolution) and rapid urbanization. China is seen as a great success story in the sense of how urbanization has driven a significant improvement in the economic wellbeing and prosperity of many of its citizens. Chinese expansion has come at a cost, and the question remains concerning the sustainability of the Chinese model. Along with this, the article suggests some of the short comings of the components of the Smart city concept and reflects on the human resource skills that will be required to implement Smart cities in the north. This is contrasted with the piecemeal way in which elements of the Smart city are being implemented in emerging economies. A process that very much seems to reflect fundamental technical and capacity issues that may hinder any blanket application of the Smart city in the emerging economies for a long time.
Collapse
|
314
|
RETRACTED: A Smart Social Insurance Big Data Analytics Framework Based on Machine Learning Algorithms. CYBERNETICS AND INFORMATION TECHNOLOGIES 2020. [DOI: 10.2478/cait-2020-0007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Abstract
Social insurance is an individual’s protection against risks such as retirement, death or disability. Big data mining and analytics are a way that could help the insurers and the actuaries to get the optimal decision for the insured individuals. Dependently, this paper proposes a novel analytic framework for Egyptian Social insurance big data. NOSI’s data contains data, which need some pre-processing methods after extraction like replacing missing values, standardization and outlier/extreme data. The paper also presents using some mining methods, such as clustering and classification algorithms on the Egyptian social insurance dataset through an experiment. In clustering, we used K-means clustering and the result showed a silhouette score 0.138 with two clusters in the dataset features. In classification, we used the Support Vector Machine (SVM) classifier and classification results showed a high accuracy percentage of 94%.
Collapse
|
315
|
Attacker Behaviour Forecasting Using Methods of Intelligent Data Analysis: A Comparative Review and Prospects. INFORMATION 2020. [DOI: 10.3390/info11030168] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Early detection of the security incidents and correct forecasting of the attack development is the basis for the efficient and timely response to cyber threats. The development of the attack depends on future steps available to the attackers, their goals, and their motivation—that is, the attacker “profile” that defines the malefactor behaviour in the system. Usually, the “attacker profile” is a set of attacker’s attributes—both inner such as motives and skills, and external such as existing financial support and tools used. The definition of the attacker’s profile allows determining the type of the malefactor and the complexity of the countermeasures, and may significantly simplify the attacker attribution process when investigating security incidents. The goal of the paper is to analyze existing techniques of the attacker’s behaviour, the attacker’ profile specifications, and their application for the forecasting of the attack future steps. The implemented analysis allowed outlining the main advantages and limitations of the approaches to attack forecasting and attacker’s profile constructing, existing challenges, and prospects in the area. The approach for attack forecasting implementation is suggested that specifies further research steps and is the basis for the development of an attacker behaviour forecasting technique.
Collapse
|
316
|
Adaptive Indoor Area Localization for Perpetual Crowdsourced Data Collection. SENSORS 2020; 20:s20051443. [PMID: 32155807 PMCID: PMC7085741 DOI: 10.3390/s20051443] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Revised: 03/02/2020] [Accepted: 03/03/2020] [Indexed: 11/21/2022]
Abstract
The accuracy of fingerprinting-based indoor localization correlates with the quality and up-to-dateness of collected training data. Perpetual crowdsourced data collection reduces manual labeling effort and provides a fresh data base. However, the decentralized collection comes with the cost of heterogeneous data that causes performance degradation. In settings with imperfect data, area localization can provide higher positioning guarantees than exact position estimation. Existing area localization solutions employ a static segmentation into areas that is independent of the available training data. This approach is not applicable for crowdsoucred data collection, which features an unbalanced spatial training data distribution that evolves over time. A segmentation is required that utilizes the existing training data distribution and adapts once new data is accumulated. We propose an algorithm for data-aware floor plan segmentation and a selection metric that balances expressiveness (information gain) and performance (correctly classified examples) of area classifiers. We utilize supervised machine learning, in particular, deep learning, to train the area classifiers. We demonstrate how to regularly provide an area localization model that adapts its prediction space to the accumulating training data. The resulting models are shown to provide higher reliability compared to models that pinpoint the exact position.
Collapse
|
317
|
Soni M, Dahiya R. Soft eSkin: distributed touch sensing with harmonized energy and computing. PHILOSOPHICAL TRANSACTIONS. SERIES A, MATHEMATICAL, PHYSICAL, AND ENGINEERING SCIENCES 2020; 378:20190156. [PMID: 31865882 PMCID: PMC6939237 DOI: 10.1098/rsta.2019.0156] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
Inspired by biology, significant advances have been made in the field of electronic skin (eSkin) or tactile skin. Many of these advances have come through mimicking the morphology of human skin and by distributing few touch sensors in an area. However, the complexity of human skin goes beyond mimicking few morphological features or using few sensors. For example, embedded computing (e.g. processing of tactile data at the point of contact) is centric to the human skin as some neuroscience studies show. Likewise, distributed cell or molecular energy is a key feature of human skin. The eSkin with such features, along with distributed and embedded sensors/electronics on soft substrates, is an interesting topic to explore. These features also make eSkin significantly different from conventional computing. For example, unlike conventional centralized computing enabled by miniaturized chips, the eSkin could be seen as a flexible and wearable large area computer with distributed sensors and harmonized energy. This paper discusses these advanced features in eSkin, particularly the distributed sensing harmoniously integrated with energy harvesters, storage devices and distributed computing to read and locally process the tactile sensory data. Rapid advances in neuromorphic hardware, flexible energy generation, energy-conscious electronics, flexible and printed electronics are also discussed. This article is part of the theme issue 'Harmonizing energy-autonomous computing and intelligence'.
Collapse
|
318
|
Fuguang Y. Research on campus network cloud storage open platform based on cloud computing and big data technology. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2020. [DOI: 10.3233/jifs-179483] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Affiliation(s)
- Yao Fuguang
- Information Center, Chongqing University of Education, Chongqing, China
| |
Collapse
|
319
|
Schulz F, Roux S, Paez-Espino D, Jungbluth S, Walsh DA, Denef VJ, McMahon KD, Konstantinidis KT, Eloe-Fadrosh EA, Kyrpides NC, Woyke T. Giant virus diversity and host interactions through global metagenomics. Nature 2020; 578:432-436. [PMID: 31968354 PMCID: PMC7162819 DOI: 10.1038/s41586-020-1957-x] [Citation(s) in RCA: 156] [Impact Index Per Article: 39.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2019] [Accepted: 01/09/2020] [Indexed: 12/11/2022]
Abstract
Our current knowledge about nucleocytoplasmic large DNA viruses (NCLDVs) is largely derived from viral isolates that are co-cultivated with protists and algae. Here we reconstructed 2,074 NCLDV genomes from sampling sites across the globe by building on the rapidly increasing amount of publicly available metagenome data. This led to an 11-fold increase in phylogenetic diversity and a parallel 10-fold expansion in functional diversity. Analysis of 58,023 major capsid proteins from large and giant viruses using metagenomic data revealed the global distribution patterns and cosmopolitan nature of these viruses. The discovered viral genomes encoded a wide range of proteins with putative roles in photosynthesis and diverse substrate transport processes, indicating that host reprogramming is probably a common strategy in the NCLDVs. Furthermore, inferences of horizontal gene transfer connected viral lineages to diverse eukaryotic hosts. We anticipate that the global diversity of NCLDVs that we describe here will establish giant viruses-which are associated with most major eukaryotic lineages-as important players in ecosystems across Earth's biomes.
Collapse
Affiliation(s)
- Frederik Schulz
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| | - Simon Roux
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - David Paez-Espino
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Sean Jungbluth
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - David A Walsh
- Groupe de recherche interuniversitaire en limnologie, Department of Biology, Concordia University, Montréal, Québec, Canada
| | - Vincent J Denef
- Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA
| | - Katherine D McMahon
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
- Department of Civil and Environmental Engineering, University of Wisconsin-Madison, Madison, WI, USA
| | | | - Emiley A Eloe-Fadrosh
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Nikos C Kyrpides
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Tanja Woyke
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| |
Collapse
|
320
|
Maqbool Z, Aggarwal P, Pammi VSC, Dutt V. Cyber Security: Effects of Penalizing Defenders in Cyber-Security Games via Experimentation and Computational Modeling. Front Psychol 2020; 11:11. [PMID: 32063872 PMCID: PMC6999552 DOI: 10.3389/fpsyg.2020.00011] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2019] [Accepted: 01/06/2020] [Indexed: 11/13/2022] Open
Abstract
Cyber-attacks are deliberate attempts by adversaries to illegally access online information of other individuals or organizations. There are likely to be severe monetary consequences for organizations and its workers who face cyber-attacks. However, currently, little is known on how monetary consequences of cyber-attacks may influence the decision-making of defenders and adversaries. In this research, using a cyber-security game, we evaluate the influence of monetary penalties on decisions made by people performing in the roles of human defenders and adversaries via experimentation and computational modeling. In a laboratory experiment, participants were randomly assigned to the role of "hackers" (adversaries) or "analysts" (defenders) in a laboratory experiment across three between-subject conditions: Equal payoffs (EQP), penalizing defenders for false alarms (PDF) and penalizing defenders for misses (PDM). The PDF and PDM conditions were 10-times costlier for defender participants compared to the EQP condition, which served as a baseline. Results revealed an increase (decrease) and decrease (increase) in attack (defend) actions in the PDF and PDM conditions, respectively. Also, both attack-and-defend decisions deviated from Nash equilibriums. To understand the reasons for our results, we calibrated a model based on Instance-Based Learning Theory (IBLT) theory to the attack-and-defend decisions collected in the experiment. The model's parameters revealed an excessive reliance on recency, frequency, and variability mechanisms by both defenders and adversaries. We discuss the implications of our results to different cyber-attack situations where defenders are penalized for their misses and false-alarms.
Collapse
Affiliation(s)
- Zahid Maqbool
- Applied Cognitive Science Laboratory, Indian Institute of Technology Mandi, Kamand, India
| | - Palvi Aggarwal
- Dynamic Decision Making Laboratory, Carnegie Mellon University, Pittsburgh, PA, United States
| | | | - Varun Dutt
- Applied Cognitive Science Laboratory, Indian Institute of Technology Mandi, Kamand, India
| |
Collapse
|
321
|
Najar IN, Sherpa MT, Das S, Thakur N. Bacterial diversity and functional metagenomics expounding the diversity of xenobiotics, stress, defense and CRISPR gene ontology providing eco-efficiency to Himalayan Hot Springs. Funct Integr Genomics 2020; 20:479-496. [PMID: 31897823 DOI: 10.1007/s10142-019-00723-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Revised: 10/17/2019] [Accepted: 11/05/2019] [Indexed: 11/26/2022]
Abstract
Sikkim is one of the bio-diverse states of India, which harbors diverse alkaline and sulfur rich hot springs in its vicinity. However, there is a dearth of data present in terms of microbial and its functional diversity as only a few hot springs have been studied in this area. Thus, in this regard, microbial and functional diversity of two hot springs by NGS, PLFA, and culture-independent approaches were carried out. PLFA and culture-dependent analysis was complementary as the Gram-positive bacteria were abundant in both the hot springs with the dominance of phylum Firmicutes with Geobacillus. Metagenomic analysis revealed the abundance of Proteobacteria, Actinobacteria, and Firmicutes in both hot springs. Functional metagenomics suggested that both Yumthang and Reshi hot spring possess a diverse set of genes analogous to stress such as genes allied to osmotic, heat shock, and acid stresses; defense analogies such as multidrug resistance efflux pump, multidrug transport system, and β-lactamase; and CRISPR analogues such as related to Cas1, Cas2, Cas3, cmr1-5 proteins, CT1972, and CT1133 gene families. The xenobiotic analogues were found against benzoate, nitrotolune, xylene, DDT, and chlorocyclohexane/chlorobenzene degradation. Thus, these defensive mechanisms against environmental and anthropogenic hiccups and hindrances provide the eco-efficiency to such thermal habitats. The higher enzymatic, degradation, defense, stress potential and the lower percentage identity (< 95%) of isolates encourage the further exploration and exploitation of these habitats for industrial and biotechnological purposes.
Collapse
Affiliation(s)
- Ishfaq Nabi Najar
- Department of Microbiology, School of Life Sciences, Sikkim University, 6th Mile, Samdur, Tadong, Gangtok, Sikkim, 737102, India
| | - Mingma Thundu Sherpa
- Department of Microbiology, School of Life Sciences, Sikkim University, 6th Mile, Samdur, Tadong, Gangtok, Sikkim, 737102, India
| | - Sayak Das
- Department of Microbiology, School of Life Sciences, Sikkim University, 6th Mile, Samdur, Tadong, Gangtok, Sikkim, 737102, India
| | - Nagendra Thakur
- Department of Microbiology, School of Life Sciences, Sikkim University, 6th Mile, Samdur, Tadong, Gangtok, Sikkim, 737102, India.
- Department of Chemical Engineering and Biomolecular Engineering, Korean Advance Institute of Science and Technology, Daejeon, South Korea.
| |
Collapse
|
322
|
Zhang J, Chen S, Sun G. Spectral and chromatographic overall analysis: An insight into chemical equivalence assessment of traditional Chinese medicine. J Chromatogr A 2020; 1610:460556. [DOI: 10.1016/j.chroma.2019.460556] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 09/17/2019] [Accepted: 09/18/2019] [Indexed: 01/19/2023]
|
323
|
Hattingh M, Matthee M, Smuts H, Pappas I, Dwivedi YK, Mäntymäki M. Requirements of Data Visualisation Tools to Analyse Big Data: A Structured Literature Review. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7134219 DOI: 10.1007/978-3-030-44999-5_39] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
The continual growth of big data necessitates efficient ways of analysing these large datasets. Data visualisation and visual analytics has been identified as a key tool in big data analysis because they draw on the human visual and cognitive capabilities to analyse data quickly, intuitively and interactively. However, current visualisation tools and visual analytical systems fall short of providing a seamless user experience and several improvements could be made to current commercially available visualisation tools. By conducting a systematic literature review, requirements of visualisation tools were identified and categorised into six groups: dimensionality reduction, data reduction, scalability and readability, interactivity, fast retrieval of results, and user assistance. The most common themes found in the literature were dimensionality reduction and interactive data exploration.
Collapse
|
324
|
Towards a real-time processing framework based on improved distributed recurrent neural network variants with fastText for social big data analytics. Inf Process Manag 2020. [DOI: 10.1016/j.ipm.2019.102122] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
325
|
User-Generated Short Video Content in Social Media. A Case Study of TikTok. LECTURE NOTES IN COMPUTER SCIENCE 2020. [DOI: 10.1007/978-3-030-49576-3_8] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
|
326
|
Hattingh M, Matthee M, Smuts H, Pappas I, Dwivedi YK, Mäntymäki M. A Conceptual Model of the Challenges of Social Media Big Data for Citizen e-Participation: A Systematic Review. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7134231 DOI: 10.1007/978-3-030-45002-1_21] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The emergence of Citizen Relationship Management (CzRM) for government plays a central role in developing citizen relationships and e-participation. As such, the South African government has shown its commitment towards citizenry and the provision of effective service delivery. Social Media Analytics (SMA) has emerged as a potential new solution to support decision-making for service delivery in CzRM. It is believed that the demand for SMA adoption will increasingly rise. However, the reality of social media Big Data comes with the challenges of analysing it in a way that brings Big Value. The purpose of this paper is to identify the challenges of social media Big Data Analytics (BDA) and to incorporate these in a conceptual model that can be used by governments to support the e-participation of citizens. The model was developed through a systematic literature review (SLR). The findings revealed that data challenges relate to designing an optimal architecture for analysing data that caters for both historic data and real-time data at the same time. The paper highlight that process challenges relate to all the activities in the data lifecycle such as data acquisition and warehousing; data mining and cleaning; data aggregation and integration; analysis and modelling; and data interpretation. The paper also identifies six types of data management challenges: privacy, security, data governance, data and information sharing, cost/operational expenditures, and data ownership.
Collapse
|
327
|
Tan Y, Shi Y, Tuba M. Swarm Intelligence in Data Science: Applications, Opportunities and Challenges. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7354777 DOI: 10.1007/978-3-030-53956-6_1] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
The Swarm Intelligence (SI) algorithms have been proved to be a comprehensive method to solve complex optimization problems by simulating the emergence behaviors of biological swarms. Nowadays, data science is getting more and more attention, which needs quick management and analysis of massive data. Most traditional methods can only be applied to continuous and differentiable functions. As a set of population-based approaches, it is proven by some recent research works that the SI algorithms have great potential for relevant tasks in this field. In order to gather better insight into the utilization of these methods in data science and to provide a further reference for future researches, this paper focuses on the relationship between data science and swarm intelligence. After introducing the mainstream swarm intelligence algorithms and their common characteristics, both the theoretical and real-world applications in the literature which utilize the swarm intelligence to the related domains of data analytics are reviewed. Based on the summary of the existing works, this paper also analyzes the opportunities and challenges in this field, which attempts to shed some light on designing more effective algorithms to solve the problems in data science for real-world applications.
Collapse
Affiliation(s)
- Ying Tan
- Peking University, Beijing, China
| | - Yuhui Shi
- Southern University of Science and Technology, Shenzhen, China
| | | |
Collapse
|
328
|
Singh A, Garg S, Kaur R, Batra S, Kumar N, Zomaya AY. Probabilistic data structures for big data analytics: A comprehensive review. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2019.104987] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
329
|
Uddin S, Khan A, Hossain ME, Moni MA. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak 2019; 19:281. [PMID: 31864346 PMCID: PMC6925840 DOI: 10.1186/s12911-019-1004-8] [Citation(s) in RCA: 424] [Impact Index Per Article: 84.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Accepted: 12/11/2019] [Indexed: 12/17/2022] Open
Abstract
Background Supervised machine learning algorithms have been a dominant method in the data mining field. Disease prediction using health data has recently shown a potential application area for these methods. This study ai7ms to identify the key trends among different types of supervised machine learning algorithms, and their performance and usage for disease risk prediction. Methods In this study, extensive research efforts were made to identify those studies that applied more than one supervised machine learning algorithm on single disease prediction. Two databases (i.e., Scopus and PubMed) were searched for different types of search items. Thus, we selected 48 articles in total for the comparison among variants supervised machine learning algorithms for disease prediction. Results We found that the Support Vector Machine (SVM) algorithm is applied most frequently (in 29 studies) followed by the Naïve Bayes algorithm (in 23 studies). However, the Random Forest (RF) algorithm showed superior accuracy comparatively. Of the 17 studies where it was applied, RF showed the highest accuracy in 9 of them, i.e., 53%. This was followed by SVM which topped in 41% of the studies it was considered. Conclusion This study provides a wide overview of the relative performance of different variants of supervised machine learning algorithms for disease prediction. This important information of relative performance can be used to aid researchers in the selection of an appropriate supervised machine learning algorithm for their studies.
Collapse
Affiliation(s)
- Shahadat Uddin
- Complex Systems Research Group, Faculty of Engineering, The University of Sydney, Room 524, SIT Building (J12), Darlington, NSW, 2008, Australia.
| | - Arif Khan
- Complex Systems Research Group, Faculty of Engineering, The University of Sydney, Room 524, SIT Building (J12), Darlington, NSW, 2008, Australia.,Health Market Quality Research Stream, Capital Markets CRC, Level 3, 55 Harrington Street, Sydney, NSW, Australia
| | - Md Ekramul Hossain
- Complex Systems Research Group, Faculty of Engineering, The University of Sydney, Room 524, SIT Building (J12), Darlington, NSW, 2008, Australia
| | - Mohammad Ali Moni
- Faculty of Medicine and Health, School of Medical Sciences, The University of Sydney, Camperdown, NSW, 2006, Australia
| |
Collapse
|
330
|
Lötsch J, Ultsch A. Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data. Int J Mol Sci 2019; 21:ijms21010079. [PMID: 31861946 PMCID: PMC6982269 DOI: 10.3390/ijms21010079] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Revised: 12/09/2019] [Accepted: 12/16/2019] [Indexed: 11/16/2022] Open
Abstract
Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroup structure. In other data sets, t-SNE occasionally suggested the wrong number of subgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative.
Collapse
Affiliation(s)
- Jörn Lötsch
- Institute of Clinical Pharmacology, Goethe-University, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany
- Fraunhofer Institute for Molecular Biology and Applied Ecology IME, Project Group Translational Medicine and Pharmacology TMP, Theodor-Stern-Kai 7, 60590 Frankfurt am Main, Germany
- Correspondence: ; Tel.: +49-69-6301-4589; Fax: +49-69-6301-4354
| | - Alfred Ultsch
- DataBionics Research Group, University of Marburg, Hans-Meerwein-Straße, 35032 Marburg, Germany;
| |
Collapse
|
331
|
Gupta A, Kaur M. Text Summarisation Using Laplacian Centrality-Based Minimum Vertex Cover. JOURNAL OF INFORMATION & KNOWLEDGE MANAGEMENT 2019. [DOI: 10.1142/s0219649219500503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Outdegree Centrality (OC) is a graph-based centrality measure that captures local connectedness of a node in a graph. The measure has been used in the literature to highlight key sentences in a graph-based optimisation method for summarisation. It is observed in resultant summaries that OC tends to be biased towards selecting introductory sentences of the document producing only generic summaries. The different graph centrality measures lead to different interpretations of a summary. Therefore, the authors propose to use another suitable centrality measure in order to generate more specific summary rather than a generic summary. Such a summary is expected to be highly informative covering all the subtopics of the source document. This requirement has instigated the authors to use Laplacian Centrality (LC) measure to find the significance of the nodes. The essence of this measure lies in highlighting central nodes from subgraphs which contribute non-uniformly towards the common goal of the graph. The modified method has shown significant improvement in informativeness and coherence of summaries and outperformed state-of-the-art results.
Collapse
Affiliation(s)
- Anand Gupta
- Department of Computer and Engineering, Netaji Subhas University of Technology, New Delhi, India
| | - Manpreet Kaur
- Department of Computer and Engineering, Netaji Subhas University of Technology, New Delhi, India
| |
Collapse
|
332
|
Khan SA, Chang HT. Comparative analysis on Facebook post interaction using DNN, ELM and LSTM. PLoS One 2019; 14:e0224452. [PMID: 31714918 PMCID: PMC6850539 DOI: 10.1371/journal.pone.0224452] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Accepted: 10/14/2019] [Indexed: 11/18/2022] Open
Abstract
This study presents a novel research approach to predict user interaction for social media post using machine learning algorithms. The posts are converted to vector form using word2vec and doc2vec model. These two methods are used to analyse the best approach for generating word embeddings. The generated word embeddings of post combined with other attributes like post published time, type of post and total interactions are used to train machine learning algorithms. Deep neural network (DNN), Extreme Learning Machine (ELM) and Long Short-Term Memory (LSTM) are used to compare the prediction of total interaction for a particular post. For word2vec, the word vectors are created using both continuous bag-of-words (CBOW) and skip-gram models. Also the pre-trained word vectors provided by google is used for the analysis. For doc2vec, the word embeddings are created using both the Distributed Memory model of Paragraph Vectors (PV-DM) and Distributed Bag of Words model of Paragraph Vectors (PV-DBOW). A word embedding is also created using PV-DBOW combined with skip-gram.
Collapse
Affiliation(s)
- Sabih Ahmad Khan
- Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan 33302, Taiwan
| | - Hsien-Tsung Chang
- Department of Computer Science and Information Engineering, Chang Gung University, Taoyuan 33302, Taiwan
- Department of Physical Medicine and Rehabilitation, Chang Gung Memorial Hospital, Taoyuan 33302, Taiwan
- * E-mail:
| |
Collapse
|
333
|
Dirmeier S, Emmenlauer M, Dehio C, Beerenwinkel N. PyBDA: a command line tool for automated analysis of big biological data sets. BMC Bioinformatics 2019; 20:564. [PMID: 31718539 PMCID: PMC6849186 DOI: 10.1186/s12859-019-3087-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2019] [Accepted: 09/09/2019] [Indexed: 12/02/2022] Open
Abstract
BACKGROUND Analysing large and high-dimensional biological data sets poses significant computational difficulties for bioinformaticians due to lack of accessible tools that scale to hundreds of millions of data points. RESULTS We developed a novel machine learning command line tool called PyBDA for automated, distributed analysis of big biological data sets. By using Apache Spark in the backend, PyBDA scales to data sets beyond the size of current applications. It uses Snakemake in order to automatically schedule jobs to a high-performance computing cluster. We demonstrate the utility of the software by analyzing image-based RNA interference data of 150 million single cells. CONCLUSION PyBDA allows automated, easy-to-use data analysis using common statistical methods and machine learning algorithms. It can be used with simple command line calls entirely making it accessible to a broad user base. PyBDA is available at https://pybda.rtfd.io.
Collapse
Affiliation(s)
- Simon Dirmeier
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Mario Emmenlauer
- Biozentrum, University of Basel, Basel, Switzerland
- BioDataAnalysis GmbH, Munich, 81669 Germany
| | | | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| |
Collapse
|
334
|
Natsidis P, Tsakogiannis A, Pavlidis P, Tsigenopoulos CS, Manousaki T. Phylogenomics investigation of sparids (Teleostei: Spariformes) using high-quality proteomes highlights the importance of taxon sampling. Commun Biol 2019; 2:400. [PMID: 31701028 PMCID: PMC6825128 DOI: 10.1038/s42003-019-0654-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Accepted: 10/08/2019] [Indexed: 12/29/2022] Open
Abstract
Sparidae (Teleostei: Spariformes) are a family of fish constituted by approximately 150 species with high popularity and commercial value, such as porgies and seabreams. Although the phylogeny of this family has been investigated multiple times, its position among other teleost groups remains ambiguous. Most studies have used a single or few genes to decipher the phylogenetic relationships of sparids. Here, we conducted a thorough phylogenomic analysis using five recently available Sparidae gene-sets and 26 high-quality, genome-predicted teleost proteomes. Our analysis suggested that Tetraodontiformes (puffer fish, sunfish) are the closest relatives to sparids than all other groups used. By analytically comparing this result to our own previous contradicting finding, we show that this discordance is not due to different orthology assignment algorithms; on the contrary, we prove that it is caused by the increased taxon sampling of the present study, outlining the great importance of this aspect in phylogenomic analyses in general.
Collapse
Affiliation(s)
- Paschalis Natsidis
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Greece
- School of Medicine, University of Crete, Heraklion, Greece
| | - Alexandros Tsakogiannis
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Greece
| | - Pavlos Pavlidis
- Institute of Computer Science, Foundation for Research and Technology, Heraklion, Greece
| | - Costas S. Tsigenopoulos
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Greece
| | - Tereza Manousaki
- Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Heraklion, Greece
| |
Collapse
|
335
|
|
336
|
An Innovative Fingerprint Location Algorithm for Indoor Positioning Based on Array Pseudolite. SENSORS 2019; 19:s19204420. [PMID: 31614855 PMCID: PMC6832918 DOI: 10.3390/s19204420] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/16/2019] [Revised: 10/09/2019] [Accepted: 10/10/2019] [Indexed: 11/21/2022]
Abstract
Since the signals of the global navigation satellite system (GNSS) are blocked by buildings, accurate positioning cannot be achieved in an indoor environment. Pseudolite can simulate similar outdoor satellite signals and can be used as a stable and reliable positioning signal source in indoor environments. Therefore, it has been proposed as a good substitute and has become a research hotspot in the field of indoor positioning. There are still some problems in the pseudolite positioning field, such as: Integer ambiguity of carrier phase, initial position determination, and low signal coverage. To avoid the limitation of these factors, an indoor positioning system based on fingerprint database matching of homologous array pseudolite is proposed in this paper, which can achieve higher positioning accuracy. The realization of this positioning system mainly includes the offline phase and the online phase. In the offline phase, the carrier phase data in the indoor environment is first collected, and a fingerprint database is established. Then a variational auto-encoding (VAE) network with location information is used to learn the probability distribution characteristics of the carrier phase difference of pseudolite in the latent space to realize feature clustering. Finally, the deep neural network is constructed by using the hidden features learned to further study the mapping relationship between different carrier phases of pseudolite and different indoor locations. In the online phase, the trained model and real-time carrier phases of pseudolite are used to predict the location of the positioning terminal. In this paper, by a large number of experiments, the performance of the pseudolite positioning system is evaluated under dynamic and static conditions. The effectiveness of the algorithm is evaluated by the comparison experiments, the experimental results show that the average positioning accuracy of the positioning system in a real indoor scene is 0.39 m, and the 95% positioning error is less than 0.85 m, which outperforms the traditional fingerprint positioning algorithms.
Collapse
|
337
|
García‐Gil D, Luque‐Sánchez F, Luengo J, García S, Herrera F. From Big to Smart Data: Iterative ensemble filter for noise filtering in Big Data classification. INT J INTELL SYST 2019. [DOI: 10.1002/int.22193] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Affiliation(s)
- Diego García‐Gil
- Department of Computer Science and Artificial IntelligenceUniversity of Granada Granada Spain
| | - Francisco Luque‐Sánchez
- Department of Computer Science and Artificial IntelligenceUniversity of Granada Granada Spain
| | - Julián Luengo
- Department of Computer Science and Artificial IntelligenceUniversity of Granada Granada Spain
| | - Salvador García
- Department of Computer Science and Artificial IntelligenceUniversity of Granada Granada Spain
| | - Francisco Herrera
- Department of Computer Science and Artificial IntelligenceUniversity of Granada Granada Spain
- Department of Computer ScienceFaculty of Computing and Information TechnologyKing Abdulaziz University Jeddah Saudi Arabia
| |
Collapse
|
338
|
|
339
|
Abstract
In the last several decades, avian influenza virus has caused numerous outbreaks around the world. These outbreaks pose a significant threat to the poultry industry and also to public health. When an avian influenza (AI) outbreak occurs, it is critical to make informed decisions about the potential risks, impact, and control measures. To this end, many modeling approaches have been proposed to acquire knowledge from different sources of data and perspectives to enhance decision making. Although some of these approaches have shown to be effective, they do not follow the process of knowledge discovery in databases (KDD). KDD is an iterative process, consisting of five steps, that aims at extracting unknown and useful information from the data. The present review attempts to survey AI modeling methods in the context of KDD process. We first divide the modeling techniques used in AI into two main categories: data-intensive modeling and small-data modeling. We then investigate the existing gaps in the literature and suggest several potential directions and techniques for future studies. Overall, this review provides insights into the control of AI in terms of the risk of introduction and spread of the virus.
Collapse
|
340
|
Abstract
The intelligent use of deep learning (DL) techniques can assist in overcoming noise and uncertainty during fingerprinting-based localization. With the rise in the available computational power on mobile devices, it is now possible to employ DL techniques, such as convolutional neural networks (CNNs), for smartphones. In this paper, we introduce a CNN model based on received signal strength indicator (RSSI) fingerprint datasets and compare it with different CNN application models, such as AlexNet, ResNet, ZFNet, Inception v3, and MobileNet v2, for indoor localization. The experimental results show that the proposed CNN model can achieve a test accuracy of 94.45% and an average location error as low as 1.44 m. Therefore, our CNN model outperforms conventional CNN applications for RSSI-based indoor positioning.
Collapse
|
341
|
Review of Big Data and Processing Frameworks for Disaster Response Applications. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION 2019. [DOI: 10.3390/ijgi8090387] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Natural hazards result in devastating losses in human life, environmental assets and personal, and regional and national economies. The availability of different big data such as satellite imageries, Global Positioning System (GPS) traces, mobile Call Detail Records (CDRs), social media posts, etc., in conjunction with advances in data analytic techniques (e.g., data mining and big data processing, machine learning and deep learning) can facilitate the extraction of geospatial information that is critical for rapid and effective disaster response. However, disaster response systems development usually requires the integration of data from different sources (streaming data sources and data sources at rest) with different characteristics and types, which consequently have different processing needs. Deciding which processing framework to use for a specific big data to perform a given task is usually a challenge for researchers from the disaster management field. Therefore, this paper contributes in four aspects. Firstly, potential big data sources are described and characterized. Secondly, the big data processing frameworks are characterized and grouped based on the sources of data they handle. Then, a short description of each big data processing framework is provided and a comparison of processing frameworks in each group is carried out considering the main aspects such as computing cluster architecture, data flow, data processing model, fault-tolerance, scalability, latency, back-pressure mechanism, programming languages, and support for machine learning libraries, which are related to specific processing needs. Finally, a link between big data and processing frameworks is established, based on the processing provisioning for essential tasks in the response phase of disaster management.
Collapse
|
342
|
Shi X, Wong YD, Li MZF, Palanisamy C, Chai C. A feature learning approach based on XGBoost for driving assessment and risk prediction. ACCIDENT; ANALYSIS AND PREVENTION 2019; 129:170-179. [PMID: 31154284 DOI: 10.1016/j.aap.2019.05.005] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2018] [Revised: 03/16/2019] [Accepted: 05/05/2019] [Indexed: 06/09/2023]
Abstract
This study designs a framework of feature extraction and selection, to assess vehicle driving and predict risk levels. The framework integrates learning-based feature selection, unsupervised risk rating, and imbalanced data resampling. For each vehicle, about 1300 driving behaviour features are extracted from trajectory data, which produce in-depth and multi-view measures on behaviours. To estimate the risk potentials of vehicles in driving, unsupervised data labelling is proposed. Based on extracted risk indicator features, vehicles are clustered into various groups labelled with graded risk levels. Data under-sampling of the safe group is performed to reduce the risk-safe class imbalance. Afterwards, the linkages between behaviour features and corresponding risk levels are built using XGBoost, and key features are identified according to feature importance ranking and recursive elimination. The risk levels of vehicles in driving are predicted based on key features selected. As a case study, NGSIM trajectory data are used in which four risk levels are clustered by Fuzzy C-means, 64 key behaviour features are identified, and an overall accuracy of 89% is achieved for behaviour-based risk prediction. Findings show that this approach is effective and reliable to identify important features for driving assessment, and achieve an accurate prediction of risk levels.
Collapse
Affiliation(s)
- Xiupeng Shi
- School of Civil & Environmental Engineering, Nanyang Technological University, 639798, Singapore.
| | - Yiik Diew Wong
- School of Civil & Environmental Engineering, Nanyang Technological University, 639798, Singapore.
| | - Michael Zhi-Feng Li
- Nanyang Business School, Nanyang Technological University, 639798, Singapore.
| | | | - Chen Chai
- College of Transportation Engineering, Tongji University, 201804, China.
| |
Collapse
|
343
|
de Lima Nichio BT, de Oliveira AMR, de Pierri CR, Santos LGC, Lejambre AQ, Vialle RA, da Rocha Coimbra NA, Guizelini D, Marchaukoski JN, de Oliveira Pedrosa F, Raittz RT. RAFTS 3G: an efficient and versatile clustering software to analyses in large protein datasets. BMC Bioinformatics 2019; 20:392. [PMID: 31307371 PMCID: PMC6631606 DOI: 10.1186/s12859-019-2973-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2018] [Accepted: 06/28/2019] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Clustering methods are essential to partitioning biological samples being useful to minimize the information complexity in large datasets. Tools in this context usually generates data with greed algorithms that solves some Data Mining difficulties which can degrade biological relevant information during the clustering process. The lack of standardization of metrics and consistent bases also raises questions about the clustering efficiency of some methods. Benchmarks are needed to explore the full potential of clustering methods - in which alignment-free methods stand out - and the good choice of dataset makes it essentials. RESULTS Here we present a new approach to Data Mining in large protein sequences datasets, the Rapid Alignment Free Tool for Sequences Similarity Search to Groups (RAFTS3G), a method to clustering aiming of losing less biological information in the processes of generation groups. The strategy developed in our algorithm is optimized to be more astringent which reflects increase in accuracy and sensitivity in the generation of clusters in a wide range of similarity. RAFTS3G is the better choice compared to three main methods when the user wants more reliable result even ignoring the ideal threshold to clustering. CONCLUSION In general, RAFTS3G is able to group up to millions of biological sequences into large datasets, which is a remarkable option of efficiency in clustering. RAFTS3G compared to other "standard-gold" methods in the clustering of large biological data maintains the balance between the reduction of biological information redundancy and the creation of consistent groups. We bring the binary search concept applied to grouped sequences which shows maintaining sensitivity/accuracy relation and up to minimize the time of data generated with RAFTS3G process.
Collapse
Affiliation(s)
- Bruno Thiago de Lima Nichio
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
- Department of Biochemistry, Biological Sciences Sector – Federal University of Paraná (UFPR), Curitiba, PR Brazil
| | - Aryel Marlus Repula de Oliveira
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Camilla Reginatto de Pierri
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
- Department of Biochemistry, Biological Sciences Sector – Federal University of Paraná (UFPR), Curitiba, PR Brazil
| | - Leticia Graziela Costa Santos
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Alexandre Quadros Lejambre
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Ricardo Assunção Vialle
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Nilson Antônio da Rocha Coimbra
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Dieval Guizelini
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Jeroniza Nunes Marchaukoski
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| | - Fabio de Oliveira Pedrosa
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
- Department of Biochemistry, Biological Sciences Sector – Federal University of Paraná (UFPR), Curitiba, PR Brazil
| | - Roberto Tadeu Raittz
- Laboratory of Bioinformatics, Professional and Technical Education Sector from the Federal University of Paraná, Curitiba, PR Brazil
| |
Collapse
|
344
|
Xu L, Dong Z, Fang L, Luo Y, Wei Z, Guo H, Zhang G, Gu YQ, Coleman-Derr D, Xia Q, Wang Y. OrthoVenn2: a web server for whole-genome comparison and annotation of orthologous clusters across multiple species. Nucleic Acids Res 2019; 47:W52-W58. [PMID: 31053848 PMCID: PMC6602458 DOI: 10.1093/nar/gkz333] [Citation(s) in RCA: 577] [Impact Index Per Article: 115.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2019] [Revised: 04/16/2019] [Accepted: 04/25/2019] [Indexed: 12/28/2022] Open
Abstract
OrthoVenn is a powerful web platform for the comparison and analysis of whole-genome orthologous clusters. Here we present an updated version, OrthoVenn2, which provides new features that facilitate the comparative analysis of orthologous clusters among up to 12 species. Additionally, this update offers improvements to data visualization and interpretation, including an occurrence pattern table for interrogating the overlap of each orthologous group for the queried species. Within the occurrence table, the functional annotations and summaries of the disjunctions and intersections of clusters between the chosen species can be displayed through an interactive Venn diagram. To facilitate a broader range of comparisons, a larger number of species, including vertebrates, metazoa, protists, fungi, plants and bacteria, have been added in OrthoVenn2. Finally, a stand-alone version is available to perform large dataset comparisons and to visualize results locally without limitation of species number. In summary, OrthoVenn2 is an efficient and user-friendly web server freely accessible at https://orthovenn2.bioinfotoolkits.net.
Collapse
Affiliation(s)
- Ling Xu
- Biological Science Research Center, Southwest University, Chongqing 400715, China
| | - Zhaobin Dong
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA 94710, USA
- USDA-ARS, Plant Gene Expression Center, Albany, CA 94706, USA
| | - Lu Fang
- Biological Science Research Center, Southwest University, Chongqing 400715, China
| | - Yongjiang Luo
- Biological Science Research Center, Southwest University, Chongqing 400715, China
| | - Zhaoyuan Wei
- Biological Science Research Center, Southwest University, Chongqing 400715, China
| | - Hailong Guo
- Biological Science Research Center, Southwest University, Chongqing 400715, China
| | - Guoqing Zhang
- Biological Science Research Center, Southwest University, Chongqing 400715, China
| | - Yong Q Gu
- USDA-ARS, Western Regional Research Center, Crop Improvement and Genetics Research Unit, Albany, CA 94706, USA
| | - Devin Coleman-Derr
- Department of Plant and Microbial Biology, University of California Berkeley, Berkeley, CA 94710, USA
- USDA-ARS, Plant Gene Expression Center, Albany, CA 94706, USA
| | - Qingyou Xia
- Biological Science Research Center, Southwest University, Chongqing 400715, China
| | - Yi Wang
- Biological Science Research Center, Southwest University, Chongqing 400715, China
| |
Collapse
|
345
|
Alcalde-Barros A, García-Gil D, García S, Herrera F. DPASF: a flink library for streaming data preprocessing. BIG DATA ANALYTICS 2019. [DOI: 10.1186/s41044-019-0041-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
|
346
|
Community detection in large-scale social networks: state-of-the-art and future directions. SOCIAL NETWORK ANALYSIS AND MINING 2019. [DOI: 10.1007/s13278-019-0566-x] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
347
|
Kumari A, Tanwar S, Tyagi S, Kumar N. Verification and validation techniques for streaming big data analytics in internet of things environment. IET NETWORKS 2019. [DOI: 10.1049/iet-net.2018.5187] [Citation(s) in RCA: 45] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Aparna Kumari
- Department of Computer Science and EngineeringInstitute of Technology, Nirma UniversityAhmedabadGujaratIndia
| | - Sudeep Tanwar
- Department of Computer Science and EngineeringInstitute of Technology, Nirma UniversityAhmedabadGujaratIndia
| | - Sudhanshu Tyagi
- Department of Electronics and Communication EngineeringThapar Institute of Engineering and TechnologyDeemed UniversityPatialaPbIndia
| | - Neeraj Kumar
- Department of Computer Science and EngineeringThapar Institute of Engineering and TechnologyDeemed UniversityPatialaPbIndia
| |
Collapse
|
348
|
Swingler K. Learning and Searching Pseudo-Boolean Surrogate Functions from Small Samples. EVOLUTIONARY COMPUTATION 2019; 28:317-338. [PMID: 31038355 DOI: 10.1162/evco_a_00257] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
When searching for input configurations that optimise the output of a system, it can be useful to build a statistical model of the system being optimised. This is done in approaches such as surrogate model-based optimisation, estimation of distribution algorithms, and linkage learning algorithms. This article presents a method for modelling pseudo-Boolean fitness functions using Walsh bases and an algorithm designed to discover the non-zero coefficients while attempting to minimise the number of fitness function evaluations required. The resulting models reveal linkage structure that can be used to guide a search of the model efficiently. It presents experimental results solving benchmark problems in fewer fitness function evaluations than those reported in the literature for other search methods such as EDAs and linkage learners.
Collapse
Affiliation(s)
- Kevin Swingler
- Computing Science and Mathematics, University of Stirling, Stirling, FK9 4LA, Scotland
| |
Collapse
|
349
|
Streaming Data Fusion for the Internet of Things. SENSORS 2019; 19:s19081955. [PMID: 31027306 PMCID: PMC6514969 DOI: 10.3390/s19081955] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/31/2019] [Revised: 04/21/2019] [Accepted: 04/22/2019] [Indexed: 11/16/2022]
Abstract
To achieve the full analytical potential of the streaming data from the internet of things, the interconnection of various data sources is needed. By definition, those sources are heterogeneous and their integration is not a trivial task. A common approach to exploit streaming sensor data potential is to use machine learning techniques for predictive analytics in a way that is agnostic to the domain knowledge. Such an approach can be easily integrated in various use cases. In this paper, we propose a novel framework for data fusion of a set of heterogeneous data streams. The proposed framework enriches streaming sensor data with the contextual and historical information relevant for describing the underlying processes. The final result of the framework is a feature vector, ready to be used in a machine learning algorithm. The framework has been applied to a cloud and to an edge device. In the latter case, incremental learning capabilities have been demonstrated. The reported results illustrate a significant improvement of data-driven models, applied to sensor streams. Beside higher accuracy of the models the platform offers easy setup and thus fast prototyping capabilities in real-world applications.
Collapse
|
350
|
A Wavelet Scattering Feature Extraction Approach for Deep Neural Network Based Indoor Fingerprinting Localization. SENSORS 2019; 19:s19081790. [PMID: 31014005 PMCID: PMC6514606 DOI: 10.3390/s19081790] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2019] [Revised: 03/30/2019] [Accepted: 04/11/2019] [Indexed: 11/16/2022]
Abstract
The performance of an Artificial Neural Network (ANN)-based algorithm is subject to the way the feature data is extracted. This is a common issue when applying the ANN to indoor fingerprinting-based localization where the signal is unstable. To date, there is not adequate feature extraction method that can significantly mitigate the influence of the receiver signal strength indicator (RSSI) variation that degrades the performance of the ANN-based indoor fingerprinting algorithm. In this work, a wavelet scattering transform is used to extract reliable features that are stable to small deformation and rotation invariant. The extracted features are used by a deep neural network (DNN) model to predict the location. The zeroth and the first layer of decomposition coefficients were used as features data by concatenating different scattering path coefficients. The proposed algorithm has been validated on real measurements and has achieved good performance. The experimentation results demonstrate that the proposed feature extraction method is stable to the RSSI variation.
Collapse
|