1
|
Qiao S, Li X, Olatosi B, Young SD. Utilizing Big Data analytics and electronic health record data in HIV prevention, treatment, and care research: a literature review. AIDS Care 2024; 36:583-603. [PMID: 34260325 DOI: 10.1080/09540121.2021.1948499] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 06/22/2021] [Indexed: 01/07/2023]
Abstract
Propelled by the transformative power of modern information and communication technologies, digitalization of data, and the increasing affordability of high-performance computing, Big Data science has brought forth revolutionary advancement in many areas of business, industry, health, and medicine. The HIV research and care service community is no exception to the benefits from the availability and utilization of Big Data analytics. Electronic health record (EHR) data (e.g., administrative and billing data, electronic medical records, or other digital records of information pertinent to individual or population health) are an essential source of health and disease outcome data because of the large amount of real-world, comprehensive, and often longitudinal data, which provide a good opportunity for leveraging advanced Big Data analytics in addressing challenges in HIV prevention, treatment, and care. This review focuses on studies that apply Big Data analytics to EHR data with aims to synthesize the HIV-related issues that EHR data studies can tackle, identify challenges in the utilization of EHR data in HIV research and practice, and discuss future needs and directions that can realize the promising potential role of Big Data in ending the HIV epidemic.
Collapse
Affiliation(s)
- Shan Qiao
- South Carolina SmartState Center for Healthcare Quality (CHQ), Columbia, SC, USA
- University of South Carolina Big Data Health Science Center, Columbia, SC, USA
- Department of Health Promotion, Education, and Behavior, Arnold School of Public Health, University of South Carolina, Columbia, SC, USA
| | - Xiaoming Li
- South Carolina SmartState Center for Healthcare Quality (CHQ), Columbia, SC, USA
- University of South Carolina Big Data Health Science Center, Columbia, SC, USA
- Department of Health Promotion, Education, and Behavior, Arnold School of Public Health, University of South Carolina, Columbia, SC, USA
| | - Bankole Olatosi
- South Carolina SmartState Center for Healthcare Quality (CHQ), Columbia, SC, USA
- University of South Carolina Big Data Health Science Center, Columbia, SC, USA
- Department of Health Services Policy and Management, Arnold School of Public Health, University of South Carolina, Columbia, SC, USA
| | - Sean D Young
- Department of Emergency Medicine, Department of Informatics, Institute for Prediction Technology, University of California, Irvine, CA, USA
| |
Collapse
|
2
|
Purnomo AT, Komariah KS, Lin DB, Hendria WF, Sin BK, Ahmadi N. Non-Contact Supervision of COVID-19 Breathing Behaviour With FMCW Radar and Stacked Ensemble Learning Model in Real-Time. IEEE TRANSACTIONS ON BIOMEDICAL CIRCUITS AND SYSTEMS 2022; 16:664-678. [PMID: 35853073 PMCID: PMC9647724 DOI: 10.1109/tbcas.2022.3192359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Revised: 03/30/2022] [Accepted: 06/24/2022] [Indexed: 06/15/2023]
Abstract
A respiratory disorder that attacks COVID-19 patients requires intensive supervision of medical practitioners during the isolation period. A non-contact monitoring device will be a suitable solution for reducing the spread risk of the virus while monitoring the COVID-19 patient. This study uses Frequency-Modulated Continuous Wave (FMCW) radar and Machine Learning (ML) to obtain respiratory information and analyze respiratory signals, respectively. Multiple subjects in a room can be detected simultaneously by calculating the Angle of Arrival (AoA) of the received signal and utilizing the Multiple Input Multiple Output (MIMO) of FMCW radar. Fast Fourier Transform (FFT) and some signal processing are implemented to obtain a breathing waveform. ML helps the system to analyze the respiratory signals automatically. This paper also compares the performance of several ML algorithms such as Multinomial Logistic Regression (MLR), Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), eXtreme Gradient Boosting (XGB), Light Gradient Boosting Machine (LGBM), CatBoosting (CB) Classifier, Multilayer Perceptron (MLP), and three proposed stacked ensemble models, namely Stacked Ensemble Classifier (SEC), Boosting Tree-based Stacked Classifier (BTSC), and Neural Stacked Ensemble Model (NSEM) to obtain the best ML model. The results show that the NSEM algorithm achieves the best performance with 97.1% accuracy. In the real-time implementation, the system could simultaneously detect several objects with different breathing characteristics and classify the respiratory signals into five different classes.
Collapse
Affiliation(s)
- Ariana Tulus Purnomo
- Department of Electronic and Computer EngineeringNational Taiwan University of Science and TechnologyTaipei10607Taiwan
| | - Kokoy Siti Komariah
- Department of AI Convergence and the Division of Computer Engineering (respectively)Pukyong National UniversityBusan48513Republic of Korea
| | - Ding-Bing Lin
- Department of Electronic and Computer EngineeringNational Taiwan University of Science and TechnologyTaipei10607Taiwan
| | - Willy Fitra Hendria
- Department of Intelligent Mechatronics EngineeringSejong UniversitySeoul05006Republic of Korea
| | - Bong-Kee Sin
- Department of AI Convergence and the Division of Computer Engineering (respectively)Pukyong National UniversityBusan48513Republic of Korea
| | - Nur Ahmadi
- Center for Artificial Intelligence (U-CoE AI-VLB), School of Electrical Engineering and InformaticsBandung Institute of TechnologyBandung40132Indonesia
| |
Collapse
|
3
|
Wheeler S, Elkhadrawi M, Stevens B, Wheeler B, Akcakaya M. Machine learning classification of false-positive human immunodeficiency virus screening results. J Pathol Inform 2021; 12:46. [PMID: 34934521 PMCID: PMC8652341 DOI: 10.4103/jpi.jpi_7_21] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 06/29/2021] [Accepted: 07/13/2021] [Indexed: 11/04/2022] Open
|
4
|
Rodriguez J, Prieto S, Correa C, Melo M, Dominguez D, Olarte N, Suárez D, Aragón L, Torres F, Santacruz F. Prediction of CD4+ Cells Counts in HIV/AIDS Patients based on Sets and Probability Theories. Curr HIV Res 2019; 16:416-424. [PMID: 30843490 DOI: 10.2174/1570162x17666190306125819] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Revised: 02/26/2019] [Accepted: 03/05/2019] [Indexed: 11/22/2022]
Abstract
BACKGROUND Previous studies have developed methodologies for predicting the number of CD4+ cells from the total leukocyte and lymphocytes count based on mathematical methodologies, obtaining percentages of effectiveness prediction higher than 90% with a value of less than 5000 leukocytes. OBJECTIVE To improve the methodology probabilities prediction in 5000-9000 leukocytes ranges. METHOD from sets A, B, C and D defined in a previous study, and based on CD4+ prediction established on the total number of leukocytes and lymphocytes, induction was performed using data from 10 patients with HIV, redefining the sets A and C that describe the lymphocytes behavior relative to leukocytes. Subsequently, we evaluated with previous research prediction probabilities parameters from a sample of 100 patients, calculating the belonging probability to each sample and organized in predetermined ranges leukocytes, of each of the sets defined, their unions and intersections. Then the same procedure was performed with the new sets and the probability values obtained with the refined method were compared with respect to previously defined, by measures of sensitivity (SENS) and Negative Predictive Value (NPV) for each range. RESULTS probabilities with values greater than 0.83 were found in five of the nine ranges inside the new sets. The probability for the set A∪C increased from 0.06 to 0.18 which means increases between 0.06 and 0.09 for the intersection (A∪C) ∩ (B∪D), making evident the prediction improvement with new sets defined. CONCLUSION The results show that the new defined sets achieved a higher percentage of effectiveness to predict the CD4+ value cells, which represents a useful tool that can be proposed as a substitute for clinical values obtained by the flow cytometry.
Collapse
Affiliation(s)
- Javier Rodriguez
- Insight Group Director, Focusing Area and Special Internship "Physical and Mathematical Theories Applied to Medicine", Nueva Granada Military University - Clinica del Country Research Center, Bogota, Colombia
| | - Signed Prieto
- Insight Group Researcher, Nueva Granada Military University, Clinica del Country Research Center, Bogota, Colombia
| | - Catalina Correa
- Insight Group Researcher, Teacher of Major and Special "Physical and Mathematical Theories Applied to Medicine", Medicine Faculty, Nueva Granada Military University, Clinica del Country Research Center, Bogota, Colombia
| | - Martha Melo
- Magister in Educational Institutions Management, FRACUMNG Group Researcher, Basic and Applied Sciences Faculty, Nueva Granada Military University, Bogota, Colombia
| | - Dario Dominguez
- Magister in Economics, FRACUMNG Research Group Director, Basic and Applied Sciences Faculty, Nueva Granada Military University, Bogota, Colombia
| | - Nancy Olarte
- Esp in Information Technologies Applied to Education, GI-iTEC Group Researcher, Engineering Faculty, Nueva Granada Military University, Bogota, Colombia
| | - Daniela Suárez
- Special Internship and Focusing Area "Physical and Mathematical Theories Applied to Medicine", Medicine Faculty, Nueva Granada Military University, Bogota, Colombia
| | - Laura Aragón
- Special Internship and Focusing Area "Physical and Mathematical Theories Applied to Medicine", Medicine Faculty, Nueva Granada Military University, Bogota, Colombia
| | - Fernando Torres
- Special Internship and Focusing Area "Physical and Mathematical Theories Applied to Medicine", Medicine Faculty, Nueva Granada Military University, Bogota, Colombia
| | - Fernando Santacruz
- Special Internship and Focusing Area "Physical and Mathematical Theories Applied to Medicine", Medicine Faculty, Nueva Granada Military University, Bogota, Colombia
| |
Collapse
|
5
|
Texier G, Allodji RS, Diop L, Meynard JB, Pellegrin L, Chaudet H. Using decision fusion methods to improve outbreak detection in disease surveillance. BMC Med Inform Decis Mak 2019; 19:38. [PMID: 30837003 PMCID: PMC6402142 DOI: 10.1186/s12911-019-0774-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2018] [Accepted: 02/18/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND When outbreak detection algorithms (ODAs) are considered individually, the task of outbreak detection can be seen as a classification problem and the ODA as a sensor providing a binary decision (outbreak yes or no) for each day of surveillance. When they are considered jointly (in cases where several ODAs analyze the same surveillance signal), the outbreak detection problem should be treated as a decision fusion (DF) problem of multiple sensors. METHODS This study evaluated the benefit for a decisions support system of using DF methods (fusing multiple ODA decisions) compared to using a single method of outbreak detection. For each day, we merged the decisions of six ODAs using 5 DF methods (two voting methods, logistic regression, CART and Bayesian network - BN). Classical metrics of accuracy, prediction and timelines were used during the evaluation steps. RESULTS In our results, we observed the greatest gain (77%) in positive predictive value compared to the best ODA if we used DF methods with a learning step (BN, logistic regression, and CART). CONCLUSIONS To identify disease outbreaks in systems using several ODAs to analyze surveillance data, we recommend using a DF method based on a Bayesian network. This method is at least equivalent to the best of the algorithms considered, regardless of the situation faced by the system. For those less familiar with this kind of technique, we propose that logistic regression be used when a training dataset is available.
Collapse
Affiliation(s)
- Gaëtan Texier
- French Armed Forces Center for Epidemiology and Public Health (CESPA), SSA, Camp de Sainte Marthe, 13568, Marseille, France. .,UMR VITROME, IRD, AP-HM, SSA, IHU-Méditerranée Infection, Aix Marseille Univ, 13005, Marseille, France.
| | - Rodrigue S Allodji
- French Armed Forces Center for Epidemiology and Public Health (CESPA), SSA, Camp de Sainte Marthe, 13568, Marseille, France.,CESP, Univ. Paris-Sud, UVSQ, INSERM, Université Paris-Saclay, Villejuif, France.,Cancer and Radiation Team, Gustave Roussy Cancer Center, F-94805, Villejuif, France
| | - Loty Diop
- International Food Policy Research Institute (IFPRI), Regional Office for West and Central Africa Regional Office, 24063, Dakar, Sénégal
| | - Jean-Baptiste Meynard
- French Armed Forces Center for Epidemiology and Public Health (CESPA), SSA, Camp de Sainte Marthe, 13568, Marseille, France.,UMR 912 - SESSTIM - INSERM/IRD/Aix-Marseille Université, 13385, Marseille, France
| | - Liliane Pellegrin
- French Armed Forces Center for Epidemiology and Public Health (CESPA), SSA, Camp de Sainte Marthe, 13568, Marseille, France.,UMR VITROME, IRD, AP-HM, SSA, IHU-Méditerranée Infection, Aix Marseille Univ, 13005, Marseille, France
| | - Hervé Chaudet
- French Armed Forces Center for Epidemiology and Public Health (CESPA), SSA, Camp de Sainte Marthe, 13568, Marseille, France.,UMR VITROME, IRD, AP-HM, SSA, IHU-Méditerranée Infection, Aix Marseille Univ, 13005, Marseille, France
| |
Collapse
|
6
|
HRGPred: Prediction of herbicide resistant genes with k-mer nucleotide compositional features and support vector machine. Sci Rep 2019; 9:778. [PMID: 30692561 PMCID: PMC6349872 DOI: 10.1038/s41598-018-37309-9] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Accepted: 12/03/2018] [Indexed: 02/07/2023] Open
Abstract
Herbicide resistance (HR) is a major concern for the agricultural producers as well as environmentalists. Resistance to commonly used herbicides are conferred due to mutation(s) in the genes encoding herbicide target sites/proteins (GETS). Identification of these genes through wet-lab experiments is time consuming and expensive. Thus, a supervised learning-based computational model has been proposed in this study, which is first of its kind for the prediction of seven classes of GETS. The cDNA sequences of the genes were initially transformed into numeric features based on the k-mer compositions and then supplied as input to the support vector machine. In the proposed SVM-based model, the prediction occurs in two stages, where a binary classifier in the first stage discriminates the genes involved in conferring the resistance to herbicides from other genes, followed by a multi-class classifier in the second stage that categorizes the predicted herbicide resistant genes in the first stage into any one of the seven resistant classes. Overall classification accuracies were observed to be ~89% and >97% for binary and multi-class classifications respectively. The proposed model confirmed higher accuracy than the homology-based algorithms viz., BLAST and Hidden Markov Model. Besides, the developed computational model achieved ~87% accuracy, while tested with an independent dataset. An online prediction server HRGPred (http://cabgrid.res.in:8080/hrgpred) has also been established to facilitate the prediction of GETS by the scientific community.
Collapse
|
7
|
Wang L, Law J, Kale SD, Murali TM, Pandey G. Large-scale protein function prediction using heterogeneous ensembles. F1000Res 2018; 7. [PMID: 30450194 PMCID: PMC6221071 DOI: 10.12688/f1000research.16415.1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/26/2018] [Indexed: 12/24/2022] Open
Abstract
Heterogeneous ensembles are an effective approach in scenarios where the ideal data type and/or individual predictor are unclear for a given problem. These ensembles have shown promise for protein function prediction (PFP), but their ability to improve PFP at a large scale is unclear. The overall goal of this study is to critically assess this ability of a variety of heterogeneous ensemble methods across a multitude of functional terms, proteins and organisms. Our results show that these methods, especially Stacking using Logistic Regression, indeed produce more accurate predictions for a variety of Gene Ontology terms differing in size and specificity. To enable the application of these methods to other related problems, we have publicly shared the HPC-enabled code underlying this work as LargeGOPred ( https://github.com/GauravPandeyLab/LargeGOPred).
Collapse
Affiliation(s)
- Linhua Wang
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Jeffrey Law
- Genetics, Bioinformatics, and Computational Biology Ph.D. Program, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA
| | - Shiv D Kale
- Biocomplexity Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA
| | - T M Murali
- Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA, 24061, USA
| | - Gaurav Pandey
- Department of Genetics and Genomic Sciences and Icahn Institute for Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| |
Collapse
|
8
|
Stanescu A, Pandey G. LEARNING PARSIMONIOUS ENSEMBLES FOR UNBALANCED COMPUTATIONAL GENOMICS PROBLEMS. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2017; 22:288-299. [PMID: 27896983 DOI: 10.1142/9789813207813_0028] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Prediction problems in biomedical sciences are generally quite difficult, partially due to incomplete knowledge of how the phenomenon of interest is influenced by the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor(s) for specific problems. In these situations, a powerful approach to improving prediction performance is to construct ensembles that combine the outputs of many individual base predictors, which have been successful for many biomedical prediction tasks. Moreover, selecting a parsimonious ensemble can be of even greater value for biomedical sciences, where it is not only important to learn an accurate predictor, but also to interpret what novel knowledge it can provide about the target problem. Ensemble selection is a promising approach for this task because of its ability to select a collectively predictive subset, often a relatively small one, of all input base predictors. One of the most well-known algorithms for ensemble selection, CES (Caruana et al.'s Ensemble Selection), generally performs well in practice, but faces several challenges due to the difficulty of choosing the right values of its various parameters. Since the choices made for these parameters are usually ad-hoc, good performance of CES is difficult to guarantee for a variety of problems or datasets. To address these challenges with CES and other such algorithms, we propose a novel heterogeneous ensemble selection approach based on the paradigm of reinforcement learning (RL), which offers a more systematic and mathematically sound methodology for exploring the many possible combinations of base predictors that can be selected into an ensemble. We develop three RL-based strategies for constructing ensembles and analyze their results on two unbalanced computational genomics problems, namely the prediction of protein function and splice sites in eukaryotic genomes. We show that the resultant ensembles are indeed substantially more parsimonious as compared to the full set of base predictors, yet still offer almost the same classification power, especially for larger datasets. The RL ensembles also yield a better combination of parsimony and predictive performance as compared to CES.
Collapse
Affiliation(s)
- Ana Stanescu
- Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | |
Collapse
|
9
|
Hou J, Gao H, Xia Q, Qi N. Feature Combination and the kNN Framework in Object Classification. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2016; 27:1368-1378. [PMID: 26316223 DOI: 10.1109/tnnls.2015.2461552] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
In object classification, feature combination can usually be used to combine the strength of multiple complementary features and produce better classification results than any single one. While multiple kernel learning (MKL) is a popular approach to feature combination in object classification, it does not always perform well in practical applications. On one hand, the optimization process in MKL usually involves a huge consumption of computation and memory space. On the other hand, in some cases, MKL is found to perform no better than the baseline combination methods. This observation motivates us to investigate the underlying mechanism of feature combination with average combination and weighted average combination. As a result, we empirically find that in average combination, it is better to use a sample of the most powerful features instead of all, whereas in one type of weighted average combination, the best classification accuracy comes from a nearly sparse combination. We integrate these observations into the k-nearest neighbors (kNNs) framework, based on which we further discuss some issues related to sparse solution and MKL. Finally, by making use of the kNN framework, we present a new weighted average combination method, which is shown to perform better than MKL in both accuracy and efficiency in experiments. We believe that the work in this paper is helpful in exploring the mechanism underlying feature combination.
Collapse
|
10
|
|
11
|
Nath A, Subbiah K. Maximizing lipocalin prediction through balanced and diversified training set and decision fusion. Comput Biol Chem 2015; 59 Pt A:101-10. [PMID: 26433483 DOI: 10.1016/j.compbiolchem.2015.09.011] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2014] [Revised: 09/08/2015] [Accepted: 09/23/2015] [Indexed: 01/17/2023]
Abstract
Lipocalins are short in sequence length and perform several important biological functions. These proteins are having less than 20% sequence similarity among paralogs. Experimentally identifying them is an expensive and time consuming process. The computational methods based on the sequence similarity for allocating putative members to this family are also far elusive due to the low sequence similarity existing among the members of this family. Consequently, the machine learning methods become a viable alternative for their prediction by using the underlying sequence/structurally derived features as the input. Ideally, any machine learning based prediction method must be trained with all possible variations in the input feature vector (all the sub-class input patterns) to achieve perfect learning. A near perfect learning can be achieved by training the model with diverse types of input instances belonging to the different regions of the entire input space. Furthermore, the prediction performance can be improved through balancing the training set as the imbalanced data sets will tend to produce the prediction bias towards majority class and its sub-classes. This paper is aimed to achieve (i) the high generalization ability without any classification bias through the diversified and balanced training sets as well as (ii) enhanced the prediction accuracy by combining the results of individual classifiers with an appropriate fusion scheme. Instead of creating the training set randomly, we have first used the unsupervised Kmeans clustering algorithm to create diversified clusters of input patterns and created the diversified and balanced training set by selecting an equal number of patterns from each of these clusters. Finally, probability based classifier fusion scheme was applied on boosted random forest algorithm (which produced greater sensitivity) and K nearest neighbour algorithm (which produced greater specificity) to achieve the enhanced predictive performance than that of individual base classifiers. The performance of the learned models trained on Kmeans preprocessed training set is far better than the randomly generated training sets. The proposed method achieved a sensitivity of 90.6%, specificity of 91.4% and accuracy of 91.0% on the first test set and sensitivity of 92.9%, specificity of 96.2% and accuracy of 94.7% on the second blind test set. These results have established that diversifying training set improves the performance of predictive models through superior generalization ability and balancing the training set improves prediction accuracy. For smaller data sets, unsupervised Kmeans based sampling can be an effective technique to increase generalization than that of the usual random splitting method.
Collapse
Affiliation(s)
- Abhigyan Nath
- Department of Computer Science, Banaras Hindu University, Varanasi 221005, India.
| | - Karthikeyan Subbiah
- Department of Computer Science, Banaras Hindu University, Varanasi 221005, India.
| |
Collapse
|
12
|
Whalen S, Pandey OP, Pandey G. Predicting protein function and other biomedical characteristics with heterogeneous ensembles. Methods 2015; 93:92-102. [PMID: 26342255 DOI: 10.1016/j.ymeth.2015.08.016] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2015] [Revised: 08/03/2015] [Accepted: 08/23/2015] [Indexed: 12/29/2022] Open
Abstract
Prediction problems in biomedical sciences, including protein function prediction (PFP), are generally quite difficult. This is due in part to incomplete knowledge of the cellular phenomenon of interest, the appropriateness and data quality of the variables and measurements used for prediction, as well as a lack of consensus regarding the ideal predictor for specific problems. In such scenarios, a powerful approach to improving prediction performance is to construct heterogeneous ensemble predictors that combine the output of diverse individual predictors that capture complementary aspects of the problems and/or datasets. In this paper, we demonstrate the potential of such heterogeneous ensembles, derived from stacking and ensemble selection methods, for addressing PFP and other similar biomedical prediction problems. Deeper analysis of these results shows that the superior predictive ability of these methods, especially stacking, can be attributed to their attention to the following aspects of the ensemble learning process: (i) better balance of diversity and performance, (ii) more effective calibration of outputs and (iii) more robust incorporation of additional base predictors. Finally, to make the effective application of heterogeneous ensembles to large complex datasets (big data) feasible, we present DataSink, a distributed ensemble learning framework, and demonstrate its sound scalability using the examined datasets. DataSink is publicly available from https://github.com/shwhalen/datasink.
Collapse
Affiliation(s)
- Sean Whalen
- Gladstone Institutes, University of California, San Francisco, CA, USA.
| | - Om Prakash Pandey
- Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| | - Gaurav Pandey
- Icahn Institute for Genomics and Multiscale Biology and Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA; Graduate School of Biomedical Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
13
|
Predictions of CD4 lymphocytes' count in HIV patients from complete blood count. BMC MEDICAL PHYSICS 2013; 13:3. [PMID: 24034560 PMCID: PMC3847222 DOI: 10.1186/1756-6649-13-3] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 10/01/2012] [Accepted: 09/09/2013] [Indexed: 11/11/2022]
Abstract
Background HIV diagnosis, prognostic and treatment requires T CD4 lymphocytes’ number from flow cytometry, an expensive technique often not available to people in developing countries. The aim of this work is to apply a previous developed methodology that predicts T CD4 lymphocytes’ value based on total white blood cell (WBC) count and lymphocytes count applying sets theory, from information taken from the Complete Blood Count (CBC). Methods Sets theory was used to classify into groups named A, B, C and D the number of leucocytes/mm3, lymphocytes/mm3, and CD4/μL3 subpopulation per flow cytometry of 800 HIV diagnosed patients. Union between sets A and C, and B and D were assessed, and intersection between both unions was described in order to establish the belonging percentage to these sets. Results were classified into eight ranges taken by 1000 leucocytes/mm3, calculating the belonging percentage of each range with respect to the whole sample. Results Intersection (A ∪ C) ∩ (B ∪ D) showed an effectiveness in the prediction of 81.44% for the range between 4000 and 4999 leukocytes, 91.89% for the range between 3000 and 3999, and 100% for the range below 3000. Conclusions Usefulness and clinical applicability of a methodology based on sets theory were confirmed to predict the T CD4 lymphocytes’ value, beginning with WBC and lymphocytes’ count from CBC. This methodology is new, objective, and has lower costs than the flow cytometry which is currently considered as Gold Standard.
Collapse
|
14
|
Beerenwinkel N, Montazeri H, Schuhmacher H, Knupfer P, von Wyl V, Furrer H, Battegay M, Hirschel B, Cavassini M, Vernazza P, Bernasconi E, Yerly S, Böni J, Klimkait T, Cellerai C, Günthard HF. The individualized genetic barrier predicts treatment response in a large cohort of HIV-1 infected patients. PLoS Comput Biol 2013; 9:e1003203. [PMID: 24009493 PMCID: PMC3757085 DOI: 10.1371/journal.pcbi.1003203] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2012] [Accepted: 07/14/2013] [Indexed: 12/12/2022] Open
Abstract
The success of combination antiretroviral therapy is limited by the evolutionary escape dynamics of HIV-1. We used Isotonic Conjunctive Bayesian Networks (I-CBNs), a class of probabilistic graphical models, to describe this process. We employed partial order constraints among viral resistance mutations, which give rise to a limited set of mutational pathways, and we modeled phenotypic drug resistance as monotonically increasing along any escape pathway. Using this model, the individualized genetic barrier (IGB) to each drug is derived as the probability of the virus not acquiring additional mutations that confer resistance. Drug-specific IGBs were combined to obtain the IGB to an entire regimen, which quantifies the virus' genetic potential for developing drug resistance under combination therapy. The IGB was tested as a predictor of therapeutic outcome using between 2,185 and 2,631 treatment change episodes of subtype B infected patients from the Swiss HIV Cohort Study Database, a large observational cohort. Using logistic regression, significant univariate predictors included most of the 18 drugs and single-drug IGBs, the IGB to the entire regimen, the expert rules-based genotypic susceptibility score (GSS), several individual mutations, and the peak viral load before treatment change. In the multivariate analysis, the only genotype-derived variables that remained significantly associated with virological success were GSS and, with 10-fold stronger association, IGB to regimen. When predicting suppression of viral load below 400 cps/ml, IGB outperformed GSS and also improved GSS-containing predictors significantly, but the difference was not significant for suppression below 50 cps/ml. Thus, the IGB to regimen is a novel data-derived predictor of treatment outcome that has potential to improve the interpretation of genotypic drug resistance tests.
Collapse
Affiliation(s)
- Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Rodríguez J, Prieto S, Correa C, Forero MF, Pérez C, Soracipa Y, Mora J, Rojas N, Pineda D, López F. Teoría de conjuntos aplicada al recuento de linfocitos y leucocitos: predicción de linfocitos T CD4 de pacientes con virus de la inmunodeficiencia humana/sida. ACTA ACUST UNITED AC 2013. [DOI: 10.1016/j.inmuno.2013.01.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
|
16
|
Svärd J, Sönnerborg A. Optimizing background therapy in treatment-experienced HIV-1 patients by rules-based algorithms and bioinformatics. Future Virol 2012. [DOI: 10.2217/fvl.12.66] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
In HIV-1-infected patients with extensive drug resistance, the optimization of background antiretroviral therapy is essential when changing drugs after treatment failure. The genotypic sensitivity score (GSS) and phenotypic sensitivity score (PSS), determined by rules-based algorithms, are employed to predict which drugs to select in a background therapy in order to receive the best treatment response when a new drug will be used, both in investigational trials of new agents and in clinical care. However, the outcome of the GSS/PSS approach for the purpose of assessing antiretroviral efficacy in patients with multiresistance has become more problematic, despite improvements such as drug potency weighting and adding information on treatment history. Bioinformatics-based methods are more recent attractive alternatives that have demonstrated equal or better precision compared with rules-based algorithms. This review aims to discuss the usefulness of GSS/PSS and bioinformatics, respectively, for the optimization of anti-HIV background therapy in heavily treatment-experienced patients.
Collapse
Affiliation(s)
- Jenny Svärd
- Unit of Infectious Diseases, Department of Medicine Huddinge, Karolinska Institutet, Stockholm, Sweden
| | - Anders Sönnerborg
- Unit of Infectious Diseases, Department of Medicine Huddinge, Karolinska Institutet, Stockholm, Sweden
- Division of Clinical Microbiology, Department of Laboratory Medicine, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
17
|
Singh Y, Mars M. HIV Drug-Resistant Patient Information Management, Analysis, and Interpretation. JMIR Res Protoc 2012; 1:e3. [PMID: 23611761 PMCID: PMC3626142 DOI: 10.2196/resprot.1930] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2011] [Revised: 01/27/2012] [Accepted: 04/22/2012] [Indexed: 02/05/2023] Open
Abstract
Introduction The science of information systems, management, and interpretation plays an important part in the continuity of care of patients. This is becoming more evident in the treatment of human immunodeficiency virus (HIV) and acquired immune deficiency syndrome (AIDS), the leading cause of death in sub-Saharan Africa. The high replication rates, selective pressure, and initial infection by resistant strains of HIV infer that drug resistance will inevitably become an important health care concern. This paper describes proposed research with the aim of developing a physician-administered, artificial intelligence-based decision support system tool to facilitate the management of patients on antiretroviral therapy. Methods This tool will consist of (1) an artificial intelligence computer program that will determine HIV drug resistance information from genomic analysis; (2) a machine-learning algorithm that can predict future CD4 count information given a genomic sequence; and (3) the integration of these tools into an electronic medical record for storage and management. Conclusion The aim of the project is to create an electronic tool that assists clinicians in managing and interpreting patient information in order to determine the optimal therapy for drug-resistant HIV patients.
Collapse
Affiliation(s)
- Yashik Singh
- Department of TeleHealth, Nelson R Mandela school of Medicine, University of KwaZulu-Natal, Durban, South Africa.
| | | |
Collapse
|
18
|
Zhang T, Song B, Zhu W, Xu X, Gong QQ, Morando C, Dassopoulos T, Newberry RD, Hunt SR, Li E. An ileal Crohn's disease gene signature based on whole human genome expression profiles of disease unaffected ileal mucosal biopsies. PLoS One 2012; 7:e37139. [PMID: 22606341 PMCID: PMC3351422 DOI: 10.1371/journal.pone.0037139] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2011] [Accepted: 04/13/2012] [Indexed: 12/21/2022] Open
Abstract
Previous genome-wide expression studies have highlighted distinct gene expression patterns in inflammatory bowel disease (IBD) compared to control samples, but the interpretation of these studies has been limited by sample heterogeneity with respect to disease phenotype, disease activity, and anatomic sites. To further improve molecular classification of inflammatory bowel disease phenotypes we focused on a single anatomic site, the disease unaffected proximal ileal margin of resected ileum, and three phenotypes that were unlikely to overlap: ileal Crohn's disease (ileal CD), ulcerative colitis (UC), and control patients without IBD. Whole human genome (Agilent) expression profiling was conducted on two independent sets of disease-unaffected ileal samples collected from the proximal margin of resected ileum. Set 1 (47 ileal CD, 27 UC, and 25 Control non-IBD patients) was used as the training set and Set 2 was subsequently collected as an independent test set (10 ileal CD, 10 UC, and 10 control non-IBD patients). We compared the 17 gene signatures selected by four different feature-selection methods to distinguish ileal CD phenotype with non-CD phenotype. The four methods yielded different but overlapping solutions that were highly discriminating. All four of these methods selected FOLH1 as a common feature. This gene is an established biomarker for prostate cancer, but has not previously been associated with Crohn's disease. Immunohistochemical staining confirmed increased expression of FOLH1 in the ileal epithelium. These results provide evidence for convergent molecular abnormalities in the macroscopically disease unaffected proximal margin of resected ileum from ileal CD subjects.
Collapse
Affiliation(s)
- Tianyi Zhang
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, New York, United States of America
| | - Bowen Song
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, New York, United States of America
| | - Wei Zhu
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, New York, United States of America
| | - Xiao Xu
- Department of Medicine, Stony Brook University, Stony Brook, New York, United States of America
| | - Qing Qing Gong
- Department of Medicine, Washington University-St. Louis School of Medicine, Saint Louis, Missouri, United States of America
| | - Christopher Morando
- Department of Medicine, Washington University-St. Louis School of Medicine, Saint Louis, Missouri, United States of America
| | - Themistocles Dassopoulos
- Department of Medicine, Washington University-St. Louis School of Medicine, Saint Louis, Missouri, United States of America
| | - Rodney D. Newberry
- Department of Medicine, Washington University-St. Louis School of Medicine, Saint Louis, Missouri, United States of America
| | - Steven R. Hunt
- Department of Surgery, Washington University-St. Louis School of Medicine, Saint Louis, Missouri, United States of America
| | - Ellen Li
- Department of Medicine, Stony Brook University, Stony Brook, New York, United States of America
- Department of Medicine, Washington University-St. Louis School of Medicine, Saint Louis, Missouri, United States of America
- * E-mail:
| |
Collapse
|
19
|
Rodríguez J, Prieto S, Bernal P, Pérez C, Correa C, Álvarez L, Bravo J, Perdomo N, Faccini Á. Predicción de la concentración de linfocitos T CD4 en sangre periférica con base en la teoría de la probabilidad. Aplicación clínica en poblaciones de leucocitos, linfocitos y CD4 de pacientes con VIH. INFECTIO 2012. [DOI: 10.1016/s0123-9392(12)70053-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022] Open
|
20
|
Obermeier M, Pironti A, Berg T, Braun P, Däumer M, Eberle J, Ehret R, Kaiser R, Kleinkauf N, Korn K, Kücherer C, Müller H, Noah C, Stürmer M, Thielen A, Wolf E, Walter H. HIV-GRADE: a publicly available, rules-based drug resistance interpretation algorithm integrating bioinformatic knowledge. Intervirology 2012; 55:102-7. [PMID: 22286877 DOI: 10.1159/000331999] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND Genotypic drug resistance testing provides essential information for guiding treatment in HIV-infected patients. It may either be used for identifying patients with transmitted drug resistance or to clarify reasons for treatment failure and to check for remaining treatment options. While different approaches for the interpretation of HIV sequence information are already available, no other available rules-based systems specifically have looked into the effects of combinations of drugs. HIV-GRADE (Genotypischer Resistenz Algorithmus Deutschland) was planned as a countrywide approach to establish standardized drug resistance interpretation in Germany and also to introduce rules for estimating the influence of mutations on drug combinations. The rules for HIV-GRADE are taken from the literature, clinical follow-up data and from a bioinformatics-driven interpretation system (geno2pheno([resistance])). HIV-GRADE presents the option of seeing the rules and results of other drug resistance algorithms for a given sequence simultaneously. METHODS The HIV-GRADE rules-based interpretation system was developed by the members of the HIV-GRADE registered society. For continuous updates, this expert committee meets twice a year to analyze data from various sources. Besides data from clinical studies and the centers involved, published correlations for mutations with drug resistance and genotype-phenotype correlation data information from the bioinformatic models of geno2pheno are used to generate the rules for the HIV-GRADE interpretation system. A freely available online tool was developed on the basis of the Stanford HIVdb rules interpretation tool using the algorithm specification interface. Clinical validation of the interpretation system was performed on the data of treatment episodes consisting of sequence information, antiretroviral treatment and viral load, before and 3 months after treatment change. Data were analyzed using multiple linear regression. RESULTS As the developed online tool allows easy comparison of different drug resistance interpretation systems, coefficients of determination (R(2)) were compared for the freely available rules-based systems. HIV-GRADE (R(2) = 0.40), Stanford HIVdb (R(2) = 0.40), REGA algorithm (R(2) = 0.36) and ANRS (R(2) = 0.35) had a very similar performance using this multiple linear regression model. CONCLUSION The performance of HIV-GRADE is comparable to alternative rules-based interpretation systems. While there is still room for improvement, HIV-GRADE has been made publicly available to allow access to our approach regarding the interpretation of resistance against single drugs and drug combinations.
Collapse
|
21
|
Oette M, Schülter E, Rosen-Zvi M, Peres Y, Zazzi M, Sönnerborg A, Struck D, Altmann A, Kaiser R. Efficacy of antiretroviral therapy switch in HIV-infected patients: a 10-year analysis of the EuResist Cohort. Intervirology 2012; 55:160-6. [PMID: 22286887 DOI: 10.1159/000332018] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
INTRODUCTION Highly active antiretroviral therapy (HAART) has been shown to be effective in many recent trials. However, there is limited data on time trends of HAART efficacy after treatment change. METHODS Data from different European cohorts were compiled within the EuResist Project. The efficacy of HAART defined by suppression of viral replication at 24 weeks after therapy switch was analyzed considering previous treatment modifications from 1999 to 2008. RESULTS Altogether, 12,323 treatment change episodes in 7,342 patients were included in the analysis. In 1999, HAART after treatment switch was effective in 38.0% of the patients who had previously undergone 1-5 therapies. This figure rose to 85.0% in 2008. In patients with more than 5 previous therapies, efficacy rose from 23.9 to 76.2% in the same time period. In patients with detectable viral load at therapy switch, the efficacy rose from 23.3 to 66.7% with 1-5 previous treatments and from 14.4 to 55.6% with more than 5 previous treatments. CONCLUSION The results of this large cohort show that the outcome of HAART switch has improved considerably over the last years. This result was particularly observed in the context after viral rebound. Thus, changing HAART is no longer associated with a high risk of treatment failure.
Collapse
Affiliation(s)
- Mark Oette
- Clinic for General Medicine, Gastroenterology and Infectious Diseases, Augustinerinnen Hospital, Cologne, Germany.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Zazzi M, Incardona F, Rosen-Zvi M, Prosperi M, Lengauer T, Altmann A, Sonnerborg A, Lavee T, Schülter E, Kaiser R. Predicting Response to Antiretroviral Treatment by Machine Learning: The EuResist Project. Intervirology 2012; 55:123-7. [DOI: 10.1159/000332008] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
|
23
|
Bejarano B, Bianco M, Gonzalez-Moron D, Sepulcre J, Goñi J, Arcocha J, Soto O, Del Carro U, Comi G, Leocani L, Villoslada P. Computational classifiers for predicting the short-term course of Multiple sclerosis. BMC Neurol 2011; 11:67. [PMID: 21649880 PMCID: PMC3118106 DOI: 10.1186/1471-2377-11-67] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2010] [Accepted: 06/07/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The aim of this study was to assess the diagnostic accuracy (sensitivity and specificity) of clinical, imaging and motor evoked potentials (MEP) for predicting the short-term prognosis of multiple sclerosis (MS). METHODS We obtained clinical data, MRI and MEP from a prospective cohort of 51 patients and 20 matched controls followed for two years. Clinical end-points recorded were: 1) expanded disability status scale (EDSS), 2) disability progression, and 3) new relapses. We constructed computational classifiers (Bayesian, random decision-trees, simple logistic-linear regression-and neural networks) and calculated their accuracy by means of a 10-fold cross-validation method. We also validated our findings with a second cohort of 96 MS patients from a second center. RESULTS We found that disability at baseline, grey matter volume and MEP were the variables that better correlated with clinical end-points, although their diagnostic accuracy was low. However, classifiers combining the most informative variables, namely baseline disability (EDSS), MRI lesion load and central motor conduction time (CMCT), were much more accurate in predicting future disability. Using the most informative variables (especially EDSS and CMCT) we developed a neural network (NNet) that attained a good performance for predicting the EDSS change. The predictive ability of the neural network was validated in an independent cohort obtaining similar accuracy (80%) for predicting the change in the EDSS two years later. CONCLUSIONS The usefulness of clinical variables for predicting the course of MS on an individual basis is limited, despite being associated with the disease course. By training a NNet with the most informative variables we achieved a good accuracy for predicting short-term disability.
Collapse
|
24
|
Heider D, Verheyen J, Hoffmann D. Machine learning on normalized protein sequences. BMC Res Notes 2011; 4:94. [PMID: 21453485 PMCID: PMC3079662 DOI: 10.1186/1756-0500-4-94] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2010] [Accepted: 03/31/2011] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths. FINDINGS We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%. CONCLUSIONS We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.
Collapse
Affiliation(s)
- Dominik Heider
- Department of Bioinformatics, Center of Medical Biotechnology, University of Duisburg-Essen, Universitaetsstr. 2, 45117 Essen, Germany
| | - Jens Verheyen
- Institute of Virology, University of Cologne, Fuerst-Pueckler-Str. 56, 50935 Cologne, Germany
| | - Daniel Hoffmann
- Department of Bioinformatics, Center of Medical Biotechnology, University of Duisburg-Essen, Universitaetsstr. 2, 45117 Essen, Germany
| |
Collapse
|
25
|
Prosperi MCF, Rosen-Zvi M, Altmann A, Zazzi M, Di Giambenedetto S, Kaiser R, Schülter E, Struck D, Sloot P, van de Vijver DA, Vandamme AM, Sönnerborg A. Antiretroviral therapy optimisation without genotype resistance testing: a perspective on treatment history based models. PLoS One 2010; 5:e13753. [PMID: 21060792 PMCID: PMC2966424 DOI: 10.1371/journal.pone.0013753] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2010] [Accepted: 09/28/2010] [Indexed: 11/24/2022] Open
Abstract
Background Although genotypic resistance testing (GRT) is recommended to guide combination antiretroviral therapy (cART), funding and/or facilities to perform GRT may not be available in low to middle income countries. Since treatment history (TH) impacts response to subsequent therapy, we investigated a set of statistical learning models to optimise cART in the absence of GRT information. Methods and Findings The EuResist database was used to extract 8-week and 24-week treatment change episodes (TCE) with GRT and additional clinical, demographic and TH information. Random Forest (RF) classification was used to predict 8- and 24-week success, defined as undetectable HIV-1 RNA, comparing nested models including (i) GRT+TH and (ii) TH without GRT, using multiple cross-validation and area under the receiver operating characteristic curve (AUC). Virological success was achieved in 68.2% and 68.0% of TCE at 8- and 24-weeks (n = 2,831 and 2,579), respectively. RF (i) and (ii) showed comparable performances, with an average (st.dev.) AUC 0.77 (0.031) vs. 0.757 (0.035) at 8-weeks, 0.834 (0.027) vs. 0.821 (0.025) at 24-weeks. Sensitivity analyses, carried out on a data subset that included antiretroviral regimens commonly used in low to middle income countries, confirmed our findings. Training on subtype B and validation on non-B isolates resulted in a decline of performance for models (i) and (ii). Conclusions Treatment history-based RF prediction models are comparable to GRT-based for classification of virological outcome. These results may be relevant for therapy optimisation in areas where availability of GRT is limited. Further investigations are required in order to account for different demographics, subtypes and different therapy switching strategies.
Collapse
Affiliation(s)
- Mattia C F Prosperi
- Clinic of Infectious Diseases, Catholic University of Sacred Heart, Rome, Italy.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
26
|
Zazzi M, Kaiser R, Sönnerborg A, Struck D, Altmann A, Prosperi M, Rosen-Zvi M, Petroczi A, Peres Y, Schülter E, Boucher CA, Brun-Vezinet F, Harrigan PR, Morris L, Obermeier M, Perno CF, Phanuphak P, Pillay D, Shafer RW, Vandamme AM, van Laethem K, Wensing AMJ, Lengauer T, Incardona F. Prediction of response to antiretroviral therapy by human experts and by the EuResist data-driven expert system (the EVE study). HIV Med 2010; 12:211-8. [PMID: 20731728 DOI: 10.1111/j.1468-1293.2010.00871.x] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
OBJECTIVES The EuResist expert system is a novel data-driven online system for computing the probability of 8-week success for any given pair of HIV-1 genotype and combination antiretroviral therapy regimen plus optional patient information. The objective of this study was to compare the EuResist system vs. human experts (EVE) for the ability to predict response to treatment. METHODS The EuResist system was compared with 10 HIV-1 drug resistance experts for the ability to predict 8-week response to 25 treatment cases derived from the EuResist database validation data set. All current and past patient data were made available to simulate clinical practice. The experts were asked to provide a qualitative and quantitative estimate of the probability of treatment success. RESULTS There were 15 treatment successes and 10 treatment failures. In the classification task, the number of mislabelled cases was six for EuResist and 6-13 for the human experts [mean±standard deviation (SD) 9.1±1.9]. The accuracy of EuResist was higher than the average for the experts (0.76 vs. 0.64, respectively). The quantitative estimates computed by EuResist were significantly correlated (Pearson r=0.695, P<0.0001) with the mean quantitative estimates provided by the experts. However, the agreement among experts was only moderate (for the classification task, inter-rater κ=0.355; for the quantitative estimation, mean±SD coefficient of variation=55.9±22.4%). CONCLUSIONS With this limited data set, the EuResist engine performed comparably to or better than human experts. The system warrants further investigation as a treatment-decision support tool in clinical practice.
Collapse
Affiliation(s)
- M Zazzi
- Department of Molecular Biology, University of Siena, Siena, Italy.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Weisser H, Altmann A, Sierra S, Incardona F, Struck D, Sönnerborg A, Kaiser R, Zazzi M, Tschochner M, Walter H, Lengauer T. Only slight impact of predicted replicative capacity for therapy response prediction. PLoS One 2010; 5:e9044. [PMID: 20140263 PMCID: PMC2815793 DOI: 10.1371/journal.pone.0009044] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2009] [Accepted: 01/15/2010] [Indexed: 12/23/2022] Open
Abstract
Background Replication capacity (RC) of specific HIV isolates is occasionally blamed for unexpected treatment responses. However, the role of viral RC in response to antiretroviral therapy is not yet fully understood. Materials and Methods We developed a method for predicting RC from genotype using support vector machines (SVMs) trained on about 300 genotype-RC pairs. Next, we studied the impact of predicted viral RC (pRC) on the change of viral load (VL) and CD4+ T-cell count (CD4) during the course of therapy on about 3,000 treatment change episodes (TCEs) extracted from the EuResist integrated database. Specifically, linear regression models using either treatment activity scores (TAS), the drug combination, or pRC or any combination of these covariates were trained to predict change in VL and CD4, respectively. Results The SVM models achieved a Spearman correlation (ρ) of 0.54 between measured RC and pRC. The prediction of change in VL (CD4) was best at 180 (360) days, reaching a correlation of ρ = 0.45 (ρ = 0.27). In general, pRC was inversely correlated to drug resistance at treatment start (on average ρ = −0.38). Inclusion of pRC in the linear regression models significantly improved prediction of virological response to treatment based either on the drug combination or on the TAS (t-test; p-values range from 0.0247 to 4 10−6) but not for the model using both TAS and drug combination. For predicting the change in CD4 the improvement derived from inclusion of pRC was not significant. Conclusion Viral RC could be predicted from genotype with moderate accuracy and could slightly improve prediction of virological treatment response. However, the observed improvement could simply be a consequence of the significant correlation between pRC and drug resistance.
Collapse
Affiliation(s)
- Hendrik Weisser
- Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany
| | - André Altmann
- Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany
- * E-mail:
| | - Saleta Sierra
- Institute of Virology, University of Cologne, Cologne, Germany
| | | | - Daniel Struck
- Retrovirology Laboratory, CRP-Santé, Strassen, Luxembourg
| | - Anders Sönnerborg
- Department of Medicine, Division of Infectious Diseases, Karolinska Institute, Stockholm, Sweden
| | - Rolf Kaiser
- Institute of Virology, University of Cologne, Cologne, Germany
| | - Maurizio Zazzi
- Department of Molecular Biology, University of Siena, Siena, Italy
| | - Monika Tschochner
- Institute of Clinical and Molecular Virology, University of Erlangen, Erlangen, Germany
| | - Hauke Walter
- Institute of Clinical and Molecular Virology, University of Erlangen, Erlangen, Germany
| | - Thomas Lengauer
- Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany
| |
Collapse
|
28
|
Bozek K, Thielen A, Sierra S, Kaiser R, Lengauer T. V3 loop sequence space analysis suggests different evolutionary patterns of CCR5- and CXCR4-tropic HIV. PLoS One 2009; 4:e7387. [PMID: 19816596 PMCID: PMC2754612 DOI: 10.1371/journal.pone.0007387] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2009] [Accepted: 09/18/2009] [Indexed: 11/29/2022] Open
Abstract
The V3 loop of human immunodeficiency virus type 1 (HIV-1) is critical for coreceptor binding and is the main determinant of which of the cellular coreceptors, CCR5 or CXCR4, the virus uses for cell entry. The aim of this study is to provide a large-scale data driven analysis of HIV-1 coreceptor usage with respect to the V3 loop evolution and to characterize CCR5- and CXCR4-tropic viral phenotypes previously studied in small- and medium-scale settings. We use different sequence similarity measures, phylogenetic and clustering methods in order to analyze the distribution in sequence space of roughly 1000 V3 loop sequences and their tropism phenotypes. This analysis affords a means of characterizing those sequences that are misclassified by several sequence-based coreceptor prediction methods, as well as predicting the coreceptor using the location of the sequence in sequence space and of relating this location to the CD4+ T-cell count of the patient. We support previous findings that the usage of CCR5 is correlated with relatively high sequence conservation whereas CXCR4-tropic viruses spread over larger regions in sequence space. The incorrectly predicted sequences are mostly located in regions in which their phenotype represents the minority or in close vicinity of regions dominated by the opposite phenotype. Nevertheless, the location of the sequence in sequence space can be used to improve the accuracy of the prediction of the coreceptor usage. Sequences from patients with high CD4+ T-cell counts are relatively highly conserved as compared to those of immunosuppressed patients. Our study thus supports hypotheses of an association of immune system depletion with an increase in V3 loop sequence variability and with the escape of the viral sequence to distant parts of the sequence space.
Collapse
|
29
|
Thompson IR, Bidgood P, Petróczi A, Denholm-Price JCW, Fielder MD. An alternative methodology for the prediction of adherence to anti HIV treatment. AIDS Res Ther 2009; 6:9. [PMID: 19486507 PMCID: PMC2698819 DOI: 10.1186/1742-6405-6-9] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2008] [Accepted: 06/01/2009] [Indexed: 11/23/2022] Open
Abstract
Background Successful treatment of HIV-positive patients is fundamental to controlling the progression to AIDS. Causes of treatment failure are either related to drug resistance and/or insufficient drug levels in the blood. Severe side effects, coupled with the intense nature of many regimens, can lead to treatment fatigue and consequently to periodic or permanent non-adherence. Although non-adherence is a recognised problem in HIV treatment, it is still poorly detected in both clinical practice and research and often based on unreliable information such as self-reports, or in a research setting, Medication Events Monitoring System caps or prescription refill rates. To meet the need for having objective information on adherence, we propose a method using viral load and HIV genome sequence data to identify non-adherence amongst patients. Presentation of the hypothesis With non-adherence operationally defined as a sharp increase in viral load in the absence of mutation, it is hypothesised that periods of non-adherence can be identified retrospectively based on the observed relationship between changes in viral load and mutation. Testing the hypothesis Spikes in the viral load (VL) can be identified from time periods over which VL rises above the undetectable level to a point at which the VL decreases by a threshold amount. The presence of mutations can be established by comparing each sequence to a reference sequence and by comparing sequences in pairs taken sequentially in time, in order to identify changes within the sequences at or around 'treatment change events'. Observed spikes in VL measurements without mutation in the corresponding sequence data then serve as a proxy indicator of non-adherence. Implications of the hypothesis It is envisaged that the validation of the hypothesised approach will serve as a first step on the road to clinical practice. The information inferred from clinical data on adherence would be a crucially important feature of treatment prediction tools provided for practitioners to aid daily practice. In addition, distinct characteristics of biological markers routinely used to assess the state of the disease may be identified in the adherent and non-adherent groups. This latter approach would directly help clinicians to differentiate between non-responding and non-adherent patients.
Collapse
|
30
|
Prosperi MCF, Altmann A, Rosen-Zvi M, Aharoni E, Borgulya G, Bazso F, Sönnerborg A, Schülter E, Struck D, Ulivi G, Vandamme AM, Vercauteren J, Zazzi M. Investigation of expert rule bases, logistic regression, and non-linear machine learning techniques for predicting response to antiretroviral treatment. Antivir Ther 2009. [DOI: 10.1177/135965350901400315] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Background The extreme flexibility of the HIV type-1 (HIV-1) genome makes it challenging to build the ideal antiretroviral treatment regimen. Interpretation of HIV-1 genotypic drug resistance is evolving from rule-based systems guided by expert opinion to data-driven engines developed through machine learning methods. Methods The aim of the study was to investigate linear and non-linear statistical learning models for classifying short-term virological outcome of antiretroviral treatment. To optimize the model, different feature selection methods were considered. Robust extra-sample error estimation and different loss functions were used to assess model performance. The results were compared with widely used rule-based genotypic interpretation systems (Stanford HIVdb, Rega and ANRS). Results A set of 3,143 treatment change episodes were extracted from the EuResist database. The dataset included patient demographics, treatment history and viral genotypes. A logistic regression model using high order interaction variables performed better than rule-based genotypic interpretation systems (accuracy 75.63% versus 71.74–73.89%, area under the receiver operating characteristic curve [AUC] 0.76 versus 0.68–0.70) and was equivalent to a random forest model (accuracy 76.16%, AUC 0.77). However, when rule-based genotypic interpretation systems were coupled with additional patient attributes, and the combination was provided as input to the logistic regression model, the performance increased significantly, becoming comparable to the fully data-driven methods. Conclusions Patient-derived supplementary features significantly improved the accuracy of the prediction of response to treatment, both with rule-based and data-driven interpretation systems. Fully data-driven models derived from large-scale data sources show promise as antiretroviral treatment decision support tools.
Collapse
Affiliation(s)
- Mattia CF Prosperi
- Computer Science and Automation Department, Roma Tre University, Rome, Italy
- Informa, Rome, Italy
| | - Andre Altmann
- Max Planck Institute for Informatics, Saarbrücken, Germany
| | | | | | - Gabor Borgulya
- KFKI Research Institute for Particle and Nuclear Physics of the Hungarian Academy of Sciences, Budapest, Hungary
| | - Fulop Bazso
- KFKI Research Institute for Particle and Nuclear Physics of the Hungarian Academy of Sciences, Budapest, Hungary
| | | | | | - Daniel Struck
- Centre de Recherche Public-Santé, Luxembourg, Luxembourg
| | - Giovanni Ulivi
- Computer Science and Automation Department, Roma Tre University, Rome, Italy
| | | | | | | |
Collapse
|
31
|
Altmann A, Sing T, Vermeiren H, Winters B, Craenenbroeck EV, Van der Borght K, Rhee SY, Shafer RW, Schülter E, Kaiser R, Peres Y, Sönnerborg A, Fessel WJ, Incardona F, Zazzi M, Bacheler L, Vlijmen HV, Lengauer T. Advantages of predicted phenotypes and statistical learning models in inferring virological response to antiretroviral therapy from HIV genotype. Antivir Ther 2009. [DOI: 10.1177/135965350901400201] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Background Inferring response to antiretroviral therapy from the viral genotype alone is challenging. The utility of an intermediate step of predicting in vitro drug susceptibility is currently controversial. Here, we provide a retrospective comparison of approaches using either genotype or predicted phenotypes alone, or in combination. Methods Treatment change episodes were extracted from two large databases from the USA (Stanford-California) and Europe (EuResistDB) comprising data from 6,706 and 13,811 patients, respectively. Response to antiretroviral treatment was dichotomized according to two definitions. Using the viral sequence and the treatment regimen as input, three expert algorithms (ANRS, Rega and HIVdb) were used to generate genotype-based encodings and VircoTYPE™ 4.0 (Virco BVBA, Mechelen, Belgium) was used to generate a predicted phenotype-based encoding. Single drug classifications were combined into a treatment score via simple summation and statistical learning using random forests. Classification performance was studied on Stanford- California data using cross-validation and, in addition, on the independent EuResistDB data. Results In all experiments, predicted phenotype was among the most sensitive approaches. Combining single drug classifications by statistical learning was significantly superior to unweighted summation ( P<2.2x10-16). Classification performance could be increased further by combining predicted phenotypes and expert encodings but not by combinations of expert encodings alone. These results were confirmed on an independent test set comprising data solely from EuResistDB. Conclusions This study demonstrates consistent performance advantages in utilizing predicted phenotype in most scenarios over methods based on genotype alone in inferring virological response. Moreover, all approaches under study benefit significantly from statistical learning for merging single drug classifications into treatment scores.
Collapse
Affiliation(s)
- André Altmann
- Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Tobias Sing
- Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany
| | | | | | | | | | - Soo-Yon Rhee
- Division of Infectious Diseases, Stanford University, Stanford, CA, USA
| | - Robert W Shafer
- Division of Infectious Diseases, Stanford University, Stanford, CA, USA
| | - Eugen Schülter
- Institute of Virology, University of Cologne, Cologne, Germany
| | - Rolf Kaiser
- Institute of Virology, University of Cologne, Cologne, Germany
| | - Yardena Peres
- Health Care and Life Sciences Group, IBM Research, Haifa, Israel
| | - Anders Sönnerborg
- Division of Infectious Diseases, Karolinska Institute, Stockholm, Sweden
| | | | | | - Maurizio Zazzi
- Department of Molecular Biology, University of Siena, Siena, Italy
| | | | | | - Thomas Lengauer
- Computational Biology and Applied Algorithmics, Max Planck Institute for Informatics, Saarbrücken, Germany
| |
Collapse
|