1
|
EGGNet, a Generalizable Geometric Deep Learning Framework for Protein Complex Pose Scoring. ACS OMEGA 2024; 9:7471-7479. [PMID: 38405499 PMCID: PMC10882658 DOI: 10.1021/acsomega.3c04889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Revised: 01/19/2024] [Accepted: 01/23/2024] [Indexed: 02/27/2024]
Abstract
Computational prediction of molecule-protein interactions has been key for developing new molecules to interact with a target protein for therapeutics development. Previous work includes two independent streams of approaches: (1) predicting protein-protein interactions (PPIs) between naturally occurring proteins and (2) predicting binding affinities between proteins and small-molecule ligands [also known as drug-target interaction (DTI)]. Studying the two problems in isolation has limited the ability of these computational models to generalize across the PPI and DTI tasks, both of which ultimately involve noncovalent interactions with a protein target. In this work, we developed Equivariant Graph of Graphs neural Network (EGGNet), a geometric deep learning (GDL) framework, for molecule-protein binding predictions that can handle three types of molecules for interacting with a target protein: (1) small molecules, (2) synthetic peptides, and (3) natural proteins. EGGNet leverages a graph of graphs (GoG) representation constructed from the molecular structures at atomic resolution and utilizes a multiresolution equivariant graph neural network to learn from such representations. In addition, EGGNet leverages the underlying biophysics and makes use of both atom- and residue-level interactions, which improve EGGNet's ability to rank candidate poses from blind docking. EGGNet achieves competitive performance on both a public protein-small-molecule binding affinity prediction task (80.2% top 1 success rate on CASF-2016) and a synthetic protein interface prediction task (88.4% area under the precision-recall curve). We envision that the proposed GDL framework can generalize to many other protein interaction prediction problems, such as binding site prediction and molecular docking, helping accelerate protein engineering and structure-based drug development.
Collapse
|
2
|
IDMIL: an alignment-free Interpretable Deep Multiple Instance Learning (MIL) for predicting disease from whole-metagenomic data. Bioinformatics 2021; 36:i39-i47. [PMID: 32657370 PMCID: PMC7355246 DOI: 10.1093/bioinformatics/btaa477] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Motivation The human body hosts more microbial organisms than human cells. Analysis of this microbial diversity provides key insight into the role played by these microorganisms on human health. Metagenomics is the collective DNA sequencing of coexisting microbial organisms in an environmental sample or a host. This has several applications in precision medicine, agriculture, environmental science and forensics. State-of-the-art predictive models for phenotype predictions from metagenomic data rely on alignments, assembly, extensive pruning, taxonomic profiling and reference sequence databases. These processes are time consuming and they do not consider novel microbial sequences when aligned with the reference genome, limiting the potential of whole metagenomics. We formulate the problem of predicting human disease from whole-metagenomic data using Multiple Instance Learning (MIL), a popular supervised learning paradigm. Our proposed alignment-free approach provides higher accuracy in prediction by harnessing the capability of deep convolutional neural network (CNN) within a MIL framework and provides interpretability via neural attention mechanism. Results The MIL formulation combined with the hierarchical feature extraction capability of deep-CNN provides significantly better predictive performance compared to popular existing approaches. The attention mechanism allows for the identification of groups of sequences that are likely to be correlated to diseases providing the much-needed interpretation. Our proposed approach does not rely on alignment, assembly and reference sequence databases; making it fast and scalable for large-scale metagenomic data. We evaluate our method on well-known large-scale metagenomic studies and show that our proposed approach outperforms comparative state-of-the-art methods for disease prediction. Availability and implementation https://github.com/mrahma23/IDMIL.
Collapse
|
3
|
Phenotype Prediction from Metagenomic Data Using Clustering and Assembly with Multiple Instance Learning (CAMIL). IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:828-840. [PMID: 28981422 DOI: 10.1109/tcbb.2017.2758782] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
The recent advent of Metagenome Wide Association Studies (MGWAS) provides insight into the role of microbes on human health and disease. However, the studies present several computational challenges. In this paper, we demonstrate a novel, efficient, and effective Multiple Instance Learning (MIL) based computational pipeline to predict patient phenotype from metagenomic data. MIL methods have the advantage that besides predicting the clinical phenotype, we can infer the instance level label or role of microbial sequence reads in the specific disease. Specifically, we use a Bag of Words method, which has been shown to be one of the most effective and efficient MIL methods. This involves assembly of the metagenomic sequence data, clustering of the assembled contigs, extracting features from the contigs, and using an SVM classifier to predict patient labels and identify the most relevant sequence clusters. With the exception of the given labels for the patients, this entire process is de novo (unsupervised). We call our pipeline "CAMIL", which stands for Clustering and Assembly with Multiple Instance Learning. We use multiple state-of-the-art clustering methods for feature extraction, evaluation, and comparison of the performance of our proposed approach for each of these clustering methods. We also present a fast and scalable pre-clustering algorithm as a preprocessing step for our proposed pipeline. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using locality sensitive hashing (LSH). These canopies are then refined by using state-of-the-art sequence clustering algorithms. We use data from a well-known MGWAS study of patients with Type-2 Diabetes and show that our pipeline significantly outperforms the classifier used in that paper, as well as other common MIL methods.
Collapse
|
4
|
Improving large-scale hierarchical classification by rewiring: a data-driven filter based approach. J Intell Inf Syst 2018. [DOI: 10.1007/s10844-018-0509-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
5
|
Abstract
Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a.
Collapse
|
6
|
HierFlat: flattened hierarchies for improving top-down hierarchical classification. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2017. [DOI: 10.1007/s41060-017-0070-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
7
|
Real-time, ultrasound-based control of a virtual hand by a trans-radial amputee. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2017; 2016:3219-3222. [PMID: 28268993 DOI: 10.1109/embc.2016.7591414] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Advancements in multiarticulate upper-limb prosthetics have outpaced the development of intuitive, non-invasive control mechanisms for implementing them. Surface electromyography is currently the most popular non-invasive control method, but presents a number of drawbacks including poor deep-muscle specificity. Previous research established the viability of ultrasound imaging as an alternative means of decoding movement intent, and demonstrated the ability to distinguish between complex grasps in able-bodied subjects via imaging of the anterior forearm musculature. In order to translate this work to clinical viability, able-bodied testing is insufficient. Amputation-induced changes in muscular geometry, dynamics, and imaging characteristics are all likely to influence the effectiveness of our existing techniques. In this work, we conducted preliminary trials with a transradial amputee participant to assess these effects, and potentially elucidate necessary refinements to our approach. Two trials were performed, the first using a set of three motion types, and the second using four. After a brief training period in each trial, the participant was able to control a virtual prosthetic hand in real-time; attempted grasps were successfully classified with a rate of 77% in trial 1, and 71% in trial 2. While the results are sub-optimal compared to our previous able-bodied testing, they are a promising step forward. More importantly, the data collected during these trials can provide valuable information for refining our image processing methods, especially via comparison to previously acquired data from able-bodied individuals. Ultimately, further work with amputees is a necessity for translation towards clinical application.
Collapse
|
8
|
Classification methods for the analysis of LH-PCR data associated with inflammatory bowel disease patients. ACTA ACUST UNITED AC 2015; 11:111-29. [PMID: 25786791 DOI: 10.1504/ijbra.2015.068087] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The human gut is one of the most densely populated microbial communities in the world. The interaction of microbes with human host cells is responsible for several disease conditions and of criticality to human health. It is imperative to understand the relationships between these microbial communities within the human gut and their roles in disease. In this study we analyse the microbial communities within the human gut and their role in Inflammatory Bowel Disease (IBD). The bacterial communities were interrogated using Length Heterogeneity PCR (LH-PCR) fingerprinting of mucosal and luminal associated microbial communities for a class of healthy and diseases patients.
Collapse
|
9
|
Predicting Protein Function Using Multiple Kernels. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:219-233. [PMID: 26357091 DOI: 10.1109/tcbb.2014.2351821] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
High-throughput experimental techniques provide a wide variety of heterogeneous proteomic data sources. To exploit the information spread across multiple sources for protein function prediction, these data sources are transformed into kernels and then integrated into a composite kernel. Several methods first optimize the weights on these kernels to produce a composite kernel, and then train a classifier on the composite kernel. As such, these approaches result in an optimal composite kernel, but not necessarily in an optimal classifier. On the other hand, some approaches optimize the loss of binary classifiers and learn weights for the different kernels iteratively. For multi-class or multi-label data, these methods have to solve the problem of optimizing weights on these kernels for each of the labels, which are computationally expensive and ignore the correlation among labels. In this paper, we propose a method called Predicting Protein Function using Multiple Kernels (ProMK). ProMK iteratively optimizes the phases of learning optimal weights and reduces the empirical loss of multi-label classifier for each of the labels simultaneously. ProMK can integrate kernels selectively and downgrade the weights on noisy kernels. We investigate the performance of ProMK on several publicly available protein function prediction benchmarks and synthetic datasets. We show that the proposed approach performs better than previously proposed protein function prediction approaches that integrate multiple data sources and multi-label multiple kernel learning methods. The codes of our proposed method are available at https://sites.google.com/site/guoxian85/promk.
Collapse
|
10
|
Classifying Protein Sequences Using Regularized Multi-Task Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:1087-1098. [PMID: 26357046 DOI: 10.1109/tcbb.2014.2338303] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Classification problems in which several learning tasks are organized hierarchically pose a special challenge because the hierarchical structure of the problems needs to be considered. Multi-task learning (MTL) provides a framework for dealing with such interrelated learning tasks. When two different hierarchical sources organize similar information, in principle, this combined knowledge can be exploited to further improve classification performance. We have studied this problem in the context of protein structure classification by integrating the learning process for two hierarchical protein structure classification database, SCOP and CATH. Our goal is to accurately predict whether a given protein belongs to a particular class in these hierarchies using only the amino acid sequences. We have utilized the recent developments in multi-task learning to solve the interrelated classification problems. We have also evaluated how the various relationships between tasks affect the classification performance. Our evaluations show that learning schemes in which both the classification databases are used outperform the schemes which utilize only one of them.
Collapse
|
11
|
Guest Editorial for Special Section on BIOKDD2013. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:773-774. [PMID: 26605393 DOI: 10.1109/tcbb.2014.2348731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
|
12
|
Protein Function Prediction with Incomplete Annotations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:579-591. [PMID: 26356025 DOI: 10.1109/tcbb.2013.142] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Automated protein function prediction is one of the grand challenges in computational biology. Multi-label learning is widely used to predict functions of proteins. Most of multi-label learning methods make prediction for unlabeled proteins under the assumption that the labeled proteins are completely annotated, i.e., without any missing functions. However, in practice, we may have a subset of the ground-truth functions for a protein, and whether the protein has other functions is unknown. To predict protein functions with incomplete annotations, we propose a Protein Function Prediction method with Weak-label Learning (ProWL) and its variant ProWL-IF. Both ProWL and ProWL-IF can replenish the missing functions of proteins. In addition, ProWL-IF makes use of the knowledge that a protein cannot have certain functions, which can further boost the performance of protein function prediction. Our experimental results on protein-protein interaction networks and gene expression benchmarks validate the effectiveness of both ProWL and ProWL-IF.
Collapse
|
13
|
Erratum to "Protein Function Prediction Using Multilabel Ensemble Classification". IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:265. [PMID: 26355524 DOI: 10.1109/tcbb.2014.2299736] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
|
14
|
16S rRNA metagenome clustering and diversity estimation using locality sensitive hashing. BMC SYSTEMS BIOLOGY 2013; 7 Suppl 4:S11. [PMID: 24565031 PMCID: PMC3854655 DOI: 10.1186/1752-0509-7-s4-s11] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Background Advances in biotechnology have changed the manner of characterizing large populations of microbial communities that are ubiquitous across several environments."Metagenome" sequencing involves decoding the DNA of organisms co-existing within ecosystems ranging from ocean, soil and human body. Several researchers are interested in metagenomics because it provides an insight into the complex biodiversity across several environments. Clinicians are using metagenomics to determine the role played by collection of microbial organisms within human body with respect to human health wellness and disease. Results We have developed an efficient and scalable, species richness estimation algorithm that uses locality sensitive hashing (LSH). Our algorithm achieves efficiency by approximating the pairwise sequence comparison operations using hashing and also incorporates matching of fixed-length, gapless subsequences criterion to improve the quality of sequence comparisons. We use LSH-based similarity function to cluster similar sequences and make individual groups, called operational taxonomic units (OTUs). We also compute different species diversity/richness metrics by utilizing OTU assignment results to further extend our analysis. Conclusion The algorithm is evaluated on synthetic samples and eight targeted 16S rRNA metagenome samples taken from seawater. We compare the performance of our algorithm with several competing diversity estimation algorithms. We show the benefits of our approach with respect to computational runtime and meaningful OTU assignments. We also demonstrate practical significance of the developed algorithm by comparing bacterial diversity and structure across different skin locations. Website http://www.cs.gmu.edu/~mlbio/LSH-DIV
Collapse
|
15
|
Novel Method for Predicting Dexterous Individual Finger Movements by Imaging Muscle Activity Using a Wearable Ultrasonic System. IEEE Trans Neural Syst Rehabil Eng 2013; 22:69-76. [PMID: 23996580 DOI: 10.1109/tnsre.2013.2274657] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Recently there have been major advances in the electro-mechanical design of upper extremity prosthetics. However, the development of control strategies for such prosthetics has lagged significantly behind. Conventional noninvasive myoelectric control strategies rely on the amplitude of electromyography (EMG) signals from flexor and extensor muscles in the forearm. Surface EMG has limited specificity for deep contiguous muscles because of cross talk and cannot reliably differentiate between individual digit and joint motions. We present a novel ultrasound imaging based control strategy for upper arm prosthetics that can overcome many of the limitations of myoelectric control. Real time ultrasound images of the forearm muscles were obtained using a wearable mechanically scanned single element ultrasound system, and analyzed to create maps of muscle activity based on changes in the ultrasound echogenicity of the muscle during contraction. Individual digit movements were associated with unique maps of activity. These maps were correlated with previously acquired training data to classify individual digit movements. Preliminary results using ten healthy volunteers demonstrated this approach could provide robust classification of individual finger movements with 98% accuracy (precision 96%-100% and recall 97%-100% for individual finger flexions). The change in ultrasound echogenicity was found to be proportional to the digit flexion speed (R(2)=0.9), and thus our proposed strategy provided a proportional signal that can be used for fine control. We anticipate that ultrasound imaging based control strategies could be a significant improvement over conventional myoelectric control of prosthetics.
Collapse
|
16
|
Protein function prediction using multilabel ensemble classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2013; 10:1045-1057. [PMID: 24334396 DOI: 10.1109/tcbb.2013.111] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
High-throughput experimental techniques produce several kinds of heterogeneous proteomic and genomic data sets. To computationally annotate proteins, it is necessary and promising to integrate these heterogeneous data sources. Some methods transform these data sources into different kernels or feature representations. Next, these kernels are linearly (or nonlinearly) combined into a composite kernel. The composite kernel is utilized to develop a predictive model to infer the function of proteins. A protein can have multiple roles and functions (or labels). Therefore, multilabel learning methods are also adapted for protein function prediction. We develop a transductive multilabel classifier (TMC) to predict multiple functions of proteins using several unlabeled proteins. We also propose a method called transductive multilabel ensemble classifier (TMEC) for integrating the different data sources using an ensemble approach. The TMEC trains a graph-based multilabel classifier on each single data source, and then combines the predictions of the individual classifiers. We use a directed birelational graph to capture the relationships between pairs of proteins, between pairs of functions, and between proteins and functions. We evaluate the effectiveness of the TMC and TMEC to predict the functions of proteins on three benchmarks. We show that our approaches perform better than recently proposed protein function prediction methods on composite and multiple kernels. The code, data sets used in this paper and supplemental material are available at https://sites.google.com/site/guoxian85/tmec.
Collapse
|
17
|
|
18
|
Modulation of the metabiome by rifaximin in patients with cirrhosis and minimal hepatic encephalopathy. PLoS One 2013; 8:e60042. [PMID: 23565181 PMCID: PMC3615021 DOI: 10.1371/journal.pone.0060042] [Citation(s) in RCA: 302] [Impact Index Per Article: 27.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2012] [Accepted: 02/19/2013] [Indexed: 12/12/2022] Open
Abstract
UNLABELLED Hepatic encephalopathy (HE) represents a dysfunctional gut-liver-brain axis in cirrhosis which can negatively impact outcomes. This altered gut-brain relationship has been treated using gut-selective antibiotics such as rifaximin, that improve cognitive function in HE, especially its subclinical form, minimal HE (MHE). However, the precise mechanism of the action of rifaximin in MHE is unclear. We hypothesized that modulation of gut microbiota and their end-products by rifaximin would affect the gut-brain axis and improve cognitive performance in cirrhosis. Aim To perform a systems biology analysis of the microbiome, metabolome and cognitive change after rifaximin in MHE. METHODS Twenty cirrhotics with MHE underwent cognitive testing, endotoxin analysis, urine/serum metabolomics (GC and LC-MS) and fecal microbiome assessment (multi-tagged pyrosequencing) at baseline and 8 weeks post-rifaximin 550 mg BID. Changes in cognition, endotoxin, serum/urine metabolites (and microbiome were analyzed using recommended systems biology techniques. Specifically, correlation networks between microbiota and metabolome were analyzed before and after rifaximin. RESULTS There was a significant improvement in cognition(six of seven tests improved, p<0.01) and endotoxemia (0.55 to 0.48 Eu/ml, p = 0.02) after rifaximin. There was a significant increase in serum saturated (myristic, caprylic, palmitic, palmitoleic, oleic and eicosanoic) and unsaturated (linoleic, linolenic, gamma-linolenic and arachnidonic) fatty acids post-rifaximin. No significant microbial change apart from a modest decrease in Veillonellaceae and increase in Eubacteriaceae was observed. Rifaximin resulted in a significant reduction in network connectivity and clustering on the correlation networks. The networks centered on Enterobacteriaceae, Porphyromonadaceae and Bacteroidaceae indicated a shift from pathogenic to beneficial metabolite linkages and better cognition while those centered on autochthonous taxa remained similar. CONCLUSIONS Rifaximin is associated with improved cognitive function and endotoxemia in MHE, which is accompanied by alteration of gut bacterial linkages with metabolites without significant change in microbial abundance. TRIAL REGISTRATION ClinicalTrials.gov NCT01069133.
Collapse
|
19
|
Abstract
Several studies indicate the importance of colonic microbiota in metabolic and inflammatory disorders and importance of diet on microbiota composition. The effects of alcohol, one of the prominent components of diet, on colonic bacterial composition is largely unknown. Mounting evidence suggests that gut-derived bacterial endotoxins are cofactors for alcohol-induced tissue injury and organ failure like alcoholic liver disease (ALD) that only occur in a subset of alcoholics. We hypothesized that chronic alcohol consumption results in alterations of the gut microbiome in a subgroup of alcoholics, and this may be responsible for the observed inflammatory state and endotoxemia in alcoholics. Thus we interrogated the mucosa-associated colonic microbiome in 48 alcoholics with and without ALD as well as 18 healthy subjects. Colonic biopsy samples from subjects were analyzed for microbiota composition using length heterogeneity PCR fingerprinting and multitag pyrosequencing. A subgroup of alcoholics have an altered colonic microbiome (dysbiosis). The alcoholics with dysbiosis had lower median abundances of Bacteroidetes and higher ones of Proteobacteria. The observed alterations appear to correlate with high levels of serum endotoxin in a subset of the samples. Network topology analysis indicated that alcohol use is correlated with decreased connectivity of the microbial network, and this alteration is seen even after an extended period of sobriety. We show that the colonic mucosa-associated bacterial microbiome is altered in a subset of alcoholics. The altered microbiota composition is persistent and correlates with endotoxemia in a subgroup of alcoholics.
Collapse
|
20
|
Abstract
Background Metagenomic assembly is a challenging problem due to the presence of genetic material from multiple organisms. The problem becomes even more difficult when short reads produced by next generation sequencing technologies are used. Although whole genome assemblers are not designed to assemble metagenomic samples, they are being used for metagenomics due to the lack of assemblers capable of dealing with metagenomic samples. We present an evaluation of assembly of simulated short-read metagenomic samples using a state-of-art de Bruijn graph based assembler. Results We assembled simulated metagenomic reads from datasets of various complexities using a state-of-art de Bruijn graph based parallel assembler. We have also studied the effect of k-mer size used in de Bruijn graph on metagenomic assembly and developed a clustering solution to pool the contigs obtained from different assembly runs, which allowed us to obtain longer contigs. We have also assessed the degree of chimericity of the assembled contigs using an entropy/impurity metric and compared the metagenomic assemblies to assemblies of isolated individual source genomes. Conclusions Our results show that accuracy of the assembled contigs was better than expected for the metagenomic samples with a few dominant organisms and was especially poor in samples containing many closely related strains. Clustering contigs from different k-mer parameter of the de Bruijn graph allowed us to obtain longer contigs, however the clustering resulted in accumulation of erroneous contigs thus increasing the error rate in clustered contigs.
Collapse
|
21
|
Abstract
The diagnostic potential and health implications of volatile organic compounds (VOCs) present in human feces has begun to receive considerable attention. Headspace solid-phase microextraction (SPME) has greatly facilitated the isolation and analysis of VOCs from human feces. Pioneering human fecal VOC metabolomic investigations have utilized a single SPME fiber type for analyte extraction and analysis. However, we hypothesized that the multifarious nature of metabolites present in human feces dictates the use of several diverse SPME fiber coatings for more comprehensive metabolomic coverage. We report here an evaluation of eight different commercially available SPME fibers, in combination with both GC-MS and GC-FID, and identify the 50/30 µm CAR-DVB-PDMS, 85 µm CAR-PDMS, 65 µm DVB-PDMS, 7 µm PDMS, and 60 µm PEG SPME fibers as a minimal set of fibers appropriate for human fecal VOC metabolomics, collectively isolating approximately 90% of the total metabolites obtained when using all eight fibers. We also evaluate the effect of extraction duration on metabolite isolation and illustrate that ex vivo enteric microbial fermentation has no effect on metabolite composition during prolonged extractions if the SPME is performed as described herein.
Collapse
|
22
|
TOPTMH: topology predictor for transmembrane alpha-helices. J Bioinform Comput Biol 2010; 8:39-57. [PMID: 20183873 DOI: 10.1142/s0219720010004501] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2009] [Accepted: 10/22/2009] [Indexed: 11/18/2022]
Abstract
Alpha-helical transmembrane proteins mediate many key biological processes and represent 20%-30% of all genes in many organisms. Due to the difficulties in experimentally determining their high-resolution 3D structure, computational methods to predict the location and orientation of transmembrane helix segments using sequence information are essential. We present TOPTMH, a new transmembrane helix topology prediction method that combines support vector machines, hidden Markov models, and a widely used rule-based scheme. The contribution of this work is the development of a prediction approach that first uses a binary SVM classifier to predict the helix residues and then it employs a pair of HMM models that incorporate the SVM predictions and hydropathy-based features to identify the entire transmembrane helix segments by capturing the structural characteristics of these proteins. TOPTMH outperforms state-of-the-art prediction methods and achieves the best performance on an independent static benchmark.
Collapse
|
23
|
Abstract
In this article, we used a network-based approach to characterize the microflora abundance in colonic mucosal samples and correlate potential interactions between the identified species with respect to the healthy and diseased states. We analyzed the modelled network by computing several local and global network statistics, identified recurring patterns or motifs, fit the network models to a family of well-studied graph models. This study has demonstrated, for the first time, an approach that differentiated the gut microbiota in alcoholic subjects and healthy subjects using topological network analysis of the gut microbiome.
Collapse
|
24
|
|
25
|
Mining manufacturing data for discovery of high productivity process characteristics. J Biotechnol 2010; 147:186-97. [PMID: 20416347 DOI: 10.1016/j.jbiotec.2010.04.005] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2010] [Revised: 04/01/2010] [Accepted: 04/13/2010] [Indexed: 11/29/2022]
Abstract
Modern manufacturing facilities for bioproducts are highly automated with advanced process monitoring and data archiving systems. The time dynamics of hundreds of process parameters and outcome variables over a large number of production runs are archived in the data warehouse. This vast amount of data is a vital resource to comprehend the complex characteristics of bioprocesses and enhance production robustness. Cell culture process data from 108 'trains' comprising production as well as inoculum bioreactors from Genentech's manufacturing facility were investigated. Each run constitutes over one-hundred on-line and off-line temporal parameters. A kernel-based approach combined with a maximum margin-based support vector regression algorithm was used to integrate all the process parameters and develop predictive models for a key cell culture performance parameter. The model was also used to identify and rank process parameters according to their relevance in predicting process outcome. Evaluation of cell culture stage-specific models indicates that production performance can be reliably predicted days prior to harvest. Strong associations between several temporal parameters at various manufacturing stages and final process outcome were uncovered. This model-based data mining represents an important step forward in establishing a process data-driven knowledge discovery in bioprocesses. Implementation of this methodology on the manufacturing floor can facilitate a real-time decision making process and thereby improve the robustness of large scale bioprocesses.
Collapse
|
26
|
svmPRAT: SVM-based protein residue annotation toolkit. BMC Bioinformatics 2009; 10:439. [PMID: 20028521 PMCID: PMC2805646 DOI: 10.1186/1471-2105-10-439] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2009] [Accepted: 12/22/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Over the last decade several prediction methods have been developed for determining the structural and functional properties of individual protein residues using sequence and sequence-derived information. Most of these methods are based on support vector machines as they provide accurate and generalizable prediction models. RESULTS We present a general purpose protein residue annotation toolkit (svmPRAT) to allow biologists to formulate residue-wise prediction problems. svmPRAT formulates the annotation problem as a classification or regression problem using support vector machines. One of the key features of svmPRAT is its ease of use in incorporating any user-provided information in the form of feature matrices. For every residue svmPRAT captures local information around the reside to create fixed length feature vectors. svmPRAT implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that accurately captures signals and pattern for training effective predictive models. CONCLUSIONS In this work we evaluate svmPRAT on several classification and regression problems including disorder prediction, residue-wise contact order estimation, DNA-binding site prediction, and local structure alphabet prediction. svmPRAT has also been used for the development of state-of-the-art transmembrane helix prediction method called TOPTMH, and secondary structure prediction method called YASSPP. This toolkit developed provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems. AVAILABILITY http://www.cs.gmu.edu/~mlbio/svmprat.
Collapse
|
27
|
Multi-Assay-Based Structure−Activity Relationship Models: Improving Structure−Activity Relationship Models by Incorporating Activity Information from Related Targets. J Chem Inf Model 2009; 49:2444-56. [DOI: 10.1021/ci900182q] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
28
|
f
RMSDPred: Predicting local RMSD between structural fragments using sequence information. Proteins 2008; 72:1005-18. [DOI: 10.1002/prot.21998] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
29
|
Improving homology models for protein-ligand binding sites. COMPUTATIONAL SYSTEMS BIOINFORMATICS. COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2008; 7:211-222. [PMID: 19642282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
In order to improve the prediction of protein-ligand binding sites through homology modeling, we incorporate knowledge of the binding residues into the modeling framework. Residues are identified as binding or nonbinding based on their true labels as well as labels predicted from structure and sequence. The sequence predictions were made using a support vector machine framework which employs a sophisticated window-based kernel. Binding labels are used with a very sensitive sequence alignment method to align the target and template. Relevant parameters governing the alignment process are searched for optimal values. Based on our results, homology models of the binding site can be improved if a priori knowledge of the binding residues is available. For target-template pairs with low sequence identity and high structural diversity our sequence-based prediction method provided sufficient information to realize this improvement.
Collapse
|
30
|
Abstract
MOTIVATION Protein sequence alignment plays a critical role in computational biology as it is an integral part in many analysis tasks designed to solve problems in comparative genomics, structure and function prediction, and homology modeling. METHODS We have developed novel sequence alignment algorithms that compute the alignment between a pair of sequences based on short fixed- or variable-length high-scoring subsequences. Our algorithms build the alignments by repeatedly selecting the highest scoring pairs of subsequences and using them to construct small portions of the final alignment. We utilize PSI-BLAST generated sequence profiles and employ a profile-to-profile scoring scheme derived from PICASSO. RESULTS We evaluated the performance of the computed alignments on two recently published benchmark datasets and compared them against the alignments computed by existing state-of-the-art dynamic programming-based profile-to-profile local and global sequence alignment algorithms. Our results show that the new algorithms achieve alignments that are comparable with or better than those achieved by existing algorithms. Moreover, our results also showed that these algorithms can be used to provide better information as to which of the aligned positions are more reliable--a critical piece of information for comparative modeling applications.
Collapse
|
31
|
fRMSDPred: predicting local RMSD between structural fragments using sequence information. COMPUTATIONAL SYSTEMS BIOINFORMATICS. COMPUTATIONAL SYSTEMS BIOINFORMATICS CONFERENCE 2007; 6:311-322. [PMID: 17951834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
The effectiveness of comparative modeling approaches for protein structure prediction can be substantially improved by incorporating predicted structural information in the initial sequence-structure alignment. Motivated by the approaches used to align protein structures, this paper focuses on developing machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel functions. Our comprehensive empirical study shows superior results compared to the profile-to-profile scoring schemes.
Collapse
|
32
|
Building multiclass classifiers for remote homology detection and fold recognition. BMC Bioinformatics 2006; 7:455. [PMID: 17042943 PMCID: PMC1635067 DOI: 10.1186/1471-2105-7-455] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2006] [Accepted: 10/16/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein remote homology detection and fold recognition are central problems in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for solving these problems. These methods are primarily used to solve binary classification problems and they have not been extensively used to solve the more general multiclass remote homology prediction and fold recognition problems. RESULTS We present a comprehensive evaluation of a number of methods for building SVM-based multiclass classification schemes in the context of the SCOP protein classification. These methods include schemes that directly build an SVM-based multiclass model, schemes that employ a second-level learning approach to combine the predictions generated by a set of binary SVM-based classifiers, and schemes that build and combine binary classifiers for various levels of the SCOP hierarchy beyond those defining the target classes. CONCLUSION Analyzing the performance achieved by the different approaches on four different datasets we show that most of the proposed multiclass SVM-based classification approaches are quite effective in solving the remote homology prediction and fold recognition problems and that the schemes that use predictions from binary models constructed for ancestral categories within the SCOP hierarchy tend to not only lead to lower error rates but also reduce the number of errors in which a superfamily is assigned to an entirely different fold and a fold is predicted as being from a different SCOP class. Our results also show that the limited size of the training data makes it hard to learn complex second-level models, and that models of moderate complexity lead to consistently better results.
Collapse
|
33
|
WE-C-330A-04: Effect of Projection Angles Used in Multi-View Reconstruction (MVR) Using Images From a Microangiographic (MA) Detector and An Image-Intensifier (II) System. Med Phys 2006. [DOI: 10.1118/1.2241680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open
|
34
|
SU-FF-I-06: A Portable Test Platform for Image Acquisition and Calibration for Cone Beam Computed Tomography (CBCT) and Region of Interest CBCT (ROI-CBCT) On a Commercial X-Ray C-Arm System. Med Phys 2006. [DOI: 10.1118/1.2240244] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open
|
35
|
Abstract
MOTIVATION Protein remote homology detection is a central problem in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for remote homology detection. The performance of these methods depends on how the protein sequences are modeled and on the method used to compute the kernel function between them. RESULTS We introduce two classes of kernel functions that are constructed by combining sequence profiles with new and existing approaches for determining the similarity between pairs of protein sequences. These kernels are constructed directly from these explicit protein similarity measures and employ effective profile-to-profile scoring schemes for measuring the similarity between pairs of proteins. Experiments with remote homology detection and fold recognition problems show that these kernels are capable of producing results that are substantially better than those produced by all of the existing state-of-the-art SVM-based methods. In addition, the experiments show that these kernels, even when used in the absence of profiles, produce results that are better than those produced by existing non-profile-based schemes. AVAILABILITY The programs for computing the various kernel functions are available on request from the authors.
Collapse
|
36
|
Generation, observation, and free radical reactivity of aliphatic bisketenes: the solution to a long-standing problem. Org Lett 2001; 3:4095-8. [PMID: 11735593 DOI: 10.1021/ol016853t] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
[reaction: see text] Bisketenes O=C=CH(CH(2))(n)()CH=C=O (1b,c,d, n = 4, 3, 6) and (E)-O=C=CHCH=CHCH=C=O (E-13) were generated in solution by dehydrochlorination of bis(acyl chlorides) and by photochemical Wolff rearrangements and identified by their characteristic IR signals. The bisketenes react with aminoxyl radicals to give tetraaddition products for 1b and conjugate 1,6-diaddition for E-13.
Collapse
|
37
|
|