1
|
Chamikara MAP, Chen YPP. MedFused: A framework to discover the relationships between drug chemical functional group impacts and side effects. Comput Biol Med 2021; 133:104361. [PMID: 33872968 DOI: 10.1016/j.compbiomed.2021.104361] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2020] [Revised: 03/12/2021] [Accepted: 03/25/2021] [Indexed: 11/16/2022]
Abstract
It is a well-known fact that there are often side effects to the long-term use of certain medications. These side effects can vary from mild dizziness to, at its most serious, death. The main factors that cause these side effects are the chemical composition, the mode of treatment, and the dose. The dynamics that govern the reaction of a drug heavily depend on its structural composition. The structural composition of a drug is defined by the structural arrangement of the corresponding basic chemical functional groups. Hence, it is essential to investigate the effect of chemical functional groups on the side effects to synthesize drugs with minimal side effects. To support this process, we developed a framework named MedFused (Medical Functional Group Side Effects Database), which is composed of drugs (International Union of Pure and Applied Chemistry: IUPAC nomenclature), functional groups, and the side effects along with other valuable information such as STITCH (search tool for interactions of chemicals) compound ID, and the Unified Medical Language System (UMLS) concept ID. We develop a web framework that functions on the MedFused system database on top of the Django web framework. Our web server supports functionalities such as exploring the database and descriptive graph tools, which provide additional exploration capabilities to the framework. These descriptive tools include histograms, pie charts, and association charts, which further explore the system. Above these basic tools, MedFused includes functionality to discover the drug's "chemical functional group" impact on "side effects". The method conducts an association rule analysis on the relationships by considering the MedFused database as a collection of transactions. A specific transaction has a list of the functional groups of a drug and one side effect. Hence, a drug that has more than one side effect forms multiple transactions. Next, we generate a binary feature matrix based on the transactions and introduce a pruning mechanism to consider only the potential functional groups and side effects based on their support (frequencies), subjected to a predefined threshold (which can be changed accordingly). As the current version of the MedFused database has a limited number of side effects (hence low support), we restricted the analysis to identify the functional groups which have the most potential of causing a particular side effect, based on a confidence value of 1. Our framework can be further extended with more functions and tools as it supports the model view controller (MVC) architecture, which is inherited from the Django Python web framework.
Collapse
Affiliation(s)
| | - Yi-Ping Phoebe Chen
- College of Science, Health and Engineering, La Trobe University, Melbourne, Australia.
| |
Collapse
|
2
|
Unadkat K, Whittall JB. Unexpected predicted length variation for the coding sequence of the sleep related gene, BHLHE41 in gorilla amidst strong purifying selection across mammals. PLoS One 2020; 15:e0223203. [PMID: 32287315 PMCID: PMC7156063 DOI: 10.1371/journal.pone.0223203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Accepted: 03/26/2020] [Indexed: 12/05/2022] Open
Abstract
There is a molecular basis for many sleep patterns and disorders involving circadian clock genes. In humans, "short-sleeper" behavior has been linked to specific amino acid substitutions in BHLHE41 (DEC2), yet little is known about variation at these sites and across this gene in mammals. We compare BHLHE41 coding sequences for 27 mammals. Approximately half of the coding sequence was invariable at the nucleotide level and close to three-quarters of the amino acid alignment was identical. No other mammals had the same "short-sleeper" amino acid substitutions previously described from humans. Phylogenetic analyses based on the nucleotides of the coding sequence alignment are consistent with established mammalian relationships confirming orthology among the sampled sequences. Significant purifying selection was detected in about two-thirds of the variable codons and no codons exhibited significant signs of positive selection. Unexpectedly, the gorilla BHLHE41 sequence has a 318 bp insertion at the 5' end of the coding sequence and a deletion of 195 bp near the 3' end of the coding sequence (including the two short sleeper variable sites). Given the strong signal of purifying selection across this gene, phylogenetic congruence with expected relationships and generally conserved function among mammals investigated thus far, we suggest the indels predicted in the gorilla BHLHE41 may represent an annotation error and warrant experimental validation.
Collapse
Affiliation(s)
- Krishna Unadkat
- Department of Biology, Santa Clara University, Santa Clara, California, United States of America
| | - Justen B. Whittall
- Department of Biology, Santa Clara University, Santa Clara, California, United States of America
| |
Collapse
|
3
|
Nasiri J, Naghavi M, Rad SN, Yolmeh T, Shirazi M, Naderi R, Nasiri M, Ahmadi S. Gene identification programs in bread wheat: a comparison study. NUCLEOSIDES NUCLEOTIDES & NUCLEIC ACIDS 2014; 32:529-54. [PMID: 24124688 DOI: 10.1080/15257770.2013.832773] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Seven ab initio web-based gene prediction programs (i.e., AUGUSTUS, BGF, Fgenesh, Fgenesh+, GeneID, Genemark.hmm, and HMMgene) were assessed to compare their prediction accuracy using protein-coding sequences of bread wheat. At both nucleotide and exon levels, Fgenesh+ was deduced as the superior program and BGF followed by Fgenesh were resided in the next positions, respectively. Conversely, at gene level, Fgenesh with the value of predicting more than 75% of all the genes precisely, concluded as the best ones. It was also found out that programs such as Fgenesh+, BGF, and Fgenesh, because of harboring the highest percentage of correct predictive exons appear to be much more applicable in achieving more trustworthy results, while using both GeneID and HMMgene the percentage of false negatives would be expected to enhance. Regarding initial exon, overall, the frequency of accurate recognition of 3' boundary was significantly higher than that of 5' and the reverse was true if terminal exon is taken into account. Lastly, HMMgene and Genemark.hmm, overall, presented independent tendency against GC content, while the others appear to be slightly more sensitive if GC-poor sequences are employed. Our results, overall, exhibited that to make adequate opportunity in acquiring remarkable results, gene finders still need additional improvements.
Collapse
Affiliation(s)
- Jaber Nasiri
- a Department of Agronomy and Plant Breeding, Division of Molecular Plant Genetics, College of Agricultural & Natural Resources , University of Tehran , Karaj , Tehran , Iran
| | | | | | | | | | | | | | | |
Collapse
|
4
|
Goodswen SJ, Kennedy PJ, Ellis JT. Evaluating high-throughput ab initio gene finders to discover proteins encoded in eukaryotic pathogen genomes missed by laboratory techniques. PLoS One 2012; 7:e50609. [PMID: 23226328 PMCID: PMC3511556 DOI: 10.1371/journal.pone.0050609] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2012] [Accepted: 10/24/2012] [Indexed: 11/25/2022] Open
Abstract
Next generation sequencing technology is advancing genome sequencing at an unprecedented level. By unravelling the code within a pathogen’s genome, every possible protein (prior to post-translational modifications) can theoretically be discovered, irrespective of life cycle stages and environmental stimuli. Now more than ever there is a great need for high-throughput ab initio gene finding. Ab initio gene finders use statistical models to predict genes and their exon-intron structures from the genome sequence alone. This paper evaluates whether existing ab initio gene finders can effectively predict genes to deduce proteins that have presently missed capture by laboratory techniques. An aim here is to identify possible patterns of prediction inaccuracies for gene finders as a whole irrespective of the target pathogen. All currently available ab initio gene finders are considered in the evaluation but only four fulfil high-throughput capability: AUGUSTUS, GeneMark_hmm, GlimmerHMM, and SNAP. These gene finders require training data specific to a target pathogen and consequently the evaluation results are inextricably linked to the availability and quality of the data. The pathogen, Toxoplasma gondii, is used to illustrate the evaluation methods. The results support current opinion that predicted exons by ab initio gene finders are inaccurate in the absence of experimental evidence. However, the results reveal some patterns of inaccuracy that are common to all gene finders and these inaccuracies may provide a focus area for future gene finder developers.
Collapse
Affiliation(s)
- Stephen J. Goodswen
- School of Medical and Molecular Sciences, and the Ithree Institute at the University of Technology Sydney (UTS), New South Wales, Australia
| | - Paul J. Kennedy
- School of Software, Faculty of Engineering and Information Technology and the Centre for Quantum Computation and Intelligent Systems at the University of Technology Sydney (UTS), New South Wales, Australia
| | - John T. Ellis
- School of Medical and Molecular Sciences, and the Ithree Institute at the University of Technology Sydney (UTS), New South Wales, Australia
- * E-mail:
| |
Collapse
|
5
|
Zhu P, Bowden P, Zhang D, Marshall JG. Mass spectrometry of peptides and proteins from human blood. MASS SPECTROMETRY REVIEWS 2011; 30:685-732. [PMID: 24737629 DOI: 10.1002/mas.20291] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2008] [Revised: 12/09/2009] [Accepted: 01/19/2010] [Indexed: 06/03/2023]
Abstract
It is difficult to convey the accelerating rate and growing importance of mass spectrometry applications to human blood proteins and peptides. Mass spectrometry can rapidly detect and identify the ionizable peptides from the proteins in a simple mixture and reveal many of their post-translational modifications. However, blood is a complex mixture that may contain many proteins first expressed in cells and tissues. The complete analysis of blood proteins is a daunting task that will rely on a wide range of disciplines from physics, chemistry, biochemistry, genetics, electromagnetic instrumentation, mathematics and computation. Therefore the comprehensive discovery and analysis of blood proteins will rank among the great technical challenges and require the cumulative sum of many of mankind's scientific achievements together. A variety of methods have been used to fractionate, analyze and identify proteins from blood, each yielding a small piece of the whole and throwing the great size of the task into sharp relief. The approaches attempted to date clearly indicate that enumerating the proteins and peptides of blood can be accomplished. There is no doubt that the mass spectrometry of blood will be crucial to the discovery and analysis of proteins, enzyme activities, and post-translational processes that underlay the mechanisms of disease. At present both discovery and quantification of proteins from blood are commonly reaching sensitivities of ∼1 ng/mL.
Collapse
Affiliation(s)
- Peihong Zhu
- Department of Chemistry and Biology, Ryerson University, 350 Victoria Street, Toronto, Ontario, Canada M5B 2K3
| | | | | | | |
Collapse
|
6
|
Poptsova MS, Gogarten JP. Using comparative genome analysis to identify problems in annotated microbial genomes. Microbiology (Reading) 2010; 156:1909-1917. [DOI: 10.1099/mic.0.033811-0] [Citation(s) in RCA: 80] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results.
Collapse
Affiliation(s)
- Maria S. Poptsova
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT 06269-3125, USA
| | - J. Peter Gogarten
- Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT 06269-3125, USA
| |
Collapse
|
7
|
Bowden P, Pendrak V, Zhu P, Marshall JG. Meta sequence analysis of human blood peptides and their parent proteins. J Proteomics 2010; 73:1163-75. [PMID: 20170764 DOI: 10.1016/j.jprot.2010.02.007] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2009] [Revised: 01/23/2010] [Accepted: 02/09/2010] [Indexed: 11/19/2022]
Abstract
Sequence analysis of the blood peptides and their qualities will be key to understanding the mechanisms that contribute to error in LC-ESI-MS/MS. Analysis of peptides and their proteins at the level of sequences is much more direct and informative than the comparison of disparate accession numbers. A portable database of all blood peptide and protein sequences with descriptor fields and gene ontology terms might be useful for designing immunological or MRM assays from human blood. The results of twelve studies of human blood peptides and/or proteins identified by LC-MS/MS and correlated against a disparate array of genetic libraries were parsed and matched to proteins from the human ENSEMBL, SwissProt and RefSeq databases by SQL. The reported peptide and protein sequences were organized into an SQL database with full protein sequences and up to five unique peptides in order of prevalence along with the peptide count for each protein. Structured query language or BLAST was used to acquire descriptive information in current databases. Sampling error at the level of peptides is the largest source of disparity between groups. Chi Square analysis of peptide to protein distributions confirmed the significant agreement between groups on identified proteins.
Collapse
Affiliation(s)
- Peter Bowden
- Department of Chemistry and Biology, Ryerson University, Toronto, Canada
| | | | | | | |
Collapse
|
8
|
Bowden P, Beavis R, Marshall J. Tandem mass spectrometry of human tryptic blood peptides calculated by a statistical algorithm and captured by a relational database with exploration by a general statistical analysis system. J Proteomics 2009; 73:103-11. [PMID: 19703602 DOI: 10.1016/j.jprot.2009.08.004] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2009] [Revised: 08/04/2009] [Accepted: 08/17/2009] [Indexed: 01/23/2023]
Abstract
A goodness of fit test may be used to assign tandem mass spectra of peptides to amino acid sequences and to directly calculate the expected probability of mis-identification. The product of the peptide expectation values directly yields the probability that the parent protein has been mis-identified. A relational database could capture the mass spectral data, the best fit results, and permit subsequent calculations by a general statistical analysis system. The many files of the Hupo blood protein data correlated by X!TANDEM against the proteins of ENSEMBL were collected into a relational database. A redundant set of 247,077 proteins and peptides were correlated by X!TANDEM, and that was collapsed to a set of 34,956 peptides from 13,379 distinct proteins. About 6875 distinct proteins were only represented by a single distinct peptide, 2866 proteins showed 2 distinct peptides, and 3454 proteins showed at least three distinct peptides by X!TANDEM. More than 99% of the peptides were associated with proteins that had cumulative expectation values, i.e. probability of false positive identification, of one in one hundred or less. The distribution of peptides per protein from X!TANDEM was significantly different than those expected from random assignment of peptides.
Collapse
Affiliation(s)
- Peter Bowden
- Department of Chemistry and Biology, Ryerson University, 350 Victoria Street, Toronto, ON, Canada M5B 2K3
| | | | | |
Collapse
|
9
|
An J, Chen YPP. Finding rule groups to classify high dimensional gene expression datasets. Comput Biol Chem 2009; 33:108-13. [DOI: 10.1016/j.compbiolchem.2008.07.031] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2007] [Revised: 07/24/2008] [Accepted: 07/24/2008] [Indexed: 11/28/2022]
|
10
|
Knapp K, Chonka A, Chen YPP. POEM, A 3-dimensional exon taxonomy and patterns in untranslated exons. BMC Genomics 2008; 9:428. [PMID: 18803852 PMCID: PMC2561055 DOI: 10.1186/1471-2164-9-428] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2008] [Accepted: 09/20/2008] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND The existence of exons and introns has been known for thirty years. Despite this knowledge, there is a lack of formal research into the categorization of exons. Exon taxonomies used by researchers tend to be selected ad hoc or based on an information poor de-facto standard. Exons have been shown to have specific properties and functions based on among other things their location and order. These factors should play a role in the naming to increase specificity about which exon type(s) are in question. RESULTS POEM (Protein Oriented Exon Monikers) is a new taxonomy focused on protein proximal exons. It integrates three dimensions of information (Global Position, Regional Position and Region), thus its exon categories are based on known statistical exon features. POEM is applied to two congruent untranslated exon datasets resulting in the following statistical properties. Using the POEM taxonomy previous wide ranging estimates of initial 5' untranslated region exons are resolved. According to our datasets, 29-36% of genes have wholly untranslated first exons. Untranslated exon containing sequences are shown to have consistently up to 6 times more 5' untranslated exons than 3' untranslated exons. Finally, three exon patterns are determined which account for 70% of untranslated exon genes. CONCLUSION We describe a thorough three-dimensional exon taxonomy called POEM, which is biologically and statistically relevant. No previous taxonomy provides such fine grained information and yet still includes all valid information dimensions. The use of POEM will improve the accuracy of genefinder comparisons and analysis by means of a common taxonomy. It will also facilitate unambiguous communication due to its fine granularity.
Collapse
Affiliation(s)
- Keith Knapp
- Faculty of Science and Technology, Deakin University, Victoria, Australia.
| | | | | |
Collapse
|
11
|
Chen YPP, Chen F. Identifying targets for drug discovery using bioinformatics. Expert Opin Ther Targets 2008; 12:383-9. [PMID: 18348676 DOI: 10.1517/14728222.12.4.383] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
BACKGROUND Drug discovery is the process of discovering and designing drugs, which includes target identification, target validation, lead identification, lead optimization and introduction of the new drugs to the public. This process is very important, involving analyzing the causes of the diseases and finding ways to tackle them. OBJECTIVE The problems we must face include: i) that this process is so long and expensive that it might cost millions of dollars and take a dozen years; and ii) the accuracy of identification of targets is not good enough, which in turn delays the process. Introducing bioinformatics into the drug discovery process could contribute much to it. Bioinformatics is a booming subject combining biology with computer science. It can explore the causes of diseases at the molecular level, explain the phenomena of the diseases from the angle of the gene and make use of computer techniques, such as data mining, machine learning and so on, to decrease the scope of analysis and enhance the accuracy of the results so as to reduce the cost and time. METHODS Here we describe recent studies about how to apply bioinformatics techniques in the four phases of drug discovery, how these techniques improve the drug discovery process and some possible difficulties that should be dealt with. RESULTS We conclude that combining bioinformatics with drug discovery is a very promising method although it faces many problems currently.
Collapse
|
12
|
Chen Y, Timms P, Chen YPP. CIDB: Chlamydia Interactive Database for cross-querying genomics, transcriptomics and proteomics data. ACTA ACUST UNITED AC 2007; 24:603-8. [PMID: 17913579 DOI: 10.1016/j.bioeng.2007.08.017] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2007] [Revised: 08/13/2007] [Accepted: 08/13/2007] [Indexed: 10/22/2022]
Abstract
Chlamydiae are important pathogens of humans, birds and a wide range of animals. They are a unique group of bacteria, characterized by their developmental cycle. Chlamydia has been difficult to study because of their obligate intracellular growth habit and lack of a genetic transformation system. However, the past 5 years has seen the full genome sequencing of seven strains of Chlamydia and a rapid expansion of genomic, transcriptomic (RT-PCR, microarray) and proteomic analysis of these pathogens. The Chlamydia Interactive Database (CIDB) described here is the first database of its type that holds genomic, RT-PCR, microarray and proteomics data sets that can be cross-queried by researchers for patterns in the data. Combining the data of many research groups into a single database and cross-querying from different perspectives should enhance our understanding of the complex cell biology of these pathogens. The database is available at: http://www3.it.deakin.edu.au:8080/CIDB/.
Collapse
Affiliation(s)
- Yan Chen
- School of Engineering and Information Technology, Deakin University, Australia
| | | | | |
Collapse
|
13
|
Nahar J, Ali S, Chen YPP. Microarray Data Classification Using Automatic SVM Kernel Selection. DNA Cell Biol 2007; 26:707-12. [PMID: 17685832 DOI: 10.1089/dna.2007.0590] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Microarray data classification is one of the most important emerging clinical applications in the medical community. Machine learning algorithms are most frequently used to complete this task. We selected one of the state-of-the-art kernel-based algorithms, the support vector machine (SVM), to classify microarray data. As a large number of kernels are available, a significant research question is what is the best kernel for patient diagnosis based on microarray data classification using SVM? We first suggest three solutions based on data visualization and quantitative measures. Different types of microarray problems then test the proposed solutions. Finally, we found that the rule-based approach is most useful for automatic kernel selection for SVM to classify microarray data.
Collapse
Affiliation(s)
- Jesmin Nahar
- Faculty of Science and Technology, Deakin University, Victoria, Australia
| | | | | |
Collapse
|