1
|
Jia R, Guo X, Liu H, Zhao F, Fan Z, Wang M, Sui J, Yin B, Wang Z, Wang Z. Analysis of Staged Features of Gastritis-Cancer Transformation and Identification of Potential Biomarkers in Gastric Cancer. J Inflamm Res 2022; 15:6857-6868. [PMID: 36597437 PMCID: PMC9805741 DOI: 10.2147/jir.s390448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Accepted: 12/16/2022] [Indexed: 12/29/2022] Open
Abstract
Purpose This work aims to elucidate the staged characteristics during gastritis-cancer transformation based on the transcriptome and use bioinformatics to identify potential biomarkers. Patients and Methods We collected blood samples from healthy controls, patients with non-atrophic gastritis, atrophic gastritis, and gastric cancer, and tissue samples from patients with gastric cancer, respectively. RNA-seq was then performed. Differentially expressed genes, weighted gene co-expression network analysis and functional enrichment analysis were used to illustrate the staged characteristics of gastritis-cancer transformation. Genes with diagnostic potential were further identified in combination with ROC analysis. Additionally, for the gastric cancer stage, the gene expression of the collected tissue transcriptome was validated using the Cancer Genome Atlas and combined with survival analysis to identify potential biomarkers. Results The 279 overlapping genes among the differentially expressed genes of NAG, AG and CA indicated that the expression characteristics of different stages were different. However, the 2243 overlapping genes of differential genes between adjacent stages indicated a certain consistency in the expression characteristics of stage development. The core functions of different stages have strong stage specificity and basically have no similarities. Twenty genes with diagnostic potential for AG or CA were obtained, respectively, and no gene could effectively differentiate NAG samples. Thirty-four potential biomarkers for gastric cancer were identified, of which 14 genes have not been reported, including ACTG2, C1QTNF2, NCAPH and SORCS1. Conclusion There may be a stable development mechanism in the process of gastritis-carcinoma transformation, resulting in the differences in the performance of each stage. The newly discovered staging features and potential biomarkers in this work can provide references for related research.
Collapse
Affiliation(s)
- Ruikang Jia
- The Affiliated Hospital and the Medical College, Hebei University of Engineering, Handan, Hebei Province, People’s Republic of China,Key Laboratory of Chinese Medicine for Gastric Medicine, Hebei Province, Handan Pharmaceutical Co. LTD, Handan, People’s Republic of China
| | - Xiaohui Guo
- Handan Central Hospital, Handan, Hebei Province, People’s Republic of China
| | - Huiyun Liu
- Key Laboratory of Chinese Medicine for Gastric Medicine, Hebei Province, Handan Pharmaceutical Co. LTD, Handan, People’s Republic of China
| | - Feiyue Zhao
- Key Laboratory of Chinese Medicine for Gastric Medicine, Hebei Province, Handan Pharmaceutical Co. LTD, Handan, People’s Republic of China
| | - Zhibin Fan
- Key Laboratory of Chinese Medicine for Gastric Medicine, Hebei Province, Handan Pharmaceutical Co. LTD, Handan, People’s Republic of China
| | - Menglei Wang
- Key Laboratory of Chinese Medicine for Gastric Medicine, Hebei Province, Handan Pharmaceutical Co. LTD, Handan, People’s Republic of China
| | - Jianliang Sui
- The Affiliated Hospital and the Medical College, Hebei University of Engineering, Handan, Hebei Province, People’s Republic of China,Key Laboratory of Chinese Medicine for Gastric Medicine, Hebei Province, Handan Pharmaceutical Co. LTD, Handan, People’s Republic of China
| | - Binghua Yin
- Handan Central Hospital, Handan, Hebei Province, People’s Republic of China
| | - Zhihong Wang
- People’s Hospital of Huangzhou District, Huanggang City, People’s Republic of China
| | - Zhen Wang
- The Affiliated Hospital and the Medical College, Hebei University of Engineering, Handan, Hebei Province, People’s Republic of China,Key Laboratory of Metabolism and Molecular Medicine, Ministry of Education, and Department of Biochemistry and Molecular Biology, Fudan University Shanghai Medical College, Shanghai, People’s Republic of China,Correspondence: Zhen Wang, The Affiliated Hospital and the Medical College, Hebei University of Engineering, Handan, Hebei Province, People’s Republic of China, Tel +8619903200632, Email
| |
Collapse
|
2
|
van Bragt JJ, Brinkman P, de Vries R, Vijverberg SJ, Weersink EJ, Haarman EG, de Jongh FH, Kester S, Lucas A, in 't Veen JC, Sterk PJ, Bel EH, Maitland-van der Zee AH. Identification of recent exacerbations in COPD patients by electronic nose. ERJ Open Res 2020; 6:00307-2020. [PMID: 33447611 PMCID: PMC7792783 DOI: 10.1183/23120541.00307-2020] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Accepted: 09/28/2020] [Indexed: 12/17/2022] Open
Abstract
Molecular profiling of exhaled breath by electronic nose (eNose) might be suitable as a noninvasive tool that can help in monitoring of clinically unstable COPD patients. However, supporting data are still lacking. Therefore, as a first step, this study aimed to determine the accuracy of exhaled breath analysis by eNose to identify COPD patients who recently exacerbated, defined as an exacerbation in the previous 3 months. Data for this exploratory, cross-sectional study were extracted from the multicentre BreathCloud cohort. Patients with a physician-reported diagnosis of COPD (n=364) on maintenance treatment were included in the analysis. Exacerbations were defined as a worsening of respiratory symptoms requiring treatment with oral corticosteroids, antibiotics or both. Data analysis involved eNose signal processing, ambient air correction and statistics based on principal component (PC) analysis followed by linear discriminant analysis (LDA). Before analysis, patients were randomly divided into a training (n=254) and validation (n=110) set. In the training set, LDA based on PCs 1-4 discriminated between patients with a recent exacerbation or no exacerbation with high accuracy (receiver operating characteristic (ROC)-area under the curve (AUC)=0.98, 95% CI 0.97-1.00). This high accuracy was confirmed in the validation set (AUC=0.98, 95% CI 0.94-1.00). Smoking, health status score, use of inhaled corticosteroids or vital capacity did not influence these results. Exhaled breath analysis by eNose can discriminate with high accuracy between COPD patients who experienced an exacerbation within 3 months prior to measurement and those who did not. This suggests that COPD patients who recently exacerbated have their own exhaled molecular fingerprint that could be valuable for monitoring purposes.
Collapse
Affiliation(s)
- Job J.M.H. van Bragt
- Amsterdam UMC, University of Amsterdam, Dept of Respiratory Medicine, Amsterdam, The Netherlands
| | - Paul Brinkman
- Amsterdam UMC, University of Amsterdam, Dept of Respiratory Medicine, Amsterdam, The Netherlands
| | - Rianne de Vries
- Amsterdam UMC, University of Amsterdam, Dept of Respiratory Medicine, Amsterdam, The Netherlands
- Breathomix BV, Leiden, The Netherlands
| | - Susanne J.H. Vijverberg
- Amsterdam UMC, University of Amsterdam, Dept of Respiratory Medicine, Amsterdam, The Netherlands
| | - Els J.M. Weersink
- Amsterdam UMC, University of Amsterdam, Dept of Respiratory Medicine, Amsterdam, The Netherlands
| | - Eric G. Haarman
- Amsterdam UMC, Vrije Universiteit Amsterdam, Dept of Pediatric Respiratory Medicine, Amsterdam, The Netherlands
| | - Frans H.C. de Jongh
- Medisch Spectrum Twente, Dept of Pulmonary Function, Enschede, The Netherlands
| | - Sigrid Kester
- Medisch Centrum Den Bosch Oost, ’s-Hertogenbosch, The Netherlands
| | | | | | - Peter J. Sterk
- Amsterdam UMC, University of Amsterdam, Dept of Respiratory Medicine, Amsterdam, The Netherlands
| | - Elisabeth H.D. Bel
- Amsterdam UMC, University of Amsterdam, Dept of Respiratory Medicine, Amsterdam, The Netherlands
| | | |
Collapse
|
3
|
Delibaş E, Arslan A, Şeker A, Diri B. A novel alignment-free DNA sequence similarity analysis approach based on top-k n-gram match-up. J Mol Graph Model 2020; 100:107693. [PMID: 32805559 DOI: 10.1016/j.jmgm.2020.107693] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2020] [Revised: 06/15/2020] [Accepted: 07/06/2020] [Indexed: 11/17/2022]
Abstract
DNA sequence similarity analysis is an essential task in computational biology and bioinformatics. In nearly all research that explores evolutionary relationships, gene function analysis, protein structure prediction and sequence retrieving, it is necessary to perform similarity calculations. As an alternative to alignment-based sequence comparison methods, which result in high computational cost, alignment-free methods have emerged that calculate similarity by digitizing the sequence in a different space. In this paper, we proposed an alignment-free DNA sequence similarity analysis method based on top-k n-gram matches, with the prediction that common repeating DNA subsections indicate high similarity between DNA sequences. In our method, we determined DNA sequence similarities by measuring similarity among feature vectors created according to top-k n-gram match-up scores without the use of similarity functions. We applied the similarity calculation for three different DNA data sets of different lengths. The phylogenetic relationships revealed by our method show that our trees coincide almost completely with the results of the MEGA software, which is based on sequence alignment. Our findings show that a certain number of frequently recurring common sequence patterns have the power to characterize DNA sequences.
Collapse
Affiliation(s)
- Emre Delibaş
- Department of Computer Engineering, Faculty of Engineering, Sivas Cumhuriyet University, 58140, Sivas, Turkey.
| | - Ahmet Arslan
- Department of Computer Engineering, Faculty of Engineering, Selçuk University, 42250, Konya, Turkey.
| | - Abdulkadir Şeker
- Department of Computer Engineering, Faculty of Engineering, Sivas Cumhuriyet University, 58140, Sivas, Turkey.
| | - Banu Diri
- Department of Computer Engineering, Faculty of Electrical and Electronics, Yıldız Technical University, 34349, Ístanbul, Turkey.
| |
Collapse
|
4
|
Genetic evaluation of the Iberian lynx ex situ conservation programme. Heredity (Edinb) 2019; 123:647-661. [PMID: 30952964 DOI: 10.1038/s41437-019-0217-z] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2018] [Revised: 03/08/2019] [Accepted: 03/11/2019] [Indexed: 11/09/2022] Open
Abstract
Ex situ programmes have become critical for improving the conservation of many threatened species, as they establish backup populations and provide individuals for reintroduction and reinforcement of wild populations. The Iberian lynx was considered the most threatened felid species in the world in the wake of a dramatic decline during the second half of the 20th century that reduced its numbers to around only 100 individuals. An ex situ conservation programme was established in 2003 with individuals from the two well-differentiated, remnant populations, with great success from a demographic point of view. Here, we evaluate the genetic status of the Iberian lynx captive population based on molecular data from 36 microsatellites, including patterns of relatedness and representativeness of the two remnant genetic backgrounds among founders, the evolution of diversity and inbreeding over the years, and genetic differentiation among breeding facilities. In general terms, the ex situ population harbours most of the genetic variability found in the two wild populations and has been able to maintain reasonably low levels of inbreeding and high diversity, thus validating the applied management measures and potentially representing a model for other species in need of conservation.
Collapse
|
5
|
Cai S, Palazoglu A, Zhang L, Hu J. Process alarm prediction using deep learning and word embedding methods. ISA TRANSACTIONS 2019; 85:274-283. [PMID: 30401489 DOI: 10.1016/j.isatra.2018.10.032] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2018] [Revised: 09/21/2018] [Accepted: 10/19/2018] [Indexed: 06/08/2023]
Abstract
Industrial alarm systems play an essential role for the safe management of process operations. With the increase in automation and instrumentation of modern process plants, the number of alarms that the operators manage has also increased significantly. The operators are expected to make critical decisions in the presence of flooding alarms, poorly configured and maintained alarms and many nuisance alarms. In this environment, if the incoming alarms can be correctly predicted before they actually occur, the operators may have a chance to address and possibly avoid abnormal behaviors by taking corrective actions in time. Inspired by the application of deep learning in natural language processing, this paper presents an alarm prediction method based on word embedding and recurrent neural networks to predict the next alarm in a process setting. This represents both a novel approach to alarm management as well as a novel application of natural language processing and deep learning techniques to this problem. The proposed method is applied to an actual case study to demonstrate its performance.
Collapse
Affiliation(s)
- Shuang Cai
- College of Safety and Ocean Engineering, State Key Laboratory of Petroleum Resources and Prospecting, China University of Petroleum, Beijing, China; Department of Chemical Engineering, University of California, Davis, CA 95616, USA
| | - Ahmet Palazoglu
- Department of Chemical Engineering, University of California, Davis, CA 95616, USA
| | - Laibin Zhang
- College of Safety and Ocean Engineering, State Key Laboratory of Petroleum Resources and Prospecting, China University of Petroleum, Beijing, China
| | - Jinqiu Hu
- College of Safety and Ocean Engineering, State Key Laboratory of Petroleum Resources and Prospecting, China University of Petroleum, Beijing, China.
| |
Collapse
|
6
|
Kleinman-Ruiz D, Martínez-Cruz B, Soriano L, Lucena-Perez M, Cruz F, Villanueva B, Fernández J, Godoy JA. Novel efficient genome-wide SNP panels for the conservation of the highly endangered Iberian lynx. BMC Genomics 2017; 18:556. [PMID: 28732460 PMCID: PMC5522595 DOI: 10.1186/s12864-017-3946-5] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2017] [Accepted: 07/13/2017] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND The Iberian lynx (Lynx pardinus) has been acknowledged as the most endangered felid species in the world. An intense contraction and fragmentation during the twentieth century left less than 100 individuals split in two isolated and genetically eroded populations by 2002. Genetic monitoring and management so far have been based on 36 STRs, but their limited variability and the more complex situation of current populations demand more efficient molecular markers. The recent characterization of the Iberian lynx genome identified more than 1.6 million SNPs, of which 1536 were selected and genotyped in an extended Iberian lynx sample. METHODS We validated 1492 SNPs and analysed their heterozygosity, Hardy-Weinberg equilibrium, and linkage disequilibrium. We then selected a panel of 343 minimally linked autosomal SNPs from which we extracted subsets optimized for four different typical tasks in conservation applications: individual identification, parentage assignment, relatedness estimation, and admixture classification, and compared their power to currently used STR panels. RESULTS We ascribed 21 SNPs to chromosome X based on their segregation patterns, and identified one additional marker that showed significant differentiation between sexes. For all applications considered, panels of autosomal SNPs showed higher power than the currently used STR set with only a very modest increase in the number of markers. CONCLUSIONS These novel panels of highly informative genome-wide SNPs provide more powerful, efficient, and flexible tools for the genetic management and non-invasive monitoring of Iberian lynx populations. This example highlights an important outcome of whole-genome studies in genetically threatened species.
Collapse
Affiliation(s)
- Daniel Kleinman-Ruiz
- Departamento de Ecología Integrativa, Estación Biológica de Doñana (EBD-CSIC), Calle Americo Vespucio 26, 41092, Sevilla, Spain
| | - Begoña Martínez-Cruz
- Departamento de Ecología Integrativa, Estación Biológica de Doñana (EBD-CSIC), Calle Americo Vespucio 26, 41092, Sevilla, Spain
| | - Laura Soriano
- Departamento de Ecología Integrativa, Estación Biológica de Doñana (EBD-CSIC), Calle Americo Vespucio 26, 41092, Sevilla, Spain
| | - Maria Lucena-Perez
- Departamento de Ecología Integrativa, Estación Biológica de Doñana (EBD-CSIC), Calle Americo Vespucio 26, 41092, Sevilla, Spain
| | - Fernando Cruz
- Departamento de Ecología Integrativa, Estación Biológica de Doñana (EBD-CSIC), Calle Americo Vespucio 26, 41092, Sevilla, Spain.,CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Baldiri i Reixac 4, 08028, Barcelona, Spain
| | - Beatriz Villanueva
- Departamento de Mejora Genética Animal, INIA, Carretera de la Coruña Km. 7, 28040, Madrid, Spain
| | - Jesús Fernández
- Departamento de Mejora Genética Animal, INIA, Carretera de la Coruña Km. 7, 28040, Madrid, Spain
| | - José A Godoy
- Departamento de Ecología Integrativa, Estación Biológica de Doñana (EBD-CSIC), Calle Americo Vespucio 26, 41092, Sevilla, Spain.
| |
Collapse
|
7
|
Fan Y, Siklenka K, Arora SK, Ribeiro P, Kimmins S, Xia J. miRNet - dissecting miRNA-target interactions and functional associations through network-based visual analysis. Nucleic Acids Res 2016; 44:W135-41. [PMID: 27105848 PMCID: PMC4987881 DOI: 10.1093/nar/gkw288] [Citation(s) in RCA: 307] [Impact Index Per Article: 38.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2016] [Revised: 04/01/2016] [Accepted: 04/08/2016] [Indexed: 01/01/2023] Open
Abstract
MicroRNAs (miRNAs) can regulate nearly all biological processes and their dysregulation is implicated in various complex diseases and pathological conditions. Recent years have seen a growing number of functional studies of miRNAs using high-throughput experimental technologies, which have produced a large amount of high-quality data regarding miRNA target genes and their interactions with small molecules, long non-coding RNAs, epigenetic modifiers, disease associations, etc These rich sets of information have enabled the creation of comprehensive networks linking miRNAs with various biologically important entities to shed light on their collective functions and regulatory mechanisms. Here, we introduce miRNet, an easy-to-use web-based tool that offers statistical, visual and network-based approaches to help researchers understand miRNAs functions and regulatory mechanisms. The key features of miRNet include: (i) a comprehensive knowledge base integrating high-quality miRNA-target interaction data from 11 databases; (ii) support for differential expression analysis of data from microarray, RNA-seq and quantitative PCR; (iii) implementation of a flexible interface for data filtering, refinement and customization during network creation; (iv) a powerful fully featured network visualization system coupled with enrichment analysis. miRNet offers a comprehensive tool suite to enable statistical analysis and functional interpretation of various data generated from current miRNA studies. miRNet is freely available at http://www.mirnet.ca.
Collapse
Affiliation(s)
- Yannan Fan
- Institute of Parasitology, McGill University, Sainte Anne de Bellevue, Québec H9X 3V9, Canada Centre for Host-Parasite Interactions, McGill University, Sainte Anne de Bellevue, Québec H9X 3V9, Canada
| | - Keith Siklenka
- Department of Pharmacology and Therapeutics, McGill University, Montreal, Québec H3G 1Y6, Canada
| | - Simran K Arora
- Institute of Parasitology, McGill University, Sainte Anne de Bellevue, Québec H9X 3V9, Canada Centre for Host-Parasite Interactions, McGill University, Sainte Anne de Bellevue, Québec H9X 3V9, Canada
| | - Paula Ribeiro
- Institute of Parasitology, McGill University, Sainte Anne de Bellevue, Québec H9X 3V9, Canada Centre for Host-Parasite Interactions, McGill University, Sainte Anne de Bellevue, Québec H9X 3V9, Canada
| | - Sarah Kimmins
- Department of Pharmacology and Therapeutics, McGill University, Montreal, Québec H3G 1Y6, Canada Department of Animal Science, McGill University, Sainte Anne de Bellevue, Québec H9X 3V9, Canada
| | - Jianguo Xia
- Institute of Parasitology, McGill University, Sainte Anne de Bellevue, Québec H9X 3V9, Canada Centre for Host-Parasite Interactions, McGill University, Sainte Anne de Bellevue, Québec H9X 3V9, Canada Department of Animal Science, McGill University, Sainte Anne de Bellevue, Québec H9X 3V9, Canada
| |
Collapse
|
8
|
|
9
|
Huang HH, Yu C. Clustering DNA sequences using the out-of-place measure with reduced n-grams. J Theor Biol 2016; 406:61-72. [PMID: 27375217 DOI: 10.1016/j.jtbi.2016.06.029] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Revised: 05/18/2016] [Accepted: 06/21/2016] [Indexed: 11/25/2022]
Abstract
The alignment-free n-gram based method with the out-of-place measures as the distance has been successfully applied to automatic text or natural languages categorization in real time. However, it is not clear about its performance and the selection of n for comparing genome sequences. Here we propose a symmetric version of the out-of-place measure and a new approach for finding the optimal range of n to construct a phylogenetic tree with the symmetric out-of-place measures. Our method is then applied to real genome sequence datasets. The resulting phylogenetic trees are matching with the standard biological classification. It shows that our proposed method is a very powerful tool for phylogenetic analysis in terms of both classification accuracy and computation efficiency.
Collapse
Affiliation(s)
- Hsin-Hsiung Huang
- Department of Statistics, University of Central Florida, Orlando, FL 32816, USA.
| | - Chenglong Yu
- Mind and Brain Theme, South Australian Health and Medical Research Institute, North Terrace, Adelaide, SA 5000, Australia; School of Medicine, Flinders University, Adelaide, SA 5001, Australia
| |
Collapse
|
10
|
Frades I, Resjö S, Andreasson E. Comparison of phosphorylation patterns across eukaryotes by discriminative N-gram analysis. BMC Bioinformatics 2015. [PMID: 26224486 PMCID: PMC4520095 DOI: 10.1186/s12859-015-0657-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background How protein phosphorylation relates to kingdom/phylum divergence is largely unknown and the amino acid residues surrounding the phosphorylation site have profound importance on protein kinase–substrate interactions. Standard motif analysis is not adequate for large scale comparative analysis because each phophopeptide is assigned to a unique motif and perform poorly with the unbalanced nature of the input datasets. Results First the discriminative n-grams of five species from five different kingdom/phyla were identified. A signature with 5540 discriminative n-grams that could be found in other species from the same kingdoms/phyla was created. Using a test data set, the ability of the signature to classify species in their corresponding kingdom/phylum was confirmed using classification methods. Lastly, ortholog proteins among proteins with n-grams were identified in order to determine to what degree was the identity of the detected n-grams a property of phosphosites rather than a consequence of species-specific or kingdom/phylum-specific protein inventory. The motifs were grouped in clusters of equal physico-chemical nature and their distribution was similar between species in the same kingdom/phylum while clear differences were found among species of different kingdom/phylum. For example, the animal-specific top discriminative n-grams contained many basic amino acids and the plant-specific motifs were mainly acidic. Secondary structure prediction methods show that the discriminative n-grams in the majority of the cases lack from a regular secondary structure as on average they had 88 % of random coil compared to 66 % found in the phosphoproteins they were derived from. Conclusions The discriminative n-grams were able to classify organisms in their corresponding kingdom/phylum, they show different patterns among species of different kingdom/phylum and these regions can contribute to evolutionary divergence as they are in disordered regions that can evolve rapidly. The differences found possibly reflect group-specific differences in the kinomes of the different groups of species. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0657-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Itziar Frades
- Department of Plant Protection Biology, Swedish University of Agricultural Sciences, Alnarp, SE-230 53, Sweden.
| | - Svante Resjö
- Department of Plant Protection Biology, Swedish University of Agricultural Sciences, Alnarp, SE-230 53, Sweden.
| | - Erik Andreasson
- Department of Plant Protection Biology, Swedish University of Agricultural Sciences, Alnarp, SE-230 53, Sweden.
| |
Collapse
|
11
|
Maury JJP, Ng D, Bi X, Bardor M, Choo ABH. Multiple Reaction Monitoring Mass Spectrometry for the Discovery and Quantification of O-GlcNAc-Modified Proteins. Anal Chem 2013; 86:395-402. [DOI: 10.1021/ac401821d] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Affiliation(s)
- Julien Jean Pierre Maury
- Bioprocessing
Technology Institute, Agency for Science, Technology and Research (A*STAR), 20 Biopolis Way, #06-01 Centros, Singapore 138668
- Department
of Bioengineering, Faculty of Engineering, National University of Singapore, Singapore 119077
| | - Daniel Ng
- Bioprocessing
Technology Institute, Agency for Science, Technology and Research (A*STAR), 20 Biopolis Way, #06-01 Centros, Singapore 138668
| | - Xuezhi Bi
- Bioprocessing
Technology Institute, Agency for Science, Technology and Research (A*STAR), 20 Biopolis Way, #06-01 Centros, Singapore 138668
| | - Muriel Bardor
- Bioprocessing
Technology Institute, Agency for Science, Technology and Research (A*STAR), 20 Biopolis Way, #06-01 Centros, Singapore 138668
- Université de Rouen, Laboratoire Glycobiologie et Matrice
Extracellulaire Végétale (Glyco-MEV) EA 4358, Institut
de Recherche et d’Innovation Biomédicale (IRIB), Faculté
des Sciences et Techniques, 76821 Mont-Saint-Aignan Cédex, France
| | - Andre Boon-Hwa Choo
- Bioprocessing
Technology Institute, Agency for Science, Technology and Research (A*STAR), 20 Biopolis Way, #06-01 Centros, Singapore 138668
- Department
of Bioengineering, Faculty of Engineering, National University of Singapore, Singapore 119077
| |
Collapse
|
12
|
Zemková M, Trifonov EN, Zahradník D. One common structural feature of "words" in protein sequences and human texts. J Biomol Struct Dyn 2013; 32:1085-91. [PMID: 23808620 DOI: 10.1080/07391102.2013.809317] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
Frequently discussed analogy between genetic and human texts is explored by comparison of alternation of polar and non-polar amino-acid residues in proteins and alternation of consonants and vowels in human texts. In human languages, the usage of possible combinations of consonants and vowels is influenced by pronounceability of the combinations. Similarly, oligopeptide composition of proteins is influenced by requirements of protein folding and stability. One special type of structure often present in proteins is amphipathic α-helices in which polar and non-polar amino acids alternate with the period 3.5 residues, not unlike alternation of consonants and vowels. In this study, we evaluated the contribution made by amphipathic alternations to the protein sequence texts (20-24%). Their proportion is lower than respective values for alternating words in human texts (57-89%). The proteomes (full sets of proteins for selected organisms) were transformed into ranked sequences of n-grams (words of length n), including periodical amphipathic structures. Similarly, human texts were transformed into sequences of alternating consonants and vowels. Analysis of the vocabularies shows that in both types of texts (human languages and proteins) the alternating words are dominant or highly preferred, thus, strengthening the analogy between these two types of texts. The contribution of amphipathic words in the upper parts of the ranked lists for 10 analyzed proteomes varies between 58 and 74%. In human texts respective values range between 90 and 100%.
Collapse
Affiliation(s)
- M Zemková
- a Faculty of Science, Department of Philosophy and History of Science , Charles University in Prague , Viničná 7, Praha CZ-12844 , Czech Republic
| | | | | |
Collapse
|
13
|
Srinivasan SM, Vural S, King BR, Guda C. Mining for class-specific motifs in protein sequence classification. BMC Bioinformatics 2013; 14:96. [PMID: 23496846 PMCID: PMC3610217 DOI: 10.1186/1471-2105-14-96] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2012] [Accepted: 12/17/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In protein sequence classification, identification of the sequence motifs or n-grams that can precisely discriminate between classes is a more interesting scientific question than the classification itself. A number of classification methods aim at accurate classification but fail to explain which sequence features indeed contribute to the accuracy. We hypothesize that sequences in lower denominations (n-grams) can be used to explore the sequence landscape and to identify class-specific motifs that discriminate between classes during classification. Discriminative n-grams are short peptide sequences that are highly frequent in one class but are either minimally present or absent in other classes. In this study, we present a new substitution-based scoring function for identifying discriminative n-grams that are highly specific to a class. RESULTS We present a scoring function based on discriminative n-grams that can effectively discriminate between classes. The scoring function, initially, harvests the entire set of 4- to 8-grams from the protein sequences of different classes in the dataset. Similar n-grams of the same size are combined to form new n-grams, where the similarity is defined by positive amino acid substitution scores in the BLOSUM62 matrix. Substitution has resulted in a large increase in the number of discriminatory n-grams harvested. Due to the unbalanced nature of the dataset, the frequencies of the n-grams are normalized using a dampening factor, which gives more weightage to the n-grams that appear in fewer classes and vice-versa. After the n-grams are normalized, the scoring function identifies discriminative 4- to 8-grams for each class that are frequent enough to be above a selection threshold. By mapping these discriminative n-grams back to the protein sequences, we obtained contiguous n-grams that represent short class-specific motifs in protein sequences. Our method fared well compared to an existing motif finding method known as Wordspy. We have validated our enriched set of class-specific motifs against the functionally important motifs obtained from the NLSdb, Prosite and ELM databases. We demonstrate that this method is very generic; thus can be widely applied to detect class-specific motifs in many protein sequence classification tasks. CONCLUSION The proposed scoring function and methodology is able to identify class-specific motifs using discriminative n-grams derived from the protein sequences. The implementation of amino acid substitution scores for similarity detection, and the dampening factor to normalize the unbalanced datasets have significant effect on the performance of the scoring function. Our multipronged validation tests demonstrate that this method can detect class-specific motifs from a wide variety of protein sequence classes with a potential application to detecting proteome-specific motifs of different organisms.
Collapse
Affiliation(s)
- Satish M Srinivasan
- Department of Genetics, Cell Biology and Anatomy, University of Nebraska Medical Center, Omaha, NE 68198-5145, USA
| | | | | | | |
Collapse
|
14
|
Motomura K, Fujita T, Tsutsumi M, Kikuzato S, Nakamura M, Otaki JM. Word decoding of protein amino Acid sequences with availability analysis: a linguistic approach. PLoS One 2012; 7:e50039. [PMID: 23185527 PMCID: PMC3503725 DOI: 10.1371/journal.pone.0050039] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2012] [Accepted: 10/15/2012] [Indexed: 11/19/2022] Open
Abstract
The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or "words". We first confirmed that the English language highly likely follows Zipf's law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and "compressed" English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., "key words") and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences.
Collapse
Affiliation(s)
- Kenta Motomura
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa, Japan
- Department of Information Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| | - Tomohiro Fujita
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| | - Motosuke Tsutsumi
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| | - Satsuki Kikuzato
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| | - Morikazu Nakamura
- Department of Information Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| | - Joji M. Otaki
- The BCPH Unit of Molecular Physiology, Department of Chemistry, Biology and Marine Science, University of the Ryukyus, Nishihara, Okinawa, Japan
| |
Collapse
|
15
|
Ganapathiraju MK, Mitchell AD, Thahir M, Motwani K, Ananthasubramanian S. Suite of tools for statistical N-gram language modeling for pattern mining in whole genome sequences. J Bioinform Comput Biol 2012; 10:1250016. [PMID: 22817111 DOI: 10.1142/s0219720012500163] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Genome sequences contain a number of patterns that have biomedical significance. Repetitive sequences of various kinds are a primary component of most of the genomic sequence patterns. We extended the suffix-array based Biological Language Modeling Toolkit to compute n-gram frequencies as well as n-gram language-model based perplexity in windows over the whole genome sequence to find biologically relevant patterns. We present the suite of tools and their application for analysis on whole human genome sequence.
Collapse
Affiliation(s)
- Madhavi K Ganapathiraju
- Department of Biomedical Informatics, University of Pittsburgh, 5607 Baum Boulevard, Suite BAUM 423, Pittsburgh, PA 15206-3701, USA.
| | | | | | | | | |
Collapse
|
16
|
King BR, Vural S, Pandey S, Barteau A, Guda C. ngLOC: software and web server for predicting protein subcellular localization in prokaryotes and eukaryotes. BMC Res Notes 2012; 5:351. [PMID: 22780965 PMCID: PMC3532370 DOI: 10.1186/1756-0500-5-351] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2012] [Accepted: 06/22/2012] [Indexed: 01/04/2023] Open
Abstract
Background Understanding protein subcellular localization is a necessary component toward understanding the overall function of a protein. Numerous computational methods have been published over the past decade, with varying degrees of success. Despite the large number of published methods in this area, only a small fraction of them are available for researchers to use in their own studies. Of those that are available, many are limited by predicting only a small number of organelles in the cell. Additionally, the majority of methods predict only a single location for a sequence, even though it is known that a large fraction of the proteins in eukaryotic species shuttle between locations to carry out their function. Findings We present a software package and a web server for predicting the subcellular localization of protein sequences based on the ngLOC method. ngLOC is an n-gram-based Bayesian classifier that predicts subcellular localization of proteins both in prokaryotes and eukaryotes. The overall prediction accuracy varies from 89.8% to 91.4% across species. This program can predict 11 distinct locations each in plant and animal species. ngLOC also predicts 4 and 5 distinct locations on gram-positive and gram-negative bacterial datasets, respectively. Conclusions ngLOC is a generic method that can be trained by data from a variety of species or classes for predicting protein subcellular localization. The standalone software is freely available for academic use under GNU GPL, and the ngLOC web server is also accessible at http://ngloc.unmc.edu.
Collapse
Affiliation(s)
- Brian R King
- Department of Computer Science, Bucknell University, One Dent Drive, Lewisburg, PA 17837, USA
| | | | | | | | | |
Collapse
|