Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Maetschke SR, Kassahn KS, Dunn JA, Han SP, Curley EZ, Stacey KJ, Ragan MA. A visual framework for sequence analysis using n-grams and spectral rearrangement. ACTA ACUST UNITED AC 2010;26:737-44. [PMID: 20130028 DOI: 10.1093/bioinformatics/btq042] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

For:	Maetschke SR, Kassahn KS, Dunn JA, Han SP, Curley EZ, Stacey KJ, Ragan MA. A visual framework for sequence analysis using n-grams and spectral rearrangement. ACTA ACUST UNITED AC 2010;26:737-44. [PMID: 20130028 DOI: 10.1093/bioinformatics/btq042] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]

Number

Cited by Other Article(s)

Luu VT, Forestier G, Weber J, Bourgeois P, Djelil F, Muller PA. A review of alignment based similarity measures for web usage mining. Artif Intell Rev 2020. [DOI: 10.1007/s10462-019-09712-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]

Bernard G, Chan CX, Chan YB, Chua XY, Cong Y, Hogan JM, Maetschke SR, Ragan MA. Alignment-free inference of hierarchical and reticulate phylogenomic relationships. Brief Bioinform 2019;20:426-435. [PMID: 28673025 PMCID: PMC6433738 DOI: 10.1093/bib/bbx067] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2017] [Revised: 05/04/2017] [Indexed: 11/22/2022] Open

Zielezinski A, Vinga S, Almeida J, Karlowski WM. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol 2017;18:186. [PMID: 28974235 PMCID: PMC5627421 DOI: 10.1186/s13059-017-1319-7] [Citation(s) in RCA: 244] [Impact Index Per Article: 34.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open

Vinga S. Information theory applications for biological sequence analysis. Brief Bioinform 2014;15:376-89. [PMID: 24058049 PMCID: PMC7109941 DOI: 10.1093/bib/bbt068] [Citation(s) in RCA: 67] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2013] [Accepted: 08/17/2013] [Indexed: 01/13/2023] Open

Srinivasan SM, Vural S, King BR, Guda C. Mining for class-specific motifs in protein sequence classification. BMC Bioinformatics 2013;14:96. [PMID: 23496846 PMCID: PMC3610217 DOI: 10.1186/1471-2105-14-96] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2012] [Accepted: 12/17/2012] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

In protein sequence classification, identification of the sequence motifs or n-grams that can precisely discriminate between classes is a more interesting scientific question than the classification itself. A number of classification methods aim at accurate classification but fail to explain which sequence features indeed contribute to the accuracy. We hypothesize that sequences in lower denominations (n-grams) can be used to explore the sequence landscape and to identify class-specific motifs that discriminate between classes during classification. Discriminative n-grams are short peptide sequences that are highly frequent in one class but are either minimally present or absent in other classes. In this study, we present a new substitution-based scoring function for identifying discriminative n-grams that are highly specific to a class.

RESULTS

We present a scoring function based on discriminative n-grams that can effectively discriminate between classes. The scoring function, initially, harvests the entire set of 4- to 8-grams from the protein sequences of different classes in the dataset. Similar n-grams of the same size are combined to form new n-grams, where the similarity is defined by positive amino acid substitution scores in the BLOSUM62 matrix. Substitution has resulted in a large increase in the number of discriminatory n-grams harvested. Due to the unbalanced nature of the dataset, the frequencies of the n-grams are normalized using a dampening factor, which gives more weightage to the n-grams that appear in fewer classes and vice-versa. After the n-grams are normalized, the scoring function identifies discriminative 4- to 8-grams for each class that are frequent enough to be above a selection threshold. By mapping these discriminative n-grams back to the protein sequences, we obtained contiguous n-grams that represent short class-specific motifs in protein sequences. Our method fared well compared to an existing motif finding method known as Wordspy. We have validated our enriched set of class-specific motifs against the functionally important motifs obtained from the NLSdb, Prosite and ELM databases. We demonstrate that this method is very generic; thus can be widely applied to detect class-specific motifs in many protein sequence classification tasks.

CONCLUSION

The proposed scoring function and methodology is able to identify class-specific motifs using discriminative n-grams derived from the protein sequences. The implementation of amino acid substitution scores for similarity detection, and the dampening factor to normalize the unbalanced datasets have significant effect on the performance of the scoring function. Our multipronged validation tests demonstrate that this method can detect class-specific motifs from a wide variety of protein sequence classes with a potential application to detecting proteome-specific motifs of different organisms.

Collapse

Wymore T, Brooks CL. From Molecular Phylogenetics to Quantum Chemistry: Discovering Enzyme Design Principles through Computation. Comput Struct Biotechnol J 2012;2:e201209018. [PMID: 24688659 PMCID: PMC3962182 DOI: 10.5936/csbj.201209018] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2012] [Revised: 11/14/2012] [Accepted: 11/15/2012] [Indexed: 11/22/2022] Open

Jiang X, Langille MGI, Neches RY, Elliot M, Levin SA, Eisen JA, Weitz JS, Dushoff J. Functional biogeography of ocean microbes revealed through non-negative matrix factorization. PLoS One 2012;7:e43866. [PMID: 23049741 PMCID: PMC3445553 DOI: 10.1371/journal.pone.0043866] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2012] [Accepted: 07/30/2012] [Indexed: 01/12/2023] Open

Abstract

The direct “metagenomic” sequencing of genomic material from complex assemblages of bacteria, archaea, viruses and microeukaryotes has yielded new insights into the structure of microbial communities. For example, analysis of metagenomic data has revealed the existence of previously unknown microbial taxa whose spatial distributions are limited by environmental conditions, ecological competition, and dispersal mechanisms. However, differences in genotypes that might lead biologists to designate two microbes as taxonomically distinct need not necessarily imply differences in ecological function. Hence, there is a growing need for large-scale analysis of the distribution of microbial function across habitats. Here, we present a framework for investigating the biogeography of microbial function by analyzing the distribution of protein families inferred from environmental sequence data across a global collection of sites. We map over 6,000,000 protein sequences from unassembled reads from the Global Ocean Survey dataset to protein families, generating a protein family relative abundance matrix that describes the distribution of each protein family across sites. We then use non-negative matrix factorization (NMF) to approximate these protein family profiles as linear combinations of a small number of ecological components. Each component has a characteristic functional profile and site profile. Our approach identifies common functional signatures within several of the components. We use our method as a filter to estimate functional distance between sites, and find that an NMF-filtered measure of functional distance is more strongly correlated with environmental distance than a comparable PCA-filtered measure. We also find that functional distance is more strongly correlated with environmental distance than with geographic distance, in agreement with prior studies. We identify similar protein functions in several components and suggest that functional co-occurrence across metagenomic samples could lead to future methods for de-novo functional prediction. We conclude by discussing how NMF, and other dimension reduction methods, can help enable a macroscopic functional description of marine ecosystems.

Collapse

The mammalian PYHIN gene family: phylogeny, evolution and expression. BMC Evol Biol 2012;12:140. [PMID: 22871040 PMCID: PMC3458909 DOI: 10.1186/1471-2148-12-140] [Citation(s) in RCA: 147] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2012] [Accepted: 07/27/2012] [Indexed: 01/13/2023] Open

Abstract

Background

Proteins of the mammalian PYHIN (IFI200/HIN-200) family are involved in defence against infection through recognition of foreign DNA. The family member absent in melanoma 2 (AIM2) binds cytosolic DNA via its HIN domain and initiates inflammasome formation via its pyrin domain. AIM2 lies within a cluster of related genes, many of which are uncharacterised in mouse. To better understand the evolution, orthology and function of these genes, we have documented the range of PYHIN genes present in representative mammalian species, and undertaken phylogenetic and expression analyses.

Results

No PYHIN genes are evident in non-mammals or monotremes, with a single member found in each of three marsupial genomes. Placental mammals show variable family expansions, from one gene in cow to four in human and 14 in mouse. A single HIN domain appears to have evolved in the common ancestor of marsupials and placental mammals, and duplicated to give rise to three distinct forms (HIN-A, -B and -C) in the placental mammal ancestor. Phylogenetic analyses showed that AIM2 HIN-C and pyrin domains clearly diverge from the rest of the family, and it is the only PYHIN protein with orthology across many species. Interestingly, although AIM2 is important in defence against some bacteria and viruses in mice, AIM2 is a pseudogene in cow, sheep, llama, dolphin, dog and elephant. The other 13 mouse genes have arisen by duplication and rearrangement within the lineage, which has allowed some diversification in expression patterns.

Conclusions

The role of AIM2 in forming the inflammasome is relatively well understood, but molecular interactions of other PYHIN proteins involved in defence against foreign DNA remain to be defined. The non-AIM2 PYHIN protein sequences are very distinct from AIM2, suggesting they vary in effector mechanism in response to foreign DNA, and may bind different DNA structures. The PYHIN family has highly varied gene composition between mammalian species due to lineage-specific duplication and loss, which probably indicates different adaptations for fighting infectious disease. Non-genomic DNA can indicate infection, or a mutagenic threat. We hypothesise that defence of the genome against endogenous retroelements has been an additional evolutionary driver for PYHIN proteins.

Collapse

Jiang X, Weitz JS, Dushoff J. A non-negative matrix factorization framework for identifying modular patterns in metagenomic profile data. J Math Biol 2011;64:697-711. [PMID: 21630089 DOI: 10.1007/s00285-011-0428-2] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2010] [Revised: 03/18/2011] [Indexed: 11/26/2022]

Osmanbeyoglu HU, Ganapathiraju MK. N-gram analysis of 970 microbial organisms reveals presence of biological language models. BMC Bioinformatics 2011;12:12. [PMID: 21219653 PMCID: PMC3027111 DOI: 10.1186/1471-2105-12-12] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2010] [Accepted: 01/10/2011] [Indexed: 11/29/2022] Open

Abstract

Background

It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree.

Results

We studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and cross-perplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of Shigellae flexneri 2a, which belongs to the Gammaproteobacteria class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from S. flexneri. The organisms of this genus, which happen to be pathotypes of E.coli, also have the closest perplexity values with E. coli.

Conclusion

Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree.

Collapse