1
|
Lee YW, Chen M, Chung IF, Chang TY. lncExplore: a database of pan-cancer analysis and systematic functional annotation for lncRNAs from RNA-sequencing data. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2021:6360505. [PMID: 34464437 PMCID: PMC8407485 DOI: 10.1093/database/baab053] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/26/2020] [Revised: 06/18/2021] [Accepted: 08/10/2021] [Indexed: 12/23/2022]
Abstract
Over the past few years, with the rapid growth of deep-sequencing technology and the development of computational prediction algorithms, a large number of long non-coding RNAs (lncRNAs) have been identified in various types of human cancers. Therefore, it has become critical to determine how to properly annotate the potential function of lncRNAs from RNA-sequencing (RNA-seq) data and arrange the robust information and analysis into a useful system readily accessible by biological and clinical researchers. In order to produce a collective interpretation of lncRNA functions, it is necessary to integrate different types of data regarding the important functional diversity and regulatory role of these lncRNAs. In this study, we utilized transcriptomic sequencing data to systematically observe and identify lncRNAs and their potential functions from 5034 The Cancer Genome Atlas RNA-seq datasets covering 24 cancers. Then, we constructed the 'lncExplore' database that was developed to comprehensively integrate various types of genomic annotation data for collective interpretation. The distinctive features in our lncExplore database include (i) novel lncRNAs verified by both coding potential and translation efficiency score, (ii) pan-cancer analysis for studying the significantly aberrant expression across 24 human cancers, (iii) genomic annotation of lncRNAs, such as cis-regulatory information and gene ontology, (iv) observation of the regulatory roles as enhancer RNAs and competing endogenous RNAs and (v) the findings of the potential lncRNA biomarkers for the user-interested cancers by integrating clinical information and disease specificity score. The lncExplore database is to our knowledge the first public lncRNA annotation database providing cancer-specific lncRNA expression profiles for not only known but also novel lncRNAs, enhancer RNAs annotation and clinical analysis based on pan-cancer analysis. lncExplore provides a more complete pathway to highly efficient, novel and more comprehensive translation of laboratory discoveries into the clinical context and will assist in reinterpreting the biological regulatory function of lncRNAs in cancer research. Database URL http://lncexplore.bmi.nycu.edu.tw.
Collapse
Affiliation(s)
- Yi-Wei Lee
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, No.155, Sec. 2, Linong St., Beitou District, Taipei 11221, Taiwan
| | - Ming Chen
- Department of Genomic Medicine and Center for Medical Genetics, Changhua Christian Hospital, No.176, Chong-Hua Rd., Changhua 50046, Taiwan.,Research Department, Changhua Christian Hospital, No.135, Nan-Hsiao St., Changhua 50006, Taiwan.,Department of Genomic Science and Technology, Changhua Christian Hospital Healthcare System, No.176, Chong-Hua Rd., Changhua 50046, Taiwan.,Department of Obstetrics and Gynecology, Changhua Christian Hospital, No.135, Nan-Hsiao St., Changhua 50006, Taiwan.,Department of Medical Genetics, National Taiwan University Hospital, No.7, Chung Shan S. Rd.(Zhongshan S. Rd.), Zhongzheng Dist., Taipei 10041, Taiwan.,Department of Obstetrics and Gynecology, College of Medicine, National Taiwan University, No.7, Chung Shan S. Rd.(Zhongshan S. Rd.), Zhongzheng Dist., Taipei 10041, Taiwan.,Department of Biomedical Science, Dayeh University, No.168, University Rd., Dacun, Changhua 51591, Taiwan.,Department of Medical Science, National Tsing Hua University, No.101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan
| | - I-Fang Chung
- Institute of Biomedical Informatics, National Yang Ming Chiao Tung University, No.155, Sec. 2, Linong St., Beitou District, Taipei 11221, Taiwan.,Center for Systems and Synthetic Biology, National Yang Ming Chiao Tung University, No.155, Sec. 2, Linong St., Beitou District, Taipei 11221, Taiwan
| | - Ting-Yu Chang
- Department of Genomic Medicine and Center for Medical Genetics, Changhua Christian Hospital, No.176, Chong-Hua Rd., Changhua 50046, Taiwan.,Research Department, Changhua Christian Hospital, No.135, Nan-Hsiao St., Changhua 50006, Taiwan.,Department of Genomic Science and Technology, Changhua Christian Hospital Healthcare System, No.176, Chong-Hua Rd., Changhua 50046, Taiwan.,Department of Bioscience Technology, Chung Yuan Christian University, No.200, Chung Pei Road, Chung Li District, Taoyuan 32023, Taiwan
| |
Collapse
|
2
|
Jalali S, Gandhi S, Scaria V. Navigating the dynamic landscape of long noncoding RNA and protein-coding gene annotations in GENCODE. Hum Genomics 2016; 10:35. [PMID: 27793185 PMCID: PMC5084464 DOI: 10.1186/s40246-016-0090-2] [Citation(s) in RCA: 33] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2016] [Accepted: 10/16/2016] [Indexed: 12/03/2022] Open
Abstract
Background Our understanding of the transcriptional potential of the genome and its functional consequences has undergone a significant change in the last decade. This has been largely contributed by the improvements in technology which could annotate and in many cases functionally characterize a number of novel gene loci in the human genome. Keeping pace with advancements in this dynamic environment and being able to systematically annotate a compendium of genes and transcripts is indeed a formidable task. Of the many databases which attempted to systematically annotate the genome, GENCODE has emerged as one of the largest and popular compendium for human genome annotations. Results The analysis of various versions of GENCODE revealed that there was a constant upgradation of transcripts for both protein-coding and long noncoding RNA (lncRNAs) leading to conflicting annotations. The GENCODE version 24 accounts for 4.18 % of the human genome to be transcribed which is an increase of 1.58 % from its first version. Out of 2,51,614 transcripts annotated across GENCODE versions, only 21.7 % had consistency. We also examined GENCODE consortia categorized transcripts into 70 biotypes out of which only 17 remained stable throughout. Conclusions In this report, we try to review the impact on the dynamicity with respect to gene annotations, specifically (lncRNA) annotations in GENCODE over the years. Our analysis suggests a significant dynamism in gene annotations, reflective of the evolution and consensus in nomenclature of genes. While a progressive change in annotations and timely release of the updates make the resource reliable in the community, the dynamicity with each release poses unique challenges to its users. Taking cues from other experiments with bio-curation, we propose potential avenues and methods to mend the gap. Electronic supplementary material The online version of this article (doi:10.1186/s40246-016-0090-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Saakshi Jalali
- GN Ramachandran Knowledge Center for Genome Informatics, CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, Delhi, 110 025, India.,Academy of Scientific and Innovative Research (AcSIR), CSIR-IGIB South Campus, Mathura Road, Delhi, 110025, India
| | - Shrey Gandhi
- GN Ramachandran Knowledge Center for Genome Informatics, CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, Delhi, 110 025, India
| | - Vinod Scaria
- GN Ramachandran Knowledge Center for Genome Informatics, CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB), Mathura Road, Delhi, 110 025, India. .,Academy of Scientific and Innovative Research (AcSIR), CSIR-IGIB South Campus, Mathura Road, Delhi, 110025, India.
| |
Collapse
|
3
|
Shimada MK, Sanbonmatsu R, Yamaguchi-Kabata Y, Yamasaki C, Suzuki Y, Chakraborty R, Gojobori T, Imanishi T. Selection pressure on human STR loci and its relevance in repeat expansion disease. Mol Genet Genomics 2016; 291:1851-69. [PMID: 27290643 DOI: 10.1007/s00438-016-1219-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2015] [Accepted: 05/21/2016] [Indexed: 12/30/2022]
Abstract
Short Tandem Repeats (STRs) comprise repeats of one to several base pairs. Because of the high mutability due to strand slippage during DNA synthesis, rapid evolutionary change in the number of repeating units directly shapes the range of repeat-number variation according to selection pressure. However, the remaining questions include: Why are STRs causing repeat expansion diseases maintained in the human population; and why are these limited to neurodegenerative diseases? By evaluating the genome-wide selection pressure on STRs using the database we constructed, we identified two different patterns of relationship in repeat-number polymorphisms between DNA and amino-acid sequences, although both patterns are evolutionary consequences of avoiding the formation of harmful long STRs. First, a mixture of degenerate codons is represented in poly-proline (poly-P) repeats. Second, long poly-glutamine (poly-Q) repeats are favored at the protein level; however, at the DNA level, STRs encoding long poly-Qs are frequently divided by synonymous SNPs. Furthermore, significant enrichments of apoptosis and neurodevelopment were biological processes found specifically in genes encoding poly-Qs with repeat polymorphism. This suggests the existence of a specific molecular function for polymorphic and/or long poly-Q stretches. Given that the poly-Qs causing expansion diseases were longer than other poly-Qs, even in healthy subjects, our results indicate that the evolutionary benefits of long and/or polymorphic poly-Q stretches outweigh the risks of long CAG repeats predisposing to pathological hyper-expansions. Molecular pathways in neurodevelopment requiring long and polymorphic poly-Q stretches may provide a clue to understanding why poly-Q expansion diseases are limited to neurodegenerative diseases.
Collapse
Affiliation(s)
- Makoto K Shimada
- Institute for Comprehensive Medical Science, Fujita Health University, 1-98 Dengakugakubo, Kutsukake-cho, Toyoake, Aichi, 470-1192, Japan. .,National Institute of Advanced Industrial Science and Technology, 2-3-26 Aomi Koto-ku, Tokyo, 135-0064, Japan. .,Japan Biological Informatics Consortium, 10F TIME24 Building, 2-4-32 Aomi, Koto-ku, Tokyo, 135-8073, Japan.
| | - Ryoko Sanbonmatsu
- Japan Biological Informatics Consortium, 10F TIME24 Building, 2-4-32 Aomi, Koto-ku, Tokyo, 135-8073, Japan
| | - Yumi Yamaguchi-Kabata
- National Institute of Advanced Industrial Science and Technology, 2-3-26 Aomi Koto-ku, Tokyo, 135-0064, Japan.,Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, 980-8573, Japan
| | - Chisato Yamasaki
- National Institute of Advanced Industrial Science and Technology, 2-3-26 Aomi Koto-ku, Tokyo, 135-0064, Japan.,Japan Biological Informatics Consortium, 10F TIME24 Building, 2-4-32 Aomi, Koto-ku, Tokyo, 135-8073, Japan
| | - Yoshiyuki Suzuki
- Graduate School of Natural Sciences, Nagoya City University, 1 Yamanohata, Mizuho-cho, Mizuho-ku, Nagoya, Aichi, 467-8501, Japan
| | - Ranajit Chakraborty
- Health Science Center, University of North Texas, 3500 Camp Bowie Blvd., Fort Worth, TX, 76107, USA
| | - Takashi Gojobori
- National Institute of Advanced Industrial Science and Technology, 2-3-26 Aomi Koto-ku, Tokyo, 135-0064, Japan.,Computational Bioscience Research Center, King Abdullah University of Science and Technology, Ibn Al-Haytham Building (West), Thuwal, 23955-6900, Kingdom of Saudi Arabia
| | - Tadashi Imanishi
- National Institute of Advanced Industrial Science and Technology, 2-3-26 Aomi Koto-ku, Tokyo, 135-0064, Japan.,Department of Molecular Life Science, Tokai University School of Medicine, 143 Shimokasuya, Isehara, Kanagawa, 259-1193, Japan
| |
Collapse
|
4
|
Martins Conde PDR, Sauter T, Pfau T. Constraint Based Modeling Going Multicellular. Front Mol Biosci 2016; 3:3. [PMID: 26904548 PMCID: PMC4748834 DOI: 10.3389/fmolb.2016.00003] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2015] [Accepted: 01/25/2016] [Indexed: 12/31/2022] Open
Abstract
Constraint based modeling has seen applications in many microorganisms. For example, there are now established methods to determine potential genetic modifications and external interventions to increase the efficiency of microbial strains in chemical production pipelines. In addition, multiple models of multicellular organisms have been created including plants and humans. While initially the focus here was on modeling individual cell types of the multicellular organism, this focus recently started to switch. Models of microbial communities, as well as multi-tissue models of higher organisms have been constructed. These models thereby can include different parts of a plant, like root, stem, or different tissue types in the same organ. Such models can elucidate details of the interplay between symbiotic organisms, as well as the concerted efforts of multiple tissues and can be applied to analyse the effects of drugs or mutations on a more systemic level. In this review we give an overview of the recent development of multi-tissue models using constraint based techniques and the methods employed when investigating these models. We further highlight advances in combining constraint based models with dynamic and regulatory information and give an overview of these types of hybrid or multi-level approaches.
Collapse
Affiliation(s)
- Patricia do Rosario Martins Conde
- Systems Biology Group, Life Sciences Research Unit, Faculty of Sciences, Technology and Communications, University of Luxembourg Luxembourg, Luxembourg
| | - Thomas Sauter
- Systems Biology Group, Life Sciences Research Unit, Faculty of Sciences, Technology and Communications, University of Luxembourg Luxembourg, Luxembourg
| | - Thomas Pfau
- Systems Biology Group, Life Sciences Research Unit, Faculty of Sciences, Technology and Communications, University of LuxembourgLuxembourg, Luxembourg; Department of Physics, Institute of Complex Systems and Mathematical Biology, University of AberdeenAberdeen, UK
| |
Collapse
|
5
|
Systematic transcriptome analysis reveals tumor-specific isoforms for ovarian cancer diagnosis and therapy. Proc Natl Acad Sci U S A 2015; 112:E3050-7. [PMID: 26015570 PMCID: PMC4466751 DOI: 10.1073/pnas.1508057112] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Tumor-specific molecules are needed across diverse areas of oncology for use in early detection, diagnosis, prognosis and therapy. Large and growing public databases of transcriptome sequencing data (RNA-seq) derived from tumors and normal tissues hold the potential of yielding tumor-specific molecules, but because the data are new they have not been fully explored for this purpose. We have developed custom bioinformatic algorithms and used them with 296 high-grade serous ovarian (HGS-OvCa) tumor and 1,839 normal RNA-seq datasets to identify mRNA isoforms with tumor-specific expression. We rank prioritized isoforms by likelihood of being expressed in HGS-OvCa tumors and not in normal tissues and analyzed 671 top-ranked isoforms by high-throughput RT-qPCR. Six of these isoforms were expressed in a majority of the 12 tumors examined but not in 18 normal tissues. An additional 11 were expressed in most tumors and only one normal tissue, which in most cases was fallopian or colon. Of the 671 isoforms, the topmost 5% (n = 33) ranked based on having tumor-specific or highly restricted normal tissue expression by RT-qPCR analysis are enriched for oncogenic, stem cell/cancer stem cell, and early development loci--including ETV4, FOXM1, LSR, CD9, RAB11FIP4, and FGFRL1. Many of the 33 isoforms are predicted to encode proteins with unique amino acid sequences, which would allow them to be specifically targeted for one or more therapeutic strategies--including monoclonal antibodies and T-cell-based vaccines. The systematic process described herein is readily and rapidly applicable to the more than 30 additional tumor types for which sufficient amounts of RNA-seq already exist.
Collapse
|
6
|
Identification and Validation of Evolutionarily Conserved Unusually Short Pre-mRNA Introns in the Human Genome. Int J Mol Sci 2015; 16:10376-88. [PMID: 25961948 PMCID: PMC4463651 DOI: 10.3390/ijms160510376] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2015] [Revised: 04/13/2015] [Accepted: 04/13/2015] [Indexed: 11/29/2022] Open
Abstract
According to the length distribution of human introns, there is a large population of short introns with a threshold of 65 nucleotides (nt) and a peak at 85 nt. Using human genome and transcriptome databases, we investigated the introns shorter than 66 nt, termed ultra-short introns, the identities of which are scarcely known. Here, we provide for the first time a list of bona fide human ultra-short introns, which have never been characterized elsewhere. By conducting BLAST searches of the databases, we screened 22 introns (37–65 nt) with conserved lengths and sequences among closely related species. We then provide experimental and bioinformatic evidence for the splicing of 15 introns, of which 12 introns were remarkably G-rich and 9 introns contained completely inefficient splice sites and/or branch sites. These unorthodox characteristics of ultra-short introns suggest that there are unknown splicing mechanisms that differ from the well-established mechanism.
Collapse
|
7
|
Zhang L, Hamad EA, Vausort M, Funakoshi H, Feldman AM, Wagner DR, Devaux Y. Identification of candidate long noncoding RNAs associated with left ventricular hypertrophy. Clin Transl Sci 2015; 8:100-6. [PMID: 25382655 PMCID: PMC5350985 DOI: 10.1111/cts.12234] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
PURPOSE Long noncoding RNAs (lncRNAs) constitute an emerging group of noncoding RNAs, which regulate gene expression. Their role in cardiac disease is poorly known. Here, we investigated the association between lncRNAs and left ventricular hypertrophy. METHODS Wild-type and adenosine A2A receptor overexpressing mice (A2A-Tg) were subjected to transverse aortic constriction (TAC) and expression of lncRNAs in the heart was investigated using genome-wide microarrays and an analytical pipeline specifically developed for lncRNAs. RESULTS Microarray analysis identified two lncRNAs up-regulated and three down-regulated in the hearts of A2A-Tg mice subjected to TAC. Quantitative PCR showed that lncRNAs 2900055J20Rik and Gm14005 were decreased in A2A-Tg mice (3.5- and 1.8-fold, p < 0.01). We found from public microarray dataset that 2900055J20Rik and Gm14005 were increased in TAC mice compared to sham-operated animals (1.8- and 1.4-fold, after 28 days, p < 0.01). Interestingly, in this public dataset, cardioprotective drug JQ1 decreased 2900055J20Rik and Gm14005 expression by 2.2- and 1.6-fold (p < 0.01). CONCLUSIONS First, we have shown that data on lncRNAs can be obtained from gene expression microarrays. Second, expression of lncRNAs 2900055J20Rik and Gm14005 is regulated after TAC and can be modulated by cardioprotective molecules. These observations motivate further investigation of the therapeutic value of lncRNAs in the heart.
Collapse
Affiliation(s)
- Lu Zhang
- Laboratory of Cardiovascular ResearchPublic Research Center – Health (CRP‐Santé)Luxembourg
| | - Eman A. Hamad
- Department of PhysiologyCardiovascular Research CenterTemple University School of MedicinePhiladelphiaPennsylvaniaUSA
| | - Mélanie Vausort
- Laboratory of Cardiovascular ResearchPublic Research Center – Health (CRP‐Santé)Luxembourg
| | | | - Arthur M. Feldman
- Department of PhysiologyCardiovascular Research CenterTemple University School of MedicinePhiladelphiaPennsylvaniaUSA
| | - Daniel R. Wagner
- Laboratory of Cardiovascular ResearchPublic Research Center – Health (CRP‐Santé)Luxembourg
- Division of CardiologyHospital CenterLuxembourg
| | - Yvan Devaux
- Laboratory of Cardiovascular ResearchPublic Research Center – Health (CRP‐Santé)Luxembourg
| |
Collapse
|
8
|
LncRBase: an enriched resource for lncRNA information. PLoS One 2014; 9:e108010. [PMID: 25233092 PMCID: PMC4169474 DOI: 10.1371/journal.pone.0108010] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2014] [Accepted: 08/11/2014] [Indexed: 11/19/2022] Open
Abstract
Long noncoding RNAs (lncRNAs) are noncoding transcripts longer than 200 nucleotides, which show evidence of pervasive transcription and participate in a plethora of cellular regulatory processes. Although several noncoding transcripts have been functionally annotated as lncRNAs within the genome, not all have been proven to fulfill the criteria for a functional regulator and further analyses have to be done in order to include them in a functional cohort. LncRNAs are being classified and reclassified in an ongoing annotation process, and the challenge is fraught with ambiguity, as newer evidences of their biogenesis and functional implication come into light. In our effort to understand the complexity of this still enigmatic biomolecule, we have developed a new database entitled "LncRBase" where we have classified and characterized lncRNAs in human and mouse. It is an extensive resource of human and mouse lncRNA transcripts belonging to fourteen distinct subtypes, with a total of 83,201 entries for mouse and 133,361 entries for human: among these, we have newly annotated 8,507 mouse and 14,813 human non coding RNA transcripts (from UCSC and H-InvDB 8.0) as lncRNAs. We have especially considered protein coding gene loci which act as hosts for non coding transcripts. LncRBase includes different lncRNA transcript variants of protein coding genes within LncRBase. LncRBase provides information about the genomic context of different lncRNA subtypes, their interaction with small non coding RNAs (ncRNAs) viz. piwi interacting RNAs (piRNAs) and microRNAs (miRNAs) and their mode of regulation, via association with diverse other genomic elements. Adequate knowledge about genomic origin and molecular features of lncRNAs is essential to understand their functional and behavioral complexities. Overall, LncRBase provides a thorough study on various aspects of lncRNA origin and function and a user-friendly interface to search for lncRNA information. LncRBase is available at http://bicresources.jcbose.ac.in/zhumur/lncrbase.
Collapse
|
9
|
Chen YA, Murakami Y, Ahmad S, Yoshimaru T, Katagiri T, Mizuguchi K. Brefeldin A-inhibited guanine nucleotide-exchange protein 3 (BIG3) is predicted to interact with its partner through an ARM-type α-helical structure. BMC Res Notes 2014; 7:435. [PMID: 24997568 PMCID: PMC4096751 DOI: 10.1186/1756-0500-7-435] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2014] [Accepted: 06/30/2014] [Indexed: 12/21/2022] Open
Abstract
Background Brefeldin A-inhibited guanine nucleotide-exchange protein 3 (BIG3) has been identified recently as a novel regulator of estrogen signalling in breast cancer cells. Despite being a potential target for new breast cancer treatment, its amino acid sequence suggests no association with any well-characterized protein family and provides little clues as to its molecular function. In this paper, we predicted the structure, function and interactions of BIG3 using a range of bioinformatic tools. Results Homology search results showed that BIG3 had distinct features from its paralogues, BIG1 and BIG2, with a unique region between the two shared domains, Sec7 and DUF1981. Although BIG3 contains Sec7 domain, the lack of the conserved motif and the critical glutamate residue suggested no potential guaninyl-exchange factor (GEF) activity. Fold recognition tools predicted BIG3 to adopt an α-helical repeat structure similar to that of the armadillo (ARM) family. Using state-of-the-art methods, we predicted interaction sites between BIG3 and its partner PHB2. Conclusions The combined results of the structure and interaction prediction led to a novel hypothesis that one of the predicted helices of BIG3 might play an important role in binding to PHB2 and thereby preventing its translocation to the nucleus. This hypothesis has been subsequently verified experimentally.
Collapse
Affiliation(s)
| | | | | | | | | | - Kenji Mizuguchi
- National Institute of Biomedical Innovation, 7-6-8 Saito-asagi, Ibaraki city, Osaka 567-0085, Japan.
| |
Collapse
|
10
|
Abstract
BACKGROUND Genome wide association studies (GWAS) have revealed a large number of links between genome variation and complex disease. Among other benefits, it is expected that these insights will lead to new therapeutic strategies, particularly the identification of new drug targets. In this paper, we evaluate the power of GWAS studies to find drug targets by examining how many existing drug targets have been directly 'rediscovered' by this technique, and the extent to which GWAS results may be leveraged by network information to discover known and new drug targets. RESULTS We find that only a very small fraction of drug targets are directly detected in the relevant GWAS studies. We investigate two possible explanations for this observation. First, we find evidence of negative selection acting on drug target genes as a consequence of strong coupling with the disease phenotype, so reducing the incidence of SNPs linked to the disease. Second, we find that GWAS genes are substantially longer on average than drug targets and than all genes, suggesting there is a length related bias in GWAS results. In spite of the low direct relationship between drug targets and GWAS reported genes, we found these two sets of genes are closely coupled in the human protein network. As a consequence, machine-learning methods are able to recover known drug targets based on network context and the set of GWAS reported genes for the same disease. We show the approach is potentially useful for identifying drug repurposing opportunities. CONCLUSIONS Although GWA studies do not directly identify most existing drug targets, there are several reasons to expect that new targets will nevertheless be discovered using these data. Initial results on drug repurposing studies using network analysis are encouraging and suggest directions for future development.
Collapse
|
11
|
Wu PY, Phan JH, Wang MD. Assessing the impact of human genome annotation choice on RNA-seq expression estimates. BMC Bioinformatics 2013; 14 Suppl 11:S8. [PMID: 24564364 PMCID: PMC3816316 DOI: 10.1186/1471-2105-14-s11-s8] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background Genome annotation is a crucial component of RNA-seq data analysis. Much effort has been devoted to producing an accurate and rational annotation of the human genome. An annotated genome provides a comprehensive catalogue of genomic functional elements. Currently, at least six human genome annotations are publicly available, including AceView Genes, Ensembl Genes, H-InvDB Genes, RefSeq Genes, UCSC Known Genes, and Vega Genes. Characteristics of these annotations differ because of variations in annotation strategies and information sources. When performing RNA-seq data analysis, researchers need to choose a genome annotation. However, the effect of genome annotation choice on downstream RNA-seq expression estimates is still unclear. This study (1) investigates the effect of different genome annotations on RNA-seq quantification and (2) provides guidelines for choosing a genome annotation based on research focus. Results We define the complexity of human genome annotations in terms of the number of genes, isoforms, and exons. This definition facilitates an investigation of potential relationships between complexity and variations in RNA-seq quantification. We apply several evaluation metrics to demonstrate the impact of genome annotation choice on RNA-seq expression estimates. In the mapping stage, the least complex genome annotation, RefSeq Genes, appears to have the highest percentage of uniquely mapped short sequence reads. In the quantification stage, RefSeq Genes results in the most stable expression estimates in terms of the average coefficient of variation over all genes. Stable expression estimates in the quantification stage translate to accurate statistics for detecting differentially expressed genes. We observe that RefSeq Genes produces the most accurate fold-change measures with respect to a ground truth of RT-qPCR gene expression estimates. Conclusions Based on the observed variations in the mapping, quantification, and differential expression calling stages, we demonstrate that the selection of human genome annotation results in different gene expression estimates. When conducting research that emphasizes reproducible and robust gene expression estimates, a less complex genome annotation may be preferred. However, simpler genome annotations may limit opportunities for identifying or characterizing novel transcriptional or regulatory mechanisms. When conducting research that aims to be more exploratory, a more complex genome annotation may be preferred.
Collapse
|
12
|
Mochida K, Uehara-Yamaguchi Y, Takahashi F, Yoshida T, Sakurai T, Shinozaki K. Large-scale collection and analysis of full-length cDNAs from Brachypodium distachyon and integration with Pooideae sequence resources. PLoS One 2013; 8:e75265. [PMID: 24130698 PMCID: PMC3793998 DOI: 10.1371/journal.pone.0075265] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2013] [Accepted: 08/14/2013] [Indexed: 01/09/2023] Open
Abstract
A comprehensive collection of full-length cDNAs is essential for correct structural gene annotation and functional analyses of genes. We constructed a mixed full-length cDNA library from 21 different tissues of Brachypodium distachyon Bd21, and obtained 78,163 high quality expressed sequence tags (ESTs) from both ends of ca. 40,000 clones (including 16,079 contigs). We updated gene structure annotations of Brachypodium genes based on full-length cDNA sequences in comparison with the latest publicly available annotations. About 10,000 non-redundant gene models were supported by full-length cDNAs; ca. 6,000 showed some transcription unit modifications. We also found ca. 580 novel gene models, including 362 newly identified in Bd21. Using the updated transcription start sites, we searched a total of 580 plant cis-motifs in the −3 kb promoter regions and determined a genome-wide Brachypodium promoter architecture. Furthermore, we integrated the Brachypodium full-length cDNAs and updated gene structures with available sequence resources in wheat and barley in a web-accessible database, the RIKEN Brachypodium FL cDNA database. The database represents a “one-stop” information resource for all genomic information in the Pooideae, facilitating functional analysis of genes in this model grass plant and seamless knowledge transfer to the Triticeae crops.
Collapse
Affiliation(s)
- Keiichi Mochida
- Biomass Research Platform Team, Biomass Engineering Program Cooperation Division, RIKEN Center for Sustainable Resource Science, Tsurumi-ku, Yokohama, Kanagawa, Japan
- Kihara Institute for Biological Research, Yokohama City University, Totsuka-ku, Yokohama, Kanagawa, Japan
- * E-mail:
| | - Yukiko Uehara-Yamaguchi
- Biomass Research Platform Team, Biomass Engineering Program Cooperation Division, RIKEN Center for Sustainable Resource Science, Tsurumi-ku, Yokohama, Kanagawa, Japan
| | - Fuminori Takahashi
- Biomass Research Platform Team, Biomass Engineering Program Cooperation Division, RIKEN Center for Sustainable Resource Science, Tsurumi-ku, Yokohama, Kanagawa, Japan
| | - Takuhiro Yoshida
- Integrated Genome Informatics Research Unit, RIKEN Center for Sustainable Resource Science, Tsurumi-ku, Yokohama, Kanagawa, Japan
| | - Tetsuya Sakurai
- Integrated Genome Informatics Research Unit, RIKEN Center for Sustainable Resource Science, Tsurumi-ku, Yokohama, Kanagawa, Japan
| | - Kazuo Shinozaki
- Biomass Research Platform Team, Biomass Engineering Program Cooperation Division, RIKEN Center for Sustainable Resource Science, Tsurumi-ku, Yokohama, Kanagawa, Japan
| |
Collapse
|
13
|
Kikugawa S, Nishikata K, Murakami K, Sato Y, Suzuki M, Altaf-Ul-Amin M, Kanaya S, Imanishi T. PCDq: human protein complex database with quality index which summarizes different levels of evidences of protein complexes predicted from h-invitational protein-protein interactions integrative dataset. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 2:S7. [PMID: 23282181 PMCID: PMC3521179 DOI: 10.1186/1752-0509-6-s2-s7] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Background Proteins interact with other proteins or biomolecules in complexes to perform cellular functions. Existing protein-protein interaction (PPI) databases and protein complex databases for human proteins are not organized to provide protein complex information or facilitate the discovery of novel subunits. Data integration of PPIs focused specifically on protein complexes, subunits, and their functions. Predicted candidate complexes or subunits are also important for experimental biologists. Description Based on integrated PPI data and literature, we have developed a human protein complex database with a complex quality index (PCDq), which includes both known and predicted complexes and subunits. We integrated six PPI data (BIND, DIP, MINT, HPRD, IntAct, and GNP_Y2H), and predicted human protein complexes by finding densely connected regions in the PPI networks. They were curated with the literature so that missing proteins were complemented and some complexes were merged, resulting in 1,264 complexes comprising 9,268 proteins with 32,198 PPIs. The evidence level of each subunit was assigned as a categorical variable. This indicated whether it was a known subunit, and a specific function was inferable from sequence or network analysis. To summarize the categories of all the subunits in a complex, we devised a complex quality index (CQI) and assigned it to each complex. We examined the proportion of consistency of Gene Ontology (GO) terms among protein subunits of a complex. Next, we compared the expression profiles of the corresponding genes and found that many proteins in larger complexes tend to be expressed cooperatively at the transcript level. The proportion of duplicated genes in a complex was evaluated. Finally, we identified 78 hypothetical proteins that were annotated as subunits of 82 complexes, which included known complexes. Of these hypothetical proteins, after our prediction had been made, four were reported to be actual subunits of the assigned protein complexes. Conclusions We constructed a new protein complex database PCDq including both predicted and curated human protein complexes. CQI is a useful source of experimentally confirmed information about protein complexes and subunits. The predicted protein complexes can provide functional clues about hypothetical proteins. PCDq is freely available at http://h-invitational.jp/hinv/pcdq/.
Collapse
Affiliation(s)
- Shingo Kikugawa
- Integrated Databases and Systems Biology Team, Biological Information Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | | | | | | | | | | | | | | |
Collapse
|
14
|
Abstract
The relationship between sequence polymorphisms and human disease has been studied mostly in terms of effects of single nucleotide polymorphisms (SNPs) leading to single amino acid substitutions that change protein structure and function. However, less attention has been paid to more drastic sequence polymorphisms which cause premature termination of a protein’s sequence or large changes, insertions, or deletions in the sequence. We have analyzed a large set (n = 512) of insertions and deletions (indels) and single nucleotide polymorphisms causing premature termination of translation in disease-related genes. Prediction of protein-destabilization effects was performed by graphical presentation of the locations of polymorphisms in the protein structure, using the Genomes TO Protein (GTOP) database, and manual annotation with a set of specific criteria. Protein-destabilization was predicted for 44.4% of the nonsense SNPs, 32.4% of the frameshifting indels, and 9.1% of the non-frameshifting indels. A prediction of nonsense-mediated decay allowed to infer which truncated proteins would actually be translated as defective proteins. These cases included the proteins linked to diseases inherited dominantly, suggesting a relation between these diseases and toxic aggregation. Our approach would be useful in identifying potentially aggregation-inducing polymorphisms that may have pathological effects.
Collapse
|
15
|
Wu PY, Phan JH, Wang MD. The Effect of Human Genome Annotation Complexity on RNA-Seq Gene Expression Quantification. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE WORKSHOPS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2012; 2012:712-717. [PMID: 27532059 DOI: 10.1109/bibmw.2012.6470224] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Next-generation sequencing (NGS) has brought human genomic research to an unprecedented era. RNA-Seq is a branch of NGS that can be used to quantify gene expression and depends on accurate annotation of the human genome (i.e., the definition of genes and all of their variants or isoforms). Multiple annotations of the human genome exist with varying complexity. However, it is not clear how the choice of genome annotation influences RNA-Seq gene expression quantification. We assess the effect of different genome annotations in terms of (1) mapping quality, (2) quantification variation, (3) quantification accuracy (i.e., by comparing to qRT-PCR data), and (4) the concordance of detecting differentially expressed genes. External validation with qRT-PCR suggests that more complex genome annotations result in higher quantification variation.
Collapse
Affiliation(s)
- Po-Yen Wu
- Department of Electrical and Computer Engineering, Georgia Tech, Atlanta, GA, U.S.A,
| | - John H Phan
- The Wallace H. Coulter Department of Biomedical Engineering, Georgia Tech and Emory University, Atlanta, GA, U.S.A,
| | - May D Wang
- The Wallace H. Coulter Department of Biomedical Engineering, Georgia Tech and Emory University, Atlanta, GA, U.S.A,
| |
Collapse
|
16
|
Comparative genome analysis of three eukaryotic parasites with differing abilities to transform leukocytes reveals key mediators of Theileria-induced leukocyte transformation. mBio 2012; 3:e00204-12. [PMID: 22951932 PMCID: PMC3445966 DOI: 10.1128/mbio.00204-12] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
We sequenced the genome of Theileria orientalis, a tick-borne apicomplexan protozoan parasite of cattle. The focus of this study was a comparative genome analysis of T. orientalis relative to other highly pathogenic Theileria species, T. parva and T. annulata. T. parva and T. annulata induce transformation of infected cells of lymphocyte or macrophage/monocyte lineages; in contrast, T. orientalis does not induce uncontrolled proliferation of infected leukocytes and multiplies predominantly within infected erythrocytes. While synteny across homologous chromosomes of the three Theileria species was found to be well conserved overall, subtelomeric structures were found to differ substantially, as T. orientalis lacks the large tandemly arrayed subtelomere-encoded variable secreted protein-encoding gene family. Moreover, expansion of particular gene families by gene duplication was found in the genomes of the two transforming Theileria species, most notably, the TashAT/TpHN and Tar/Tpr gene families. Gene families that are present only in T. parva and T. annulata and not in T. orientalis, Babesia bovis, or Plasmodium were also identified. Identification of differences between the genome sequences of Theileria species with different abilities to transform and immortalize bovine leukocytes will provide insight into proteins and mechanisms that have evolved to induce and regulate this process. The T. orientalis genome database is available at http://totdb.czc.hokudai.ac.jp/.
Collapse
|
17
|
Griss J, Martín M, O'Donovan C, Apweiler R, Hermjakob H, Vizcaíno JA. Consequences of the discontinuation of the International Protein Index (IPI) database and its substitution by the UniProtKB "complete proteome" sets. Proteomics 2011; 11:4434-8. [PMID: 21932440 PMCID: PMC3556690 DOI: 10.1002/pmic.201100363] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2011] [Revised: 08/15/2011] [Accepted: 08/22/2011] [Indexed: 11/09/2022]
Abstract
The International Protein Index (IPI) database has been one of the most widely used protein databases in MS proteomics approaches. Recently, the closure of IPI in September 2011 was announced. Its recommended replacement is the new UniProt Knowledgebase (UniProtKB) "complete proteome" sets, launched in May 2011. Here, we analyze the consequences of IPI's discontinuation for human and mouse data, and the effect of its substitution with UniProtKB on two levels: (i) data already produced and (ii) newly performed experiments. To estimate the effect on existing data, we investigated how well IPI identifiers map to UniProtKB accessions. We found that 21% of human and 10% of mouse identifiers do not map to UniProtKB and would thus be "lost." To investigate the impact on new experiments, we compared the theoretical search space (i.e. the tryptic peptides) of both resources and found that it is decreased by 14.0% for human and 8.9% for mouse data through IPI's closure. An analysis on the experimental evidence for these "lost" peptides showed that the vast majority has not been identified in experiments available in the major proteomics repositories. It thus seems likely that the search space provided by UniProtKB is of higher quality than the one currently provided by IPI.
Collapse
Affiliation(s)
- Johannes Griss
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | | | | | | | | | | |
Collapse
|
18
|
Overton IM, Barton GJ. Computational approaches to selecting and optimising targets for structural biology. Methods 2011; 55:3-11. [PMID: 21906678 PMCID: PMC3202631 DOI: 10.1016/j.ymeth.2011.08.014] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2011] [Revised: 08/18/2011] [Accepted: 08/22/2011] [Indexed: 11/29/2022] Open
Abstract
Selection of protein targets for study is central to structural biology and may be influenced by numerous factors. A key aim is to maximise returns for effort invested by identifying proteins with the balance of biophysical properties that are conducive to success at all stages (e.g. solubility, crystallisation) in the route towards a high resolution structural model. Selected targets can be optimised through construct design (e.g. to minimise protein disorder), switching to a homologous protein, and selection of experimental methodology (e.g. choice of expression system) to prime for efficient progress through the structural proteomics pipeline. Here we discuss computational techniques in target selection and optimisation, with more detailed focus on tools developed within the Scottish Structural Proteomics Facility (SSPF); namely XANNpred, ParCrys, OB-Score (target selection) and TarO (target optimisation). TarO runs a large number of algorithms, searching for homologues and annotating the pool of possible alternative targets. This pool of putative homologues is presented in a ranked, tabulated format and results are also visualised as an automatically generated and annotated multiple sequence alignment. The target selection algorithms each predict the propensity of a selected protein target to progress through the experimental stages leading to diffracting crystals. This single predictor approach has advantages for target selection, when compared with an approach using two or more predictors that each predict for success at a single experimental stage. The tools described here helped SSPF achieve a high (21%) success rate in progressing cloned targets to diffraction-quality crystals.
Collapse
Affiliation(s)
- Ian M Overton
- MRC Human Genetics Unit, Institute of Genetics and Molecular Medicine, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, United Kingdom.
| | | |
Collapse
|
19
|
Griss J, Côté RG, Gerner C, Hermjakob H, Vizcaíno JA. Published and perished? The influence of the searched protein database on the long-term storage of proteomics data. Mol Cell Proteomics 2011; 10:M111.008490. [PMID: 21700957 PMCID: PMC3186200 DOI: 10.1074/mcp.m111.008490] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
In proteomics, protein identifications are reported and stored using an unstable reference system: protein identifiers. These proprietary identifiers are created individually by every protein database and can change or may even be deleted over time. To estimate the effect of the searched protein sequence database on the long-term storage of proteomics data we analyzed the changes of reported protein identifiers from all public experiments in the Proteomics Identifications (PRIDE) database by November 2010. To map the submitted protein identifier to a currently active entry, two distinct approaches were used. The first approach used the Protein Identifier Cross Referencing (PICR) service at the EBI, which maps protein identifiers based on 100% sequence identity. The second one (called logical mapping algorithm) accessed the source databases and retrieved the current status of the reported identifier. Our analysis showed the differences between the main protein databases (International Protein Index (IPI), UniProt Knowledgebase (UniProtKB), National Center for Biotechnological Information nr database (NCBI nr), and Ensembl) in respect to identifier stability. For example, whereas 20% of submitted IPI entries were deleted after two years, virtually all UniProtKB entries remained either active or replaced. Furthermore, the two mapping algorithms produced markedly different results. For example, the PICR service reported 10% more IPI entries deleted compared with the logical mapping algorithm. We found several cases where experiments contained more than 10% deleted identifiers already at the time of publication. We also assessed the proportion of peptide identifications in these data sets that still fitted the originally identified protein sequences. Finally, we performed the same overall analysis on all records from IPI, Ensembl, and UniProtKB: two releases per year were used, from 2005. This analysis showed for the first time the true effect of changing protein identifiers on proteomics data. Based on these findings, UniProtKB seems the best database for applications that rely on the long-term storage of proteomics data.
Collapse
Affiliation(s)
- Johannes Griss
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | | | | | | | | |
Collapse
|
20
|
Chen YA, Tripathi LP, Mizuguchi K. TargetMine, an integrated data warehouse for candidate gene prioritisation and target discovery. PLoS One 2011; 6:e17844. [PMID: 21408081 PMCID: PMC3050930 DOI: 10.1371/journal.pone.0017844] [Citation(s) in RCA: 86] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2010] [Accepted: 02/14/2011] [Indexed: 11/19/2022] Open
Abstract
Prioritising candidate genes for further experimental characterisation is a
non-trivial challenge in drug discovery and biomedical research in general. An
integrated approach that combines results from multiple data types is best
suited for optimal target selection. We developed TargetMine, a data warehouse
for efficient target prioritisation. TargetMine utilises the InterMine
framework, with new data models such as protein-DNA interactions integrated in a
novel way. It enables complicated searches that are difficult to perform with
existing tools and it also offers integration of custom annotations and in-house
experimental data. We proposed an objective protocol for target prioritisation
using TargetMine and set up a benchmarking procedure to evaluate its
performance. The results show that the protocol can identify known
disease-associated genes with high precision and coverage. A demonstration
version of TargetMine is available at http://targetmine.nibio.go.jp/.
Collapse
Affiliation(s)
- Yi-An Chen
- National Institute of Biomedical Innovation,
Saito-Asagi, Ibaraki, Osaka, Japan
- Graduated School of Frontier Biosciences,
Osaka University, Yamadaoka, Suita, Osaka, Japan
| | - Lokesh P. Tripathi
- National Institute of Biomedical Innovation,
Saito-Asagi, Ibaraki, Osaka, Japan
| | - Kenji Mizuguchi
- National Institute of Biomedical Innovation,
Saito-Asagi, Ibaraki, Osaka, Japan
- Graduated School of Frontier Biosciences,
Osaka University, Yamadaoka, Suita, Osaka, Japan
- * E-mail:
| |
Collapse
|
21
|
Sandve GK, Gundersen S, Rydbeck H, Glad IK, Holden L, Holden M, Liestøl K, Clancy T, Ferkingstad E, Johansen M, Nygaard V, Tøstesen E, Frigessi A, Hovig E. The Genomic HyperBrowser: inferential genomics at the sequence level. Genome Biol 2010; 11:R121. [PMID: 21182759 PMCID: PMC3046481 DOI: 10.1186/gb-2010-11-12-r121] [Citation(s) in RCA: 72] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2010] [Revised: 12/08/2010] [Accepted: 12/23/2010] [Indexed: 11/16/2022] Open
Abstract
The immense increase in the generation of genomic scale data poses an unmet analytical challenge, due to a lack of established methodology with the required flexibility and power. We propose a first principled approach to statistical analysis of sequence-level genomic information. We provide a growing collection of generic biological investigations that query pairwise relations between tracks, represented as mathematical objects, along the genome. The Genomic HyperBrowser implements the approach and is available at http://hyperbrowser.uio.no.
Collapse
Affiliation(s)
- Geir K Sandve
- Department of Informatics, University of Oslo, Blindern, 0316 Oslo, Norway.
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
22
|
Michelhaugh SK, Lipovich L, Blythe J, Jia H, Kapatos G, Bannon MJ. Mining Affymetrix microarray data for long non-coding RNAs: altered expression in the nucleus accumbens of heroin abusers. J Neurochem 2010; 116:459-66. [PMID: 21128942 DOI: 10.1111/j.1471-4159.2010.07126.x] [Citation(s) in RCA: 136] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Although recent data suggest that some long non-coding RNAs (lncRNAs) exert widespread effects on gene expression and organelle formation, lncRNAs as a group constitute a sizable but poorly characterized fraction of the human transcriptome. We investigated whether some human lncRNA sequences were fortuitously represented on commonly used microarrays, then used this annotation to assess lncRNA expression in human brain. A computational and annotation pipeline was developed to identify lncRNA transcripts represented on Affymetrix U133 arrays. A previously published dataset derived from human nucleus accumbens was then examined for potential lncRNA expression. Twenty-three lncRNAs were determined to be represented on U133 arrays. Of these, dataset analysis revealed that five lncRNAs were consistently detected in samples of human nucleus accumbens. Strikingly, the abundance of these lncRNAs was up-regulated in human heroin abusers compared to matched drug-free control subjects, a finding confirmed by quantitative PCR. This study presents a paradigm for examining existing Affymetrix datasets for the detection and potential regulation of lncRNA expression, including changes associated with human disease. The finding that all detected lncRNAs were up-regulated in heroin abusers is consonant with the proposed role of lncRNAs as mediators of widespread changes in gene expression as occur in drug abuse.
Collapse
Affiliation(s)
- Sharon K Michelhaugh
- Department of Pharmacology, Wayne State University School of Medicine, Detroit, Michigan, USA.
| | | | | | | | | | | |
Collapse
|
23
|
Bareke E, Pierre M, Gaigneaux A, De Meulder B, Depiereux S, Berger F, Habra N, Depiereux E. PathEx: a novel multi factors based datasets selector web tool. BMC Bioinformatics 2010; 11:528. [PMID: 20969778 PMCID: PMC2978222 DOI: 10.1186/1471-2105-11-528] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2010] [Accepted: 10/22/2010] [Indexed: 11/27/2022] Open
Abstract
Background Microarray experiments have become very popular in life science research. However, if such experiments are only considered independently, the possibilities for analysis and interpretation of many life science phenomena are reduced. The accumulation of publicly available data provides biomedical researchers with a valuable opportunity to either discover new phenomena or improve the interpretation and validation of other phenomena that partially understood or well known. This can only be achieved by intelligently exploiting this rich mine of information. Description Considering that technologies like microarrays remain prohibitively expensive for researchers with limited means to order their own experimental chips, it would be beneficial to re-use previously published microarray data. For certain researchers interested in finding gene groups (requiring many replicates), there is a great need for tools to help them to select appropriate datasets for analysis. These tools may be effective, if and only if, they are able to re-use previously deposited experiments or to create new experiments not initially envisioned by the depositors. However, the generation of new experiments requires that all published microarray data be completely annotated, which is not currently the case. Thus, we propose the PathEx approach. Conclusion This paper presents PathEx, a human-focused web solution built around a two-component system: one database component, enriched with relevant biological information (expression array, omics data, literature) from different sources, and another component comprising sophisticated web interfaces that allow users to perform complex dataset building queries on the contents integrated into the PathEx database.
Collapse
Affiliation(s)
- Eric Bareke
- Molecular Biology Research Unit, University of Namur - FUNDP, Namur, Belgium.
| | | | | | | | | | | | | | | |
Collapse
|
24
|
Prediction of carbohydrate-binding proteins from sequences using support vector machines. Adv Bioinformatics 2010. [PMID: 20936154 PMCID: PMC2948896 DOI: 10.1155/2010/289301] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2010] [Revised: 05/20/2010] [Accepted: 07/19/2010] [Indexed: 11/17/2022] Open
Abstract
Carbohydrate-binding proteins are proteins that can interact with sugar chains but do not modify them. They are involved in many physiological functions, and we have developed a method for predicting them from their amino acid sequences. Our method is based on support vector machines (SVMs). We first clarified the definition of carbohydrate-binding proteins and then constructed positive and negative datasets with which the SVMs were trained. By applying the leave-one-out test to these datasets, our method delivered 0.92 of the area under the receiver operating characteristic (ROC) curve. We also examined two amino acid grouping methods that enable effective learning of sequence patterns and evaluated the performance of these methods. When we applied our method in combination with the homology-based prediction method to the annotated human genome database, H-invDB, we found that the true positive rate of prediction was improved.
Collapse
|
25
|
Jia H, Osak M, Bogu GK, Stanton LW, Johnson R, Lipovich L. Genome-wide computational identification and manual annotation of human long noncoding RNA genes. RNA (NEW YORK, N.Y.) 2010; 16:1478-1487. [PMID: 20587619 DOI: 10.1261/rna.1951310.4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Experimental evidence suggests that half or more of the mammalian transcriptome consists of noncoding RNA. Noncoding RNAs are divided into short noncoding RNAs (including microRNAs) and long noncoding RNAs (lncRNAs). We defined complementary DNAs (cDNAs) lacking any positive-strand open reading frames (ORFs) longer than 30 amino acids, as well as cDNAs lacking any evidence of interspecies conservation of their longer-than-30-amino acid ORFs, as noncoding. We have identified 5446 lncRNA genes in the human genome from approximately 24,000 full-length cDNAs, using our new ORF-prediction pipeline. We combined them nonredundantly with lncRNAs from four published sources to derive 6736 lncRNA genes. In an effort to distinguish standalone and antisense lncRNA genes from database artifacts, we stratified our catalog of lncRNAs according to the distance between each lncRNA gene candidate and its nearest known protein-coding gene. We concurrently examined the protein-coding capacity of known genes overlapping with lncRNAs. Remarkably, 62% of known genes with "hypothetical protein" names actually lacked protein-coding capacity. This study has greatly expanded the known human lncRNA catalog, increased its accuracy through manual annotation of cDNA-to-genome alignments, and revealed that a large set of hypothetical-protein genes in GenBank lacks protein-coding capacity. In addition, we have developed, independently of existing NCBI tools, command-line programs with high-throughput ORF-finding and BLASTP-parsing functionality, suitable for future automated assessments of protein-coding capacity of novel transcripts.
Collapse
Affiliation(s)
- Hui Jia
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MO 48202, USA
| | | | | | | | | | | |
Collapse
|
26
|
Jia H, Osak M, Bogu GK, Stanton LW, Johnson R, Lipovich L. Genome-wide computational identification and manual annotation of human long noncoding RNA genes. RNA (NEW YORK, N.Y.) 2010; 16:1478-87. [PMID: 20587619 PMCID: PMC2905748 DOI: 10.1261/rna.1951310] [Citation(s) in RCA: 294] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
Experimental evidence suggests that half or more of the mammalian transcriptome consists of noncoding RNA. Noncoding RNAs are divided into short noncoding RNAs (including microRNAs) and long noncoding RNAs (lncRNAs). We defined complementary DNAs (cDNAs) lacking any positive-strand open reading frames (ORFs) longer than 30 amino acids, as well as cDNAs lacking any evidence of interspecies conservation of their longer-than-30-amino acid ORFs, as noncoding. We have identified 5446 lncRNA genes in the human genome from approximately 24,000 full-length cDNAs, using our new ORF-prediction pipeline. We combined them nonredundantly with lncRNAs from four published sources to derive 6736 lncRNA genes. In an effort to distinguish standalone and antisense lncRNA genes from database artifacts, we stratified our catalog of lncRNAs according to the distance between each lncRNA gene candidate and its nearest known protein-coding gene. We concurrently examined the protein-coding capacity of known genes overlapping with lncRNAs. Remarkably, 62% of known genes with "hypothetical protein" names actually lacked protein-coding capacity. This study has greatly expanded the known human lncRNA catalog, increased its accuracy through manual annotation of cDNA-to-genome alignments, and revealed that a large set of hypothetical-protein genes in GenBank lacks protein-coding capacity. In addition, we have developed, independently of existing NCBI tools, command-line programs with high-throughput ORF-finding and BLASTP-parsing functionality, suitable for future automated assessments of protein-coding capacity of novel transcripts.
Collapse
Affiliation(s)
- Hui Jia
- Center for Molecular Medicine and Genetics, Wayne State University, Detroit, MO 48202, USA
| | | | | | | | | | | |
Collapse
|
27
|
Navratil V, Lotteau V, Rabourdin-Combe C. [The virtual infected cell: a systems biology rational for antiviral drug discovery]. Med Sci (Paris) 2010; 26:603-9. [PMID: 20619162 DOI: 10.1051/medsci/2010266-7603] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Infection caused by pathogens kills millions of people every year. Comprehensive understanding of molecular pathogen-host interactions, i.e. the infectome, is one of the key steps towards the development of novel diagnostic, therapeutic and preventive strategies. In this quest, progress in high-throughput << omics >> technologies applied to pathogens, i.e. infectomics, opens new perspectives toward systemic understanding of perturbations induced during infection. Deciphering the pathogen-host system also relies on the analytical and predictive power of molecular systems biology and by developing in silico models taking into account the whole picture of the molecules and their interactions. In this context, we have reconstructed a prototype of the human virtual infected cell based on 30 years of intensive research in the field of molecular virology. This model contains more than one hundred viral infectomes, including major human pathogens (HCV, HBV, HIV, HHV, HPV) and has led to the generation of novel systems-level hypotheses that could be suitable for the development of innovative antiviral strategies based on the control of cellular functions.
Collapse
|
28
|
Shimada MK, Hayakawa Y, Takeda JI, Gojobori T, Imanishi T. A comprehensive survey of human polymorphisms at conserved splice dinucleotides and its evolutionary relationship with alternative splicing. BMC Evol Biol 2010; 10:122. [PMID: 20433709 PMCID: PMC2882926 DOI: 10.1186/1471-2148-10-122] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2009] [Accepted: 04/30/2010] [Indexed: 12/17/2022] Open
Abstract
BACKGROUND Alternative splicing (AS) is a key molecular process that endows biological functions with diversity and complexity. Generally, functional redundancy leads to the generation of new functions through relaxation of selective pressure in evolution, as exemplified by duplicated genes. It is also known that alternatively spliced exons (ASEs) are subject to relaxed selective pressure. Within consensus sequences at the splice junctions, the most conserved sites are dinucleotides at both ends of introns (splice dinucleotides). However, a small number of single nucleotide polymorphisms (SNPs) occur at splice dinucleotides. An intriguing question relating to the evolution of AS diversity is whether mutations at splice dinucleotides are maintained as polymorphisms and produce diversity in splice patterns within the human population. We therefore surveyed validated SNPs in the database dbSNP located at splice dinucleotides of all human genes that are defined by the H-Invitational Database. RESULTS We found 212 validated SNPs at splice dinucleotides (sdSNPs); these were confirmed to be consistent with the GT-AG rule at either allele. Moreover, 53 of them were observed to neighbor ASEs (AE dinucleotides). No significant differences were observed between sdSNPs at AE dinucleotides and those at constitutive exons (CE dinucleotides) in SNP properties including average heterozygosity, SNP density, ratio of predicted alleles consistent with the GT-AG rule, and scores of splice sites formed with the predicted allele. We also found that the proportion of non-conserved exons was higher for exons with sdSNPs than for other exons. CONCLUSIONS sdSNPs are found at CE dinucleotides in addition to those at AE dinucleotides, suggesting two possibilities. First, sdSNPs at CE dinucleotides may be robust against sdSNPs because of unknown mechanisms. Second, similar to sdSNPs at AE dinucleotides, those at CE dinucleotides cause differences in AS patterns because of the arbitrariness in the classification of exons into alternative and constitutive type that varies according to the dataset. Taking into account the absence of differences in sdSNP properties between those at AE and CE dinucleotides, the increased proportion of non-conserved exons found in exons flanked by sdSNPs suggests the hypothesis that sdSNPs are maintained at the splice dinucleotides of newly generated exons at which negative selection pressure is relaxed.
Collapse
Affiliation(s)
- Makoto K Shimada
- Biomedicinal Information Research Center, National Institute of Advanced Industrial Science and Technology, 2-42 Aomi Koto-ku, Tokyo135-0064, Japan
- Japan Biological Informatics Consortium, 10F TIME24 Building, 2-45 Aomi, Koto-ku, Tokyo 135-0064, Japan
- Institute for Comprehensive Medical Science, Fujita Health University, 1-98 Dengakugakubo, Kutsukake-cho, Toyoake, Aichi 470-1192, Japan
| | - Yosuke Hayakawa
- Japan Biological Informatics Consortium, 10F TIME24 Building, 2-45 Aomi, Koto-ku, Tokyo 135-0064, Japan
- Hitachi Software Engineering Co., Ltd., 1-1-43 Suehirocho, Tsurumi-ku, Yokohama 230-0045, Japan
| | - Jun-ichi Takeda
- Biomedicinal Information Research Center, National Institute of Advanced Industrial Science and Technology, 2-42 Aomi Koto-ku, Tokyo135-0064, Japan
- Japan Biological Informatics Consortium, 10F TIME24 Building, 2-45 Aomi, Koto-ku, Tokyo 135-0064, Japan
| | - Takashi Gojobori
- Biomedicinal Information Research Center, National Institute of Advanced Industrial Science and Technology, 2-42 Aomi Koto-ku, Tokyo135-0064, Japan
- Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, 1111 Yata, Mishima, Shizuoka 411-8540, Japan
| | - Tadashi Imanishi
- Biomedicinal Information Research Center, National Institute of Advanced Industrial Science and Technology, 2-42 Aomi Koto-ku, Tokyo135-0064, Japan
| |
Collapse
|
29
|
Risueño A, Fontanillo C, Dinger ME, De Las Rivas J. GATExplorer: genomic and transcriptomic explorer; mapping expression probes to gene loci, transcripts, exons and ncRNAs. BMC Bioinformatics 2010; 11:221. [PMID: 20429936 PMCID: PMC2875241 DOI: 10.1186/1471-2105-11-221] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2009] [Accepted: 04/29/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome-wide expression studies have developed exponentially in recent years as a result of extensive use of microarray technology. However, expression signals are typically calculated using the assignment of "probesets" to genes, without addressing the problem of "gene" definition or proper consideration of the location of the measuring probes in the context of the currently known genomes and transcriptomes. Moreover, as our knowledge of metazoan genomes improves, the number of both protein-coding and noncoding genes, as well as their associated isoforms, continues to increase. Consequently, there is a need for new databases that combine genomic and transcriptomic information and provide updated mapping of expression probes to current genomic annotations. RESULTS GATExplorer (Genomic and Transcriptomic Explorer) is a database and web platform that integrates a gene loci browser with nucleotide level mappings of oligo probes from expression microarrays. It allows interactive exploration of gene loci, transcripts and exons of human, mouse and rat genomes, and shows the specific location of all mappable Affymetrix microarray probes and their respective expression levels in a broad set of biological samples. The web site allows visualization of probes in their genomic context together with any associated protein-coding or noncoding transcripts. In the case of all-exon arrays, this provides a means by which the expression of the individual exons within a gene can be compared, thereby facilitating the identification and analysis of alternatively spliced exons. The application integrates data from four major source databases: Ensembl, RNAdb, Affymetrix and GeneAtlas; and it provides the users with a series of files and packages (R CDFs) to analyze particular query expression datasets. The maps cover both the widely used Affymetrix GeneChip microarrays based on 3' expression (e.g. human HG U133 series) and the all-exon expression microarrays (Gene 1.0 and Exon 1.0). CONCLUSIONS GATExplorer is an integrated database that combines genomic/transcriptomic visualization with nucleotide-level probe mapping. By considering expression at the nucleotide level rather than the gene level, it shows that the arrays detect expression signals from entities that most researchers do not contemplate or discriminate. This approach provides the means to undertake a higher resolution analysis of microarray data and potentially extract considerably more detailed and biologically accurate information from existing and future microarray experiments.
Collapse
Affiliation(s)
- Alberto Risueño
- Bioinformatics and Functional Genomics Research Group, Cancer Research Center (CiC-IBMCC, CSIC/USAL), Salamanca, Spain
| | | | | | | |
Collapse
|
30
|
Mining mammalian transcript data for functional long non-coding RNAs. PLoS One 2010; 5:e10316. [PMID: 20428234 PMCID: PMC2859052 DOI: 10.1371/journal.pone.0010316] [Citation(s) in RCA: 112] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2009] [Accepted: 03/30/2010] [Indexed: 12/11/2022] Open
Abstract
Background The role of long non-coding RNAs (lncRNAs) in controlling gene expression has garnered increased interest in recent years. Sequencing projects, such as Fantom3 for mouse and H-InvDB for human, have generated abundant data on transcribed components of mammalian cells, the majority of which appear not to be protein-coding. However, much of the non-protein-coding transcriptome could merely be a consequence of ‘transcription noise’. It is therefore essential to use bioinformatic approaches to identify the likely functional candidates in a high throughput manner. Principal Findings We derived a scheme for classifying and annotating likely functional lncRNAs in mammals. Using the available experimental full-length cDNA data sets for human and mouse, we identified 78 lncRNAs that are either syntenically conserved between human and mouse, or that originate from the same protein-coding genes. Of these, 11 have significant sequence homology. We found that these lncRNAs exhibit: (i) patterns of codon substitution typical of non-coding transcripts; (ii) preservation of sequences in distant mammals such as dog and cow, (iii) significant sequence conservation relative to their corresponding flanking regions (in 50% cases, flanking regions do not have homology at all; and in the remaining, the degree of conservation is significantly less); (iv) existence mostly as single-exon forms (8/11); and, (v) presence of conserved and stable secondary structure motifs within them. We further identified orthologous protein-coding genes that are contributing to the pool of lncRNAs; of which, genes implicated in carcinogenesis are significantly over-represented. Conclusion Our comparative mammalian genomics approach coupled with evolutionary analysis identified a small population of conserved long non-protein-coding RNAs (lncRNAs) that are potentially functional across Mammalia. Additionally, our analysis indicates that amongst the orthologous protein-coding genes that produce lncRNAs, those implicated in cancer pathogenesis are significantly over-represented, suggesting that these lncRNAs could play an important role in cancer pathomechanisms.
Collapse
|
31
|
Mochida K, Shinozaki K. Genomics and bioinformatics resources for crop improvement. PLANT & CELL PHYSIOLOGY 2010; 51:497-523. [PMID: 20208064 PMCID: PMC2852516 DOI: 10.1093/pcp/pcq027] [Citation(s) in RCA: 79] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/08/2010] [Accepted: 03/01/2010] [Indexed: 05/19/2023]
Abstract
Recent remarkable innovations in platforms for omics-based research and application development provide crucial resources to promote research in model and applied plant species. A combinatorial approach using multiple omics platforms and integration of their outcomes is now an effective strategy for clarifying molecular systems integral to improving plant productivity. Furthermore, promotion of comparative genomics among model and applied plants allows us to grasp the biological properties of each species and to accelerate gene discovery and functional analyses of genes. Bioinformatics platforms and their associated databases are also essential for the effective design of approaches making the best use of genomic resources, including resource integration. We review recent advances in research platforms and resources in plant omics together with related databases and advances in technology.
Collapse
|
32
|
Kawahara Y, Sakate R, Matsuya A, Murakami K, Sato Y, Zhang H, Gojobori T, Itoh T, Imanishi T. G-compass: a web-based comparative genome browser between human and other vertebrate genomes. Bioinformatics 2009; 25:3321-2. [PMID: 19846439 PMCID: PMC2788932 DOI: 10.1093/bioinformatics/btp594] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Summary: G-compass is designed for efficient comparative genome analysis between human and other vertebrate genomes. The current version of G-compass allows us to browse two corresponding genomic regions between human and another species in parallel. One-to-one evolutionarily conserved regions (i.e. orthologous regions) between species are highlighted along the genomes. Information such as locations of duplicated regions, copy number variations and mammalian ultra-conserved elements is also provided. These features of G-compass enable us to easily determine patterns of genomic rearrangements and changes in gene orders through evolutionary time. Since G-compass is a satellite database of H-InvDB, which is a comprehensive annotation resource for human genes and transcripts, users can easily refer to manually curated functional annotations and other abundant biological information for each human transcript. G-compass is expected to be a valuable tool for comparing human and model organisms and promoting the exchange of functional information. Availability: G-compass is freely available at http://www.h-invitational.jp/g-compass/. Contact:t.imanishi@aist.go.jp
Collapse
Affiliation(s)
- Yoshihiro Kawahara
- Japan Biological Information Research Center, Japan Biological Informatics Consortium, Aomi 2-42, Koto-ku, Tokyo 135-0064, Japan
| | | | | | | | | | | | | | | | | |
Collapse
|
33
|
Ren J, Jiang C, Gao X, Liu Z, Yuan Z, Jin C, Wen L, Zhang Z, Xue Y, Yao X. PhosSNP for systematic analysis of genetic polymorphisms that influence protein phosphorylation. Mol Cell Proteomics 2009; 9:623-34. [PMID: 19995808 DOI: 10.1074/mcp.m900273-mcp200] [Citation(s) in RCA: 64] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
We are entering the era of personalized genomics as breakthroughs in sequencing technology have made it possible to sequence or genotype an individual person in an efficient and accurate manner. Preliminary results from HapMap and other similar projects have revealed the existence of tremendous genetic variations among world populations and among individuals. It is important to delineate the functional implication of such variations, i.e. whether they affect the stability and biochemical properties of proteins. It is also generally believed that the genetic variation is the main cause for different susceptibility to certain diseases or different response to therapeutic treatments. Understanding genetic variation in the context of human diseases thus holds the promise for "personalized medicine." In this work, we carried out a genome-wide analysis of single nucleotide polymorphisms (SNPs) that could potentially influence protein phosphorylation characteristics in human. Here, we defined a phosphorylation-related SNP (phosSNP) as a non-synonymous SNP (nsSNP) that affects the protein phosphorylation status. Using an in-house developed kinase-specific phosphorylation site predictor (GPS 2.0), we computationally detected that approximately 70% of the reported nsSNPs are potential phosSNPs. More interestingly, approximately 74.6% of these potential phosSNPs might also induce changes in protein kinase types in adjacent phosphorylation sites rather than creating or removing phosphorylation sites directly. Taken together, we proposed that a large proportion of the nsSNPs might affect protein phosphorylation characteristics and play important roles in rewiring biological pathways. Finally, all phosSNPs were integrated into the PhosSNP 1.0 database, which was implemented in JAVA 1.5 (J2SE 5.0). The PhosSNP 1.0 database is freely available for academic researchers.
Collapse
Affiliation(s)
- Jian Ren
- Department of Systems Biology, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan, China
| | | | | | | | | | | | | | | | | | | |
Collapse
|
34
|
Takeda JI, Suzuki Y, Sakate R, Sato Y, Gojobori T, Imanishi T, Sugano S. H-DBAS: human-transcriptome database for alternative splicing: update 2010. Nucleic Acids Res 2009; 38:D86-90. [PMID: 19969536 PMCID: PMC2808982 DOI: 10.1093/nar/gkp984] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
H-DBAS (http://h-invitational.jp/h-dbas/) is a specialized database for human alternative splicing (AS) based on H-Invitational full-length cDNAs. In this update, for better annotations of AS events, we correlated RNA-Seq tag information to the AS exons and splice junctions. We generated a total of 148 376 598 RNA-Seq tags from RNAs extracted from cytoplasmic, nuclear and polysome fractions. Analysis of the RNA-Seq tags allowed us to identify 90 900 exons that are very likely to be used for protein synthesis. On the other hand, 254 AS junctions of human RefSeq transcripts are unique to nuclear RNA and may not have any translational consequences. We also present a new comparative genomics viewer so that users can empirically understand the evolutionary turnover of AS. With the unique experimental data closely connected with intensively curated cDNA information, H-DBAS provides a unique platform for the analysis of complex AS.
Collapse
Affiliation(s)
- Jun-ichi Takeda
- Integrated Database and Systems Biology Team, Biomedicinal Information Research Center National Institute of Advanced Industrial Science and Technology, AIST Bio-IT Research Bldg Aomi 2-4-7, Koto-ku, Tokyo 135-0064, Japan
| | | | | | | | | | | | | |
Collapse
|
35
|
Yamasaki C, Murakami K, Takeda JI, Sato Y, Noda A, Sakate R, Habara T, Nakaoka H, Todokoro F, Matsuya A, Imanishi T, Gojobori T. H-InvDB in 2009: extended database and data mining resources for human genes and transcripts. Nucleic Acids Res 2009; 38:D626-32. [PMID: 19933760 PMCID: PMC2808976 DOI: 10.1093/nar/gkp1020] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We report the extended database and data mining resources newly released in the H-Invitational Database (H-InvDB; http://www.h-invitational.jp/). H-InvDB is a comprehensive annotation resource of human genes and transcripts, and consists of two main views and six sub-databases. The latest release of H-InvDB (release 6.2) provides the annotation for 219,765 human transcripts in 43,159 human gene clusters based on human full-length cDNAs and mRNAs. H-InvDB now provides several new annotation features, such as mapping of microarray probes, new gene models, relation to known ncRNAs and information from the Glycogene database. H-InvDB also provides useful data mining resources-'Navigation search', 'H-InvDB Enrichment Analysis Tool (HEAT)' and web service APIs. 'Navigation search' is an extended search system that enables complicated searches by combining 16 different search options. HEAT is a data mining tool for automatically identifying features specific to a given human gene set. HEAT searches for H-InvDB annotations that are significantly enriched in a user-defined gene set, as compared with the entire H-InvDB representative transcripts. H-InvDB now has web service APIs of SOAP and REST to allow the use of H-InvDB data in programs, providing the users extended data accessibility.
Collapse
Affiliation(s)
- Chisato Yamasaki
- BIRC, NIG Waterfront Bio-IT Research Building, 4-7 Aomi, Tokyo 135-0064, Japan
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
36
|
Mochida K, Yoshida T, Sakurai T, Ogihara Y, Shinozaki K. TriFLDB: a database of clustered full-length coding sequences from Triticeae with applications to comparative grass genomics. PLANT PHYSIOLOGY 2009; 150:1135-46. [PMID: 19448038 PMCID: PMC2705016 DOI: 10.1104/pp.109.138214] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/07/2009] [Accepted: 05/08/2009] [Indexed: 05/19/2023]
Abstract
The Triticeae Full-Length CDS Database (TriFLDB) contains available information regarding full-length coding sequences (CDSs) of the Triticeae crops wheat (Triticum aestivum) and barley (Hordeum vulgare) and includes functional annotations and comparative genomics features. TriFLDB provides a search interface using keywords for gene function and related Gene Ontology terms and a similarity search for DNA and deduced translated amino acid sequences to access annotations of Triticeae full-length CDS (TriFLCDS) entries. Annotations consist of similarity search results against several sequence databases and domain structure predictions by InterProScan. The deduced amino acid sequences in TriFLDB are grouped with the proteome datasets for Arabidopsis (Arabidopsis thaliana), rice (Oryza sativa), and sorghum (Sorghum bicolor) by hierarchical clustering in stepwise thresholds of sequence identity, providing hierarchical clustering results based on full-length protein sequences. The database also provides sequence similarity results based on comparative mapping of TriFLCDSs onto the rice and sorghum genome sequences, which together with current annotations can be used to predict gene structures for TriFLCDS entries. To provide the possible genetic locations of full-length CDSs, TriFLCDS entries are also assigned to the genetically mapped cDNA sequences of barley and diploid wheat, which are currently accommodated in the Triticeae Mapped EST Database. These relational data are searchable from the search interfaces of both databases. The current TriFLDB contains 15,871 full-length CDSs from barley and wheat and includes putative full-length cDNAs for barley and wheat, which are publicly accessible. This informative content provides an informatics gateway for Triticeae genomics and grass comparative genomics. TriFLDB is publicly available at http://TriFLDB.psc.riken.jp/.
Collapse
|
37
|
Imanishi T, Nakaoka H. Hyperlink Management System and ID Converter System: enabling maintenance-free hyperlinks among major biological databases. Nucleic Acids Res 2009; 37:W17-22. [PMID: 19454601 PMCID: PMC2703917 DOI: 10.1093/nar/gkp355] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Hyperlink Management System (HMS) is a system for automatically updating and maintaining hyperlinks among major public databases in the field of life science. We daily create corresponding tables of data IDs of major databases for human genes and proteins, and provide a CGI-program that returns correct and up-to-date URLs for showing data of various databases that correspond to user-specified IDs. The HMS can deal with various IDs: accession numbers of International Nucleotide Sequence Databases, HUGO Gene Symbols and IDs of UniProt, PDB, H-InvDB and others, and it can return URLs of various databases: H-InvDB, HUGO Gene Nomenclature Committee Database, NCBI Entrez Gene, UniProt, PDB and others. For example, 23 297 pages of Locus view of H-InvDB are reachable by using HUGO Gene Symbols through the HMS. Not only the CGI-program, the HMS provides a Web page for finding and opening URLs of these databases. Although hyperlinking is an effective way of relating biological data among different databases, updating hyperlinks has been a laborious work. The HMS fully automates the job, enabling maintenance-free hyperlinks. We also developed the ID Converter System (ICS) for simply converting data IDs by using corresponding tables in the HMS. The HMS and ICS are freely available at http://biodb.jp/.
Collapse
Affiliation(s)
- Tadashi Imanishi
- Biomedicinal Information Research Center, National Institute of Advanced Industrial Science and Technology, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan.
| | | |
Collapse
|
38
|
Arvidsson S, Kwasniewski M, Riaño-Pachón DM, Mueller-Roeber B. QuantPrime--a flexible tool for reliable high-throughput primer design for quantitative PCR. BMC Bioinformatics 2008; 9:465. [PMID: 18976492 PMCID: PMC2612009 DOI: 10.1186/1471-2105-9-465] [Citation(s) in RCA: 383] [Impact Index Per Article: 23.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2008] [Accepted: 11/01/2008] [Indexed: 11/21/2022] Open
Abstract
Background Medium- to large-scale expression profiling using quantitative polymerase chain reaction (qPCR) assays are becoming increasingly important in genomics research. A major bottleneck in experiment preparation is the design of specific primer pairs, where researchers have to make several informed choices, often outside their area of expertise. Using currently available primer design tools, several interactive decisions have to be made, resulting in lengthy design processes with varying qualities of the assays. Results Here we present QuantPrime, an intuitive and user-friendly, fully automated tool for primer pair design in small- to large-scale qPCR analyses. QuantPrime can be used online through the internet or on a local computer after download; it offers design and specificity checking with highly customizable parameters and is ready to use with many publicly available transcriptomes of important higher eukaryotic model organisms and plant crops (currently 295 species in total), while benefiting from exon-intron border and alternative splice variant information in available genome annotations. Experimental results with the model plant Arabidopsis thaliana, the crop Hordeum vulgare and the model green alga Chlamydomonas reinhardtii show success rates of designed primer pairs exceeding 96%. Conclusion QuantPrime constitutes a flexible, fully automated web application for reliable primer design for use in larger qPCR experiments, as proven by experimental data. The flexible framework is also open for simple use in other quantification applications, such as hydrolyzation probe design for qPCR and oligonucleotide probe design for quantitative in situ hybridization. Future suggestions made by users can be easily implemented, thus allowing QuantPrime to be developed into a broad-range platform for the design of RNA expression assays.
Collapse
Affiliation(s)
- Samuel Arvidsson
- Max-Planck Institute of Molecular Plant Physiology, Potsdam-Golm, Germany.
| | | | | | | |
Collapse
|
39
|
Shimada MK, Matsumoto R, Hayakawa Y, Sanbonmatsu R, Gough C, Yamaguchi-Kabata Y, Yamasaki C, Imanishi T, Gojobori T. VarySysDB: a human genetic polymorphism database based on all H-InvDB transcripts. Nucleic Acids Res 2008; 37:D810-5. [PMID: 18953038 PMCID: PMC2686441 DOI: 10.1093/nar/gkn798] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Creation of a vast variety of proteins is accomplished by genetic variation and a variety of alternative splicing transcripts. Currently, however, the abundant available data on genetic variation and the transcriptome are stored independently and in a dispersed fashion. In order to provide a research resource regarding the effects of human genetic polymorphism on various transcripts, we developed VarySysDB, a genetic polymorphism database based on 187,156 extensively annotated matured mRNA transcripts from 36,073 loci provided by H-InvDB. VarySysDB offers information encompassing published human genetic polymorphisms for each of these transcripts separately. This allows comparisons of effects derived from a polymorphism on different transcripts. The published information we analyzed includes single nucleotide polymorphisms and deletion-insertion polymorphisms from dbSNP, copy number variations from Database of Genomic Variants, short tandem repeats and single amino acid repeats from H-InvDB and linkage disequilibrium regions from D-HaploDB. The information can be searched and retrieved by features, functions and effects of polymorphisms, as well as by keywords. VarySysDB combines two kinds of viewers, GBrowse and Sequence View, to facilitate understanding of the positional relationship among polymorphisms, genome, transcripts, loci and functional domains. We expect that VarySysDB will yield useful information on polymorphisms affecting gene expression and phenotypes. VarySysDB is available at http://h-invitational.jp/varygene/.
Collapse
Affiliation(s)
- Makoto K Shimada
- Integrated Database and Systems Biology Team, Biomedicinal Information Research Center, National Institute of Advanced Industrial Science and Technology, Japan Biological Informatics Consortium, Hitachi Software Engineering Co., Ltd., Tokyo, Japan
| | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Yamaguchi-Kabata Y, Shimada MK, Hayakawa Y, Minoshima S, Chakraborty R, Gojobori T, Imanishi T. Distribution and effects of nonsense polymorphisms in human genes. PLoS One 2008; 3:e3393. [PMID: 18852891 PMCID: PMC2561068 DOI: 10.1371/journal.pone.0003393] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2008] [Accepted: 09/03/2008] [Indexed: 11/20/2022] Open
Abstract
Background A great amount of data has been accumulated on genetic variations in the human genome, but we still do not know much about how the genetic variations affect gene function. In particular, little is known about the distribution of nonsense polymorphisms in human genes despite their drastic effects on gene products. Methodology/Principal Findings To detect polymorphisms affecting gene function, we analyzed all publicly available polymorphisms in a database for single nucleotide polymorphisms (dbSNP build 125) located in the exons of 36,712 known and predicted protein-coding genes that were defined in an annotation project of all human genes and transcripts (H-InvDB ver3.8). We found a total of 252,555 single nucleotide polymorphisms (SNPs) and 8,479 insertion and deletions in the representative transcripts in these genes. The SNPs located in ORFs include 40,484 synonymous and 53,754 nonsynonymous SNPs, and 1,258 SNPs that were predicted to be nonsense SNPs or read-through SNPs. We estimated the density of nonsense SNPs to be 0.85×10−3 per site, which is lower than that of nonsynonymous SNPs (2.1×10−3 per site). On average, nonsense SNPs were located 250 codons upstream of the original termination codon, with the substitution occurring most frequently at the first codon position. Of the nonsense SNPs, 581 were predicted to cause nonsense-mediated decay (NMD) of transcripts that would prevent translation. We found that nonsense SNPs causing NMD were more common in genes involving kinase activity and transport. The remaining 602 nonsense SNPs are predicted to produce truncated polypeptides, with an average truncation of 75 amino acids. In addition, 110 read-through SNPs at termination codons were detected. Conclusion/Significance Our comprehensive exploration of nonsense polymorphisms showed that nonsense SNPs exist at a lower density than nonsynonymous SNPs, suggesting that nonsense mutations have more severe effects than amino acid changes. The correspondence of nonsense SNPs to known pathological variants suggests that phenotypic effects of nonsense SNPs have been reported for only a small fraction of nonsense SNPs, and that nonsense SNPs causing NMD are more likely to be involved in phenotypic variations. These nonsense SNPs may include pathological variants that have not yet been reported. These data are available from Transcript View of H-InvDB and VarySysDB (http://h-invitational.jp/varygene/).
Collapse
Affiliation(s)
- Yumi Yamaguchi-Kabata
- Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
| | - Makoto K. Shimada
- Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
- Japan Biological Information Research Center, Japan Biological Informatics Consortium, Tokyo, Japan
| | - Yosuke Hayakawa
- Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
- Japan Biological Information Research Center, Japan Biological Informatics Consortium, Tokyo, Japan
| | | | - Ranajit Chakraborty
- Center for Genome Information, University of Cincinnati, Cincinnati, Ohio, United States of America
| | - Takashi Gojobori
- Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
- Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Mishima, Shizuoka, Japan
| | - Tadashi Imanishi
- Biological Information Research Center, National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
- * E-mail:
| |
Collapse
|
41
|
Takeda JI, Suzuki Y, Sakate R, Sato Y, Seki M, Irie T, Takeuchi N, Ueda T, Nakao M, Sugano S, Gojobori T, Imanishi T. Low conservation and species-specific evolution of alternative splicing in humans and mice: comparative genomics analysis using well-annotated full-length cDNAs. Nucleic Acids Res 2008; 36:6386-95. [PMID: 18838389 PMCID: PMC2582632 DOI: 10.1093/nar/gkn677] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Using full-length cDNA sequences, we compared alternative splicing (AS) in humans and mice. The alignment of the human and mouse genomes showed that 86% of 199 426 total exons in human AS variants were conserved in the mouse genome. Of the 20 392 total human AS variants, however, 59% consisted of all conserved exons. Comparing AS patterns between human and mouse transcripts revealed that only 431 transcripts from 189 loci were perfectly conserved AS variants. To exclude the possibility that the full-length human cDNAs used in the present study, especially those with retained introns, were cloning artefacts or prematurely spliced transcripts, we experimentally validated 34 such cases. Our results indicate that even retained-intron type transcripts are typically expressed in a highly controlled manner and interact with translating ribosomes. We found non-conserved AS exons to be predominantly outside the coding sequences (CDSs). This suggests that non-conserved exons in the CDSs of transcripts cause functional constraint. These findings should enhance our understanding of the relationship between AS and species specificity of human genes.
Collapse
Affiliation(s)
- Jun-Ichi Takeda
- Integrated Database and Systems Biology Team, Biomedicinal Information Research Center, National Institute of Advanced Industrial Science and Technology, AIST Bio-IT Research Building, Aomi 2-42, Koto-ku, Tokyo, Japan
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
42
|
Li JT, Zhang Y, Kong L, Liu QR, Wei L. Trans-natural antisense transcripts including noncoding RNAs in 10 species: implications for expression regulation. Nucleic Acids Res 2008; 36:4833-44. [PMID: 18653530 PMCID: PMC2528163 DOI: 10.1093/nar/gkn470] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Natural antisense transcripts are at least partially complementary to their sense transcripts. Cis-Sense/Antisense pairs (cis-SAs) have been extensively characterized and known to play diverse regulatory roles, whereas trans-Sense/Antisense pairs (trans-SAs) in animals are poorly studied. We identified long trans-SAs in human and nine other animals, using ESTs to increase coverage significantly over previous studies. The percentage of transcriptional units (TUs) involved in trans-SAs among all TUs was as high as 4.13%. Particularly 2896 human TUs (or 2.89% of all human TUs) were involved in 3327 trans-SAs. Sequence complementarities over multiple segments with predicted RNA hybridization indicated that some trans-SAs might have sophisticated RNA-RNA pairing patterns. One-fourth of human trans-SAs involved noncoding TUs, suggesting that many noncoding RNAs may function by a trans-acting antisense mechanism. TUs in trans-SAs were statistically significantly enriched in nucleic acid binding, ion/protein binding and transport and signal transduction functions and pathways; a significant number of human trans-SAs showed concordant or reciprocal expression pattern; a significant number of human trans-SAs were conserved in mouse. This evidence suggests important regulatory functions of trans-SAs. In 30 cases, trans-SAs were related to cis-SAs through paralogues, suggesting a possible mechanism for the origin of trans-SAs. All trans-SAs are available at http://trans.cbi.pku.edu.cn/.
Collapse
Affiliation(s)
- Jiong-Tang Li
- Center for Bioinformatics, National Laboratory of Protein Engineering and Plant Genetic Engineering, College of Life Sciences, Peking University, Beijing, 100871, PR China
| | | | | | | | | |
Collapse
|
43
|
Isoform discovery by targeted cloning, 'deep-well' pooling and parallel sequencing. Nat Methods 2008; 5:597-600. [PMID: 18552854 DOI: 10.1038/nmeth.1224] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2008] [Accepted: 05/21/2008] [Indexed: 12/12/2022]
Abstract
Describing the 'ORFeome' of an organism, including all major isoforms, is essential for a system-level understanding of any species; however, conventional cloning and sequencing approaches are prohibitively costly and labor-intensive. We describe a potentially genome-wide methodology for efficiently capturing new coding isoforms using reverse transcriptase (RT)-PCR recombinational cloning, 'deep-well' pooling and a next-generation sequencing platform. This ORFeome discovery pipeline will be applicable to any eukaryotic species with a sequenced genome.
Collapse
|
44
|
Matsuya A, Sakate R, Kawahara Y, Koyanagi KO, Sato Y, Fujii Y, Yamasaki C, Habara T, Nakaoka H, Todokoro F, Yamaguchi K, Endo T, Oota S, Makalowski W, Ikeo K, Suzuki Y, Hanada K, Hashimoto K, Hirai M, Iwama H, Saitou N, Hiraki AT, Jin L, Kaneko Y, Kanno M, Murakami K, Noda AO, Saichi N, Sanbonmatsu R, Suzuki M, Takeda JI, Tanaka M, Gojobori T, Imanishi T, Itoh T. Evola: Ortholog database of all human genes in H-InvDB with manual curation of phylogenetic trees. Nucleic Acids Res 2007; 36:D787-92. [PMID: 17982176 PMCID: PMC2238928 DOI: 10.1093/nar/gkm878] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Currently, with the rapid growth of transcriptome data of various species, more reliable orthology information is prerequisite for further studies. However, detection of orthologs could be erroneous if pairwise distance-based methods, such as reciprocal BLAST searches, are utilized. Thus, as a sub-database of H-InvDB, an integrated database of annotated human genes (http://h-invitational.jp/), we constructed a fully curated database of evolutionary features of human genes, called ‘Evola’. In the process of the ortholog detection, computational analysis based on conserved genome synteny and transcript sequence similarity was followed by manual curation by researchers examining phylogenetic trees. In total, 18 968 human genes have orthologs among 11 vertebrates (chimpanzee, mouse, cow, chicken, zebrafish, etc.), either computationally detected or manually curated orthologs. Evola provides amino acid sequence alignments and phylogenetic trees of orthologs and homologs. In ‘dN/dS view’, natural selection on genes can be analyzed between human and other species. In ‘Locus maps’, all transcript variants and their exon/intron structures can be compared among orthologous gene loci. We expect the Evola to serve as a comprehensive and reliable database to be utilized in comparative analyses for obtaining new knowledge about human genes. Evola is available at http://www.h-invitational.jp/evola/.
Collapse
Affiliation(s)
- Akihiro Matsuya
- Integrated Database Group, Japan Biological Information Research Center, Japan Biological Informatics Consortium, Tokyo, Japan
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|