1
|
Sun Y, Wang HY, Liu B, Yue B, Liu Q, Liu Y, Rosa IF, Doretto LB, Han S, Lin L, Gong X, Shao C. CRISPR/dCas9-Mediated DNA Methylation Editing on emx2 in Chinese Tongue Sole ( Cynoglossus semilaevis) Testis Cells. Int J Mol Sci 2024; 25:7637. [PMID: 39062879 PMCID: PMC11277268 DOI: 10.3390/ijms25147637] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 07/03/2024] [Accepted: 07/09/2024] [Indexed: 07/28/2024] Open
Abstract
DNA methylation is a key epigenetic mechanism orchestrating gene expression networks in many biological processes. Nonetheless, studying the role of specific gene methylation events in fish faces challenges. In this study, we validate the regulation of DNA methylation on empty spiracles homeobox 2 (emx2) expression with decitabine treatment in Chinese tongue sole testis cells. We used the emx2 gene as the target gene and developed a new DNA methylation editing system by fusing dnmt3a with catalytic dead Cas9 (dCas9) and demonstrated its ability for sequence-specific DNA methylation editing. Results revealed that utilizing dCas9-dnmt3a to target emx2 promoter region led to increased DNA methylation levels and decreased emx2 expression in Chinese tongue sole testis cells. More importantly, the DNA methylation editing significantly suppressed the expression of MYC proto-oncogene, bHLH transcription factor (myc), one target gene of emx2. Furthermore, we assessed the off-target effects of dCas9-dnmt3a and confirmed no significant impact on the predicted off-target gene expression. Taken together, we developed the first DNA methylation editing system in marine species and demonstrated its effective editing ability in Chinese tongue sole cells. This provides a new strategy for both epigenetic research and molecular breeding of marine species.
Collapse
Affiliation(s)
- Yanxu Sun
- College of Fisheries and Life Science, Shanghai Ocean University, Shanghai 201306, China; (Y.S.); (B.Y.); (X.G.)
- State Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao 266071, China; (H.-Y.W.); (B.L.); (Q.L.); (Y.L.); (L.B.D.); (S.H.); (L.L.)
| | - Hong-Yan Wang
- State Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao 266071, China; (H.-Y.W.); (B.L.); (Q.L.); (Y.L.); (L.B.D.); (S.H.); (L.L.)
| | - Binghua Liu
- State Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao 266071, China; (H.-Y.W.); (B.L.); (Q.L.); (Y.L.); (L.B.D.); (S.H.); (L.L.)
| | - Bowen Yue
- College of Fisheries and Life Science, Shanghai Ocean University, Shanghai 201306, China; (Y.S.); (B.Y.); (X.G.)
- State Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao 266071, China; (H.-Y.W.); (B.L.); (Q.L.); (Y.L.); (L.B.D.); (S.H.); (L.L.)
| | - Qian Liu
- State Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao 266071, China; (H.-Y.W.); (B.L.); (Q.L.); (Y.L.); (L.B.D.); (S.H.); (L.L.)
| | - Yuyan Liu
- State Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao 266071, China; (H.-Y.W.); (B.L.); (Q.L.); (Y.L.); (L.B.D.); (S.H.); (L.L.)
| | - Ivana F. Rosa
- Department of Structural and Functional Biology, Institute of Biosciences, São Paulo State University (UNESP), Botucatu 01049-010, Brazil;
| | - Lucas B. Doretto
- State Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao 266071, China; (H.-Y.W.); (B.L.); (Q.L.); (Y.L.); (L.B.D.); (S.H.); (L.L.)
| | - Shenglei Han
- State Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao 266071, China; (H.-Y.W.); (B.L.); (Q.L.); (Y.L.); (L.B.D.); (S.H.); (L.L.)
| | - Lei Lin
- State Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao 266071, China; (H.-Y.W.); (B.L.); (Q.L.); (Y.L.); (L.B.D.); (S.H.); (L.L.)
| | - Xiaoling Gong
- College of Fisheries and Life Science, Shanghai Ocean University, Shanghai 201306, China; (Y.S.); (B.Y.); (X.G.)
- Key Laboratory of Exploration and Utilization of Aquatic Genetic Resources (Shanghai Ocean University), Ministry of Education, Shanghai 201306, China
- National Demonstration Center for Experimental Fisheries Science Education, Shanghai Ocean University, Shanghai 201306, China
| | - Changwei Shao
- State Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao 266071, China; (H.-Y.W.); (B.L.); (Q.L.); (Y.L.); (L.B.D.); (S.H.); (L.L.)
- Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao Marine Science and Technology Center, Qingdao 266237, China
| |
Collapse
|
2
|
Ulusoy E, Doğan T. Mutual annotation-based prediction of protein domain functions with Domain2GO. Protein Sci 2024; 33:e4988. [PMID: 38757367 PMCID: PMC11099699 DOI: 10.1002/pro.4988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 02/25/2024] [Accepted: 03/30/2024] [Indexed: 05/18/2024]
Abstract
Identifying unknown functional properties of proteins is essential for understanding their roles in both health and disease states. The domain composition of a protein can reveal critical information in this context, as domains are structural and functional units that dictate how the protein should act at the molecular level. The expensive and time-consuming nature of wet-lab experimental approaches prompted researchers to develop computational strategies for predicting the functions of proteins. In this study, we proposed a new method called Domain2GO that infers associations between protein domains and function-defining gene ontology (GO) terms, thus redefining the problem as domain function prediction. Domain2GO uses documented protein-level GO annotations together with proteins' domain annotations. Co-annotation patterns of domains and GO terms in the same proteins are examined using statistical resampling to obtain reliable associations. As a use-case study, we evaluated the biological relevance of examples selected from the Domain2GO-generated domain-GO term mappings via literature review. Then, we applied Domain2GO to predict unknown protein functions by propagating domain-associated GO terms to proteins annotated with these domains. For function prediction performance evaluation and comparison against other methods, we employed Critical Assessment of Function Annotation 3 (CAFA3) challenge datasets. The results demonstrated the high potential of Domain2GO, particularly for predicting molecular function and biological process terms, along with advantages such as producing interpretable results and having an exceptionally low computational cost. The approach presented here can be extended to other ontologies and biological entities to investigate unknown relationships in complex and large-scale biological data. The source code, datasets, results, and user instructions for Domain2GO are available at https://github.com/HUBioDataLab/Domain2GO. Additionally, we offer a user-friendly online tool at https://huggingface.co/spaces/HUBioDataLab/Domain2GO, which simplifies the prediction of functions of previously unannotated proteins solely using amino acid sequences.
Collapse
Affiliation(s)
- Erva Ulusoy
- Biological Data Science Lab, Department of Computer EngineeringHacettepe UniversityAnkaraTurkey
- Department of BioinformaticsGraduate School of Health Sciences, Hacettepe UniversityAnkaraTurkey
| | - Tunca Doğan
- Biological Data Science Lab, Department of Computer EngineeringHacettepe UniversityAnkaraTurkey
- Department of BioinformaticsGraduate School of Health Sciences, Hacettepe UniversityAnkaraTurkey
| |
Collapse
|
3
|
Rutherford KM, Lera-Ramírez M, Wood V. PomBase: a Global Core Biodata Resource-growth, collaboration, and sustainability. Genetics 2024; 227:iyae007. [PMID: 38376816 PMCID: PMC11075564 DOI: 10.1093/genetics/iyae007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Accepted: 01/13/2024] [Indexed: 02/21/2024] Open
Abstract
PomBase (https://www.pombase.org), the model organism database (MOD) for fission yeast, was recently awarded Global Core Biodata Resource (GCBR) status by the Global Biodata Coalition (GBC; https://globalbiodata.org/) after a rigorous selection process. In this MOD review, we present PomBase's continuing growth and improvement over the last 2 years. We describe these improvements in the context of the qualitative GCBR indicators related to scientific quality, comprehensivity, accelerating science, user stories, and collaborations with other biodata resources. This review also showcases the depth of existing connections both within the biocuration ecosystem and between PomBase and its user community.
Collapse
Affiliation(s)
- Kim M Rutherford
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK
| | - Manuel Lera-Ramírez
- Department of Genetics, Evolution and Environment, University College London, London WC1E 6BT, UK
| | - Valerie Wood
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK
| |
Collapse
|
4
|
Zhang T, Huang W, Zhang L, Li DZ, Qi J, Ma H. Phylogenomic profiles of whole-genome duplications in Poaceae and landscape of differential duplicate retention and losses among major Poaceae lineages. Nat Commun 2024; 15:3305. [PMID: 38632270 PMCID: PMC11024178 DOI: 10.1038/s41467-024-47428-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Accepted: 04/02/2024] [Indexed: 04/19/2024] Open
Abstract
Poaceae members shared a whole-genome duplication called rho. However, little is known about the evolutionary pattern of the rho-derived duplicates among Poaceae lineages and implications in adaptive evolution. Here we present phylogenomic/phylotranscriptomic analyses of 363 grasses covering all 12 subfamilies and report nine previously unknown whole-genome duplications. Furthermore, duplications from a single whole-genome duplication were mapped to multiple nodes on the species phylogeny; a whole-genome duplication was likely shared by woody bamboos with possible gene flow from herbaceous bamboos; and recent paralogues of a tetraploid Oryza are implicated in tolerance of seawater submergence. Moreover, rho duplicates showing differential retention among subfamilies include those with functions in environmental adaptations or morphogenesis, including ACOT for aquatic environments (Oryzoideae), CK2β for cold responses (Pooideae), SPIRAL1 for rapid cell elongation (Bambusoideae), and PAI1 for drought/cold responses (Panicoideae). This study presents a Poaceae whole-genome duplication profile with evidence for multiple evolutionary mechanisms that contribute to gene retention and losses.
Collapse
Affiliation(s)
- Taikui Zhang
- Department of Biology, the Eberly College of Science, and the Huck Institutes of the Life Sciences, the Pennsylvania State University, University Park, State College, PA, 16802, USA
- Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, School of Life Sciences, Fudan University, Shanghai, 200438, China
| | - Weichen Huang
- Department of Biology, the Eberly College of Science, and the Huck Institutes of the Life Sciences, the Pennsylvania State University, University Park, State College, PA, 16802, USA
| | - Lin Zhang
- Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, School of Life Sciences, Fudan University, Shanghai, 200438, China
- Chongqing Key Laboratory of Plant Resource Conservation and Germplasm Innovation, School of Life Sciences, Southwest University, Chongqing, 400715, China
| | - De-Zhu Li
- Germplasm Bank of Wild Species, Kunming Institute of Botany, Chinese Academy of Sciences, Kunming, Yunnan, 650201, China
| | - Ji Qi
- Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, School of Life Sciences, Fudan University, Shanghai, 200438, China.
| | - Hong Ma
- Department of Biology, the Eberly College of Science, and the Huck Institutes of the Life Sciences, the Pennsylvania State University, University Park, State College, PA, 16802, USA.
| |
Collapse
|
5
|
Kyu KL, Taylor CM, Douglas CA, Malik AI, Colmer TD, Siddique KHM, Erskine W. Genetic diversity and candidate genes for transient waterlogging tolerance in mungbean at the germination and seedling stages. FRONTIERS IN PLANT SCIENCE 2024; 15:1297096. [PMID: 38584945 PMCID: PMC10996369 DOI: 10.3389/fpls.2024.1297096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 02/26/2024] [Indexed: 04/09/2024]
Abstract
Mungbean [Vigna radiata var. radiata (L.) Wilczek] production in Asia is detrimentally affected by transient soil waterlogging caused by unseasonal and increasingly frequent extreme precipitation events. While mungbean exhibits sensitivity to waterlogging, there has been insufficient exploration of germplasm for waterlogging tolerance, as well as limited investigation into the genetic basis for tolerance to identify valuable loci. This research investigated the diversity of transient waterlogging tolerance in a mini-core germplasm collection of mungbean and identified candidate genes for adaptive traits of interest using genome-wide association studies (GWAS) at two critical stages of growth: germination and seedling stage (i.e., once the first trifoliate leaf had fully-expanded). In a temperature-controlled glasshouse, 292 genotypes were screened for tolerance after (i) 4 days of waterlogging followed by 7 days of recovery at the germination stage and (ii) 8 days of waterlogging followed by 7 days of recovery at the seedling stage. Tolerance was measured against drained controls. GWAS was conducted using 3,522 high-quality DArTseq-derived SNPs, revealing five significant associations with five phenotypic traits indicating improved tolerance. Waterlogging tolerance was positively correlated with the formation of adventitious roots and higher dry masses. FGGY carbohydrate kinase domain-containing protein was identified as a candidate gene for adventitious rooting and mRNA-uncharacterized LOC111241851, Caffeoyl-CoA O-methyltransferase At4g26220 and MORC family CW-type zinc finger protein 3 and zinc finger protein 2B genes for shoot, root, and total dry matter production. Moderate to high broad-sense heritability was exhibited for all phenotypic traits, including seed emergence (81%), adventitious rooting (56%), shoot dry mass (81%), root dry mass (79%) and SPAD chlorophyll content (70%). The heritability estimates, marker-trait associations, and identification of sources of waterlogging tolerant germplasm from this study demonstrate high potential for marker-assisted selection of tolerance traits to accelerate breeding of climate-resilient mungbean varieties.
Collapse
Affiliation(s)
- Khin Lay Kyu
- Centre for Plant Genetics and Breeding (PGB), UWA School of Agriculture and Environment, The University of Western Australia, Perth, WA, Australia
- The UWA Institute of Agriculture, The University of Western Australia, Crawley, WA, Australia
| | | | - Colin Andrew Douglas
- Department of Agriculture and Fisheries, Gatton Research Facility, Gatton, QLD, Australia
| | - Al Imran Malik
- Centre for Plant Genetics and Breeding (PGB), UWA School of Agriculture and Environment, The University of Western Australia, Perth, WA, Australia
- International Center for Tropical Agriculture (CIAT-Asia), Lao PDR Office, Vientiane, Lao People’s Democratic Republic
| | - Timothy David Colmer
- The UWA Institute of Agriculture, The University of Western Australia, Crawley, WA, Australia
| | - Kadambot H. M. Siddique
- The UWA Institute of Agriculture, The University of Western Australia, Crawley, WA, Australia
| | - William Erskine
- Centre for Plant Genetics and Breeding (PGB), UWA School of Agriculture and Environment, The University of Western Australia, Perth, WA, Australia
- The UWA Institute of Agriculture, The University of Western Australia, Crawley, WA, Australia
| |
Collapse
|
6
|
McCartney N, Kondakath G, Tai A, Trimmer BA. Functional annotation of insecta transcriptomes: A cautionary tale from Lepidoptera. INSECT BIOCHEMISTRY AND MOLECULAR BIOLOGY 2024; 165:104038. [PMID: 37952902 DOI: 10.1016/j.ibmb.2023.104038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 10/30/2023] [Accepted: 11/07/2023] [Indexed: 11/14/2023]
Abstract
Functional annotation is a critical step in the analysis of genomic data, as it provides insight into the function of individual genes and the pathways in which they participate. Currently, there is no consensus on the best computational approach for assigning functional annotation. This study compares three functional annotation methods (BLAST, eggNOG-Mapper, and InterProScan) in their ability to assign Gene Ontology terms in two species of Insecta with differing levels of annotation, Bombyx mori and Manduca sexta. The methods were compared for their annotation coverage, number of term assignments, term agreement and non-overlapping terms. Here we show that there are large discrepancies in gene ontology term assignment among the three computational methods, which could lead to confounding interpretations of data and non-comparable results. This study provide insight into the strengths and weaknesses of each computational method and highlight the need for more standardized methods of functional annotation.
Collapse
Affiliation(s)
- Naya McCartney
- Department of Biology, Tufts University, 200 Boston Ave, Medford, MA, 02155, USA
| | - Gayathri Kondakath
- Department of Biology, Tufts University, 200 Boston Ave, Medford, MA, 02155, USA
| | - Albert Tai
- School of Medicine, Tufts University, 136 Harrison Ave, Boston, MA, 02111, USA
| | - Barry A Trimmer
- Department of Biology, Tufts University, 200 Boston Ave, Medford, MA, 02155, USA.
| |
Collapse
|
7
|
Ibtehaz N, Kagaya Y, Kihara D. Domain-PFP allows protein function prediction using function-aware domain embedding representations. Commun Biol 2023; 6:1103. [PMID: 37907681 PMCID: PMC10618451 DOI: 10.1038/s42003-023-05476-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 10/17/2023] [Indexed: 11/02/2023] Open
Abstract
Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, the characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains by learning domain-Gene Ontology (GO) co-occurrences and associations. The domain embeddings we constructed turned out to be effective in performing actual function prediction tasks. Extensive evaluations showed that protein representations using the domain embeddings are superior to those of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method built on the domain embeddings, named Domain-PFP, substantially outperformed the state-of-the-art function predictors. Additionally, Domain-PFP demonstrated competitive performance in the CAFA3 evaluation, achieving overall the best performance among the top teams that participated in the assessment.
Collapse
Affiliation(s)
- Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Yuki Kagaya
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, USA.
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
| |
Collapse
|
8
|
Tripathi S, Shirnekhi HK, Gorman SD, Chandra B, Baggett DW, Park CG, Somjee R, Lang B, Hosseini SMH, Pioso BJ, Li Y, Iacobucci I, Gao Q, Edmonson MN, Rice SV, Zhou X, Bollinger J, Mitrea DM, White MR, McGrail DJ, Jarosz DF, Yi SS, Babu MM, Mullighan CG, Zhang J, Sahni N, Kriwacki RW. Defining the condensate landscape of fusion oncoproteins. Nat Commun 2023; 14:6008. [PMID: 37770423 PMCID: PMC10539325 DOI: 10.1038/s41467-023-41655-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Accepted: 09/13/2023] [Indexed: 09/30/2023] Open
Abstract
Fusion oncoproteins (FOs) arise from chromosomal translocations in ~17% of cancers and are often oncogenic drivers. Although some FOs can promote oncogenesis by undergoing liquid-liquid phase separation (LLPS) to form aberrant biomolecular condensates, the generality of this phenomenon is unknown. We explored this question by testing 166 FOs in HeLa cells and found that 58% formed condensates. The condensate-forming FOs displayed physicochemical features distinct from those of condensate-negative FOs and segregated into distinct feature-based groups that aligned with their sub-cellular localization and biological function. Using Machine Learning, we developed a predictor of FO condensation behavior, and discovered that 67% of ~3000 additional FOs likely form condensates, with 35% of those predicted to function by altering gene expression. 47% of the predicted condensate-negative FOs were associated with cell signaling functions, suggesting a functional dichotomy between condensate-positive and -negative FOs. Our Datasets and reagents are rich resources to interrogate FO condensation in the future.
Collapse
Affiliation(s)
- Swarnendu Tripathi
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Hazheen K Shirnekhi
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Scott D Gorman
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
- Arrakis Therapeutics, 830 Winter St, Waltham, MA, 02451, USA
| | - Bappaditya Chandra
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - David W Baggett
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Cheon-Gil Park
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Ramiz Somjee
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
- Rhodes College, Memphis, TN, USA
- Washington University School of Medicine, 660 South Euclid Avenue, St. Louis, MO, 63110, USA
| | - Benjamin Lang
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
- Center of Excellence for Data-Driven Discovery, Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Seyed Mohammad Hadi Hosseini
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
- Center of Excellence for Data-Driven Discovery, Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Brittany J Pioso
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Yongsheng Li
- Livestrong Cancer Institutes, Department of Oncology, Dell Medical School, The University of Texas at Austin, Austin, TX, 78712, USA
| | - Ilaria Iacobucci
- Department of Pathology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Qingsong Gao
- Department of Pathology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Michael N Edmonson
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Stephen V Rice
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Xin Zhou
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - John Bollinger
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Diana M Mitrea
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
- Dewpoint Therapeutics, 451 D Street, Suite 104, Boston, MA, 02210, USA
| | - Michael R White
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
- IDEXX Laboratories, Inc., One IDEXX Drive, Westbrook, ME, 04092, USA
| | - Daniel J McGrail
- Center for Immunotherapy and Precision Immuno-Oncology, Cleveland Clinic, Cleveland, OH, USA
- Lerner Research Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Daniel F Jarosz
- Department of Chemical and Systems Biology, Stanford University School of Medicine, Stanford, CA, USA
- Department of Developmental Biology, Stanford University School of Medicine, Stanford, CA, USA
| | - S Stephen Yi
- Livestrong Cancer Institutes, Department of Oncology, Dell Medical School, The University of Texas at Austin, Austin, TX, 78712, USA
- Department of Biomedical Engineering, and Oden Institute for Computational Engineering and Sciences, The University of Texas at Austin, Austin, TX, USA
| | - M Madan Babu
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
- Center of Excellence for Data-Driven Discovery, Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Charles G Mullighan
- Department of Pathology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Jinghui Zhang
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN, USA
| | - Nidhi Sahni
- Department of Epigenetics and Molecular Carcinogenesis, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
- Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
- Program in Quantitative and Computational Biosciences, Baylor College of Medicine, Houston, TX, USA
| | - Richard W Kriwacki
- Department of Structural Biology, St. Jude Children's Research Hospital, Memphis, TN, USA.
- Department of Microbiology, Immunology and Biochemistry, University of Tennessee Health Sciences Center, Memphis, TN, USA.
| |
Collapse
|
9
|
Ibtehaz N, Kagaya Y, Kihara D. Domain-PFP: Protein Function Prediction Using Function-Aware Domain Embedding Representations. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.23.554486. [PMID: 37662252 PMCID: PMC10473699 DOI: 10.1101/2023.08.23.554486] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, the characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains by learning domain-Gene Ontology (GO) co-occurrences and associations. The domain embeddings we constructed turned out to be effective in performing actual function prediction tasks. Extensive evaluations showed that protein representations using the domain embeddings are superior to those of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method built on the domain embeddings, named Domain-PFP, significantly outperformed the state-of-the-art function predictors. Additionally, Domain-PFP demonstrated competitive performance in the CAFA3 evaluation, achieving overall the best performance among the top teams that participated in the assessment.
Collapse
Affiliation(s)
- Nabil Ibtehaz
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
| | - Yuki Kagaya
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| | - Daisuke Kihara
- Department of Computer Science, Purdue University, West Lafayette, IN, United States
- Department of Biological Sciences, Purdue University, West Lafayette, IN, United States
| |
Collapse
|
10
|
Liu H, Zhang Y, Chen J. Whole-genome sequencing and functional annotation of pathogenic Paraconiothyrium brasiliense causing human cellulitis. Hum Genomics 2023; 17:65. [PMID: 37461066 DOI: 10.1186/s40246-023-00512-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Accepted: 07/11/2023] [Indexed: 07/20/2023] Open
Abstract
BACKGROUND A pathogenic filamentous fungus causing eyelid cellulitis was isolated from the secretion from a patient's left eyelid, and a phylogenetic analysis based on the rDNA internal transcribed spacer region (ITS) and single-copy gene families identified the isolated strain as Paraconiothyrium brasiliense. The genus Paraconiothyrium contains the major plant pathogenic fungi, and in our study, P. brasiliense was identified for the first time as causing human infection. To comprehensively analyze the pathogenicity, and proteomics of the isolated strain from a genetic perspective, whole-genome sequencing was performed with the Illumina NovaSeq and Oxford Nanopore Technologies platforms, and a bioinformatics analysis was performed with BLAST against genome sequences in various publicly available databases. RESULTS The genome of P. brasiliense GGX 413 is 39.49 Mb in length, with a 51.2% GC content, and encodes 13,057 protein-coding genes and 181 noncoding RNAs. Functional annotation showed that 592 genes encode virulence factors that are involved in human disease, including 61 lethal virulence factors and 30 hypervirulence factors. Fifty-four of these 592 virulence genes are related to carbohydrate-active enzymes, including 46 genes encoding secretory CAZymes, and 119 associated with peptidases, including 70 genes encoding secretory peptidases, and 27 are involved in secondary metabolite synthesis, including four that are associated with terpenoid metabolism. CONCLUSIONS This study establishes the genomic resources of P. brasiliense and provides a theoretical basis for future studies of the pathogenic mechanism of its infection of humans, the treatment of the diseases caused, and related research.
Collapse
Affiliation(s)
- Haibing Liu
- Department of Clinical Laboratory, The Affiliated People's Hospital of Jiangsu University, Zhenjiang, Jiangsu, China
| | - Yue Zhang
- Department of Clinical Laboratory, The Affiliated People's Hospital of Jiangsu University, Zhenjiang, Jiangsu, China
| | - Jianguo Chen
- Department of Clinical Laboratory, The Affiliated People's Hospital of Jiangsu University, Zhenjiang, Jiangsu, China.
| |
Collapse
|
11
|
Dosch J, Bergmann H, Tran V, Ebersberger I. FAS: assessing the similarity between proteins using multi-layered feature architectures. Bioinformatics 2023; 39:btad226. [PMID: 37084276 PMCID: PMC10185405 DOI: 10.1093/bioinformatics/btad226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 02/23/2023] [Accepted: 04/13/2023] [Indexed: 04/23/2023] Open
Abstract
MOTIVATION Protein sequence comparison is a fundamental element in the bioinformatics toolkit. When sequences are annotated with features such as functional domains, transmembrane domains, low complexity regions or secondary structure elements, the resulting feature architectures allow better informed comparisons. However, many existing schemes for scoring architecture similarities cannot cope with features arising from multiple annotation sources. Those that do fall short in the resolution of overlapping and redundant feature annotations. RESULTS Here, we introduce FAS, a scoring method that integrates features from multiple annotation sources in a directed acyclic architecture graph. Redundancies are resolved as part of the architecture comparison by finding the paths through the graphs that maximize the pair-wise architecture similarity. In a large-scale evaluation on more than 10 000 human-yeast ortholog pairs, architecture similarities assessed with FAS are consistently more plausible than those obtained using e-values to resolve overlaps or leaving overlaps unresolved. Three case studies demonstrate the utility of FAS on architecture comparison tasks: benchmarking of orthology assignment software, identification of functionally diverged orthologs, and diagnosing protein architecture changes stemming from faulty gene predictions. With the help of FAS, feature architecture comparisons can now be routinely integrated into these and many other applications. AVAILABILITY AND IMPLEMENTATION FAS is available as python package: https://pypi.org/project/greedyFAS/.
Collapse
Affiliation(s)
- Julian Dosch
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
| | - Holger Bergmann
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
| | - Vinh Tran
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
| | - Ingo Ebersberger
- Applied Bioinformatics Group, Goethe University Frankfurt, Faculty of Biosciences, Institute of Cell Biology and Neuroscience, Frankfurt, 60438, Germany
- Senckenberg Biodiversity and Climate Research Centre (S-BIKF), Frankfurt, 60325, Germany
- LOEWE Centre for Translational Biodiversity Genomics (TBG), Frankfurt, 60325, Germany
| |
Collapse
|
12
|
Reijnders MJMF, Waterhouse RM. CrowdGO: Machine learning and semantic similarity guided consensus Gene Ontology annotation. PLoS Comput Biol 2022; 18:e1010075. [PMID: 35560159 PMCID: PMC9132264 DOI: 10.1371/journal.pcbi.1010075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2021] [Revised: 05/25/2022] [Accepted: 04/04/2022] [Indexed: 11/29/2022] Open
Abstract
Characterising gene function for the ever-increasing number and diversity of species with annotated genomes relies almost entirely on computational prediction methods. These software are also numerous and diverse, each with different strengths and weaknesses as revealed through community benchmarking efforts. Meta-predictors that assess consensus and conflict from individual algorithms should deliver enhanced functional annotations. To exploit the benefits of meta-approaches, we developed CrowdGO, an open-source consensus-based Gene Ontology (GO) term meta-predictor that employs machine learning models with GO term semantic similarities and information contents. By re-evaluating each gene-term annotation, a consensus dataset is produced with high-scoring confident annotations and low-scoring rejected annotations. Applying CrowdGO to results from a deep learning-based, a sequence similarity-based, and two protein domain-based methods, delivers consensus annotations with improved precision and recall. Furthermore, using standard evaluation measures CrowdGO performance matches that of the community’s best performing individual methods. CrowdGO therefore offers a model-informed approach to leverage strengths of individual predictors and produce comprehensive and accurate gene functional annotations. New technologies mean that we are able to read the genetic blueprints in the form of complete genome sequences from many different species. We are also able to use computational methods combined with evidence from experiments to map out the locations in the genomes of many thousands of genes and other important regions. However, discovering and characterising the biological functions of all these genes and their protein products requires considerably more experimental work. In order to gain insights into the possible functions of the many genes currently lacking functional information from experiments we must therefore rely on methods that computationally predict protein functions. Many different software tools have been developed to tackle this challenge, each with their own strengths and weaknesses as shown by several community-based competitions that assess the performance of the predictors. Taking advantage of powerful modern machine learning techniques, we developed CrowdGO, a new software that aims to combine predictions from several tools and produce comprehensive and accurate gene functional annotations. CrowdGO is able to computationally assess agreements and conflicts amongst annotations from different predictors to then re-evaluate the results and deliver enhanced predictions of protein functions.
Collapse
Affiliation(s)
- Maarten J. M. F. Reijnders
- Department of Ecology and Evolution, University of Lausanne, and Swiss Institute of Bioinformatics, Lausanne, Switzerland
- * E-mail: (MJMFR); (RMW)
| | - Robert M. Waterhouse
- Department of Ecology and Evolution, University of Lausanne, and Swiss Institute of Bioinformatics, Lausanne, Switzerland
- * E-mail: (MJMFR); (RMW)
| |
Collapse
|
13
|
Thuy-Boun PS, Wang AY, Crissien-Martinez A, Xu JH, Chatterjee S, Stupp GS, Su AI, Coyle WJ, Wolan DW. Quantitative metaproteomics and activity-based protein profiling of patient fecal microbiome identifies host and microbial serine-type endopeptidase activity associated with ulcerative colitis. Mol Cell Proteomics 2022; 21:100197. [PMID: 35033677 PMCID: PMC8941213 DOI: 10.1016/j.mcpro.2022.100197] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2021] [Revised: 01/10/2022] [Accepted: 01/11/2022] [Indexed: 12/12/2022] Open
Abstract
The gut microbiota plays an important yet incompletely understood role in the induction and propagation of ulcerative colitis (UC). Organism-level efforts to identify UC-associated microbes have revealed the importance of community structure, but less is known about the molecular effectors of disease. We performed 16S rRNA gene sequencing in parallel with label-free data-dependent LC-MS/MS proteomics to characterize the stool microbiomes of healthy (n = 8) and UC (n = 10) patients. Comparisons of taxonomic composition between techniques revealed major differences in community structure partially attributable to the additional detection of host, fungal, viral, and food peptides by metaproteomics. Differential expression analysis of metaproteomic data identified 176 significantly enriched protein groups between healthy and UC patients. Gene ontology analysis revealed several enriched functions with serine-type endopeptidase activity overrepresented in UC patients. Using a biotinylated fluorophosphonate probe and streptavidin-based enrichment, we show that serine endopeptidases are active in patient fecal samples and that additional putative serine hydrolases are detectable by this approach compared with unenriched profiling. Finally, as metaproteomic databases expand, they are expected to asymptotically approach completeness. Using ComPIL and de novo peptide sequencing, we estimate the size of the probable peptide space unidentified (“dark peptidome”) by our large database approach to establish a rough benchmark for database sufficiency. Despite high variability inherent in patient samples, our analysis yielded a catalog of differentially enriched proteins between healthy and UC fecal proteomes. This catalog provides a clinically relevant jumping-off point for further molecular-level studies aimed at identifying the microbial underpinnings of UC. Identified 176 significantly altered protein groups between healthy and UC patients. Serine-type endopeptidase activity is overrepresented in UC patients. Fluorophosphonate ABPP shows that endopeptidases are active in fecal samples. ABPP enrichment helps identify additional putative serine hydrolases in samples. De novo sequencing used to estimate number of MS2 spectra unidentified by ComPIL.
Collapse
Affiliation(s)
- Peter S Thuy-Boun
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA 92037
| | - Ana Y Wang
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA 92037
| | | | - Janice H Xu
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA 92037
| | - Sandip Chatterjee
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA 92037
| | - Gregory S Stupp
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037
| | - Andrew I Su
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037
| | - Walter J Coyle
- Scripps Clinic Gastroenterology Division, La Jolla, CA 92037
| | - Dennis W Wolan
- Department of Molecular Medicine, The Scripps Research Institute, La Jolla, CA 92037; Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037.
| |
Collapse
|
14
|
Xu P, Zhao C, You X, Yang F, Chen J, Ruan Z, Gu R, Xu J, Bian C, Shi Q. Draft Genome of the Mirrorwing Flyingfish ( Hirundichthys speculiger). Front Genet 2021; 12:695700. [PMID: 34306036 PMCID: PMC8294118 DOI: 10.3389/fgene.2021.695700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Accepted: 06/03/2021] [Indexed: 12/04/2022] Open
Affiliation(s)
- Pengwei Xu
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Chenxi Zhao
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China
| | - Xinxin You
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China.,Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, BGI Academy of Marine Sciences, BGI Marine, BGI, Shenzhen, China
| | - Fan Yang
- Marine Geological Department, Marine Geological Survey Institute of Hainan Province, Haikou, China
| | - Jieming Chen
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China.,Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, BGI Academy of Marine Sciences, BGI Marine, BGI, Shenzhen, China
| | - Zhiqiang Ruan
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China.,Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, BGI Academy of Marine Sciences, BGI Marine, BGI, Shenzhen, China
| | - Ruobo Gu
- Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, BGI Academy of Marine Sciences, BGI Marine, BGI, Shenzhen, China
| | - Junmin Xu
- Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, BGI Academy of Marine Sciences, BGI Marine, BGI, Shenzhen, China
| | - Chao Bian
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China.,Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, BGI Academy of Marine Sciences, BGI Marine, BGI, Shenzhen, China
| | - Qiong Shi
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China.,Shenzhen Key Lab of Marine Genomics, Guangdong Provincial Key Lab of Molecular Breeding in Marine Economic Animals, BGI Academy of Marine Sciences, BGI Marine, BGI, Shenzhen, China
| |
Collapse
|
15
|
Yang Y, Huang L, Xu C, Qi L, Wu Z, Li J, Chen H, Wu Y, Fu T, Zhu H, Saand MA, Li J, Liu L, Fan H, Zhou H, Qin W. Chromosome-scale genome assembly of areca palm (Areca catechu). Mol Ecol Resour 2021; 21:2504-2519. [PMID: 34133844 DOI: 10.1111/1755-0998.13446] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Revised: 06/08/2021] [Accepted: 06/11/2021] [Indexed: 11/28/2022]
Abstract
Areca palm (Areca catechu L.; family Arecaceae) is an important tropical medicinal crop and is also used for masticatory and religious purposes in Asia. Improvements to areca properties made by traditional breeding tools have been very slow, and further advances in its cultivation and practical use require genomic information, which is still unavailable. Here, we present a chromosome-scale reference genome assembly for areca by combining Illumina and PacBio data with Hi-C mapping technologies, covering the predicted A. catechu genome length (2.59 Gb, variety "Reyan#1") to an estimated 240× read depth. The assembly was 2.51 Gb in length with a scaffold N50 of 1.7Mb. The scaffolds were then further assembled into 16 pseudochromosomes, with an N50 of 172 Mb. Transposable elements comprised 80.37% of the areca genome, and 68.68% of them were long-terminal repeat retrotransposon elements. The areca palm genome was predicted to harbour 31,571 protein-coding genes and overall, 92.92% of genes were functionally annotated, including enriched and expanded families of genes responsible for biosynthesis of flavonoid, anthocyanin, monoterpenoid and their derivatives. Comparative analyses indicated that A. catechu probably diverged from its close relatives Elaeis guineensis and Cocos nucifera approximately 50.3 million years ago (Ma). Two whole genome duplication events in areca palm were found to be shared by palms and monocots, respectively. This genome assembly and associated resources represents an important addition to the palm genomics community and will be a valuable resource that will facilitate areca palm breeding and improve our understanding of areca palm biology and evolution.
Collapse
Affiliation(s)
- Yaodong Yang
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Wenchang, China
| | - Liyun Huang
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Wenchang, China
| | - Chunyan Xu
- BGI Genomics, BGI-Shenzhen, Shenzhen, China
| | - Lan Qi
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Wenchang, China
| | | | - Jia Li
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Wenchang, China
| | | | - Yi Wu
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Wenchang, China
| | - Tao Fu
- BGI Genomics, BGI-Shenzhen, Shenzhen, China
| | - Hui Zhu
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Wenchang, China
| | - Mumtaz Ali Saand
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Wenchang, China
| | - Jing Li
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Wenchang, China
| | - Liyun Liu
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Wenchang, China
| | - Haikou Fan
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Wenchang, China
| | - Huanqi Zhou
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Wenchang, China
| | - Weiquan Qin
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Wenchang, China
| |
Collapse
|
16
|
Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-Lafosse T, Qureshi M, Raj S, Richardson L, Salazar GA, Williams L, Bork P, Bridge A, Gough J, Haft DH, Letunic I, Marchler-Bauer A, Mi H, Natale DA, Necci M, Orengo CA, Pandurangan AP, Rivoire C, Sigrist CJA, Sillitoe I, Thanki N, Thomas PD, Tosatto SCE, Wu CH, Bateman A, Finn RD. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res 2021; 49:D344-D354. [PMID: 33156333 PMCID: PMC7778928 DOI: 10.1093/nar/gkaa977] [Citation(s) in RCA: 1168] [Impact Index Per Article: 389.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/08/2020] [Accepted: 10/23/2020] [Indexed: 01/22/2023] Open
Abstract
The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. InterProScan is the underlying software that allows protein and nucleic acid sequences to be searched against InterPro's signatures. Signatures are predictive models which describe protein families, domains or sites, and are provided by multiple databases. InterPro combines signatures representing equivalent families, domains or sites, and provides additional information such as descriptions, literature references and Gene Ontology (GO) terms, to produce a comprehensive resource for protein classification. Founded in 1999, InterPro has become one of the most widely used resources for protein family annotation. Here, we report the status of InterPro (version 81.0) in its 20th year of operation, and its associated software, including updates to database content, the release of a new website and REST API, and performance improvements in InterProScan.
Collapse
Affiliation(s)
- Matthias Blum
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Hsin-Yu Chang
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Sara Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Tiago Grego
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Swaathi Kandasaamy
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Alex Mitchell
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Gift Nuka
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Typhaine Paysan-Lafosse
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Matloob Qureshi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Shriya Raj
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Lorna Richardson
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Gustavo A Salazar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Lowri Williams
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Peer Bork
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Meyerhofstraße 1, 69117 Heidelberg, Germany
| | - Alan Bridge
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
| | - Julian Gough
- Medical Research Council Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Ave, Trumpington, Cambridge CB2 0QH, UK
| | - Daniel H Haft
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD 20894 USA
| | - Ivica Letunic
- Biobyte Solutions GmbH, Bothestr 142, 69126 Heidelberg, Germany
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD 20894 USA
| | - Huaiyu Mi
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Darren A Natale
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
| | - Marco Necci
- Department of Biomedical Sciences, University of Padua, via U. Bassi 58/b, 35131 Padua, Italy
| | - Christine A Orengo
- Department of Structural and Molecular Biology, University College London, Gower St, Bloomsbury, London WC1E 6BT, UK
| | - Arun P Pandurangan
- Medical Research Council Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Ave, Trumpington, Cambridge CB2 0QH, UK
| | - Catherine Rivoire
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
| | - Christian J A Sigrist
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva 4, Switzerland
| | - Ian Sillitoe
- Department of Structural and Molecular Biology, University College London, Gower St, Bloomsbury, London WC1E 6BT, UK
| | - Narmada Thanki
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda MD 20894 USA
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padua, via U. Bassi 58/b, 35131 Padua, Italy
| | - Cathy H Wu
- Protein Information Resource, Georgetown University Medical Center, Washington, DC 20007, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK
| |
Collapse
|
17
|
Mi H, Ebert D, Muruganujan A, Mills C, Albou LP, Mushayamaha T, Thomas PD. PANTHER version 16: a revised family classification, tree-based classification tool, enhancer regions and extensive API. Nucleic Acids Res 2020; 49:D394-D403. [PMID: 33290554 PMCID: PMC7778891 DOI: 10.1093/nar/gkaa1106] [Citation(s) in RCA: 787] [Impact Index Per Article: 196.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 10/19/2020] [Accepted: 10/28/2020] [Indexed: 01/29/2023] Open
Abstract
PANTHER (Protein Analysis Through Evolutionary Relationships, http://www.pantherdb.org) is a resource for the evolutionary and functional classification of protein-coding genes from all domains of life. The evolutionary classification is based on a library of over 15,000 phylogenetic trees, and the functional classifications include Gene Ontology terms and pathways. Here, we analyze the current coverage of genes from genomes in different taxonomic groups, so that users can better understand what to expect when analyzing a gene list using PANTHER tools. We also describe extensive improvements to PANTHER made in the past two years. The PANTHER Protein Class ontology has been completely refactored, and 6101 PANTHER families have been manually assigned to a Protein Class, providing a high level classification of protein families and their genes. Users can access the TreeGrafter tool to add their own protein sequences to the reference phylogenetic trees in PANTHER, to infer evolutionary context as well as fine-grained annotations. We have added human enhancer-gene links that associate non-coding regions with the annotated human genes in PANTHER. We have also expanded the available services for programmatic access to PANTHER tools and data via application programming interfaces (APIs). Other improvements include additional plant genomes and an updated PANTHER GO-slim.
Collapse
Affiliation(s)
- Huaiyu Mi
- Correspondence may also be addressed to Huaiyu Mi.
| | - Dustin Ebert
- Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Anushya Muruganujan
- Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Caitlin Mills
- Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Laurent-Philippe Albou
- Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Tremayne Mushayamaha
- Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, CA 90033, USA
| | - Paul D Thomas
- To whom correspondence should be addressed. Tel: +1 323 442 7975;
| |
Collapse
|
18
|
Rath PP, Gourinath S. The actin cytoskeleton orchestra in Entamoeba histolytica. Proteins 2020; 88:1361-1375. [PMID: 32506560 DOI: 10.1002/prot.25955] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Revised: 04/17/2020] [Accepted: 05/27/2020] [Indexed: 12/14/2022]
Abstract
Years of evolution have kept actin conserved throughout various clades of life. It is an essential protein starring in many cellular processes. In a primitive eukaryote named Entamoeba histolytica, actin directs the process of phagocytosis. A finely tuned coordination between various actin-binding proteins (ABPs) choreographs this process and forms one of the virulence factors for this protist pathogen. The ever-expanding world of ABPs always has space to accommodate new and varied types of proteins to the earlier existing repertoire. In this article, we report the identification of 390 ABPs from Entamoeba histolytica. These proteins are part of diverse families that have been known to regulate actin dynamics. Most of the proteins are primarily uncharacterized in this organism; however, this study aims to annotate the ABPs based on their domain arrangements. A unique characteristic about some of the ABPs found is the combination of domains present in them unlike any other reported till date. Calponin domain-containing proteins formed the largest group among all types with 38 proteins, followed by 29 proteins with the infamous BAR domain in them, and 23 proteins belonging to actin-related proteins. The other protein families had a lesser number of members. Presence of exclusive domain arrangements in these proteins could guide us to yet unknown actin regulatory mechanisms prevalent in nature. This article is the first step to unraveling them.
Collapse
|
19
|
Wood V, Carbon S, Harris MA, Lock A, Engel SR, Hill DP, Van Auken K, Attrill H, Feuermann M, Gaudet P, Lovering RC, Poux S, Rutherford KM, Mungall CJ. Term Matrix: a novel Gene Ontology annotation quality control system based on ontology term co-annotation patterns. Open Biol 2020; 10:200149. [PMID: 32875947 PMCID: PMC7536087 DOI: 10.1098/rsob.200149] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Accepted: 08/06/2020] [Indexed: 12/11/2022] Open
Abstract
Biological processes are accomplished by the coordinated action of gene products. Gene products often participate in multiple processes, and can therefore be annotated to multiple Gene Ontology (GO) terms. Nevertheless, processes that are functionally, temporally and/or spatially distant may have few gene products in common, and co-annotation to unrelated processes probably reflects errors in literature curation, ontology structure or automated annotation pipelines. We have developed an annotation quality control workflow that uses rules based on mutually exclusive processes to detect annotation errors, based on and validated by case studies including the three we present here: fission yeast protein-coding gene annotations over time; annotations for cohesin complex subunits in human and model species; and annotations using a selected set of GO biological process terms in human and five model species. For each case study, we reviewed available GO annotations, identified pairs of biological processes which are unlikely to be correctly co-annotated to the same gene products (e.g. amino acid metabolism and cytokinesis), and traced erroneous annotations to their sources. To date we have generated 107 quality control rules, and corrected 289 manual annotations in eukaryotes and over 52 700 automatically propagated annotations across all taxa.
Collapse
Affiliation(s)
- Valerie Wood
- Cambridge Systems Biology Centre, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
| | - Seth Carbon
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Midori A. Harris
- Cambridge Systems Biology Centre, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
| | - Antonia Lock
- Department of Genetics, Evolution and Environment, University College London, London WC1E 6B, UK
| | - Stacia R. Engel
- Department of Genetics, Stanford University, Palo Alto, CA 94304-5477, USA
| | - David P. Hill
- Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, ME 04609, USA
| | - Kimberly Van Auken
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA
| | - Helen Attrill
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge CB2 3DY, UK
| | - Marc Feuermann
- Swiss Institute of Bioinformatics, 1 Michel-Servet, 1204 Geneva, Switzerland
| | - Pascale Gaudet
- Swiss Institute of Bioinformatics, 1 Michel-Servet, 1204 Geneva, Switzerland
| | - Ruth C. Lovering
- Functional Gene Annotation, Preclinical and Fundamental Science, Institute of Cardiovascular Science, University College London, London WC1E 6JF, UK
| | - Sylvain Poux
- Swiss Institute of Bioinformatics, 1 Michel-Servet, 1204 Geneva, Switzerland
| | - Kim M. Rutherford
- Cambridge Systems Biology Centre, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
| | - Christopher J. Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| |
Collapse
|
20
|
Koo DCE, Bonneau R. Towards region-specific propagation of protein functions. Bioinformatics 2020; 35:1737-1744. [PMID: 30304483 PMCID: PMC6513163 DOI: 10.1093/bioinformatics/bty834] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 08/23/2018] [Accepted: 10/08/2018] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION Due to the nature of experimental annotation, most protein function prediction methods operate at the protein-level, where functions are assigned to full-length proteins based on overall similarities. However, most proteins function by interacting with other proteins or molecules, and many functional associations should be limited to specific regions rather than the entire protein length. Most domain-centric function prediction methods depend on accurate domain family assignments to infer relationships between domains and functions, with regions that are unassigned to a known domain-family left out of functional evaluation. Given the abundance of residue-level annotations currently available, we present a function prediction methodology that automatically infers function labels of specific protein regions using protein-level annotations and multiple types of region-specific features. RESULTS We apply this method to local features obtained from InterPro, UniProtKB and amino acid sequences and show that this method improves both the accuracy and region-specificity of protein function transfer and prediction. We compare region-level predictive performance of our method against that of a whole-protein baseline method using proteins with structurally verified binding sites and also compare protein-level temporal holdout predictive performances to expand the variety and specificity of GO terms we could evaluate. Our results can also serve as a starting point to categorize GO terms into region-specific and whole-protein terms and select prediction methods for different classes of GO terms. AVAILABILITY AND IMPLEMENTATION The code and features are freely available at: https://github.com/ek1203/rsfp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Da Chen Emily Koo
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA
| | - Richard Bonneau
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA.,Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.,Center for Data Science, New York University, New York, NY, USA
| |
Collapse
|
21
|
Kishore R, Arnaboldi V, Van Slyke CE, Chan J, Nash RS, Urbano JM, Dolan ME, Engel SR, Shimoyama M, Sternberg PW, Genome Resources TAO. Automated generation of gene summaries at the Alliance of Genome Resources. Database (Oxford) 2020; 2020:baaa037. [PMID: 32559296 PMCID: PMC7304461 DOI: 10.1093/database/baaa037] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2020] [Revised: 04/06/2020] [Accepted: 04/29/2020] [Indexed: 12/28/2022]
Abstract
Short paragraphs that describe gene function, referred to as gene summaries, are valued by users of biological knowledgebases for the ease with which they convey key aspects of gene function. Manual curation of gene summaries, while desirable, is difficult for knowledgebases to sustain. We developed an algorithm that uses curated, structured gene data at the Alliance of Genome Resources (Alliance; www.alliancegenome.org) to automatically generate gene summaries that simulate natural language. The gene data used for this purpose include curated associations (annotations) to ontology terms from the Gene Ontology, Disease Ontology, model organism knowledgebase (MOK)-specific anatomy ontologies and Alliance orthology data. The method uses sentence templates for each data category included in the gene summary in order to build a natural language sentence from the list of terms associated with each gene. To improve readability of the summaries when numerous gene annotations are present, we developed a new algorithm that traverses ontology graphs in order to group terms by their common ancestors. The algorithm optimizes the coverage of the initial set of terms and limits the length of the final summary, using measures of information content of each ontology term as a criterion for inclusion in the summary. The automated gene summaries are generated with each Alliance release, ensuring that they reflect current data at the Alliance. Our method effectively leverages category-specific curation efforts of the Alliance member databases to create modular, structured and standardized gene summaries for seven member species of the Alliance. These automatically generated gene summaries make cross-species gene function comparisons tenable and increase discoverability of potential models of human disease. In addition to being displayed on Alliance gene pages, these summaries are also included on several MOK gene pages.
Collapse
Affiliation(s)
- Ranjana Kishore
- WormBase, Division of Biology and Biological Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA
| | - Valerio Arnaboldi
- WormBase, Division of Biology and Biological Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA
| | - Ceri E Van Slyke
- ZFIN, The Institute of Neuroscience, 222 Huestis Hall, University of Oregon, Eugene, OR 97403-1254, USA
| | - Juancarlos Chan
- WormBase, Division of Biology and Biological Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA
| | - Robert S Nash
- Saccharomyces Genome Database, Department of Genetics, Stanford University, 3165 Porter Drive, Palo Alto, CA 94304, USA
| | - Jose M Urbano
- FlyBase, Department of Physiology, Development and Neuroscience, 7 Downing Pl, University of Cambridge, Cambridge CB2 3DY, UK
| | - Mary E Dolan
- MGI, The Jackson Laboratory, Bar Harbor, ME 04609, USA
| | - Stacia R Engel
- Saccharomyces Genome Database, Department of Genetics, Stanford University, 3165 Porter Drive, Palo Alto, CA 94304, USA
| | - Mary Shimoyama
- Rat Genome Database, Department of Biomedical Engineering, Medical College of Wisconsin and Marquette University, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA
| | - Paul W Sternberg
- WormBase, Division of Biology and Biological Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA
| | | |
Collapse
|
22
|
Tang H, Finn RD, Thomas PD. TreeGrafter: phylogenetic tree-based annotation of proteins with Gene Ontology terms and other annotations. Bioinformatics 2019; 35:518-520. [PMID: 30032202 PMCID: PMC6361231 DOI: 10.1093/bioinformatics/bty625] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2018] [Accepted: 07/18/2018] [Indexed: 11/13/2022] Open
Abstract
Summary TreeGrafter is a new software tool for annotating protein sequences using pre-annotated phylogenetic trees. Currently, the tool provides annotations to Gene Ontology (GO) terms, and PANTHER family and subfamily. The approach is generalizable to any annotations that have been made to internal nodes of a reference phylogenetic tree. TreeGrafter takes each input query protein sequence, finds the best matching homologous family in a library of pre-calculated, pre-annotated gene trees, and then grafts it to the best location in the tree. It then annotates the sequence by propagating annotations from ancestral nodes in the reference tree. We show that TreeGrafter outperforms subfamily HMM scoring for correctly assigning subfamily membership, and that it produces highly specific annotations of GO terms based on annotated reference phylogenetic trees. This method will be further integrated into InterProScan, enabling an even broader user community. Availability and implementation TreeGrafter is freely available on the web at https://github.com/pantherdb/TreeGrafter, including as a Docker image. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Haiming Tang
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA, USA
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
23
|
Shim JE, Kim JH, Shin J, Lee JE, Lee I. Pathway-specific protein domains are predictive for human diseases. PLoS Comput Biol 2019; 15:e1007052. [PMID: 31075101 PMCID: PMC6530867 DOI: 10.1371/journal.pcbi.1007052] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Revised: 05/22/2019] [Accepted: 04/19/2019] [Indexed: 01/04/2023] Open
Abstract
Protein domains are basic functional units of proteins. Many protein domains are pervasive among diverse biological processes, yet some are associated with specific pathways. Human complex diseases are generally viewed as pathway-level disorders. Therefore, we hypothesized that pathway-specific domains could be highly informative for human diseases. To test the hypothesis, we developed a network-based scoring scheme to quantify specificity of domain-pathway associations. We first generated domain profiles for human proteins, then constructed a co-pathway protein network based on the associations between domain profiles. Based on the score, we classified human protein domains into pathway-specific domains (PSDs) and non-specific domains (NSDs). We found that PSDs contained more pathogenic variants than NSDs. PSDs were also enriched for disease-associated mutations that disrupt protein-protein interactions (PPIs) and tend to have a moderate number of domain interactions. These results suggest that mutations in PSDs are likely to disrupt within-pathway PPIs, resulting in functional failure of pathways. Finally, we demonstrated the prediction capacity of PSDs for disease-associated genes with experimental validations in zebrafish. Taken together, the network-based quantitative method of modeling domain-pathway associations presented herein suggested underlying mechanisms of how protein domains associated with specific pathways influence mutational impacts on diseases via perturbations in within-pathway PPIs, and provided a novel genomic feature for interpreting genetic variants to facilitate the discovery of human disease genes. Protein domains are basic functional units of proteins, yet domain-based pathway annotations for proteins are challenging tasks because many domains are pervasive among diverse pathways. Therefore, we developed a network-based scoring scheme to measure pathway specificity of domains, and then used it to identify pathway-specific domains. Surprisingly, we observed substantially more disease mutations in pathway-specific domains than non-specific domains. We found evidences that mutations of pathway-specific domains tend to perturb pathway integrity via disrupting within-pathway protein-protein interactions. We also demonstrated prediction capacity of pathway-specific domains for complex diseases with experimental validations. Our study demonstrated the usefulness of pathway information for protein domains in interpreting non-random distribution of disease mutations among domains and identification of disease genes and variants.
Collapse
Affiliation(s)
- Jung Eun Shim
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
- Yonsei Biomedical Research Institute, Yonsei University College of Medicine, Seoul, Korea
| | - Ji Hyun Kim
- Department of Health Sciences and Technology, SAIHST, Sungkyunkwan University, Seoul, Korea
| | - Junha Shin
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
| | - Ji Eun Lee
- Department of Health Sciences and Technology, SAIHST, Sungkyunkwan University, Seoul, Korea
- Samsung Biomedical Research Institute, Samsung Medical Center, Seoul, Korea
| | - Insuk Lee
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, Korea
- * E-mail:
| |
Collapse
|
24
|
Liu W, Cai Y, He P, Chen L, Bian Y. Comparative transcriptomics reveals potential genes involved in the vegetative growth of Morchella importuna. 3 Biotech 2019; 9:81. [PMID: 30800592 PMCID: PMC6374242 DOI: 10.1007/s13205-019-1614-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2018] [Accepted: 02/02/2019] [Indexed: 12/16/2022] Open
Abstract
True morels (Morchella spp.) are edible, medicinal mushrooms which have recently been artificially cultivated in China but stable production remains a problem. Here, we describe complete and comprehensive transcriptome of Morchella importuna at the stages of vegetative mycelium (VM), initial sclerotium (IS) and mature sclerotium (MS) by deep transcriptional sequencing and de novo assembly for the first time and which will potentially provide useful information for improving its cultivation. A total of 26,496 genes were identified with a contig N50 length of 1763 bp and an average length of over 1064 bp. Additionally, 11,957 open reading frames (ORFs) were predicted and 9676 of them (80.9%) were annotated. The 2605 differentially expressed genes (DEGs) identified by gene expression clustering were mainly involved with energy metabolism and could be divided into three broad clusters, of which genes in cluster_1 and cluster_2 were involved in the metabolic process of carbohydrate, polysaccharide, hydrolase, caprolactam, beta-galactosidase, and disaccharide, respectively. Genes in cluster_3 were the largest category, mainly identified with the catalytic activity and transporter activity. Overall, the enzymes involved in the carbohydrate metabolism were highly expressed, and the CAZyme (carbohydrate-active enzyme) genes were significantly expressed within cluster_3. For expression verification, 16 CAZYme genes were selected for qRT-PCR, and the results suggested that the catabolism of carbohydrates occurs mainly in the vegetative mycelium stage, and the anabolism of the energy-rich substances is the main event of mycelial growth and sclerotial morphogenesis of M. importuna.
Collapse
Affiliation(s)
- Wei Liu
- Institute of Applied Mycology, Huazhong Agricultural University, 430070 Wuhan, China
- Key Laboratory of Agro-Microbial Resource Comprehensive Utilization, Ministry of Agriculture, Huazhong Agricultural University, 430070 Wuhan, Hubei China
| | - Yingli Cai
- Institute of Vegetable, Wuhan Academy of Agricultural Sciences, 430070 Wuhan, China
| | - Peixin He
- School of Food and Biological Engineering, Zhengzhou University of Light Industry, 450001 Zhengzhou, China
| | - Lianfu Chen
- Institute of Applied Mycology, Huazhong Agricultural University, 430070 Wuhan, China
- Key Laboratory of Agro-Microbial Resource Comprehensive Utilization, Ministry of Agriculture, Huazhong Agricultural University, 430070 Wuhan, Hubei China
| | - Yinbing Bian
- Institute of Applied Mycology, Huazhong Agricultural University, 430070 Wuhan, China
- Key Laboratory of Agro-Microbial Resource Comprehensive Utilization, Ministry of Agriculture, Huazhong Agricultural University, 430070 Wuhan, Hubei China
| |
Collapse
|
25
|
Dhar D, Dey D, Basu S. Insights into the evolution of extracellular leucine-rich repeats in metazoans with special reference to Toll-like receptor 4. J Biosci 2019; 44:18. [PMID: 30837369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The importance of the widely spread leucine-rich repeat (LRR) motif was studied considering TLRs, the LRR-containing protein involved in animal immune response. The protein connects intracellular signalling with a chain of molecular interactions through the presence of LRRs in the ectodomain and TIR in the endodomain. Domain analyses with human TLR1-9 reported ectodomain with tandem repeats, transmembrane domain and TIR domain. The repeat number varied across members of TLR and remained characteristic to a particular member. Analysis of gene structure revealed absence of codon interruption with TLR3 and TLR4 as exceptions. Extensive study with TLR4 from metazoans confirmed the presence of 23 LRRs in tandem. Distinct clade formation using coding and amino acid sequence of individual repeats illustrated independent evolution. Although ectodomain and endodomain exhibited differential selection pressure, within the ectodomain, however, the individual repeats displayed positive, negative and neutral selection pressure depending on their structural and functional significance.
Collapse
Affiliation(s)
- Dipanjana Dhar
- Department of Microbiology, University of Calcutta, Kolkata 700 019, India
| | | | | |
Collapse
|
26
|
Insights into the evolution of extracellular leucine-rich repeats in metazoans with special reference to Toll-like receptor 4. J Biosci 2019. [DOI: 10.1007/s12038-018-9821-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
27
|
Garapati PV, Zhang J, Rey AJ, Marygold SJ. Towards comprehensive annotation of Drosophila melanogaster enzymes in FlyBase. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5298334. [PMID: 30689844 PMCID: PMC6343044 DOI: 10.1093/database/bay144] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2018] [Accepted: 12/18/2018] [Indexed: 11/13/2022]
Abstract
The catalytic activities of enzymes can be described using Gene Ontology (GO) terms and Enzyme Commission (EC) numbers. These annotations are available from numerous biological databases and are routinely accessed by researchers and bioinformaticians to direct their work. However, enzyme data may not be congruent between different resources, while the origin, quality and genomic coverage of these data within any one resource are often unclear. GO/EC annotations are assigned either manually by expert curators or inferred computationally, and there is potential for errors in both types of annotation. If such errors remain unchecked, false positive annotations may be propagated across multiple resources, significantly degrading the quality and usefulness of these data. Similarly, the absence of annotations (false negatives) from any one resource can lead to incorrect inferences or conclusions. We are systematically reviewing and enhancing the functional annotation of the enzymes of Drosophila melanogaster, focusing on improvements within the FlyBase (www.flybase.org) database. We have reviewed four major enzyme groups to date: oxidoreductases, lyases, isomerases and ligases. Herein, we describe our review workflow, the improvement in the quality and coverage of enzyme annotations within FlyBase and the wider impact of our work on other related databases.
Collapse
Affiliation(s)
- Phani V Garapati
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Jingyao Zhang
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Alix J Rey
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Steven J Marygold
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| |
Collapse
|
28
|
Fey P, Dodson RJ, Basu S, Hartline EC, Chisholm RL. dictyBase and the Dicty Stock Center (version 2.0) - a progress report. THE INTERNATIONAL JOURNAL OF DEVELOPMENTAL BIOLOGY 2019; 63:563-572. [PMID: 31840793 PMCID: PMC7409682 DOI: 10.1387/ijdb.190226pf] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
After serving the Dictyostelium community for many years, the first version of dictyBase (Chisholm et al., 2006; Fey et al., 2006) was in need of a decisive update. The original dictyBase software was not adaptable to more current demands such as handling the import of large-scale data from recently sequenced genomes, keeping up with changes in the Gene Ontology (GO), or handling the automatic annotation of over 20,000 new strains. Therefore, we have embarked on a complete overhaul of dictyBase. The new infrastructure will allow the introduction of new data, such as more expressive GO annotations and Dictyostelium disease orthologs. A modern user interface aims to streamline usage of the database including orders from the Dicty Stock Center (DSC). New displays will allow novel views including the combination of data in two new tools. With the underlying software infrastructure now in place, dictyBase software engineers and curators are currently adding the user interfaces, new tools and content pages for the evolving version 2.0 of dictyBase. This review highlights the emerging status of the new dictyBase, updated pages and annotations that will soon be available in the new environment, an overview of our annotation procedures, and plans to involve the community in curation efforts.
Collapse
Affiliation(s)
- Petra Fey
- Northwestern University, Chicago, IL, USA.
| | | | | | | | | |
Collapse
|
29
|
Gurdeep Singh R, Tanca A, Palomba A, Van der Jeugt F, Verschaffelt P, Uzzau S, Martens L, Dawyndt P, Mesuere B. Unipept 4.0: Functional Analysis of Metaproteome Data. J Proteome Res 2018; 18:606-615. [PMID: 30465426 DOI: 10.1021/acs.jproteome.8b00716] [Citation(s) in RCA: 83] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Unipept ( https://unipept.ugent.be ) is a web application for metaproteome data analysis, with an initial focus on tryptic-peptide-based biodiversity analysis of MS/MS samples. Because the true potential of metaproteomics lies in gaining insight into the expressed functions of complex environmental samples, the 4.0 release of Unipept introduces complementary functional analysis based on GO terms and EC numbers. Integration of this new functional analysis with the existing biodiversity analysis is an important asset of the extended pipeline. As a proof of concept, a human faecal metaproteome data set from 15 healthy subjects was reanalyzed with Unipept 4.0, yielding fast, detailed, and straightforward characterization of taxon-specific catalytic functions that is shown to be consistent with previous results from a BLAST-based functional analysis of the same data.
Collapse
Affiliation(s)
- Robbert Gurdeep Singh
- Department of Applied Mathematics, Computer Science and Statistics , Ghent University , Ghent B-9000 , Belgium
| | - Alessandro Tanca
- Porto Conte Ricerche, Science and Technology Park of Sardinia , Tramariglio, Alghero 07041 , Italy
| | - Antonio Palomba
- Porto Conte Ricerche, Science and Technology Park of Sardinia , Tramariglio, Alghero 07041 , Italy
| | - Felix Van der Jeugt
- Department of Applied Mathematics, Computer Science and Statistics , Ghent University , Ghent B-9000 , Belgium
| | - Pieter Verschaffelt
- Department of Applied Mathematics, Computer Science and Statistics , Ghent University , Ghent B-9000 , Belgium
| | - Sergio Uzzau
- Porto Conte Ricerche, Science and Technology Park of Sardinia , Tramariglio, Alghero 07041 , Italy
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology , VIB , Ghent B-9000 , Belgium.,Department of Biochemistry , Ghent University , Ghent B-9000 , Belgium
| | - Peter Dawyndt
- Department of Applied Mathematics, Computer Science and Statistics , Ghent University , Ghent B-9000 , Belgium
| | - Bart Mesuere
- Department of Applied Mathematics, Computer Science and Statistics , Ghent University , Ghent B-9000 , Belgium.,VIB-UGent Center for Medical Biotechnology , VIB , Ghent B-9000 , Belgium.,Department of Biochemistry , Ghent University , Ghent B-9000 , Belgium
| |
Collapse
|
30
|
Zeng C, Zhan W, Deng L. SDADB: a functional annotation database of protein structural domains. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5046758. [PMID: 29961821 PMCID: PMC6025185 DOI: 10.1093/database/bay064] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/11/2017] [Accepted: 06/04/2018] [Indexed: 12/27/2022]
Abstract
Annotating functional terms with individual domains is essential for understanding the functions of full-length proteins. We describe SDADB, a functional annotation database for structural domains. SDADB provides associations between gene ontology (GO) terms and SCOP domains calculated with an integrated framework. GO annotations are assigned probabilities of being correct, which are estimated with a Bayesian network by taking advantage of structural neighborhood mappings, SCOP-InterPro domain mapping information, position-specific scoring matrices (PSSMs) and sequence homolog features, with the most substantial contribution coming from high-coverage structure-based domain-protein mappings. The domain-protein mappings are computed using large-scale structure alignment. SDADB contains ontological terms with probabilistic scores for more than 214 000 distinct SCOP domains. It also provides additional features include 3D structure alignment visualization, GO hierarchical tree view, search, browse and download options. Database URL: http://sda.denglab.org
Collapse
Affiliation(s)
- Cheng Zeng
- School of Software, Central South University, Changsha 410075, China
| | - Weihua Zhan
- School of Electronics and Computer Science, Zhejiang Wanli University, Ningbo 315100, China
| | - Lei Deng
- School of Software, Central South University, Changsha 410075, China.,Shanghai Key Lab of Intelligent Information Processing, Shanghai 200433, China
| |
Collapse
|
31
|
Swapna LS, Molinaro AM, Lindsay-Mosher N, Pearson BJ, Parkinson J. Comparative transcriptomic analyses and single-cell RNA sequencing of the freshwater planarian Schmidtea mediterranea identify major cell types and pathway conservation. Genome Biol 2018; 19:124. [PMID: 30143032 PMCID: PMC6109357 DOI: 10.1186/s13059-018-1498-x] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2018] [Accepted: 08/01/2018] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND In the Lophotrochozoa/Spiralia superphylum, few organisms have as high a capacity for rapid testing of gene function and single-cell transcriptomics as the freshwater planaria. The species Schmidtea mediterranea in particular has become a powerful model to use in studying adult stem cell biology and mechanisms of regeneration. Despite this, systematic attempts to define gene complements and their annotations are lacking, restricting comparative analyses that detail the conservation of biochemical pathways and identify lineage-specific innovations. RESULTS In this study we compare several transcriptomes and define a robust set of 35,232 transcripts. From this, we perform systematic functional annotations and undertake a genome-scale metabolic reconstruction for S. mediterranea. Cross-species comparisons of gene content identify conserved, lineage-specific, and expanded gene families, which may contribute to the regenerative properties of planarians. In particular, we find that the TRAF gene family has been greatly expanded in planarians. We further provide a single-cell RNA sequencing analysis of 2000 cells, revealing both known and novel cell types defined by unique signatures of gene expression. Among these are a novel mesenchymal cell population as well as a cell type involved in eye regeneration. Integration of our metabolic reconstruction further reveals the extent to which given cell types have adapted energy and nucleotide biosynthetic pathways to support their specialized roles. CONCLUSIONS In general, S. mediterranea displays a high level of gene and pathway conservation compared with other model systems, rendering it a viable model to study the roles of these pathways in stem cell biology and regeneration.
Collapse
Affiliation(s)
| | - Alyssa M Molinaro
- Hospital for Sick Children, Toronto, ON, Canada.,Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Nicole Lindsay-Mosher
- Hospital for Sick Children, Toronto, ON, Canada.,Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada
| | - Bret J Pearson
- Hospital for Sick Children, Toronto, ON, Canada. .,Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada. .,Ontario Institute for Cancer Research, Toronto, ON, Canada.
| | - John Parkinson
- Hospital for Sick Children, Toronto, ON, Canada. .,Department of Molecular Genetics, University of Toronto, Toronto, ON, Canada. .,Department of Biochemistry, University of Toronto, Toronto, ON, Canada.
| |
Collapse
|
32
|
Xiao Y, Xu P, Fan H, Baudouin L, Xia W, Bocs S, Xu J, Li Q, Guo A, Zhou L, Li J, Wu Y, Ma Z, Armero A, Issali AE, Liu N, Peng M, Yang Y. The genome draft of coconut (Cocos nucifera). Gigascience 2018; 6:1-11. [PMID: 29048487 PMCID: PMC5714197 DOI: 10.1093/gigascience/gix095] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2017] [Accepted: 09/28/2017] [Indexed: 12/02/2022] Open
Abstract
Coconut palm (Cocos nucifera,2n = 32), a member of genus Cocos and family Arecaceae (Palmaceae), is an important tropical fruit and oil crop. Currently, coconut palm is cultivated in 93 countries, including Central and South America, East and West Africa, Southeast Asia and the Pacific Islands, with a total growth area of more than 12 million hectares [1]. Coconut palm is generally classified into 2 main categories: “Tall” (flowering 8–10 years after planting) and “Dwarf” (flowering 4–6 years after planting), based on morphological characteristics and breeding habits. This Palmae species has a long growth period before reproductive years, which hinders conventional breeding progress. In spite of initial successes, improvements made by conventional breeding have been very slow. In the present study, we obtained de novo sequences of the Cocos nucifera genome: a major genomic resource that could be used to facilitate molecular breeding in Cocos nucifera and accelerate the breeding process in this important crop. A total of 419.67 gigabases (Gb) of raw reads were generated by the Illumina HiSeq 2000 platform using a series of paired-end and mate-pair libraries, covering the predicted Cocos nucifera genome length (2.42 Gb, variety “Hainan Tall”) to an estimated ×173.32 read depth. A total scaffold length of 2.20 Gb was generated (N50 = 418 Kb), representing 90.91% of the genome. The coconut genome was predicted to harbor 28 039 protein-coding genes, which is less than in Phoenix dactylifera (PDK30: 28 889), Phoenix dactylifera (DPV01: 41 660), and Elaeis guineensis (EG5: 34 802). BUSCO evaluation demonstrated that the obtained scaffold sequences covered 90.8% of the coconut genome and that the genome annotation was 74.1% complete. Genome annotation results revealed that 72.75% of the coconut genome consisted of transposable elements, of which long-terminal repeat retrotransposons elements (LTRs) accounted for the largest proportion (92.23%). Comparative analysis of the antiporter gene family and ion channel gene families between C. nucifera and Arabidopsis thaliana indicated that significant gene expansion may have occurred in the coconut involving Na+/H+ antiporter, carnitine/acylcarnitine translocase, potassium-dependent sodium-calcium exchanger, and potassium channel genes. Despite its agronomic importance, C. nucifera is still under-studied. In this report, we present a draft genome of C. nucifera and provide genomic information that will facilitate future functional genomics and molecular-assisted breeding in this crop species.
Collapse
Affiliation(s)
- Yong Xiao
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Av. Wenqing No. 496, Wenchang, Hainan 571339, P. R. China
| | - Pengwei Xu
- BGI Genomics, BGI-Shenzhen, Building NO.7, BGI Park, No. 21 Hongan 3rd Street, Yantian District, Shenzhen 518083, China
| | - Haikuo Fan
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Av. Wenqing No. 496, Wenchang, Hainan 571339, P. R. China
| | - Luc Baudouin
- AGAP, Université de Montpellier, CIRAD, INRA, Montpellier Supagro, F-34398, Montpellier, France.,CIRAD, UMR AGAP, F-34398, Montpellier France
| | - Wei Xia
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Av. Wenqing No. 496, Wenchang, Hainan 571339, P. R. China
| | - Stéphanie Bocs
- AGAP, Université de Montpellier, CIRAD, INRA, Montpellier Supagro, F-34398, Montpellier, France.,CIRAD, UMR AGAP, F-34398, Montpellier France
| | - Junyang Xu
- BGI Genomics, BGI-Shenzhen, Building NO.7, BGI Park, No. 21 Hongan 3rd Street, Yantian District, Shenzhen 518083, China
| | - Qiong Li
- Institute of Tropical Bioscience and Biotechnology, Chinese Academy of Tropical Agricultural Science, Rd. Xueyuan No. 4, Haikou, Hainan 571101, P. R. China
| | - Anping Guo
- Institute of Tropical Bioscience and Biotechnology, Chinese Academy of Tropical Agricultural Science, Rd. Xueyuan No. 4, Haikou, Hainan 571101, P. R. China
| | - Lixia Zhou
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Av. Wenqing No. 496, Wenchang, Hainan 571339, P. R. China
| | - Jing Li
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Av. Wenqing No. 496, Wenchang, Hainan 571339, P. R. China
| | - Yi Wu
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Av. Wenqing No. 496, Wenchang, Hainan 571339, P. R. China
| | - Zilong Ma
- Institute of Tropical Bioscience and Biotechnology, Chinese Academy of Tropical Agricultural Science, Rd. Xueyuan No. 4, Haikou, Hainan 571101, P. R. China
| | - Alix Armero
- AGAP, Université de Montpellier, CIRAD, INRA, Montpellier Supagro, F-34398, Montpellier, France.,Montpellier Supagro, UMR AGAP, F-34398, Montpellier, France
| | - Auguste Emmanuel Issali
- Station Cocotier Marc Delorme, Centre National De Recherche Agronomique (CNRA) 07 B.P. 13, Port Bouet, Côte d'Ivoire
| | - Na Liu
- BGI Genomics, BGI-Shenzhen, Building NO.7, BGI Park, No. 21 Hongan 3rd Street, Yantian District, Shenzhen 518083, China
| | - Ming Peng
- Institute of Tropical Bioscience and Biotechnology, Chinese Academy of Tropical Agricultural Science, Rd. Xueyuan No. 4, Haikou, Hainan 571101, P. R. China
| | - Yaodong Yang
- Hainan Key Laboratory of Tropical Oil Crops Biology/Coconut Research Institute, Chinese Academy of Tropical Agricultural Sciences, Av. Wenqing No. 496, Wenchang, Hainan 571339, P. R. China
| |
Collapse
|
33
|
Song H, Lin K, Hu J, Pang E. An Updated Functional Annotation of Protein-Coding Genes in the Cucumber Genome. FRONTIERS IN PLANT SCIENCE 2018; 9:325. [PMID: 29599790 PMCID: PMC5863696 DOI: 10.3389/fpls.2018.00325] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/18/2017] [Accepted: 02/27/2018] [Indexed: 06/08/2023]
Abstract
Background: Although the cucumber reference genome and its annotation were published several years ago, the functional annotation of predicted genes, particularly protein-coding genes, still requires further improvement. In general, accurately determining orthologous relationships between genes allows for better and more robust functional assignments of predicted genes. As one of the most reliable strategies, the determination of collinearity information may facilitate reliable orthology inferences among genes from multiple related genomes. Currently, the identification of collinear segments has mainly been based on conservation of gene order and orientation. Over the course of plant genome evolution, various evolutionary events have disrupted or distorted the order of genes along chromosomes, making it difficult to use those genes as genome-wide markers for plant genome comparisons. Results: Using the localized LASTZ/MULTIZ analysis pipeline, we aligned 15 genomes, including cucumber and other related angiosperm plants, and identified a set of genomic segments that are short in length, stable in structure, uniform in distribution and highly conserved across all 15 plants. Compared with protein-coding genes, these conserved segments were more suitable for use as genomic markers for detecting collinear segments among distantly divergent plants. Guided by this set of identified collinear genomic segments, we inferred 94,486 orthologous protein-coding gene pairs (OPPs) between cucumber and 14 other angiosperm species, which were used as proxies for transferring functional terms to cucumber genes from the annotations of the other 14 genomes. In total, 10,885 protein-coding genes were assigned Gene Ontology (GO) terms which was nearly 1,300 more than results collected in Uniprot-proteomic database. Our results showed that annotation accuracy would been improved compared with other existing approaches. Conclusions: In this study, we provided an alternative resource for the functional annotation of predicted cucumber protein-coding genes, which we expect will be beneficial for the cucumber's biological study, accessible from http://cmb.bnu.edu.cn/functional_annotation. Meanwhile, using the cucumber reference genome as a case study, we presented an efficient strategy for transferring gene functional information from previously well-characterized protein-coding genes in model species to newly sequenced or "non-model" plant species.
Collapse
Affiliation(s)
- Hongtao Song
- MOE Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Kui Lin
- MOE Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Jinglu Hu
- Graduate School of Information, Production and Systems, Waseda University, Kitakyushu-shi, Japan
| | - Erli Pang
- MOE Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| |
Collapse
|
34
|
Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, von Mering C, Bork P. Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper. Mol Biol Evol 2018; 34:2115-2122. [PMID: 28460117 PMCID: PMC5850834 DOI: 10.1093/molbev/msx148] [Citation(s) in RCA: 1619] [Impact Index Per Article: 269.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Orthology assignment is ideally suited for functional inference. However, because predicting orthology is computationally intensive at large scale, and most pipelines are relatively inaccessible (e.g., new assignments only available through database updates), less precise homology-based functional transfer is still the default for (meta-)genome annotation. We, therefore, developed eggNOG-mapper, a tool for functional annotation of large sets of sequences based on fast orthology assignments using precomputed clusters and phylogenies from the eggNOG database. To validate our method, we benchmarked Gene Ontology (GO) predictions against two widely used homology-based approaches: BLAST and InterProScan. Orthology filters applied to BLAST results reduced the rate of false positive assignments by 11%, and increased the ratio of experimentally validated terms recovered over all terms assigned per protein by 15%. Compared with InterProScan, eggNOG-mapper achieved similar proteome coverage and precision while predicting, on average, 41 more terms per protein and increasing the rate of experimentally validated terms recovered over total term assignments per protein by 35%. EggNOG-mapper predictions scored within the top-5 methods in the three GO categories using the CAFA2 NK-partial benchmark. Finally, we evaluated eggNOG-mapper for functional annotation of metagenomics data, yielding better performance than interProScan. eggNOG-mapper runs ∼15× faster than BLAST and at least 2.5× faster than InterProScan. The tool is available standalone and as an online service at http://eggnog-mapper.embl.de.
Collapse
Affiliation(s)
- Jaime Huerta-Cepas
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Kristoffer Forslund
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Luis Pedro Coelho
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Damian Szklarczyk
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.,Bioinformatics/Systems Biology Group, Swiss Institute of Bioinformatics (SIB), Zurich, Switzerland
| | - Lars Juhl Jensen
- The Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Christian von Mering
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.,Bioinformatics/Systems Biology Group, Swiss Institute of Bioinformatics (SIB), Zurich, Switzerland
| | - Peer Bork
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.,Germany Molecular Medicine Partnership Unit (MMPU), University Hospital Heidelberg and European Molecular Biology Laboratory, Heidelberg, Germany.,Max Delbrück Centre for Molecular Medicine, Berlin, Germany.,Department of Bioinformatics, Biocenter University of Würzburg, Würzburg, Germany
| |
Collapse
|
35
|
Grove C, Cain S, Chen WJ, Davis P, Harris T, Howe KL, Kishore R, Lee R, Paulini M, Raciti D, Tuli MA, Van Auken K, Williams G. Using WormBase: A Genome Biology Resource for Caenorhabditis elegans and Related Nematodes. Methods Mol Biol 2018; 1757:399-470. [PMID: 29761466 DOI: 10.1007/978-1-4939-7737-6_14] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
WormBase ( www.wormbase.org ) provides the nematode research community with a centralized database for information pertaining to nematode genes and genomes. As more nematode genome sequences are becoming available and as richer data sets are published, WormBase strives to maintain updated information, displays, and services to facilitate efficient access to and understanding of the knowledge generated by the published nematode genetics literature. This chapter aims to provide an explanation of how to use basic features of WormBase, new features, and some commonly used tools and data queries. Explanations of the curated data and step-by-step instructions of how to access the data via the WormBase website and available data mining tools are provided.
Collapse
Affiliation(s)
- Christian Grove
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
| | - Scott Cain
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research, Toronto, ON, Canada
| | - Wen J Chen
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Paul Davis
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Todd Harris
- Informatics and Bio-computing Platform, Ontario Institute for Cancer Research, Toronto, ON, Canada
| | - Kevin L Howe
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Ranjana Kishore
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Raymond Lee
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Michael Paulini
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Daniela Raciti
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Mary Ann Tuli
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Kimberly Van Auken
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Gary Williams
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| |
Collapse
|
36
|
Zhang L, Xu P, Cai Y, Ma L, Li S, Li S, Xie W, Song J, Peng L, Yan H, Zou L, Ma Y, Zhang C, Gao Q, Wang J. The draft genome assembly of Rhododendron delavayi Franch. var. delavayi. Gigascience 2017; 6:1-11. [PMID: 29020749 PMCID: PMC5632301 DOI: 10.1093/gigascience/gix076] [Citation(s) in RCA: 49] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2017] [Revised: 05/28/2017] [Accepted: 08/04/2017] [Indexed: 01/16/2023] Open
Abstract
Rhododendron delavayi Franch. is globally famous as an ornamental plant. Its distribution in southwest China covers several different habitats and environments. However, not much research had been conducted on Rhododendron spp. at the molecular level, which hinders understanding of its evolution, speciation, and synthesis of secondary metabolites, as well as its wide adaptability to different environments. Here, we report the genome assembly and gene annotation of R. delavayi var. delavayi (the second genome sequenced in the Ericaceae), which will facilitate the study of the family. The genome assembly will have further applications in genome-assisted cultivar breeding. The final size of the assembled R. delavayi var. delavayi genome (695.09 Mb) was close to the 697.94 Mb, estimated by k-mer analysis. A total of 336.83 gigabases (Gb) of raw Illumina HiSeq 2000 reads were generated from 9 libraries (with insert sizes ranging from 170 bp to 40 kb), achieving a raw sequencing depth of ×482.6. After quality filtering, 246.06 Gb of clean reads were obtained, giving ×352.55 coverage depth. Assembly using Platanus gave a total scaffold length of 695.09 Mb, with a contig N50 of 61.8 kb and a scaffold N50 of 637.83 kb. Gene prediction resulted in the annotation of 32 938 protein-coding genes. The genome completeness was evaluated by CEGMA and BUSCO and reached 95.97% and 92.8%, respectively. The gene annotation completeness was also evaluated by CEGMA and BUSCO and reached 97.01% and 87.4%, respectively. Genome annotation revealed that 51.77% of the R. delavayi genome is composed of transposable elements, and 37.48% of long terminal repeat elements (LTRs). The de novo assembled genome of R. delavayi var. delavayi (hereinafter referred to as R. delavayi) is the second genomic resource of the family Ericaceae and will provide a valuable resource for research on future comparative genomic studies in Rhododendron species. The availability of the R. delavayi genome sequence will hopefully provide a tool for scientists to tackle open questions regarding molecular mechanisms underlying environmental interactions in the genus Rhododendron, more accurately understand the evolutionary processes and systematics of the genus, facilitate the identification of genes encoding pharmaceutically important compounds, and accelerate molecular breeding to release elite varieties.
Collapse
Affiliation(s)
- Lu Zhang
- Flower Research Institute of Yunnan Academy of Agricultural Sciences, National Engineering Research Center For Ornamental Horticulture, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
- Key Lab of Yunnan Flower Breeding, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
| | - Pengwei Xu
- BGI-Shenzhen, BGI Park, No. 21 Hongan 3rd Street, Yantian District, Shenzhen 518083, China
| | - Yanfei Cai
- Flower Research Institute of Yunnan Academy of Agricultural Sciences, National Engineering Research Center For Ornamental Horticulture, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
- Key Lab of Yunnan Flower Breeding, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
| | - Lulin Ma
- Flower Research Institute of Yunnan Academy of Agricultural Sciences, National Engineering Research Center For Ornamental Horticulture, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
- Key Lab of Yunnan Flower Breeding, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
| | - Shifeng Li
- Flower Research Institute of Yunnan Academy of Agricultural Sciences, National Engineering Research Center For Ornamental Horticulture, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
- Key Lab of Yunnan Flower Breeding, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
| | - Shufa Li
- Flower Research Institute of Yunnan Academy of Agricultural Sciences, National Engineering Research Center For Ornamental Horticulture, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
- Key Lab of Yunnan Flower Breeding, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
| | - Weijia Xie
- Flower Research Institute of Yunnan Academy of Agricultural Sciences, National Engineering Research Center For Ornamental Horticulture, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
- Key Lab of Yunnan Flower Breeding, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
| | - Jie Song
- Flower Research Institute of Yunnan Academy of Agricultural Sciences, National Engineering Research Center For Ornamental Horticulture, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
- Key Lab of Yunnan Flower Breeding, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
| | - Lvchun Peng
- Flower Research Institute of Yunnan Academy of Agricultural Sciences, National Engineering Research Center For Ornamental Horticulture, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
- Key Lab of Yunnan Flower Breeding, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
| | - Huijun Yan
- Flower Research Institute of Yunnan Academy of Agricultural Sciences, National Engineering Research Center For Ornamental Horticulture, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
- Key Lab of Yunnan Flower Breeding, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
| | - Ling Zou
- Flower Research Institute of Yunnan Academy of Agricultural Sciences, National Engineering Research Center For Ornamental Horticulture, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
- Key Lab of Yunnan Flower Breeding, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
| | - Yongpeng Ma
- Kunming Botanical Garden, Kunming Institute of Botany, Chinese Academy of Sciences, No. 132 Lanhei Road, Panlong District, Kunming, Yunnan 650201, China
| | - Chengjun Zhang
- Germplasm Bank of Wild species, Kunming Institute of Botany, Chinese Academy of Sciences, No. 132 Lanhei Road, Panlong District, Kunming, Yunnan 650201, China
| | - Qiang Gao
- BGI-Shenzhen, BGI Park, No. 21 Hongan 3rd Street, Yantian District, Shenzhen 518083, China
| | - Jihua Wang
- Flower Research Institute of Yunnan Academy of Agricultural Sciences, National Engineering Research Center For Ornamental Horticulture, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
- Key Lab of Yunnan Flower Breeding, No. 2238 Beijing Road, Panlong District, Kunming 650205, China
| |
Collapse
|
37
|
Huerta-Cepas J, Forslund K, Coelho LP, Szklarczyk D, Jensen LJ, von Mering C, Bork P. Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper. Mol Biol Evol 2017. [PMID: 28460117 DOI: 10.1093/molbev/msx148.] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Orthology assignment is ideally suited for functional inference. However, because predicting orthology is computationally intensive at large scale, and most pipelines are relatively inaccessible (e.g., new assignments only available through database updates), less precise homology-based functional transfer is still the default for (meta-)genome annotation. We, therefore, developed eggNOG-mapper, a tool for functional annotation of large sets of sequences based on fast orthology assignments using precomputed clusters and phylogenies from the eggNOG database. To validate our method, we benchmarked Gene Ontology (GO) predictions against two widely used homology-based approaches: BLAST and InterProScan. Orthology filters applied to BLAST results reduced the rate of false positive assignments by 11%, and increased the ratio of experimentally validated terms recovered over all terms assigned per protein by 15%. Compared with InterProScan, eggNOG-mapper achieved similar proteome coverage and precision while predicting, on average, 41 more terms per protein and increasing the rate of experimentally validated terms recovered over total term assignments per protein by 35%. EggNOG-mapper predictions scored within the top-5 methods in the three GO categories using the CAFA2 NK-partial benchmark. Finally, we evaluated eggNOG-mapper for functional annotation of metagenomics data, yielding better performance than interProScan. eggNOG-mapper runs ∼15× faster than BLAST and at least 2.5× faster than InterProScan. The tool is available standalone and as an online service at http://eggnog-mapper.embl.de.
Collapse
Affiliation(s)
- Jaime Huerta-Cepas
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Kristoffer Forslund
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Luis Pedro Coelho
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Damian Szklarczyk
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.,Bioinformatics/Systems Biology Group, Swiss Institute of Bioinformatics (SIB), Zurich, Switzerland
| | - Lars Juhl Jensen
- The Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Christian von Mering
- Institute of Molecular Life Sciences, University of Zurich, Zurich, Switzerland.,Bioinformatics/Systems Biology Group, Swiss Institute of Bioinformatics (SIB), Zurich, Switzerland
| | - Peer Bork
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.,Germany Molecular Medicine Partnership Unit (MMPU), University Hospital Heidelberg and European Molecular Biology Laboratory, Heidelberg, Germany.,Max Delbrück Centre for Molecular Medicine, Berlin, Germany.,Department of Bioinformatics, Biocenter University of Würzburg, Würzburg, Germany
| |
Collapse
|
38
|
Hulo C, Masson P, Toussaint A, Osumi-Sutherland D, de Castro E, Auchincloss AH, Poux S, Bougueleret L, Xenarios I, Le Mercier P. Bacterial Virus Ontology; Coordinating across Databases. Viruses 2017; 9:E126. [PMID: 28545254 PMCID: PMC5490803 DOI: 10.3390/v9060126] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Revised: 05/16/2017] [Accepted: 05/17/2017] [Indexed: 12/29/2022] Open
Abstract
Bacterial viruses, also called bacteriophages, display a great genetic diversity and utilize unique processes for infecting and reproducing within a host cell. All these processes were investigated and indexed in the ViralZone knowledge base. To facilitate standardizing data, a simple ontology of viral life-cycle terms was developed to provide a common vocabulary for annotating data sets. New terminology was developed to address unique viral replication cycle processes, and existing terminology was modified and adapted. Classically, the viral life-cycle is described by schematic pictures. Using this ontology, it can be represented by a combination of successive events: entry, latency, transcription/replication, host-virus interactions and virus release. Each of these parts is broken down into discrete steps. For example enterobacteria phage lambda entry is broken down in: viral attachment to host adhesion receptor, viral attachment to host entry receptor, viral genome ejection and viral genome circularization. To demonstrate the utility of a standard ontology for virus biology, this work was completed by annotating virus data in the ViralZone, UniProtKB and Gene Ontology databases.
Collapse
Affiliation(s)
- Chantal Hulo
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, CMU, University of Geneva Medical School, 1211 Geneva, Switzerland.
| | - Patrick Masson
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, CMU, University of Geneva Medical School, 1211 Geneva, Switzerland.
| | - Ariane Toussaint
- University Libre de Bruxelles, Génétique et Physiologie Bactérienne (LGPB), 12 rue des Professeurs Jeener et Brachet, 6041 Charleroi, Belgium.
| | - David Osumi-Sutherland
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton CB10 1SD, UK.
| | - Edouard de Castro
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, CMU, University of Geneva Medical School, 1211 Geneva, Switzerland.
| | - Andrea H Auchincloss
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, CMU, University of Geneva Medical School, 1211 Geneva, Switzerland.
| | - Sylvain Poux
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, CMU, University of Geneva Medical School, 1211 Geneva, Switzerland.
| | - Lydie Bougueleret
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, CMU, University of Geneva Medical School, 1211 Geneva, Switzerland.
| | - Ioannis Xenarios
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, CMU, University of Geneva Medical School, 1211 Geneva, Switzerland.
| | - Philippe Le Mercier
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, CMU, University of Geneva Medical School, 1211 Geneva, Switzerland.
| |
Collapse
|
39
|
Oppenheim SJ, Rosenfeld JA, DeSalle R. Genome content analysis yields new insights into the relationship between the human malaria parasite Plasmodium falciparum and its anopheline vectors. BMC Genomics 2017; 18:205. [PMID: 28241792 PMCID: PMC5327517 DOI: 10.1186/s12864-017-3590-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2016] [Accepted: 02/13/2017] [Indexed: 11/24/2022] Open
Abstract
Background The persistent and growing gap between the availability of sequenced genomes and the ability to assign functions to sequenced genes led us to explore ways to maximize the information content of automated annotation for studies of anopheline mosquitos. Specifically, we use genome content analysis of a large number of previously sequenced anopheline mosquitos to follow the loss and gain of protein families over the evolutionary history of this group. The importance of this endeavor lies in the potential for comparative genomic studies between Anopheles and closely related non-vector species to reveal ancestral genome content dynamics involved in vector competence. In addition, comparisons within Anopheles could identify genome content changes responsible for variation in the vectorial capacity of this family of important parasite vectors. Results The competence and capacity of P. falciparum vectors do not appear to be phylogenetically constrained within the Anophelinae. Instead, using ancestral reconstruction methods, we suggest that a previously unexamined component of vector biology, anopheline nucleotide metabolism, may contribute to the unique status of anophelines as P. falciparum vectors. While the fitness effects of nucleotide co-option by P. falciparum parasites on their anopheline hosts are not yet known, our results suggest that anopheline genome content may be responding to selection pressure from P. falciparum. Whether this response is defensive, in an attempt to redress improper nucleotide balance resulting from P. falciparum infection, or perhaps symbiotic, resulting from an as-yet-unknown mutualism between anophelines and P. falciparum, is an open question that deserves further study. Conclusions Clearly, there is a wealth of functional information to be gained from detailed manual genome annotation, yet the rapid increase in the number of available sequences means that most researchers will not have the time or resources to manually annotate all the sequence data they generate. We believe that efforts to maximize the amount of information obtained from automated annotation can help address the functional annotation deficit that most evolutionary biologists now face, and here demonstrate the value of such an approach. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3590-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sara J Oppenheim
- Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY, 10024, USA.
| | - Jeffrey A Rosenfeld
- Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY, 10024, USA.,Cancer Institute of New Jersey, Rutgers University, New Brunswick, NJ, USA
| | - Rob DeSalle
- Sackler Institute for Comparative Genomics, American Museum of Natural History, New York, NY, 10024, USA
| |
Collapse
|
40
|
Abstract
Surveys of public sequence resources show that experimentally supported functional information is still completely missing for a considerable fraction of known proteins and is clearly incomplete for an even larger portion. Bioinformatics methods have long made use of very diverse data sources alone or in combination to predict protein function, with the understanding that different data types help elucidate complementary biological roles. This chapter focuses on methods accepting amino acid sequences as input and producing GO term assignments directly as outputs; the relevant biological and computational concepts are presented along with the advantages and limitations of individual approaches.
Collapse
Affiliation(s)
- Domenico Cozzetto
- Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK
| | - David T Jones
- Bioinformatics Group, Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
41
|
Abstract
The Gene Ontology (GO) project is the largest resource for cataloguing gene function. The combination of solid conceptual underpinnings and a practical set of features have made the GO a widely adopted resource in the research community and an essential resource for data analysis. In this chapter, we provide a concise primer for all users of the GO. We briefly introduce the structure of the ontology and explain how to interpret annotations associated with the GO.
Collapse
Affiliation(s)
- Pascale Gaudet
- CALIPHO group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Michel-Servet, 1211, Geneva, Switzerland. .,Department of Human Protein Sciences, Faculty of Medicine, University of Geneva, Geneva, Switzerland.
| | - Nives Škunca
- Department of Computer Science, ETH Zurich, Universitätstrasse 19, 8092, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, Universitätstr. 19, 8092, Zurich, Switzerland.,University College London, Gower St, London, WC1E 6BT, UK
| | - James C Hu
- Department of Biochemistry and Biophysics, Texas A&M University and Texas AgriLife Research, College Station, TX, USA
| | - Christophe Dessimoz
- Department of Genetics, Evolution & Environment, University College London, Gower St, London, WC1E 6BT, UK.,Swiss Institute of Bioinformatics, Biophore, 1015, Lausanne, Switzerland.,Department of Ecology and Evolution, University of Lausanne, Street Biophore, 1015, Lausanne, Switzerland.,Center of Integrative Genomics, University of Lausanne, Biophore, 1015, Lausanne, Switzerland.,Department of Computer Science, University College London, Gower St, Lausanne, WC1E 6BT, UK
| |
Collapse
|
42
|
Abstract
The Gene Ontology (GO) is a formidable resource, but there are several considerations about it that are essential to understand the data and interpret it correctly. The GO is sufficiently simple that it can be used without deep understanding of its structure or how it is developed, which is both a strength and a weakness. In this chapter, we discuss some common misinterpretations of the ontology and the annotations. A better understanding of the pitfalls and the biases in the GO should help users make the most of this very rich resource. We also review some of the misconceptions and misleading assumptions commonly made about GO, including the effect of data incompleteness, the importance of annotation qualifiers, and the transitivity or lack thereof associated with different ontology relations. We also discuss several biases that can confound aggregate analyses such as gene enrichment analyses. For each of these pitfalls and biases, we suggest remedies and best practices.
Collapse
Affiliation(s)
- Pascale Gaudet
- CALIPHO group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel-Servet, 1211, Geneva 4, Switzerland. .,Department of Human Protein Sciences, Faculty of Medicine, University of Geneva, 1211, Geneva, Switzerland.
| | - Christophe Dessimoz
- Department of Genetics, Evolution & Environment, University College London, Gower St, London, WC1E 6BT, UK.,Swiss Institute of Bioinformatics, Biophore Building, 1015, Lausanne, Switzerland.,Department of Ecology and Evolution, University of Lausanne, Street Biophore, 1015, Lausanne, Switzerland.,Center of Integrative Genomics, University of Lausanne, Biophore, 1015, Lausanne, Switzerland.,Department of Computer Science, University College London, Gower St, WC1E 6BT, London, UK
| |
Collapse
|
43
|
Feuermann M, Gaudet P, Mi H, Lewis SE, Thomas PD. Large-scale inference of gene function through phylogenetic annotation of Gene Ontology terms: case study of the apoptosis and autophagy cellular processes. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw155. [PMID: 28025345 PMCID: PMC5199145 DOI: 10.1093/database/baw155] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Revised: 10/10/2016] [Accepted: 11/01/2016] [Indexed: 01/30/2023]
Abstract
We previously reported a paradigm for large-scale phylogenomic analysis of gene families that takes advantage of the large corpus of experimentally supported Gene Ontology (GO) annotations. This 'GO Phylogenetic Annotation' approach integrates GO annotations from evolutionarily related genes across ∼100 different organisms in the context of a gene family tree, in which curators build an explicit model of the evolution of gene functions. GO Phylogenetic Annotation models the gain and loss of functions in a gene family tree, which is used to infer the functions of uncharacterized (or incompletely characterized) gene products, even for human proteins that are relatively well studied. Here, we report our results from applying this paradigm to two well-characterized cellular processes, apoptosis and autophagy. This revealed several important observations with respect to GO annotations and how they can be used for function inference. Notably, we applied only a small fraction of the experimentally supported GO annotations to infer function in other family members. The majority of other annotations describe indirect effects, phenotypes or results from high throughput experiments. In addition, we show here how feedback from phylogenetic annotation leads to significant improvements in the PANTHER trees, the GO annotations and GO itself. Thus GO phylogenetic annotation both increases the quantity and improves the accuracy of the GO annotations provided to the research community. We expect these phylogenetically based annotations to be of broad use in gene enrichment analysis as well as other applications of GO annotations.Database URL: http://amigo.geneontology.org/amigo.
Collapse
Affiliation(s)
| | - Pascale Gaudet
- CALIPHO group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Switzerland 1 rue Michel Servet, 1211 Geneva 4, Switzerland
| | - Huaiyu Mi
- Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA, USA and
| | - Suzanna E Lewis
- Lawrence Berkeley National Laboratory, Genomics Division, Berkeley, CA, USA
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, Keck School of Medicine of USC, University of Southern California, Los Angeles, CA, USA and
| |
Collapse
|
44
|
Falda M, Lavezzo E, Fontana P, Bianco L, Berselli M, Formentin E, Toppo S. Eliciting the Functional Taxonomy from protein annotations and taxa. Sci Rep 2016; 6:31971. [PMID: 27534507 PMCID: PMC4989186 DOI: 10.1038/srep31971] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 08/01/2016] [Indexed: 11/30/2022] Open
Abstract
The advances of omics technologies have triggered the production of an enormous volume of data coming from thousands of species. Meanwhile, joint international efforts like the Gene Ontology (GO) consortium have worked to provide functional information for a vast amount of proteins. With these data available, we have developed FunTaxIS, a tool that is the first attempt to infer functional taxonomy (i.e. how functions are distributed over taxa) combining functional and taxonomic information. FunTaxIS is able to define a taxon specific functional space by exploiting annotation frequencies in order to establish if a function can or cannot be used to annotate a certain species. The tool generates constraints between GO terms and taxa and then propagates these relations over the taxonomic tree and the GO graph. Since these constraints nearly cover the whole taxonomy, it is possible to obtain the mapping of a function over the taxonomy. FunTaxIS can be used to make functional comparative analyses among taxa, to detect improper associations between taxa and functions, and to discover how functional knowledge is either distributed or missing. A benchmark test set based on six different model species has been devised to get useful insights on the generated taxonomic rules.
Collapse
Affiliation(s)
- Marco Falda
- Department of Molecular Medicine, University of Padova, Padova, 35131, Italy
| | - Enrico Lavezzo
- Department of Molecular Medicine, University of Padova, Padova, 35131, Italy
| | - Paolo Fontana
- Istituto Agrario San Michele all'Adige Research and Innovation Centre, Foundation Edmund Mach, Trento, 38010, Italy
| | - Luca Bianco
- Istituto Agrario San Michele all'Adige Research and Innovation Centre, Foundation Edmund Mach, Trento, 38010, Italy
| | - Michele Berselli
- Department of Molecular Medicine, University of Padova, Padova, 35131, Italy
| | - Elide Formentin
- Department of Biology, University of Padova, Padova, 35131, Italy
| | - Stefano Toppo
- Department of Molecular Medicine, University of Padova, Padova, 35131, Italy
| |
Collapse
|
45
|
Shim JE, Lee I. Weighted mutual information analysis substantially improves domain-based functional network models. Bioinformatics 2016; 32:2824-30. [PMID: 27207946 PMCID: PMC5018372 DOI: 10.1093/bioinformatics/btw320] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Accepted: 05/16/2016] [Indexed: 11/30/2022] Open
Abstract
Motivation: Functional protein–protein interaction (PPI) networks elucidate molecular pathways underlying complex phenotypes, including those of human diseases. Extrapolation of domain–domain interactions (DDIs) from known PPIs is a major domain-based method for inferring functional PPI networks. However, the protein domain is a functional unit of the protein. Therefore, we should be able to effectively infer functional interactions between proteins based on the co-occurrence of domains. Results: Here, we present a method for inferring accurate functional PPIs based on the similarity of domain composition between proteins by weighted mutual information (MI) that assigned different weights to the domains based on their genome-wide frequencies. Weighted MI outperforms other domain-based network inference methods and is highly predictive for pathways as well as phenotypes. A genome-scale human functional network determined by our method reveals numerous communities that are significantly associated with known pathways and diseases. Domain-based functional networks may, therefore, have potential applications in mapping domain-to-pathway or domain-to-phenotype associations. Availability and Implementation: Source code for calculating weighted mutual information based on the domain profile matrix is available from www.netbiolab.org/w/WMI. Contact:Insuklee@yonsei.ac.kr Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jung Eun Shim
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
| | - Insuk Lee
- Department of Biotechnology, College of Life Science and Biotechnology, Yonsei University, Seoul, Korea
| |
Collapse
|
46
|
Berardini TZ, Reiser L, Li D, Mezheritsky Y, Muller R, Strait E, Huala E. The Arabidopsis information resource: Making and mining the "gold standard" annotated reference plant genome. Genesis 2015; 53:474-85. [PMID: 26201819 PMCID: PMC4545719 DOI: 10.1002/dvg.22877] [Citation(s) in RCA: 640] [Impact Index Per Article: 71.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2015] [Revised: 07/15/2015] [Accepted: 07/15/2015] [Indexed: 11/09/2022]
Abstract
The Arabidopsis Information Resource (TAIR) is a continuously updated, online database of genetic and molecular biology data for the model plant Arabidopsis thaliana that provides a global research community with centralized access to data for over 30,000 Arabidopsis genes. TAIR's biocurators systematically extract, organize, and interconnect experimental data from the literature along with computational predictions, community submissions, and high throughput datasets to present a high quality and comprehensive picture of Arabidopsis gene function. TAIR provides tools for data visualization and analysis, and enables ordering of seed and DNA stocks, protein chips, and other experimental resources. TAIR actively engages with its users who contribute expertise and data that augments the work of the curatorial staff. TAIR's focus in an extensive and evolving ecosystem of online resources for plant biology is on the critically important role of extracting experimentally based research findings from the literature and making that information computationally accessible. In response to the loss of government grant funding, the TAIR team founded a nonprofit entity, Phoenix Bioinformatics, with the aim of developing sustainable funding models for biological databases, using TAIR as a test case. Phoenix has successfully transitioned TAIR to subscription-based funding while still keeping its data relatively open and accessible.
Collapse
Affiliation(s)
| | | | - Donghui Li
- Phoenix Bioinformatics, Redwood City, California
| | | | | | - Emily Strait
- Phoenix Bioinformatics, Redwood City, California
| | - Eva Huala
- Phoenix Bioinformatics, Redwood City, California
| |
Collapse
|
47
|
Drabkin HJ, Christie KR, Dolan ME, Hill DP, Ni L, Sitnikov D, Blake JA. Application of comparative biology in GO functional annotation: the mouse model. Mamm Genome 2015; 26:574-83. [PMID: 26141960 PMCID: PMC4602061 DOI: 10.1007/s00335-015-9580-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2015] [Accepted: 06/23/2015] [Indexed: 01/22/2023]
Abstract
The Gene Ontology (GO) is an important component of modern biological knowledge representation with great utility for computational analysis of genomic and genetic data. The Gene Ontology Consortium (GOC) consists of a large team of contributors including curation teams from most model organism database groups as well as curation teams focused on representation of data relevant to specific human diseases. Key to the generation of consistent and comprehensive annotations is the development and use of shared standards and measures of curation quality. The GOC engages all contributors to work to a defined standard of curation that is presented here in the context of annotation of genes in the laboratory mouse. Comprehensive understanding of the origin, epistemology, and coverage of GO annotations is essential for most effective use of GO resources. Here the application of comparative approaches to capturing functional data in the mouse system is described.
Collapse
Affiliation(s)
| | | | - Mary E Dolan
- The Jackson Laboratory, Bar Harbor, ME, 04609, USA
| | - David P Hill
- The Jackson Laboratory, Bar Harbor, ME, 04609, USA
| | - Li Ni
- The Jackson Laboratory, Bar Harbor, ME, 04609, USA
| | | | | |
Collapse
|
48
|
Ito A, Ohkawa T. A method of searching for related literature on protein structure analysis by considering a user's intention. BMC Bioinformatics 2015; 16 Suppl 7:S4. [PMID: 25952498 PMCID: PMC4423583 DOI: 10.1186/1471-2105-16-s7-s4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In recent years, with advances in techniques for protein structure analysis, the knowledge about protein structure and function has been published in a vast number of articles. A method to search for specific publications from such a large pool of articles is needed. In this paper, we propose a method to search for related articles on protein structure analysis by using an article itself as a query. RESULTS Each article is represented as a set of concepts in the proposed method. Then, by using similarities among concepts formulated from databases such as Gene Ontology, similarities between articles are evaluated. In this framework, the desired search results vary depending on the user's search intention because a variety of information is included in a single article. Therefore, the proposed method provides not only one input article (primary article) but also additional articles related to it as an input query to determine the search intention of the user, based on the relationship between two query articles. In other words, based on the concepts contained in the input article and additional articles, we actualize a relevant literature search that considers user intention by varying the degree of attention given to each concept and modifying the concept hierarchy graph. CONCLUSIONS We performed an experiment to retrieve relevant papers from articles on protein structure analysis registered in the Protein Data Bank by using three query datasets. The experimental results yielded search results with better accuracy than when user intention was not considered, confirming the effectiveness of the proposed method.
Collapse
|
49
|
Huntley RP, Sawford T, Mutowo-Meullenet P, Shypitsyna A, Bonilla C, Martin MJ, O'Donovan C. The GOA database: gene Ontology annotation updates for 2015. Nucleic Acids Res 2014; 43:D1057-63. [PMID: 25378336 PMCID: PMC4383930 DOI: 10.1093/nar/gku1113] [Citation(s) in RCA: 381] [Impact Index Per Article: 38.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
The Gene Ontology Annotation (GOA) resource (http://www.ebi.ac.uk/GOA) provides evidence-based Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB). Manual annotations provided by UniProt curators are supplemented by manual and automatic annotations from model organism databases and specialist annotation groups. GOA currently supplies 368 million GO annotations to almost 54 million proteins in more than 480 000 taxonomic groups. The resource now provides annotations to five times the number of proteins it did 4 years ago. As a member of the GO Consortium, we adhere to the most up-to-date Consortium-agreed annotation guidelines via the use of quality control checks that ensures that the GOA resource supplies high-quality functional information to proteins from a wide range of species. Annotations from GOA are freely available and are accessible through a powerful web browser as well as a variety of annotation file formats.
Collapse
Affiliation(s)
- Rachael P Huntley
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Tony Sawford
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Prudence Mutowo-Meullenet
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Aleksandra Shypitsyna
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Carlos Bonilla
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Maria J Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Claire O'Donovan
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| |
Collapse
|
50
|
Fang H. dcGOR: an R package for analysing ontologies and protein domain annotations. PLoS Comput Biol 2014; 10:e1003929. [PMID: 25356683 PMCID: PMC4214615 DOI: 10.1371/journal.pcbi.1003929] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2014] [Accepted: 09/21/2014] [Indexed: 01/08/2023] Open
Abstract
I introduce an open-source R package 'dcGOR' to provide the bioinformatics community with the ease to analyse ontologies and protein domain annotations, particularly those in the dcGO database. The dcGO is a comprehensive resource for protein domain annotations using a panel of ontologies including Gene Ontology. Although increasing in popularity, this database needs statistical and graphical support to meet its full potential. Moreover, there are no bioinformatics tools specifically designed for domain ontology analysis. As an add-on package built in the R software environment, dcGOR offers a basic infrastructure with great flexibility and functionality. It implements new data structure to represent domains, ontologies, annotations, and all analytical outputs as well. For each ontology, it provides various mining facilities, including: (i) domain-based enrichment analysis and visualisation; (ii) construction of a domain (semantic similarity) network according to ontology annotations; and (iii) significance analysis for estimating a contact (statistical significance) network. To reduce runtime, most analyses support high-performance parallel computing. Taking as inputs a list of protein domains of interest, the package is able to easily carry out in-depth analyses in terms of functional, phenotypic and diseased relevance, and network-level understanding. More importantly, dcGOR is designed to allow users to import and analyse their own ontologies and annotations on domains (taken from SCOP, Pfam and InterPro) and RNAs (from Rfam) as well. The package is freely available at CRAN for easy installation, and also at GitHub for version control. The dedicated website with reproducible demos can be found at http://supfam.org/dcGOR.
Collapse
Affiliation(s)
- Hai Fang
- Computational Genomics Group, Department of Computer Science, University of Bristol, Bristol, United Kingdom
- * E-mail:
| |
Collapse
|