1
|
Barker DJ, Maccari G, Georgiou X, Cooper MA, Flicek P, Robinson J, Marsh SGE. The IPD-IMGT/HLA Database. Nucleic Acids Res 2022; 51:D1053-D1060. [PMID: 36350643 PMCID: PMC9825470 DOI: 10.1093/nar/gkac1011] [Citation(s) in RCA: 603] [Impact Index Per Article: 301.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Revised: 10/14/2022] [Accepted: 10/21/2022] [Indexed: 11/10/2022] Open
Abstract
It is 24 years since the IPD-IMGT/HLA Database, http://www.ebi.ac.uk/ipd/imgt/hla/, was first released, providing the HLA community with a searchable repository of highly curated HLA sequences. The database now contains over 35 000 alleles of the human Major Histocompatibility Complex (MHC) named by the WHO Nomenclature Committee for Factors of the HLA System. This complex contains the most polymorphic genes in the human genome and is now considered hyperpolymorphic. The IPD-IMGT/HLA Database provides a stable and user-friendly repository for this information. Uptake of Next Generation Sequencing technology in recent years has driven an increase in the number of alleles and the length of sequences submitted. As the size of the database has grown the traditional methods of accessing and presenting this data have been challenged, in response, we have developed a suite of tools providing an enhanced user experience to our traditional web-based users while creating new programmatic access for our bioinformatics user base. This suite of tools is powered by the IPD-API, an Application Programming Interface (API), providing scalable and flexible access to the database. The IPD-API provides a stable platform for our future development allowing us to meet the future challenges of the HLA field and needs of the community.
Collapse
Affiliation(s)
- Dominic J Barker
- Anthony Nolan Research Institute, Royal Free Hospital, Pond Street, London, NW3 2QG, UK,UCL Cancer Institute, University College London (UCL), Royal Free Campus, Pond Street, London, NW3 2QG, UK
| | - Giuseppe Maccari
- Data Science for Health (DaScH) Lab, Fondazione Toscana Life Sciences, Siena, Italy
| | - Xenia Georgiou
- Anthony Nolan Research Institute, Royal Free Hospital, Pond Street, London, NW3 2QG, UK
| | - Michael A Cooper
- Anthony Nolan Research Institute, Royal Free Hospital, Pond Street, London, NW3 2QG, UK
| | - Paul Flicek
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - James Robinson
- To whom correspondence should be addressed. Tel: +44 20 7284 8307;
| | - Steven G E Marsh
- Correspondence may also be addressed to Steven G.E. Marsh. Tel: +44 20 7284 8321;
| |
Collapse
|
2
|
Prasad P, Khatoon U, Verma RK, Sawant SV, Bag SK. Data mining of transcriptional biomarkers at different cotton fiber developmental stages. Funct Integr Genomics 2022; 22:989-1002. [PMID: 35788822 DOI: 10.1007/s10142-022-00878-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 06/13/2022] [Accepted: 06/21/2022] [Indexed: 11/04/2022]
Abstract
Advancement of the gene expression study provides comprehensive information on pivotal genes at different cotton fiber development stages. For the betterment of cotton fiber yield and their quality, genetic improvement is a major target point for the cotton community. Therefore, various studies were carried out to understand the transcriptional machinery of fiber leading to the detailed integrative as well as innovative study. Through data mining and statistical approaches, we identified and validated the transcriptional biomarkers for staged specific differentiation of fiber. With the unique mapping read matrix of ~ 200 cotton transcriptome data and sequential statistical analysis, we identified several important genes that have a deciding and specific role in fiber cell commitment, initiation and elongation, or secondary cell wall synthesis stage. Based on the importance score and validation analysis, IQ domain 26, Aquaporin, Gibberellin regulated protein, methionine gamma lyase, alpha/beta hydrolases, and HAD-like superfamily have shown the specific and determining role for fiber developmental stages. These genes are represented as transcriptional biomarkers that provide a base for molecular characterization for cotton fiber development which will ultimately determine the high yield.
Collapse
Affiliation(s)
- Priti Prasad
- Molecular Biology and Biotechnology Division, CSIR-National Botanical Research Institute, Rana Pratap Marg, Lucknow, 226001, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Uzma Khatoon
- Molecular Biology and Biotechnology Division, CSIR-National Botanical Research Institute, Rana Pratap Marg, Lucknow, 226001, India.,Department of Botany, University of Lucknow, Lucknow, 226001, India
| | - Rishi Kumar Verma
- Molecular Biology and Biotechnology Division, CSIR-National Botanical Research Institute, Rana Pratap Marg, Lucknow, 226001, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Samir V Sawant
- Molecular Biology and Biotechnology Division, CSIR-National Botanical Research Institute, Rana Pratap Marg, Lucknow, 226001, India. .,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India.
| | - Sumit K Bag
- Molecular Biology and Biotechnology Division, CSIR-National Botanical Research Institute, Rana Pratap Marg, Lucknow, 226001, India. .,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India.
| |
Collapse
|
3
|
Gupta R, Srivastava D, Sahu M, Tiwari S, Ambasta RK, Kumar P. Artificial intelligence to deep learning: machine intelligence approach for drug discovery. Mol Divers 2021; 25:1315-1360. [PMID: 33844136 PMCID: PMC8040371 DOI: 10.1007/s11030-021-10217-3] [Citation(s) in RCA: 302] [Impact Index Per Article: 100.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Accepted: 03/22/2021] [Indexed: 02/06/2023]
Abstract
Drug designing and development is an important area of research for pharmaceutical companies and chemical scientists. However, low efficacy, off-target delivery, time consumption, and high cost impose a hurdle and challenges that impact drug design and discovery. Further, complex and big data from genomics, proteomics, microarray data, and clinical trials also impose an obstacle in the drug discovery pipeline. Artificial intelligence and machine learning technology play a crucial role in drug discovery and development. In other words, artificial neural networks and deep learning algorithms have modernized the area. Machine learning and deep learning algorithms have been implemented in several drug discovery processes such as peptide synthesis, structure-based virtual screening, ligand-based virtual screening, toxicity prediction, drug monitoring and release, pharmacophore modeling, quantitative structure-activity relationship, drug repositioning, polypharmacology, and physiochemical activity. Evidence from the past strengthens the implementation of artificial intelligence and deep learning in this field. Moreover, novel data mining, curation, and management techniques provided critical support to recently developed modeling algorithms. In summary, artificial intelligence and deep learning advancements provide an excellent opportunity for rational drug design and discovery process, which will eventually impact mankind. The primary concern associated with drug design and development is time consumption and production cost. Further, inefficiency, inaccurate target delivery, and inappropriate dosage are other hurdles that inhibit the process of drug delivery and development. With advancements in technology, computer-aided drug design integrating artificial intelligence algorithms can eliminate the challenges and hurdles of traditional drug design and development. Artificial intelligence is referred to as superset comprising machine learning, whereas machine learning comprises supervised learning, unsupervised learning, and reinforcement learning. Further, deep learning, a subset of machine learning, has been extensively implemented in drug design and development. The artificial neural network, deep neural network, support vector machines, classification and regression, generative adversarial networks, symbolic learning, and meta-learning are examples of the algorithms applied to the drug design and discovery process. Artificial intelligence has been applied to different areas of drug design and development process, such as from peptide synthesis to molecule design, virtual screening to molecular docking, quantitative structure-activity relationship to drug repositioning, protein misfolding to protein-protein interactions, and molecular pathway identification to polypharmacology. Artificial intelligence principles have been applied to the classification of active and inactive, monitoring drug release, pre-clinical and clinical development, primary and secondary drug screening, biomarker development, pharmaceutical manufacturing, bioactivity identification and physiochemical properties, prediction of toxicity, and identification of mode of action.
Collapse
Affiliation(s)
- Rohan Gupta
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological University (Formerly DCE), Shahbad Daulatpur, Bawana Road, Delhi, 110042, India
| | - Devesh Srivastava
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological University (Formerly DCE), Shahbad Daulatpur, Bawana Road, Delhi, 110042, India
| | - Mehar Sahu
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological University (Formerly DCE), Shahbad Daulatpur, Bawana Road, Delhi, 110042, India
| | - Swati Tiwari
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological University (Formerly DCE), Shahbad Daulatpur, Bawana Road, Delhi, 110042, India
| | - Rashmi K Ambasta
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological University (Formerly DCE), Shahbad Daulatpur, Bawana Road, Delhi, 110042, India
| | - Pravir Kumar
- Molecular Neuroscience and Functional Genomics Laboratory, Department of Biotechnology, Delhi Technological University (Formerly DCE), Shahbad Daulatpur, Bawana Road, Delhi, 110042, India.
| |
Collapse
|
4
|
Foroughmand-Araabi MH, Goliaei S, Goliaei B. A novel pattern matching algorithm for genomic patterns related to protein motifs. J Bioinform Comput Biol 2020; 18:2050011. [PMID: 32336249 DOI: 10.1142/s0219720020500110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Background: Patterns on proteins and genomic sequences are vastly analyzed, extracted and collected in databases. Although protein patterns originate from genomic coding regions, very few works have directly or indirectly dealt with coding region patterns induced from protein patterns. Results: In this paper, we have defined a new genomic pattern structure suitable for representing induced patterns from proteins. The provided pattern structure, which is called "Consecutive Positions Scoring Matrix (CPSSM)", is a replacement for protein patterns and profiles in the genomic context. CPSSMs can be identified, discovered, and searched in genomes. Then, we have presented a novel pattern matching algorithm between the defined genomic pattern and genomic sequences based on dynamic programming. In addition, we have modified the provided algorithm to support intronic gaps and huge sequences. We have implemented and tested the provided algorithm on real data. The results on Saccharomyces cerevisiae's genome show 132% more true positives and no false negatives and the results on human genome show no false negatives and 10 times as many true positives as those in previous works. Conclusion: CPSSM and provided methods could be used for open reading frame detection and gene finding. The application is available with source codes to run and download at http://app.foroughmand.ir/cpssm/.
Collapse
Affiliation(s)
| | - Sama Goliaei
- Faculty of New Sciences & Technologies, University of Tehran, Tehran, Iran
| | - Bahram Goliaei
- Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| |
Collapse
|
5
|
Gruenstaeudl M, Hartmaring Y. EMBL2checklists: A Python package to facilitate the user-friendly submission of plant and fungal DNA barcoding sequences to ENA. PLoS One 2019; 14:e0210347. [PMID: 30629718 PMCID: PMC6328100 DOI: 10.1371/journal.pone.0210347] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2018] [Accepted: 12/20/2018] [Indexed: 02/06/2023] Open
Abstract
Background The submission of DNA sequences to public sequence databases is an essential, but insufficiently automated step in the process of generating and disseminating novel DNA sequence data. Despite the centrality of database submissions to biological research, the range of available software tools that facilitate the preparation of sequence data for database submissions is low, especially for sequences generated via plant and fungal DNA barcoding. Current submission procedures can be complex and prohibitively time expensive for any but a small number of input sequences. A user-friendly software tool is needed that streamlines the file preparation for database submissions of DNA sequences that are commonly generated in plant and fungal DNA barcoding. Methods A Python package was developed that converts DNA sequences from the common EMBL and GenBank flat file formats to submission-ready, tab-delimited spreadsheets (so-called ‘checklists’) for a subsequent upload to the annotated sequence section of the European Nucleotide Archive (ENA). The software tool, titled ‘EMBL2checklists’, automatically converts DNA sequences, their annotation features, and associated metadata into the idiosyncratic format of marker-specific ENA checklists and, thus, generates files that can be uploaded via the interactive Webin submission system of ENA. Results EMBL2checklists provides a simple, platform-independent tool that automates the conversion of common DNA barcoding sequences into easily editable spreadsheets that require no further processing but their upload to ENA via the interactive Webin submission system. The software is equipped with an intuitive graphical as well as an efficient command-line interface for its operation. The utility of the software is illustrated by its application in four recent investigations, including plant phylogenetic and fungal metagenomic studies. Discussion EMBL2checklists bridges the gap between common software suites for DNA sequence assembly and annotation and the interactive data submission process of ENA. It represents an easy-to-use solution for plant and fungal biologists without bioinformatics expertise to generate submission-ready checklists from common DNA sequence data. It allows the post-processing of checklists as well as work-sharing during the submission process and solves a critical bottleneck in the effort to increase participation in public data sharing.
Collapse
|
6
|
Robinson J, Soormally AR, Hayhurst JD, Marsh SGE. The IPD-IMGT/HLA Database - New developments in reporting HLA variation. Hum Immunol 2016; 77:233-237. [PMID: 26826444 DOI: 10.1016/j.humimm.2016.01.020] [Citation(s) in RCA: 85] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Revised: 12/18/2015] [Accepted: 01/23/2016] [Indexed: 10/22/2022]
Abstract
IPD-IMGT/HLA is a constituent of the Immuno Polymorphism Database (IPD), which was developed to provide a centralised system for the study of polymorphism in genes of the immune system. The IPD project works with specialist groups of nomenclature committees who provide and curate individual sections before they are submitted to IPD for online publication. The primary database within the IPD project is the IPD-IMGT/HLA Database, which provides a locus-specific database for the hyper-polymorphic allele sequences of the genes in the HLA system, also known as the human Major Histocompatibility Complex. The IPD-IMGT/HLA Database was first released over 17 years ago, building on the work of the WHO Nomenclature Committee for Factors of the HLA system that was initiated in 1968. The IPD-IMGT/HLA Database enhanced this work by providing the HLA community with an online, searchable repository of highly curated HLA sequences. Many of the genes encode proteins of the immune system and are hyper polymorphic, with some genes currently having over 4000 known allelic variants. Through the work of the HLA Informatics Group and in collaboration with the European Bioinformatics Institute we are able to provide public access to this data through the website, http://www.ebi.ac.uk/ipd/imgt/hla.
Collapse
Affiliation(s)
- James Robinson
- Anthony Nolan Research Institute, Royal Free Hospital, Pond Street, Hampstead, London NW3 2QG, UK; UCL Cancer Institute, University College London, Royal Free Campus, Pond Street, Hampstead, London NW3 2QG, UK
| | - Anup R Soormally
- Anthony Nolan Research Institute, Royal Free Hospital, Pond Street, Hampstead, London NW3 2QG, UK
| | - James D Hayhurst
- Anthony Nolan Research Institute, Royal Free Hospital, Pond Street, Hampstead, London NW3 2QG, UK
| | - Steven G E Marsh
- Anthony Nolan Research Institute, Royal Free Hospital, Pond Street, Hampstead, London NW3 2QG, UK; UCL Cancer Institute, University College London, Royal Free Campus, Pond Street, Hampstead, London NW3 2QG, UK.
| |
Collapse
|
7
|
Schmedes SE, King JL, Budowle B. Correcting Inconsistencies and Errors in Bacterial Genome Metadata Using an Automated Curation Tool in Excel (AutoCurE). Front Bioeng Biotechnol 2015; 3:138. [PMID: 26442252 PMCID: PMC4566056 DOI: 10.3389/fbioe.2015.00138] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Accepted: 08/28/2015] [Indexed: 12/24/2022] Open
Abstract
Whole-genome data are invaluable for large-scale comparative genomic studies. Current sequencing technologies have made it feasible to sequence entire bacterial genomes with relative ease and time with a substantially reduced cost per nucleotide, hence cost per genome. More than 3,000 bacterial genomes have been sequenced and are available at the finished status. Publically available genomes can be readily downloaded; however, there are challenges to verify the specific supporting data contained within the download and to identify errors and inconsistencies that may be present within the organizational data content and metadata. AutoCurE, an automated tool for bacterial genome database curation in Excel, was developed to facilitate local database curation of supporting data that accompany downloaded genomes from the National Center for Biotechnology Information. AutoCurE provides an automated approach to curate local genomic databases by flagging inconsistencies or errors by comparing the downloaded supporting data to the genome reports to verify genome name, RefSeq accession numbers, the presence of archaea, BioProject/UIDs, and sequence file descriptions. Flags are generated for nine metadata fields if there are inconsistencies between the downloaded genomes and genomes reports and if erroneous or missing data are evident. AutoCurE is an easy-to-use tool for local database curation for large-scale genome data prior to downstream analyses.
Collapse
Affiliation(s)
- Sarah E Schmedes
- Institute of Applied Genetics, Department of Molecular and Medical Genetics, University of North Texas Health Science Center , Fort Worth, TX , USA
| | - Jonathan L King
- Institute of Applied Genetics, Department of Molecular and Medical Genetics, University of North Texas Health Science Center , Fort Worth, TX , USA
| | - Bruce Budowle
- Institute of Applied Genetics, Department of Molecular and Medical Genetics, University of North Texas Health Science Center , Fort Worth, TX , USA ; Center of Excellence in Genomic Medicine Research, King Abdulaziz University , Jeddah , Saudi Arabia
| |
Collapse
|
8
|
Seetan RI, Denton AM, Al-Azzam O, Kumar A, Iqbal MJ, Kianian SF. Reliable Radiation Hybrid Maps: An Efficient Scalable Clustering-Based Approach. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2014; 11:788-800. [PMID: 26356853 DOI: 10.1109/tcbb.2014.2329310] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
The process of mapping markers from radiation hybrid mapping (RHM) experiments is equivalent to the traveling salesman problem and, thereby, has combinatorial complexity. As an additional problem, experiments typically result in some unreliable markers that reduce the overall quality of the map. We propose a clustering approach for addressing both problems efficiently by eliminating unreliable markers without the need for mapping the complete set of markers. Traditional approaches for eliminating markers use resampling of the full data set, which has an even higher computational complexity than the original mapping problem. In contrast, the proposed approach uses a divide-and-conquer strategy to construct framework maps based on clusters that exclude unreliable markers. Clusters are ordered using parallel processing and are then combined to form the complete map. We present three algorithms that explore the trade-off between the number of markers included in the map and placement accuracy. Using an RHM data set of the human genome, we compare the framework maps from our proposed approaches with published physical maps and with the results of using the Carthagene tool. Overall, our approaches have a very low computational complexity and produce solid framework maps with good chromosome coverage and high agreement with the physical map marker order.
Collapse
|
9
|
Abstract
The IMGT/HLA Database (http://www.ebi.ac.uk/ipd/imgt/hla/) was first released over 15 years ago, providing the HLA community with a searchable repository of highly curated HLA sequences. The HLA complex is located within the 6p21.3 region of human chromosome 6 and contains more than 220 genes of diverse function. Many of the genes encode proteins of the immune system and are highly polymorphic, with some genes currently having over 3,000 known allelic variants. The Immuno Polymorphism Database (IPD) (http://www.ebi.ac.uk/ipd/) expands on this model, with a further set of specialist databases related to the study of polymorphic genes in the immune system. The IPD project works with specialist groups or nomenclature committees who provide and curate individual sections before they are submitted to IPD for online publication. IPD currently consists of four databases: IPD-KIR contains the allelic sequences of killer-cell immunoglobulin-like receptors; IPD-MHC is a database of sequences of the major histocompatibility complex of different species; IPD-HPA, alloantigens expressed only on platelets; and IPD-ESTDAB, which provides access to the European Searchable Tumour Cell-Line Database, a cell bank of immunologically characterized melanoma cell lines. Through the work of the HLA Informatics Group and in collaboration with the European Bioinformatics Institute we are able to provide public access to this data through the website http://www.ebi.ac.uk/ipd/.
Collapse
Affiliation(s)
- James Robinson
- Anthony Nolan Research Institute, Royal Free Hospital, Pond Street, Hampstead, London, NW3 2QG, UK
| | | | | |
Collapse
|
10
|
Hunter S, Corbett M, Denise H, Fraser M, Gonzalez-Beltran A, Hunter C, Jones P, Leinonen R, McAnulla C, Maguire E, Maslen J, Mitchell A, Nuka G, Oisel A, Pesseat S, Radhakrishnan R, Rocca-Serra P, Scheremetjew M, Sterk P, Vaughan D, Cochrane G, Field D, Sansone SA. EBI metagenomics--a new resource for the analysis and archiving of metagenomic data. Nucleic Acids Res 2013; 42:D600-6. [PMID: 24165880 PMCID: PMC3965009 DOI: 10.1093/nar/gkt961] [Citation(s) in RCA: 77] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Metagenomics is a relatively recently established but rapidly expanding field that uses high-throughput next-generation sequencing technologies to characterize the microbial communities inhabiting different ecosystems (including oceans, lakes, soil, tundra, plants and body sites). Metagenomics brings with it a number of challenges, including the management, analysis, storage and sharing of data. In response to these challenges, we have developed a new metagenomics resource (http://www.ebi.ac.uk/metagenomics/) that allows users to easily submit raw nucleotide reads for functional and taxonomic analysis by a state-of-the-art pipeline, and have them automatically stored (together with descriptive, standards-compliant metadata) in the European Nucleotide Archive.
Collapse
Affiliation(s)
- Sarah Hunter
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, CB10 1SD, UK, Oxford e-Research Centre, University of Oxford, 7 Keble Road, Oxford, OX1 3QG, UK and NERC Centre for Ecology and Hydrology, Maclean Building, Benson Lane, Crowmarsh Gifford, Wallingford, OX10 8BB, UK
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Minkiewicz P, Miciński J, Darewicz M, Bucholska J. Biological and Chemical Databases for Research into the Composition of Animal Source Foods. FOOD REVIEWS INTERNATIONAL 2013. [DOI: 10.1080/87559129.2013.818011] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
|
12
|
Carvalho BS, Rustici G. The challenges of delivering bioinformatics training in the analysis of high-throughput data. Brief Bioinform 2013; 14:538-47. [PMID: 23543353 PMCID: PMC3771233 DOI: 10.1093/bib/bbt018] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023] Open
Abstract
High-throughput technologies are widely used in the field of functional genomics and used in an increasing number of applications. For many ‘wet lab’ scientists, the analysis of the large amount of data generated by such technologies is a major bottleneck that can only be overcome through very specialized training in advanced data analysis methodologies and the use of dedicated bioinformatics software tools. In this article, we wish to discuss the challenges related to delivering training in the analysis of high-throughput sequencing data and how we addressed these challenges in the hands-on training courses that we have developed at the European Bioinformatics Institute.
Collapse
Affiliation(s)
- Benilton S Carvalho
- *Functional Genomics Group, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton CB10 1SD, UK. Tel.: +44-1223-492539; Fax: +44-1223-494468;
| | | |
Collapse
|
13
|
Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, García-Girón C, Gordon L, Hourlier T, Hunt S, Juettemann T, Kähäri AK, Keenan S, Komorowska M, Kulesha E, Longden I, Maurel T, McLaren WM, Muffato M, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ritchie GRS, Ruffier M, Schuster M, Sheppard D, Sobral D, Taylor K, Thormann A, Trevanion S, White S, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I, Harrow J, Herrero J, Hubbard TJP, Johnson N, Kinsella R, Parker A, Spudich G, Yates A, Zadissa A, Searle SMJ. Ensembl 2013. Nucleic Acids Res 2012. [PMID: 23203987 PMCID: PMC3531136 DOI: 10.1093/nar/gks1236] [Citation(s) in RCA: 791] [Impact Index Per Article: 65.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The Ensembl project (http://www.ensembl.org) provides genome information for sequenced chordate genomes with a particular focus on human, mouse, zebrafish and rat. Our resources include evidenced-based gene sets for all supported species; large-scale whole genome multiple species alignments across vertebrates and clade-specific alignments for eutherian mammals, primates, birds and fish; variation data resources for 17 species and regulation annotations based on ENCODE and other data sets. Ensembl data are accessible through the genome browser at http://www.ensembl.org and through other tools and programmatic interfaces.
Collapse
Affiliation(s)
- Paul Flicek
- European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SD, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
14
|
Velankar S, Dana JM, Jacobsen J, van Ginkel G, Gane PJ, Luo J, Oldfield TJ, O'Donovan C, Martin MJ, Kleywegt GJ. SIFTS: Structure Integration with Function, Taxonomy and Sequences resource. Nucleic Acids Res 2012. [PMID: 23203869 PMCID: PMC3531078 DOI: 10.1093/nar/gks1258] [Citation(s) in RCA: 174] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
The Structure Integration with Function, Taxonomy and Sequences resource (SIFTS; http://pdbe.org/sifts) is a close collaboration between the Protein Data Bank in Europe (PDBe) and UniProt. The two teams have developed a semi-automated process for maintaining up-to-date cross-reference information to UniProt entries, for all protein chains in the PDB entries present in the UniProt database. This process is carried out for every weekly PDB release and the information is stored in the SIFTS database. The SIFTS process includes cross-references to other biological resources such as Pfam, SCOP, CATH, GO, InterPro and the NCBI taxonomy database. The information is exported in XML format, one file for each PDB entry, and is made available by FTP. Many bioinformatics resources use SIFTS data to obtain cross-references between the PDB and other biological databases so as to provide their users with up-to-date information.
Collapse
Affiliation(s)
- Sameer Velankar
- Protein Data Bank in Europe, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Ogasawara O, Mashima J, Kodama Y, Kaminuma E, Nakamura Y, Okubo K, Takagi T. DDBJ new system and service refactoring. Nucleic Acids Res 2012. [PMID: 23180790 PMCID: PMC3531146 DOI: 10.1093/nar/gks1152] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Open
Abstract
The DNA data bank of Japan (DDBJ, http://www.ddbj.nig.ac.jp) maintains a primary nucleotide sequence database and provides analytical resources for biological information to researchers. This database content is exchanged with the US National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI) within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). Resources provided by the DDBJ include traditional nucleotide sequence data released in the form of 27 316 452 entries or 16 876 791 557 base pairs (as of June 2012), and raw reads of new generation sequencers in the sequence read archive (SRA). A Japanese researcher published his own genome sequence via DDBJ-SRA on 31 July 2012. To cope with the ongoing genomic data deluge, in March 2012, our computer previous system was totally replaced by a commodity cluster-based system that boasts 122.5 TFlops of CPU capacity and 5 PB of storage space. During this upgrade, it was considered crucial to replace and refactor substantial portions of the DDBJ software systems as well. As a result of the replacement process, which took more than 2 years to perform, we have achieved significant improvements in system performance.
Collapse
Affiliation(s)
- Osamu Ogasawara
- DDBJ Center, National Institute of Genetics, Yata 1111, Mishima, Shizuoka 411-8540, Japan.
| | | | | | | | | | | | | |
Collapse
|
16
|
Nakamura Y, Cochrane G, Karsch-Mizrachi I. The International Nucleotide Sequence Database Collaboration. Nucleic Acids Res 2012. [PMID: 23180798 PMCID: PMC3531182 DOI: 10.1093/nar/gks1084] [Citation(s) in RCA: 99] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
The International Nucleotide Sequence Database Collaboration (INSDC; http://www.insdc.org), one of the longest-standing global alliances of biological data archives, captures, preserves and provides comprehensive public domain nucleotide sequence information. Three partners of the INSDC work in cooperation to establish formats for data and metadata and protocols that facilitate reliable data submission to their databases and support continual data exchange around the world. In this article, the INSDC current status and update for the year of 2012 are presented. Among discussed items of international collaboration meeting in 2012, BioSample database and changes in submission are described as topics.
Collapse
Affiliation(s)
- Yasukazu Nakamura
- DDBJ Center, National Institute of Genetics, Research Organization for Information and Systems, Yata, Mishima 411-8510, Japan.
| | | | | | | |
Collapse
|
17
|
Robinson J, Halliwell JA, McWilliam H, Lopez R, Parham P, Marsh SGE. The IMGT/HLA database. Nucleic Acids Res 2012; 41:D1222-7. [PMID: 23080122 PMCID: PMC3531221 DOI: 10.1093/nar/gks949] [Citation(s) in RCA: 510] [Impact Index Per Article: 42.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
It is 14 years since the IMGT/HLA database was first released, providing the HLA community with a searchable repository of highly curated HLA sequences. The HLA complex is located within the 6p21.3 region of human chromosome 6 and contains more than 220 genes of diverse function. Of these, 21 genes encode proteins of the immune system that are highly polymorphic. The naming of these HLA genes and alleles and their quality control is the responsibility of the World Health Organization Nomenclature Committee for Factors of the HLA System. Through the work of the HLA Informatics Group and in collaboration with the European Bioinformatics Institute, we are able to provide public access to these data through the website http://www.ebi.ac.uk/imgt/hla/. Regular updates to the website ensure that new and confirmatory sequences are dispersed to the HLA community and the wider research and clinical communities. This article describes the latest updates and additional tools added to the IMGT/HLA project.
Collapse
Affiliation(s)
- James Robinson
- HLA Informatics Group, Anthony Nolan Research Institute, Royal Free Hospital, Pond Street, Hampstead, London NW3 2QG, UK
| | | | | | | | | | | |
Collapse
|
18
|
Malkaram SA, Hassan YI, Zempleni J. Online tools for bioinformatics analyses in nutrition sciences. Adv Nutr 2012; 3:654-65. [PMID: 22983844 PMCID: PMC3648747 DOI: 10.3945/an.112.002477] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023] Open
Abstract
Recent advances in "omics" research have resulted in the creation of large datasets that were generated by consortiums and centers, small datasets that were generated by individual investigators, and bioinformatics tools for mining these datasets. It is important for nutrition laboratories to take full advantage of the analysis tools to interrogate datasets for information relevant to genomics, epigenomics, transcriptomics, proteomics, and metabolomics. This review provides guidance regarding bioinformatics resources that are currently available in the public domain, with the intent to provide a starting point for investigators who want to take advantage of the opportunities provided by the bioinformatics field.
Collapse
Affiliation(s)
- Sridhar A. Malkaram
- Department of Nutrition and Health Sciences, University of Nebraska, Lincoln, Nebraska
| | - Yousef I. Hassan
- Nutrition and Food Science Department, Faculty of Health Sciences, University of Kalamoon, Deirattiah, Syria
| | - Janos Zempleni
- Department of Nutrition and Health Sciences, University of Nebraska, Lincoln, Nebraska,To whom correspondence should be addressed: E-mail:
| |
Collapse
|
19
|
Galperin MY, Fernández-Suárez XM. The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res 2011; 40:D1-8. [PMID: 22144685 PMCID: PMC3245068 DOI: 10.1093/nar/gkr1196] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The 19th annual Database Issue of Nucleic Acids Research features descriptions of 92 new online databases covering various areas of molecular biology and 100 papers describing recent updates to the databases previously described in NAR and other journals. The highlights of this issue include, among others, a description of neXtProt, a knowledgebase on human proteins; a detailed explanation of the principles behind the NCBI Taxonomy Database; NCBI and EBI papers on the recently launched BioSample databases that store sample information for a variety of database resources; descriptions of the recent developments in the Gene Ontology and UniProt Gene Ontology Annotation projects; updates on Pfam, SMART and InterPro domain databases; update papers on KEGG and TAIR, two universally acclaimed databases that face an uncertain future; and a separate section with 10 wiki-based databases, introduced in an accompanying editorial. The NAR online Molecular Biology Database Collection, available at http://www.oxfordjournals.org/nar/database/a/, has been updated and now lists 1380 databases. Brief machine-readable descriptions of the databases featured in this issue, according to the BioDBcore standards, will be provided at the http://biosharing.org/biodbcore web site. The full content of the Database Issue is freely available online on the Nucleic Acids Research web site (http://nar.oxfordjournals.org/).
Collapse
Affiliation(s)
- Michael Y Galperin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | | |
Collapse
|