1
|
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 2004; 5:R80. [PMID: 15461798 PMCID: PMC545600 DOI: 10.1186/gb-2004-5-10-r80] [Citation(s) in RCA: 9574] [Impact Index Per Article: 455.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2004] [Revised: 07/01/2004] [Accepted: 08/03/2004] [Indexed: 12/12/2022] Open
Abstract
The Bioconductor project is an initiative for the collaborative creation of extensible software for computational biology and bioinformatics. The goals of the project include: fostering collaborative development and widespread use of innovative software, reducing barriers to entry into interdisciplinary scientific research, and promoting the achievement of remote reproducibility of research results. We describe details of our aims and methods, identify current challenges, compare Bioconductor to other open bioinformatics projects, and provide working examples.
Collapse
|
research-article |
21 |
9574 |
2
|
Kong AT, Leprevost FV, Avtonomov DM, Mellacheruvu D, Nesvizhskii AI. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat Methods 2017; 14:513-520. [PMID: 28394336 PMCID: PMC5409104 DOI: 10.1038/nmeth.4256] [Citation(s) in RCA: 1287] [Impact Index Per Article: 160.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2016] [Accepted: 03/06/2017] [Indexed: 12/22/2022]
Abstract
There is a need to better understand and handle the 'dark matter' of proteomics-the vast diversity of post-translational and chemical modifications that are unaccounted in a typical mass spectrometry-based analysis and thus remain unidentified. We present a fragment-ion indexing method, and its implementation in peptide identification tool MSFragger, that enables a more than 100-fold improvement in speed over most existing proteome database search tools. Using several large proteomic data sets, we demonstrate how MSFragger empowers the open database search concept for comprehensive identification of peptides and all their modified forms, uncovering dramatic differences in modification rates across experimental samples and conditions. We further illustrate its utility using protein-RNA cross-linked peptide data and using affinity purification experiments where we observe, on average, a 300% increase in the number of identified spectra for enriched proteins. We also discuss the benefits of open searching for improved false discovery rate estimation in proteomics.
Collapse
|
research-article |
8 |
1287 |
3
|
Jain M, Olsen HE, Paten B, Akeson M. The Oxford Nanopore MinION: delivery of nanopore sequencing to the genomics community. Genome Biol 2016; 17:239. [PMID: 27887629 PMCID: PMC5124260 DOI: 10.1186/s13059-016-1103-0] [Citation(s) in RCA: 807] [Impact Index Per Article: 89.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Nanopore DNA strand sequencing has emerged as a competitive, portable technology. Reads exceeding 150 kilobases have been achieved, as have in-field detection and analysis of clinical pathogens. We summarize key technical features of the Oxford Nanopore MinION, the dominant platform currently available. We then discuss pioneering applications executed by the genomics community.
Collapse
|
Research Support, N.I.H., Extramural |
9 |
807 |
4
|
Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Güldener U, Mannhaupt G, Münsterkötter M, Mewes HW. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res 2004; 32:5539-45. [PMID: 15486203 PMCID: PMC524302 DOI: 10.1093/nar/gkh894] [Citation(s) in RCA: 769] [Impact Index Per Article: 36.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
In this paper, we present the Functional Catalogue (FunCat), a hierarchically structured, organism-independent, flexible and scalable controlled classification system enabling the functional description of proteins from any organism. FunCat has been applied for the manual annotation of prokaryotes, fungi, plants and animals. We describe how FunCat is implemented as a highly efficient and robust tool for the manual and automatic annotation of genomic sequences. Owing to its hierarchical architecture, FunCat has also proved to be useful for many subsequent downstream bioinformatic applications. This is illustrated by the analysis of large-scale experiments from various investigations in transcriptomics and proteomics, where FunCat was used to project experimental data into functional units, as 'gold standard' for functional classification methods, and also served to compare the significance of different experimental methods. Over the last decade, the FunCat has been established as a robust and stable annotation scheme that offers both, meaningful and manageable functional classification as well as ease of perception.
Collapse
|
Research Support, Non-U.S. Gov't |
21 |
769 |
5
|
Khatri P, Drăghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005; 21:3587-95. [PMID: 15994189 PMCID: PMC2435250 DOI: 10.1093/bioinformatics/bti565] [Citation(s) in RCA: 564] [Impact Index Per Article: 28.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Independent of the platform and the analysis methods used, the result of a microarray experiment is, in most cases, a list of differentially expressed genes. An automatic ontological analysis approach has been recently proposed to help with the biological interpretation of such results. Currently, this approach is the de facto standard for the secondary analysis of high throughput experiments and a large number of tools have been developed for this purpose. We present a detailed comparison of 14 such tools using the following criteria: scope of the analysis, visualization capabilities, statistical model(s) used, correction for multiple comparisons, reference microarrays available, installation issues and sources of annotation data. This detailed analysis of the capabilities of these tools will help researchers choose the most appropriate tool for a given type of analysis. More importantly, in spite of the fact that this type of analysis has been generally adopted, this approach has several important intrinsic drawbacks. These drawbacks are associated with all tools discussed and represent conceptual limitations of the current state-of-the-art in ontological analysis. We propose these as challenges for the next generation of secondary data analysis tools.
Collapse
|
Research Support, U.S. Gov't, P.H.S. |
20 |
564 |
6
|
Abstract
SUMMARY Heterogeneity and genome search meta-analysis (HEGESMA) is a comprehensive software for performing genome scan meta-analysis, a quantitative method to identify genetic regions (bins) with consistently increased linkage score across multiple genome scans, and for testing the heterogeneity of the results of each bin across scans. The program provides as an output the average of ranks and three heterogeneity statistics, as well as corresponding significance levels. Statistical inferences are based on Monte Carlo permutation tests. The program allows both unweighted and weighted analysis, with the weights for each study as specified by the user. Furthermore, the program performs heterogeneity analyses restricted to the bins with similar average ranks. AVAILABILITY http://biomath.med.uth.gr.
Collapse
|
Journal Article |
20 |
363 |
7
|
Eliceiri KW, Berthold MR, Goldberg IG, Ibáñez L, Manjunath B, Martone ME, Murphy RF, Peng H, Plant AL, Roysam B, Stuurman N, Swedlow JR, Tomancak P, Carpenter AE. Biological imaging software tools. Nat Methods 2012; 9:697-710. [PMID: 22743775 PMCID: PMC3659807 DOI: 10.1038/nmeth.2084] [Citation(s) in RCA: 351] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Few technologies are more widespread in modern biological laboratories than imaging. Recent advances in optical technologies and instrumentation are providing hitherto unimagined capabilities. Almost all these advances have required the development of software to enable the acquisition, management, analysis and visualization of the imaging data. We review each computational step that biologists encounter when dealing with digital images, the inherent challenges and the overall status of available software for bioimage informatics, focusing on open-source options.
Collapse
|
Research Support, N.I.H., Extramural |
13 |
351 |
8
|
Medema MH, Fischbach MA. Computational approaches to natural product discovery. Nat Chem Biol 2015; 11:639-48. [PMID: 26284671 PMCID: PMC5024737 DOI: 10.1038/nchembio.1884] [Citation(s) in RCA: 325] [Impact Index Per Article: 32.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2015] [Accepted: 07/07/2015] [Indexed: 01/13/2023]
Abstract
Starting with the earliest Streptomyces genome sequences, the promise of natural product genome mining has been captivating: genomics and bioinformatics would transform compound discovery from an ad hoc pursuit to a high-throughput endeavor. Until recently, however, genome mining has advanced natural product discovery only modestly. Here, we argue that the development of algorithms to mine the continuously increasing amounts of (meta)genomic data will enable the promise of genome mining to be realized. We review computational strategies that have been developed to identify biosynthetic gene clusters in genome sequences and predict the chemical structures of their products. We then discuss networking strategies that can systematize large volumes of genetic and chemical data and connect genomic information to metabolomic and phenotypic data. Finally, we provide a vision of what natural product discovery might look like in the future, specifically considering longstanding questions in microbial ecology regarding the roles of metabolites in interspecies interactions.
Collapse
|
Research Support, N.I.H., Extramural |
10 |
325 |
9
|
Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L. The distributed annotation system. BMC Bioinformatics 2001; 2:7. [PMID: 11667947 PMCID: PMC58584 DOI: 10.1186/1471-2105-2-7] [Citation(s) in RCA: 271] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2001] [Accepted: 10/10/2001] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Currently, most genome annotation is curated by centralized groups with limited resources. Efforts to share annotations transparently among multiple groups have not yet been satisfactory. RESULTS Here we introduce a concept called the Distributed Annotation System (DAS). DAS allows sequence annotations to be decentralized among multiple third-party annotators and integrated on an as-needed basis by client-side software. The communication between client and servers in DAS is defined by the DAS XML specification. Annotations are displayed in layers, one per server. Any client or server adhering to the DAS XML specification can participate in the system; we describe a simple prototype client and server example. CONCLUSIONS The DAS specification is being used experimentally by Ensembl, WormBase, and the Berkeley Drosophila Genome Project. Continued success will depend on the readiness of the research community to adopt DAS and provide annotations. All components are freely available from the project website http://www.biodas.org/.
Collapse
|
research-article |
24 |
271 |
10
|
Licursi V, Conte F, Fiscon G, Paci P. MIENTURNET: an interactive web tool for microRNA-target enrichment and network-based analysis. BMC Bioinformatics 2019; 20:545. [PMID: 31684860 PMCID: PMC6829817 DOI: 10.1186/s12859-019-3105-x] [Citation(s) in RCA: 256] [Impact Index Per Article: 42.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2019] [Accepted: 09/20/2019] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND miRNAs regulate the expression of several genes with one miRNA able to target multiple genes and with one gene able to be simultaneously targeted by more than one miRNA. Therefore, it has become indispensable to shorten the long list of miRNA-target interactions to put in the spotlight in order to gain insight into understanding the regulatory mechanism orchestrated by miRNAs in various cellular processes. A reasonable solution is certainly to prioritize miRNA-target interactions to maximize the effectiveness of the downstream analysis. RESULTS We propose a new and easy-to-use web tool MIENTURNET (MicroRNA ENrichment TURned NETwork) that receives in input a list of miRNAs or mRNAs and tackles the problem of prioritizing miRNA-target interactions by performing a statistical analysis followed by a fully featured network-based visualization and analysis. The statistics is used to assess the significance of an over-representation of miRNA-target interactions and then MIENTURNET filters based on the statistical significance associated with each miRNA-target interaction. In addition, the holistic approach of the network theory is used to infer possible evidences of miRNA regulation by capturing emergent properties of the miRNA-target regulatory network that would be not evident through a pairwise analysis of the individual components. CONCLUSION MIENTURNET offers the possibility to consistently perform both statistical and network-based analyses by using only a single tool leading to a more effective prioritization of the miRNA-target interactions. This has the potential to avoid researchers without computational and informatics skills to navigate multiple websites and thus to independently investigate miRNA activity in every cellular process of interest in an easy and at the same time exhaustive way thanks to the intuitive web interface. The web application along with a well-documented and comprehensive user guide are freely available at http://userver.bio.uniroma1.it/apps/mienturnet/ without any login requirement.
Collapse
|
research-article |
6 |
256 |
11
|
Kan Z, Rouchka EC, Gish WR, States DJ. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res 2001; 11:889-900. [PMID: 11337482 PMCID: PMC311065 DOI: 10.1101/gr.155001] [Citation(s) in RCA: 255] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
With the availability of a nearly complete sequence of the human genome, aligning expressed sequence tags (EST) to the genomic sequence has become a practical and powerful strategy for gene prediction. Elucidating gene structure is a complex problem requiring the identification of splice junctions, gene boundaries, and alternative splicing variants. We have developed a software tool, Transcript Assembly Program (TAP), to delineate gene structures using genomically aligned EST sequences. TAP assembles the joint gene structure of the entire genomic region from individual splice junction pairs, using a novel algorithm that uses the EST-encoded connectivity and redundancy information to sort out the complex alternative splicing patterns. A method called polyadenylation site scan (PASS) has been developed to detect poly-A sites in the genome. TAP uses these predictions to identify gene boundaries by segmenting the joint gene structure at polyadenylated terminal exons. Reconstructing 1007 known transcripts, TAP scored a sensitivity (Sn) of 60% and a specificity (Sp) of 92% at the exon level. The gene boundary identification process was found to be accurate 78% of the time. also reports alternative splicing patterns in EST alignments. An analysis of alternative splicing in 1124 genic regions suggested that more than half of human genes undergo alternative splicing. Surprisingly, we saw an absolute majority of the detected alternative splicing events affect the coding region. Furthermore, the evolutionary conservation of alternative splicing between human and mouse was analyzed using an EST-based approach. (See http://stl.wustl.edu/~zkan/TAP/)
Collapse
|
research-article |
24 |
255 |
12
|
Abstract
SUMMARY simuPOP is a forward-time population genetics simulation environment. The core of simuPOP is a scripting language (Python) that provides a large number of objects and functions to manipulate populations, and a mechanism to evolve populations forward in time. Using this R/Splus-like environment, users can create, manipulate and evolve populations interactively, or write a script and run it as a batch file. Owing to its flexible and extensible design, simuPOP can simulate large and complex evolutionary processes with ease. At a more user-friendly level, simuPOP provides an increasing number of built-in scripts that perform simulations ranging from implementation of basic population genetics models to generating datasets under complex evolutionary scenarios. AVAILABILITY simuPOP is freely available at http://simupop.sourceforge.net, distributed under GPL license.
Collapse
|
|
20 |
213 |
13
|
Jiang W, Baker ML, Ludtke SJ, Chiu W. Bridging the information gap: computational tools for intermediate resolution structure interpretation. J Mol Biol 2001; 308:1033-44. [PMID: 11352589 DOI: 10.1006/jmbi.2001.4633] [Citation(s) in RCA: 211] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Due to large sizes and complex nature, few large macromolecular complexes have been solved to atomic resolution. This has lead to an under-representation of these structures, which are composed of novel and/or homologous folds, in the library of known structures and folds. While it is often difficult to achieve a high-resolution model for these structures, X-ray crystallography and electron cryomicroscopy are capable of determining structures of large assemblies at low to intermediate resolutions. To aid in the interpretation and analysis of such structures, we have developed two programs: helixhunter and foldhunter. Helixhunter is capable of reliably identifying helix position, orientation and length using a five-dimensional cross-correlation search of a three-dimensional density map followed by feature extraction. Helixhunter's results can in turn be used to probe a library of secondary structure elements derived from the structures in the Protein Data Bank (PDB). From this analysis, it is then possible to identify potential homologous folds or suggest novel folds based on the arrangement of alpha helix elements, resulting in a structure-based recognition of folds containing alpha helices. Foldhunter uses a six-dimensional cross-correlation search allowing a probe structure to be fitted within a region or component of a target structure. The structural fitting therefore provides a quantitative means to further examine the architecture and organization of large, complex assemblies. These two methods have been successfully tested with simulated structures modeled from the PDB at resolutions between 6 and 12 A. With the integration of helixhunter and foldhunter into sequence and structural informatics techniques, we have the potential to deduce or confirm known or novel folds in domains or components within large complexes.
Collapse
|
|
24 |
211 |
14
|
Hildebrand F, Tadeo R, Voigt AY, Bork P, Raes J. LotuS: an efficient and user-friendly OTU processing pipeline. MICROBIOME 2014; 2:30. [PMID: 27367037 PMCID: PMC4179863 DOI: 10.1186/2049-2618-2-30] [Citation(s) in RCA: 205] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2014] [Accepted: 08/23/2014] [Indexed: 05/05/2023]
Abstract
BACKGROUND 16S ribosomal DNA (rDNA) amplicon sequencing is frequently used to analyse the structure of bacterial communities from oceans to the human microbiota. However, computational power is still a major bottleneck in the analysis of continuously enlarging metagenomic data sets. Analysis is further complicated by the technical complexity of current bioinformatics tools. RESULTS Here we present the less operational taxonomic units scripts (LotuS), a fast and user-friendly open-source tool to calculate denoised, chimera-checked, operational taxonomic units (OTUs). These are the basis to generate taxonomic abundance tables and phylogenetic trees from multiplexed, next-generation sequencing data (454, illumina MiSeq and HiSeq). LotuS is outstanding in its execution speed, as it can process 16S rDNA data up to two orders of magnitude faster than other existing pipelines. This is partly due to an included stand-alone fast simultaneous demultiplexer and quality filter C++ program, simple demultiplexer (sdm), which comes packaged with LotuS. Additionally, we sequenced two MiSeq runs with the intent to validate future pipelines by sequencing 40 technical replicates; these are made available in this work. CONCLUSION We show that LotuS analyses microbial 16S data with comparable or even better results than existing pipelines, requiring a fraction of the execution time and providing state-of-the-art denoising and phylogenetic reconstruction. LotuS is available through the following URL: http://psbweb05.psb.ugent.be/lotus .
Collapse
|
product-review |
11 |
205 |
15
|
Ge H, Liu K, Juan T, Fang F, Newman M, Hoeck W. FusionMap: detecting fusion genes from next-generation sequencing data at base-pair resolution. Bioinformatics 2011; 27:1922-8. [PMID: 21593131 DOI: 10.1093/bioinformatics/btr310] [Citation(s) in RCA: 187] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION Next generation sequencing technology generates high-throughput data, which allows us to detect fusion genes at both transcript and genomic levels. To detect fusion genes, the current bioinformatics tools heavily rely on paired-end approaches and overlook the importance of reads that span fusion junctions. Thus there is a need to develop an efficient aligner to detect fusion events by accurate mapping of these junction-spanning single reads, particularly when the read gets longer with the improvement in sequencing technology. RESULTS We present a novel method, FusionMap, which aligns fusion reads directly to the genome without prior knowledge of potential fusion regions. FusionMap can detect fusion events in both single- and paired-end datasets from either RNA-Seq or gDNA-Seq studies and characterize fusion junctions at base-pair resolution. We showed that FusionMap achieved high sensitivity and specificity in fusion detection on two simulated RNA-Seq datasets, which contained 75 nt paired-end reads. FusionMap achieved substantially higher sensitivity and specificity than the paired-end approach when the inner distance between read pairs was small. Using FusionMap to characterize fusion genes in K562 chronic myeloid leukemia cell line, we further demonstrated its accuracy in fusion detection in both single-end RNA-Seq and gDNA-Seq datasets. These combined results show that FusionMap provides an accurate and systematic solution to detecting fusion events through junction-spanning reads. AVAILABILITY FusionMap includes reference indexing, read filtering, fusion alignment and reporting in one package. The software is free for noncommercial use at (http://www.omicsoft.com/fusionmap).
Collapse
|
Journal Article |
14 |
187 |
16
|
Aanensen DM, Huntley DM, Feil EJ, al-Own F, Spratt BG. EpiCollect: linking smartphones to web applications for epidemiology, ecology and community data collection. PLoS One 2009; 4:e6968. [PMID: 19756138 PMCID: PMC2735776 DOI: 10.1371/journal.pone.0006968] [Citation(s) in RCA: 174] [Impact Index Per Article: 10.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2009] [Accepted: 08/06/2009] [Indexed: 11/18/2022] Open
Abstract
Background Epidemiologists and ecologists often collect data in the field and, on returning to their laboratory, enter their data into a database for further analysis. The recent introduction of mobile phones that utilise the open source Android operating system, and which include (among other features) both GPS and Google Maps, provide new opportunities for developing mobile phone applications, which in conjunction with web applications, allow two-way communication between field workers and their project databases. Methodology Here we describe a generic framework, consisting of mobile phone software, EpiCollect, and a web application located within www.spatialepidemiology.net. Data collected by multiple field workers can be submitted by phone, together with GPS data, to a common web database and can be displayed and analysed, along with previously collected data, using Google Maps (or Google Earth). Similarly, data from the web database can be requested and displayed on the mobile phone, again using Google Maps. Data filtering options allow the display of data submitted by the individual field workers or, for example, those data within certain values of a measured variable or a time period. Conclusions Data collection frameworks utilising mobile phones with data submission to and from central databases are widely applicable and can give a field worker similar display and analysis tools on their mobile phone that they would have if viewing the data in their laboratory via the web. We demonstrate their utility for epidemiological data collection and display, and briefly discuss their application in ecological and community data collection. Furthermore, such frameworks offer great potential for recruiting ‘citizen scientists’ to contribute data easily to central databases through their mobile phone.
Collapse
|
Research Support, Non-U.S. Gov't |
16 |
174 |
17
|
Zheng Y, Gao S, Padmanabhan C, Li R, Galvez M, Gutierrez D, Fuentes S, Ling KS, Kreuze J, Fei Z. VirusDetect: An automated pipeline for efficient virus discovery using deep sequencing of small RNAs. Virology 2016; 500:130-138. [PMID: 27825033 DOI: 10.1016/j.virol.2016.10.017] [Citation(s) in RCA: 141] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2016] [Revised: 10/17/2016] [Accepted: 10/20/2016] [Indexed: 12/14/2022]
Abstract
Accurate detection of viruses in plants and animals is critical for agriculture production and human health. Deep sequencing and assembly of virus-derived small interfering RNAs has proven to be a highly efficient approach for virus discovery. Here we present VirusDetect, a bioinformatics pipeline that can efficiently analyze large-scale small RNA (sRNA) datasets for both known and novel virus identification. VirusDetect performs both reference-guided assemblies through aligning sRNA sequences to a curated virus reference database and de novo assemblies of sRNA sequences with automated parameter optimization and the option of host sRNA subtraction. The assembled contigs are compared to a curated and classified reference virus database for known and novel virus identification, and evaluated for their sRNA size profiles to identify novel viruses. Extensive evaluations using plant and insect sRNA datasets suggest that VirusDetect is highly sensitive and efficient in identifying known and novel viruses. VirusDetect is freely available at http://bioinfo.bti.cornell.edu/tool/VirusDetect/.
Collapse
|
Journal Article |
9 |
141 |
18
|
Guo J, Chen H, Sun Z, Lin Y. A novel method for protein secondary structure prediction using dual-layer SVM and profiles. Proteins 2004; 54:738-43. [PMID: 14997569 DOI: 10.1002/prot.10634] [Citation(s) in RCA: 137] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
A high-performance method was developed for protein secondary structure prediction based on the dual-layer support vector machine (SVM) and position-specific scoring matrices (PSSMs). SVM is a new machine learning technology that has been successfully applied in solving problems in the field of bioinformatics. The SVM's performance is usually better than that of traditional machine learning approaches. The performance was further improved by combining PSSM profiles with the SVM analysis. The PSSMs were generated from PSI-BLAST profiles, which contain important evolution information. The final prediction results were generated from the second SVM layer output. On the CB513 data set, the three-state overall per-residue accuracy, Q3, reached 75.2%, while segment overlap (SOV) accuracy increased to 80.0%. On the CB396 data set, the Q3 of our method reached 74.0% and the SOV reached 78.1%. A web server utilizing the method has been constructed and is available at http://www.bioinfo.tsinghua.edu.cn/pmsvm.
Collapse
|
Research Support, Non-U.S. Gov't |
21 |
137 |
19
|
Parekh R, Ascoli GA. Neuronal morphology goes digital: a research hub for cellular and system neuroscience. Neuron 2013; 77:1017-38. [PMID: 23522039 PMCID: PMC3653619 DOI: 10.1016/j.neuron.2013.03.008] [Citation(s) in RCA: 132] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/06/2013] [Indexed: 02/07/2023]
Abstract
The importance of neuronal morphology in brain function has been recognized for over a century. The broad applicability of "digital reconstructions" of neuron morphology across neuroscience subdisciplines has stimulated the rapid development of numerous synergistic tools for data acquisition, anatomical analysis, three-dimensional rendering, electrophysiological simulation, growth models, and data sharing. Here we discuss the processes of histological labeling, microscopic imaging, and semiautomated tracing. Moreover, we provide an annotated compilation of currently available resources in this rich research "ecosystem" as a central reference for experimental and computational neuroscience.
Collapse
|
Research Support, N.I.H., Extramural |
12 |
132 |
20
|
Miho E, Yermanos A, Weber CR, Berger CT, Reddy ST, Greiff V. Computational Strategies for Dissecting the High-Dimensional Complexity of Adaptive Immune Repertoires. Front Immunol 2018; 9:224. [PMID: 29515569 PMCID: PMC5826328 DOI: 10.3389/fimmu.2018.00224] [Citation(s) in RCA: 128] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Accepted: 01/26/2018] [Indexed: 12/21/2022] Open
Abstract
The adaptive immune system recognizes antigens via an immense array of antigen-binding antibodies and T-cell receptors, the immune repertoire. The interrogation of immune repertoires is of high relevance for understanding the adaptive immune response in disease and infection (e.g., autoimmunity, cancer, HIV). Adaptive immune receptor repertoire sequencing (AIRR-seq) has driven the quantitative and molecular-level profiling of immune repertoires, thereby revealing the high-dimensional complexity of the immune receptor sequence landscape. Several methods for the computational and statistical analysis of large-scale AIRR-seq data have been developed to resolve immune repertoire complexity and to understand the dynamics of adaptive immunity. Here, we review the current research on (i) diversity, (ii) clustering and network, (iii) phylogenetic, and (iv) machine learning methods applied to dissect, quantify, and compare the architecture, evolution, and specificity of immune repertoires. We summarize outstanding questions in computational immunology and propose future directions for systems immunology toward coupling AIRR-seq with the computational discovery of immunotherapeutics, vaccines, and immunodiagnostics.
Collapse
|
Review |
7 |
128 |
21
|
Zmasek CM, Eddy SR. RIO: analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 2002; 3:14. [PMID: 12028595 PMCID: PMC116988 DOI: 10.1186/1471-2105-3-14] [Citation(s) in RCA: 125] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2002] [Accepted: 05/16/2002] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND When analyzing protein sequences using sequence similarity searches, orthologous sequences (that diverged by speciation) are more reliable predictors of a new protein's function than paralogous sequences (that diverged by gene duplication). The utility of phylogenetic information in high-throughput genome annotation ("phylogenomics") is widely recognized, but existing approaches are either manual or not explicitly based on phylogenetic trees. RESULTS Here we present RIO (Resampled Inference of Orthologs), a procedure for automated phylogenomics using explicit phylogenetic inference. RIO analyses are performed over bootstrap resampled phylogenetic trees to estimate the reliability of orthology assignments. We also introduce supplementary concepts that are helpful for functional inference. RIO has been implemented as Perl pipeline connecting several C and Java programs. It is available at http://www.genetics.wustl.edu/eddy/forester/. A web server is at http://www.rio.wustl.edu/. RIO was tested on the Arabidopsis thaliana and Caenorhabditis elegans proteomes. CONCLUSION The RIO procedure is particularly useful for the automated detection of first representatives of novel protein subfamilies. We also describe how some orthologies can be misleading for functional inference.
Collapse
|
research-article |
23 |
125 |
22
|
Zhao YY, Wu SP, Liu S, Zhang Y, Lin RC. Ultra-performance liquid chromatography-mass spectrometry as a sensitive and powerful technology in lipidomic applications. Chem Biol Interact 2014; 220:181-192. [PMID: 25014415 DOI: 10.1016/j.cbi.2014.06.029] [Citation(s) in RCA: 125] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Revised: 05/31/2014] [Accepted: 06/30/2014] [Indexed: 11/15/2022]
Abstract
Lipidomics, the comprehensive illumination of lipid-based information in biology systems, involves in identifying lipids and profiling lipids and lipid-derived mediators. The development of lipidomics enables the characterization of lipid species and detailed lipid profiling in body fluid, tissue or cell, and allows for a wider understanding of the biological roles of lipid networks. Lipidomic research has been greatly facilitated by recent advances in ultra-performance liquid chromatography-mass spectrometry (UPLC-MS) and involved in lipid extraction, lipid identification and data analysis supporting applications from qualitative and quantitative assessment of multiple lipid species. UPLC technique, different mass spectrometry technique, lipid extraction and data analysis in lipidomics are reviewed. Afterwards, examples are provided on the use of UPLC-MS for finding lipid biomarkers in disease, drug, food, nutrition and plant fields. We also discuss the UPLC-MS-based lipidomics for the future perspectives and their potential problems.
Collapse
|
Review |
11 |
125 |
23
|
Leggett RM, Clark MD. A world of opportunities with nanopore sequencing. JOURNAL OF EXPERIMENTAL BOTANY 2017; 68:5419-5429. [PMID: 28992056 DOI: 10.1093/jxb/erx289] [Citation(s) in RCA: 118] [Impact Index Per Article: 14.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Oxford Nanopore Technologies' MinION sequencer was launched in pre-release form in 2014 and represents an exciting new sequencing paradigm. The device offers multi-kilobase reads and a streamed mode of operation that allows processing of reads as they are generated. Crucially, it is an extremely compact device that is powered from the USB port of a laptop computer, enabling it to be taken out of the lab and facilitating previously impossible in-field sequencing experiments to be undertaken. Many of the initial publications concerning the platform focused on provision of tools to access and analyse the new sequence formats and then demonstrating the assembly of microbial genomes. More recently, as throughput and accuracy have increased, it has been possible to begin work involving more complex genomes and metagenomes. With the release of the high-throughput GridION X5 and PromethION platforms, the sequencing of large genomes will become more cost efficient, and enable the leveraging of extremely long (>100 kb) reads for resolution of complex genomic structures. This review provides a brief overview of nanopore sequencing technology, describes the growing range of nanopore bioinformatics tools, and highlights some of the most influential publications that have emerged over the last 2 years. Finally, we look to the future and the potential the platform has to disrupt work in human, microbiome, and plant genomics.
Collapse
|
Review |
8 |
118 |
24
|
Abstract
MOTIVATION Computational gene finding systems play an important role in finding new human genes, although no systems are yet accurate enough to predict all or even most protein-coding regions perfectly. Ab initio programs can be augmented by evidence such as expression data or protein sequence homology, which improves their performance. The amount of such evidence continues to grow, but computational methods continue to have difficulty predicting genes when the evidence is conflicting or incomplete. Genome annotation pipelines collect a variety of types of evidence about gene structure and synthesize the results, which can then be refined further through manual, expert curation of gene models. RESULTS JIGSAW is a new gene finding system designed to automate the process of predicting gene structure from multiple sources of evidence, with results that often match the performance of human curators. JIGSAW computes the relative weight of different lines of evidence using statistics generated from a training set, and then combines the evidence using dynamic programming. Our results show that JIGSAW's performance is superior to ab initio gene finding methods and to other pipelines such as Ensembl. Even without evidence from alignment to known genes, JIGSAW can substantially improve gene prediction accuracy as compared with existing methods. AVAILABILITY JIGSAW is available as an open source software package at http://cbcb.umd.edu/software/jigsaw.
Collapse
MESH Headings
- Algorithms
- Animals
- Codon
- Computational Biology/instrumentation
- Computational Biology/methods
- DNA, Complementary/metabolism
- Databases, Factual
- Databases, Genetic
- Gene Expression Profiling
- Genes, Fungal
- Genes, Plant
- Genome, Human
- Humans
- Introns
- Markov Chains
- Models, Genetic
- Models, Statistical
- Open Reading Frames
- Proteins/chemistry
- Sequence Alignment
- Sequence Analysis, DNA
- Sequence Analysis, Protein
- Software
- Software Validation
Collapse
|
Research Support, U.S. Gov't, P.H.S. |
20 |
97 |
25
|
Dutheil J, Gaillard S, Bazin E, Glémin S, Ranwez V, Galtier N, Belkhir K. Bio++: a set of C++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC Bioinformatics 2006; 7:188. [PMID: 16594991 PMCID: PMC1501049 DOI: 10.1186/1471-2105-7-188] [Citation(s) in RCA: 92] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2005] [Accepted: 04/04/2006] [Indexed: 11/10/2022] Open
Abstract
Background A large number of bioinformatics applications in the fields of bio-sequence analysis, molecular evolution and population genetics typically share input/ouput methods, data storage requirements and data analysis algorithms. Such common features may be conveniently bundled into re-usable libraries, which enable the rapid development of new methods and robust applications. Results We present Bio++, a set of Object Oriented libraries written in C++. Available components include classes for data storage and handling (nucleotide/amino-acid/codon sequences, trees, distance matrices, population genetics datasets), various input/output formats, basic sequence manipulation (concatenation, transcription, translation, etc.), phylogenetic analysis (maximum parsimony, markov models, distance methods, likelihood computation and maximization), population genetics/genomics (diversity statistics, neutrality tests, various multi-locus analyses) and various algorithms for numerical calculus. Conclusion Implementation of methods aims at being both efficient and user-friendly. A special concern was given to the library design to enable easy extension and new methods development. We defined a general hierarchy of classes that allow the developer to implement its own algorithms while remaining compatible with the rest of the libraries. Bio++ source code is distributed free of charge under the CeCILL general public licence from its website .
Collapse
|
Research Support, Non-U.S. Gov't |
19 |
92 |