1
|
Teleman J, Dowsey AW, Gonzalez-Galarza FF, Perkins S, Pratt B, Röst HL, Malmström L, Malmström J, Jones AR, Deutsch EW, Levander F. Numerical compression schemes for proteomics mass spectrometry data. Mol Cell Proteomics 2014; 13:1537-42. [PMID: 24677029 DOI: 10.1074/mcp.o114.037879] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
The open XML format mzML, used for representation of MS data, is pivotal for the development of platform-independent MS analysis software. Although conversion from vendor formats to mzML must take place on a platform on which the vendor libraries are available (i.e. Windows), once mzML files have been generated, they can be used on any platform. However, the mzML format has turned out to be less efficient than vendor formats. In many cases, the naïve mzML representation is fourfold or even up to 18-fold larger compared with the original vendor file. In disk I/O limited setups, a larger data file also leads to longer processing times, which is a problem given the data production rates of modern mass spectrometers. In an attempt to reduce this problem, we here present a family of numerical compression algorithms called MS-Numpress, intended for efficient compression of MS data. To facilitate ease of adoption, the algorithms target the binary data in the mzML standard, and support in main proteomics tools is already available. Using a test set of 10 representative MS data files we demonstrate typical file size decreases of 90% when combined with traditional compression, as well as read time decreases of up to 50%. It is envisaged that these improvements will be beneficial for data handling within the MS community.
Collapse
Affiliation(s)
- Johan Teleman
- From the ‡Department of Immunotechnology, Lund University, Medicon Village building 406, 223 81 Lund Sweden
| | - Andrew W Dowsey
- §Institute of Human Development, Faculty of Medical and Human Sciences, University of Manchester, United Kingdom; ¶Centre for Advanced Discovery and Experimental Therapeutics (CADET), University of Manchester and Central Manchester University Hospitals NHS Foundation Trust, Manchester Academic Health Sciences Centre, Oxford Road, Manchester M13 9WL, United Kingdom
| | | | - Simon Perkins
- ‖Institute of Integrative Biology, University of Liverpool, Liverpool, L69 7ZB, United Kingdom
| | - Brian Pratt
- **Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington, 98195, USA
| | - Hannes L Röst
- ‡‡Department of Biology, Institute of Molecular Systems Biology, Eidgenössische Technische Hochschule Zürich, Wolfgang-Pauli Strasse 16, 8093 Zurich, Switzerland
| | - Lars Malmström
- ‡‡Department of Biology, Institute of Molecular Systems Biology, Eidgenössische Technische Hochschule Zürich, Wolfgang-Pauli Strasse 16, 8093 Zurich, Switzerland
| | - Johan Malmström
- §§Department of Clinical Sciences, Faculty of Medicine, Lund University, SE-221 84 Lund, Sweden
| | - Andrew R Jones
- ‖Institute of Integrative Biology, University of Liverpool, Liverpool, L69 7ZB, United Kingdom
| | - Eric W Deutsch
- ¶¶Institute for Systems Biology, 401 Terry Avenue North, Seattle, Washington 98109, USA;
| | - Fredrik Levander
- From the ‡Department of Immunotechnology, Lund University, Medicon Village building 406, 223 81 Lund Sweden; ‖‖Bioinformatics Infrastructure for Life Sciences, Lund University, Sweden
| |
Collapse
|
2
|
Verheggen K, Barsnes H, Martens L. Distributed computing and data storage in proteomics: many hands make light work, and a stronger memory. Proteomics 2013; 14:367-77. [PMID: 24285552 DOI: 10.1002/pmic.201300288] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2013] [Revised: 09/09/2013] [Accepted: 09/23/2013] [Indexed: 12/25/2022]
Abstract
Modern day proteomics generates ever more complex data, causing the requirements on the storage and processing of such data to outgrow the capacity of most desktop computers. To cope with the increased computational demands, distributed architectures have gained substantial popularity in the recent years. In this review, we provide an overview of the current techniques for distributed computing, along with examples of how the techniques are currently being employed in the field of proteomics. We thus underline the benefits of distributed computing in proteomics, while also pointing out the potential issues and pitfalls involved.
Collapse
Affiliation(s)
- Kenneth Verheggen
- Department of Medical Protein Research, VIB, Ghent, Belgium; Department of Biochemistry, Faculty of Medicine and Health Sciences, Ghent University, Ghent, Belgium
| | | | | |
Collapse
|
3
|
Dowsey AW, English JA, Lisacek F, Morris JS, Yang GZ, Dunn MJ. Image analysis tools and emerging algorithms for expression proteomics. Proteomics 2010; 10:4226-57. [PMID: 21046614 PMCID: PMC3257807 DOI: 10.1002/pmic.200900635] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2009] [Accepted: 08/28/2010] [Indexed: 11/11/2022]
Abstract
Since their origins in academic endeavours in the 1970s, computational analysis tools have matured into a number of established commercial packages that underpin research in expression proteomics. In this paper we describe the image analysis pipeline for the established 2-DE technique of protein separation, and by first covering signal analysis for MS, we also explain the current image analysis workflow for the emerging high-throughput 'shotgun' proteomics platform of LC coupled to MS (LC/MS). The bioinformatics challenges for both methods are illustrated and compared, whereas existing commercial and academic packages and their workflows are described from both a user's and a technical perspective. Attention is given to the importance of sound statistical treatment of the resultant quantifications in the search for differential expression. Despite wide availability of proteomics software, a number of challenges have yet to be overcome regarding algorithm accuracy, objectivity and automation, generally due to deterministic spot-centric approaches that discard information early in the pipeline, propagating errors. We review recent advances in signal and image analysis algorithms in 2-DE, MS, LC/MS and Imaging MS. Particular attention is given to wavelet techniques, automated image-based alignment and differential analysis in 2-DE, Bayesian peak mixture models, and functional mixed modelling in MS, and group-wise consensus alignment methods for LC/MS.
Collapse
Affiliation(s)
- Andrew W. Dowsey
- Institute of Biomedical Engineering, Imperial College London, South Kensington, London SW7 2AZ, U.K
| | - Jane A. English
- Proteome Research Centre, UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Ireland
| | - Frederique Lisacek
- Proteome Informatics Group, Swiss Institute of Bioinformatics, CMU - 1, rue Michel Servet, CH-1211 Geneva, Switzerland
| | - Jeffrey S. Morris
- Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, Texas 77030-4009, U.S.A
| | - Guang-Zhong Yang
- Institute of Biomedical Engineering, Imperial College London, South Kensington, London SW7 2AZ, U.K
| | - Michael J. Dunn
- Proteome Research Centre, UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Ireland
| |
Collapse
|
4
|
Sobhani K. Urine proteomic analysis: use of two-dimensional gel electrophoresis, isotope coded affinity tags, and capillary electrophoresis. Methods Mol Biol 2010; 641:325-346. [PMID: 20407955 DOI: 10.1007/978-1-60761-711-2_18] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
The identities and abundance levels of proteins excreted in urine are not only key indicators of diseases associated with renal function but are also indicators of the overall health of individuals. Urine specimens are readily available and provide a noninvasive means to assess and diagnose many disease states. Proteins in urine originate from two sources: the ultrafiltrate of plasma, and those that are shed from the urinary tract. The protein concentration in urine excreted from a normal adult is approximately 150 mg/day, and is typically not greater than 10 mg/100 mL in any single specimen. Following precipitation, concentration, and fractionation methods, proteins of interest from urine samples can be separated, identified, and quantified. One of the most commonly used techniques in the field of urine proteomics is gel electrophoresis followed by identification with mass spectrometry and protein database search algorithms. In this chapter, two-dimensional gel electrophoresis (2-DE) will be discussed, along with less frequently applied techniques, such as isotope coded affinity tags (ICAT) and capillary electrophoresis (CE). Publications discussing the application of these techniques to urine proteomic analyses of healthy individuals and urinary disease biomarker discovery will also be summarized.
Collapse
Affiliation(s)
- Kimia Sobhani
- Department of Clinical Pathology, Cleveland Clinic, Cleveland, OH, USA.
| |
Collapse
|
5
|
Clark BN, Gutstein HB. The myth of automated, high-throughput two-dimensional gel analysis. Proteomics 2008; 8:1197-203. [PMID: 18283661 DOI: 10.1002/pmic.200700709] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Many software packages have been developed to process and analyze 2-D gel images. Some programs have been touted as automated, high-throughput solutions. We tested five commercially available programs using 18 replicate gels of a rat brain protein extract. We determined computer processing time, approximate spot editing time, time required to correct spot mismatches, as well as total processing time. We also determined the number of spots automatically detected, number of spots kept after manual editing, and the percentage of automatically generated correct matches. We also determined the effect of increasing the number of replicate gels on spot matching efficiency for two of the programs. We found that for all programs tested, less than 3% of the total processing time was automated. The remainder of the time was spent in manual, subjective editing of detected spots and computer generated matches. Total processing time for 18 gels varied from 22 to 84 h. The percentage of correct matches generated automatically varied from 1 to 62%. Increasing the number of gels in an experiment dramatically reduced the percentage of automatically generated correct matches. Our results demonstrate that these 2-D gel analysis programs are not automatic or rapid, and also suggest that matching accuracy decreases as experiment size increases.
Collapse
Affiliation(s)
- Brittan N Clark
- Department of Anesthesiology, The University of Texas, MD Anderson Cancer Center, Houston, TX 77030-4009, USA
| | | |
Collapse
|
6
|
Dowsey AW, Dunn MJ, Yang GZ. Automated image alignment for 2D gel electrophoresis in a high-throughput proteomics pipeline. ACTA ACUST UNITED AC 2008; 24:950-7. [PMID: 18310057 DOI: 10.1093/bioinformatics/btn059] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION The quest for high-throughput proteomics has revealed a number of challenges in recent years. Whilst substantial improvements in automated protein separation with liquid chromatography and mass spectrometry (LC/MS), aka 'shotgun' proteomics, have been achieved, large-scale open initiatives such as the Human Proteome Organization (HUPO) Brain Proteome Project have shown that maximal proteome coverage is only possible when LC/MS is complemented by 2D gel electrophoresis (2-DE) studies. Moreover, both separation methods require automated alignment and differential analysis to relieve the bioinformatics bottleneck and so make high-throughput protein biomarker discovery a reality. The purpose of this article is to describe a fully automatic image alignment framework for the integration of 2-DE into a high-throughput differential expression proteomics pipeline. RESULTS The proposed method is based on robust automated image normalization (RAIN) to circumvent the drawbacks of traditional approaches. These use symbolic representation at the very early stages of the analysis, which introduces persistent errors due to inaccuracies in modelling and alignment. In RAIN, a third-order volume-invariant B-spline model is incorporated into a multi-resolution schema to correct for geometric and expression inhomogeneity at multiple scales. The normalized images can then be compared directly in the image domain for quantitative differential analysis. Through evaluation against an existing state-of-the-art method on real and synthetically warped 2D gels, the proposed analysis framework demonstrates substantial improvements in matching accuracy and differential sensitivity. High-throughput analysis is established through an accelerated GPGPU (general purpose computation on graphics cards) implementation. AVAILABILITY Supplementary material, software and images used in the validation are available at http://www.proteomegrid.org/rain/.
Collapse
Affiliation(s)
- Andrew W Dowsey
- Institute of Biomedical Engineering, Imperial College London, United Kingdom
| | | | | |
Collapse
|
7
|
Johnson CJ, Zhukovsky N, Cass AEG, Nagy JM. Proteomics, nanotechnology and molecular diagnostics. Proteomics 2008; 8:715-30. [DOI: 10.1002/pmic.200700665] [Citation(s) in RCA: 78] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
|
8
|
Abstract
Proteomics holds the promise of evaluating global changes in protein expression and post-translational modification in response to environmental stimuli. However, difficulties in achieving cellular anatomic resolution and extracting specific types of proteins from cells have limited the efficacy of these techniques. Laser capture microdissection has provided a solution to the problem of anatomical resolution in tissues. New extraction methodologies have expanded the range of proteins identified in subsequent analyses. This review will examine the application of laser capture microdissection to proteomic tissue sampling, and subsequent extraction of these samples for differential expression analysis. Statistical and other quantitative issues important for the analysis of the highly complex datasets generated are also reviewed.
Collapse
Affiliation(s)
- Howard B Gutstein
- MD Anderson Cancer Center, 1515 Holcombe Blvd, Box 110, Houston, TX 77030-4009, USA.
| | | |
Collapse
|
9
|
Lisacek F, Cohen-Boulakia S, Appel RD. Proteome informatics II: bioinformatics for comparative proteomics. Proteomics 2007; 6:5445-66. [PMID: 16991192 DOI: 10.1002/pmic.200600275] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
The present review attempts to cover the most recent initiatives directed towards representing, storing, displaying and processing protein-related data suited to undertake "comparative proteomics" studies. Data interpretation is brought into focus. Efforts invested into analysing and interpreting experimental data increasingly express the need for adding meaning. This trend is perceptible in work dedicated to determining ontologies, modelling interaction networks, etc. In parallel, technical advances in computer science are spurred by the development of the Web and the growing need to channel and understand massive volumes of data. Biology benefits from these advances as an application of choice for many generic solutions. Some examples of bioinformatics solutions are discussed and directions for on-going and future work conclude the review.
Collapse
Affiliation(s)
- Frédérique Lisacek
- Proteome Informatics Group, Swiss Institute of Bioinformatics, Geneva, Switzerland.
| | | | | |
Collapse
|
10
|
Dowsey AW, English J, Pennington K, Cotter D, Stuehler K, Marcus K, Meyer HE, Dunn MJ, Yang GZ. Examination of 2-DE in the Human Proteome Organisation Brain Proteome Project pilot studies with the new RAIN gel matching technique. Proteomics 2006; 6:5030-47. [PMID: 16927431 DOI: 10.1002/pmic.200600152] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The Human Proteome Organisation (HUPO) Brain Proteome Project (BPP) pilot studies have generated over 200 2-D gels from eight participating laboratories. This data includes 67 single-channel and 60 DIGE gels comparing 30 whole frozen C57/BL6 female mouse brains, ten each at embryonic day 16, postnatal day 7 (juvenile) and postnatal day 54-56 (adult); and ten single-channel and three DIGE gels comparing human epilepsy surgery of the temporal front lobe with a corresponding post-mortem specimen. The samples were generated centrally and distributed to the participating laboratories, but otherwise no restrictions were placed on sample preparation, running and staining protocols, nor on the 2-D gel analysis packages used. Spots were characterised by MS and the annotated gel images published on a ProteinScape web server. In order to examine the resultant differential expression and protein identifications, we have reprocessed a large subset of the gels using the newly developed RAIN (Robust Automated Image Normalisation) 2-D gel matching algorithm. Traditional approaches use symbolic representation of spots at the very early stages of the analysis, which introduces persistent errors due to inaccuracies in spot modelling and matching. With RAIN, image intensity distributions, rather than selected features, are used, where smooth geometric deformation and expression bias are modelled using multi-resolution image registration and bias-field correction. The method includes a new approach of volume-invariant warping which ensures the volume of protein expression under transformation is preserved. An image-based statistical expression analysis phase is then proposed, where small insignificant expression changes over one gel pair can be revealed when reinforced by the same consistent changes in others. Results of the proposed method as applied to the HUPO BPP data show significant intra-laboratory improvements in matching accuracy over a previous state-of-the-art technique, Multi-resolution Image Registration (MIR), and the commercial Progenesis PG240 package.
Collapse
Affiliation(s)
- Andrew W Dowsey
- Royal Society / Wolfson Foundation Medical Image Computing Laboratory, Department of Computing, Imperial College London, UK
| | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Abstract
The speed of the human genome project (Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C. et al., Nature 2001, 409, 860-921) was made possible, in part, by developments in automation of sequencing technologies. Before these technologies, sequencing was a laborious, expensive, and personnel-intensive task. Similarly, automation and robotics are changing the field of proteomics today. Proteomics is defined as the effort to understand and characterize proteins in the categories of structure, function and interaction (Englbrecht, C. C., Facius, A., Comb. Chem. High Throughput Screen. 2005, 8, 705-715). As such, this field nicely lends itself to automation technologies since these methods often require large economies of scale in order to achieve cost and time-saving benefits. This article describes some of the technologies and methods being applied in proteomics in order to facilitate automation within the field as well as in linking proteomics-based information with other related research areas.
Collapse
Affiliation(s)
- Gil Alterovitz
- Division of Health Sciences and Technology, HST, Harvard Medical School and Massachusetts Institute of Technology, Boston, MA 02115, USA.
| | | | | | | |
Collapse
|
12
|
Kowalczewska M, Fenollar F, Lafitte D, Raoult D. Identification of candidate antigen in Whipple's disease using a serological proteomic approach. Proteomics 2006; 6:3294-305. [PMID: 16637011 DOI: 10.1002/pmic.200500171] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Whipple's disease (WD) is a chronic multisystemic infection, caused by Tropheryma whipplei, a Gram-positive rod. Recently, a reliable method has been developed for cultivating T. whipplei in vitro. This together with the availability of complete genome sequence of T. whipplei prompted us to initiate proteome analysis of T. whipplei. The objective of the present study was to identify candidate proteins for serological diagnosis of WD. Immunoreactivities of sera collected from 18 patients with WD were compared with those of 24 control subjects who did not have WD. For this, we used 2-DE, immunoblotting, and MS. In total, we identified 23 candidate antigenic proteins. These included a subset of six proteins, each of which was found significantly more frequently in cases as compared to their controls. The remaining 17 proteins were found exclusively in cases. The methods we used in the current study enabled us to identify candidate antigens that, in our view, might be useful for serological diagnosis of WD.
Collapse
|
13
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2005. [PMCID: PMC2447508 DOI: 10.1002/cfg.422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
|