1
|
Jiang Y, Rex DA, Schuster D, Neely BA, Rosano GL, Volkmar N, Momenzadeh A, Peters-Clarke TM, Egbert SB, Kreimer S, Doud EH, Crook OM, Yadav AK, Vanuopadath M, Hegeman AD, Mayta M, Duboff AG, Riley NM, Moritz RL, Meyer JG. Comprehensive Overview of Bottom-Up Proteomics Using Mass Spectrometry. ACS MEASUREMENT SCIENCE AU 2024; 4:338-417. [PMID: 39193565 PMCID: PMC11348894 DOI: 10.1021/acsmeasuresciau.3c00068] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 05/03/2024] [Accepted: 05/03/2024] [Indexed: 08/29/2024]
Abstract
Proteomics is the large scale study of protein structure and function from biological systems through protein identification and quantification. "Shotgun proteomics" or "bottom-up proteomics" is the prevailing strategy, in which proteins are hydrolyzed into peptides that are analyzed by mass spectrometry. Proteomics studies can be applied to diverse studies ranging from simple protein identification to studies of proteoforms, protein-protein interactions, protein structural alterations, absolute and relative protein quantification, post-translational modifications, and protein stability. To enable this range of different experiments, there are diverse strategies for proteome analysis. The nuances of how proteomic workflows differ may be challenging to understand for new practitioners. Here, we provide a comprehensive overview of different proteomics methods. We cover from biochemistry basics and protein extraction to biological interpretation and orthogonal validation. We expect this Review will serve as a handbook for researchers who are new to the field of bottom-up proteomics.
Collapse
Affiliation(s)
- Yuming Jiang
- Department
of Computational Biomedicine, Cedars Sinai
Medical Center, Los Angeles, California 90048, United States
- Smidt Heart
Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced
Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los
Angeles, California 90048, United States
| | - Devasahayam Arokia
Balaya Rex
- Center for
Systems Biology and Molecular Medicine, Yenepoya Research Centre, Yenepoya (Deemed to be University), Mangalore 575018, India
| | - Dina Schuster
- Department
of Biology, Institute of Molecular Systems
Biology, ETH Zurich, Zurich 8093, Switzerland
- Department
of Biology, Institute of Molecular Biology
and Biophysics, ETH Zurich, Zurich 8093, Switzerland
- Laboratory
of Biomolecular Research, Division of Biology and Chemistry, Paul Scherrer Institute, Villigen 5232, Switzerland
| | - Benjamin A. Neely
- Chemical
Sciences Division, National Institute of
Standards and Technology, NIST, Charleston, South Carolina 29412, United States
| | - Germán L. Rosano
- Mass
Spectrometry
Unit, Institute of Molecular and Cellular
Biology of Rosario, Rosario, 2000 Argentina
| | - Norbert Volkmar
- Department
of Biology, Institute of Molecular Systems
Biology, ETH Zurich, Zurich 8093, Switzerland
| | - Amanda Momenzadeh
- Department
of Computational Biomedicine, Cedars Sinai
Medical Center, Los Angeles, California 90048, United States
- Smidt Heart
Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced
Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los
Angeles, California 90048, United States
| | - Trenton M. Peters-Clarke
- Department
of Pharmaceutical Chemistry, University
of California—San Francisco, San Francisco, California, 94158, United States
| | - Susan B. Egbert
- Department
of Chemistry, University of Manitoba, Winnipeg, Manitoba, R3T 2N2 Canada
| | - Simion Kreimer
- Smidt Heart
Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced
Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los
Angeles, California 90048, United States
| | - Emma H. Doud
- Center
for Proteome Analysis, Indiana University
School of Medicine, Indianapolis, Indiana, 46202-3082, United States
| | - Oliver M. Crook
- Oxford
Protein Informatics Group, Department of Statistics, University of Oxford, Oxford OX1 3LB, United
Kingdom
| | - Amit Kumar Yadav
- Translational
Health Science and Technology Institute, NCR Biotech Science Cluster 3rd Milestone Faridabad-Gurgaon
Expressway, Faridabad, Haryana 121001, India
| | | | - Adrian D. Hegeman
- Departments
of Horticultural Science and Plant and Microbial Biology, University of Minnesota, Twin Cities, Minnesota 55108, United States
| | - Martín
L. Mayta
- School
of Medicine and Health Sciences, Center for Health Sciences Research, Universidad Adventista del Plata, Libertador San Martin 3103, Argentina
- Molecular
Biology Department, School of Pharmacy and Biochemistry, Universidad Nacional de Rosario, Rosario 2000, Argentina
| | - Anna G. Duboff
- Department
of Chemistry, University of Washington, Seattle, Washington 98195, United States
| | - Nicholas M. Riley
- Department
of Chemistry, University of Washington, Seattle, Washington 98195, United States
| | - Robert L. Moritz
- Institute
for Systems biology, Seattle, Washington 98109, United States
| | - Jesse G. Meyer
- Department
of Computational Biomedicine, Cedars Sinai
Medical Center, Los Angeles, California 90048, United States
- Smidt Heart
Institute, Cedars Sinai Medical Center, Los Angeles, California 90048, United States
- Advanced
Clinical Biosystems Research Institute, Cedars Sinai Medical Center, Los
Angeles, California 90048, United States
| |
Collapse
|
2
|
Hu Y, Schnaubelt M, Chen L, Zhang B, Hoang T, Lih TM, Zhang Z, Zhang H. MS-PyCloud: A Cloud Computing-Based Pipeline for Proteomic and Glycoproteomic Data Analyses. Anal Chem 2024; 96:10145-10151. [PMID: 38869158 DOI: 10.1021/acs.analchem.3c01497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2024]
Abstract
Rapid development and wide adoption of mass spectrometry-based glycoproteomic technologies have empowered scientists to study proteins and protein glycosylation in complex samples on a large scale. This progress has also created unprecedented challenges for individual laboratories to store, manage, and analyze proteomic and glycoproteomic data, both in the cost for proprietary software and high-performance computing and in the long processing time that discourages on-the-fly changes of data processing settings required in explorative and discovery analysis. We developed an open-source, cloud computing-based pipeline, MS-PyCloud, with graphical user interface (GUI), for proteomic and glycoproteomic data analysis. The major components of this pipeline include data file integrity validation, MS/MS database search for spectral assignments to peptide sequences, false discovery rate estimation, protein inference, quantitation of global protein levels, and specific glycan-modified glycopeptides as well as other modification-specific peptides such as phosphorylation, acetylation, and ubiquitination. To ensure the transparency and reproducibility of data analysis, MS-PyCloud includes open-source software tools with comprehensive testing and versioning for spectrum assignments. Leveraging public cloud computing infrastructure via Amazon Web Services (AWS), MS-PyCloud scales seamlessly based on analysis demand to achieve fast and efficient performance. Application of the pipeline to the analysis of large-scale LC-MS/MS data sets demonstrated the effectiveness and high performance of MS-PyCloud. The software can be downloaded at https://github.com/huizhanglab-jhu/ms-pycloud.
Collapse
Affiliation(s)
- Yingwei Hu
- Department of Pathology, School of Medicine, Johns Hopkins University, Baltimore, Maryland 21231, United States
| | - Michael Schnaubelt
- Department of Pathology, School of Medicine, Johns Hopkins University, Baltimore, Maryland 21231, United States
| | - Li Chen
- Department of Pathology, School of Medicine, Johns Hopkins University, Baltimore, Maryland 21231, United States
| | - Bai Zhang
- Department of Pathology, School of Medicine, Johns Hopkins University, Baltimore, Maryland 21231, United States
| | - Trung Hoang
- Department of Pathology, School of Medicine, Johns Hopkins University, Baltimore, Maryland 21231, United States
| | - T Mamie Lih
- Department of Pathology, School of Medicine, Johns Hopkins University, Baltimore, Maryland 21231, United States
| | - Zhen Zhang
- Department of Pathology, School of Medicine, Johns Hopkins University, Baltimore, Maryland 21231, United States
| | - Hui Zhang
- Department of Pathology, School of Medicine, Johns Hopkins University, Baltimore, Maryland 21231, United States
| |
Collapse
|
3
|
Deutsch EW, Mendoza L, Shteynberg DD, Hoopmann MR, Sun Z, Eng JK, Moritz RL. Trans-Proteomic Pipeline: Robust Mass Spectrometry-Based Proteomics Data Analysis Suite. J Proteome Res 2023; 22:615-624. [PMID: 36648445 PMCID: PMC10166710 DOI: 10.1021/acs.jproteome.2c00624] [Citation(s) in RCA: 32] [Impact Index Per Article: 32.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
The Trans-Proteomic Pipeline (TPP) mass spectrometry data analysis suite has been in continual development and refinement since its first tools, PeptideProphet and ProteinProphet, were published 20 years ago. The current release provides a large complement of tools for spectrum processing, spectrum searching, search validation, abundance computation, protein inference, and more. Many of the tools include machine-learning modeling to extract the most information from data sets and build robust statistical models to compute the probabilities that derived information is correct. Here we present the latest information on the many TPP tools, and how TPP can be deployed on various platforms from personal Windows laptops to Linux clusters and expansive cloud computing environments. We describe tutorials on how to use TPP in a variety of ways and describe synergistic projects that leverage TPP. We conclude with plans for continued development of TPP.
Collapse
Affiliation(s)
- Eric W Deutsch
- Institute for Systems Biology, Seattle, Washington 98109, United States
| | - Luis Mendoza
- Institute for Systems Biology, Seattle, Washington 98109, United States
| | | | | | - Zhi Sun
- Institute for Systems Biology, Seattle, Washington 98109, United States
| | - Jimmy K Eng
- Proteomics Resource, University of Washington, Seattle, Washington 98195, United States
| | - Robert L Moritz
- Institute for Systems Biology, Seattle, Washington 98109, United States
| |
Collapse
|
4
|
Halder A, Verma A, Biswas D, Srivastava S. Recent advances in mass-spectrometry based proteomics software, tools and databases. DRUG DISCOVERY TODAY. TECHNOLOGIES 2021; 39:69-79. [PMID: 34906327 DOI: 10.1016/j.ddtec.2021.06.007] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2021] [Revised: 05/08/2021] [Accepted: 06/21/2021] [Indexed: 01/12/2023]
Abstract
The field of proteomics immensely depends on data generation and data analysis which are thoroughly supported by software and databases. There has been a massive advancement in mass spectrometry-based proteomics over the last 10 years which has compelled the scientific community to upgrade or develop algorithms, tools, and repository databases in the field of proteomics. Several standalone software, and comprehensive databases have aided the establishment of integrated omics pipeline and meta-analysis workflow which has contributed to understand the disease pathobiology, biomarker discovery and predicting new therapeutic modalities. For shotgun proteomics where Data Dependent Acquisition is performed, several user-friendly software are developed that can analyse the pre-processed data to provide mechanistic insights of the disease. Likewise, in Data Independent Acquisition, pipelines are emerged which can accomplish the task from building the spectral library to identify the therapeutic targets. Furthermore, in the age of big data analysis the implications of machine learning and cloud computing are appending robustness, rapidness and in-depth proteomics data analysis. The current review talks about the recent advancement, and development of software, tools, and database in the field of mass-spectrometry based proteomics.
Collapse
Affiliation(s)
- Ankit Halder
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Ayushi Verma
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Deeptarup Biswas
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India
| | - Sanjeeva Srivastava
- Department of Biosciences and Bioengineering, Indian Institute of Technology Bombay, Powai, Mumbai 400076, India.
| |
Collapse
|
5
|
van Wijk KJ, Leppert T, Sun Q, Boguraev SS, Sun Z, Mendoza L, Deutsch EW. The Arabidopsis PeptideAtlas: Harnessing worldwide proteomics data to create a comprehensive community proteomics resource. THE PLANT CELL 2021; 33:3421-3453. [PMID: 34411258 PMCID: PMC8566204 DOI: 10.1093/plcell/koab211] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Accepted: 08/13/2021] [Indexed: 05/02/2023]
Abstract
We developed a resource, the Arabidopsis PeptideAtlas (www.peptideatlas.org/builds/arabidopsis/), to solve central questions about the Arabidopsis thaliana proteome, such as the significance of protein splice forms and post-translational modifications (PTMs), or simply to obtain reliable information about specific proteins. PeptideAtlas is based on published mass spectrometry (MS) data collected through ProteomeXchange and reanalyzed through a uniform processing and metadata annotation pipeline. All matched MS-derived peptide data are linked to spectral, technical, and biological metadata. Nearly 40 million out of ∼143 million MS/MS (tandem MS) spectra were matched to the reference genome Araport11, identifying ∼0.5 million unique peptides and 17,858 uniquely identified proteins (only isoform per gene) at the highest confidence level (false discovery rate 0.0004; 2 non-nested peptides ≥9 amino acid each), assigned canonical proteins, and 3,543 lower-confidence proteins. Physicochemical protein properties were evaluated for targeted identification of unobserved proteins. Additional proteins and isoforms currently not in Araport11 were identified that were generated from pseudogenes, alternative start, stops, and/or splice variants, and small Open Reading Frames; these features should be considered when updating the Arabidopsis genome. Phosphorylation can be inspected through a sophisticated PTM viewer. PeptideAtlas is integrated with community resources including TAIR, tracks in JBrowse, PPDB, and UniProtKB. Subsequent PeptideAtlas builds will incorporate millions more MS/MS data.
Collapse
Affiliation(s)
- Klaas J van Wijk
- Section of Plant Biology, School of Integrative Plant Sciences (SIPS), Cornell University, Ithaca, New York 14853, USA
- Authors for correspondence: (K.J.V.W.), (E.W.D.)
| | - Tami Leppert
- Institute for Systems Biology (ISB), Seattle, Washington 98109, USA
| | - Qi Sun
- Computational Biology Service Unit, Cornell University, Ithaca, New York 14853, USA
| | - Sascha S Boguraev
- Section of Plant Biology, School of Integrative Plant Sciences (SIPS), Cornell University, Ithaca, New York 14853, USA
| | - Zhi Sun
- Institute for Systems Biology (ISB), Seattle, Washington 98109, USA
| | - Luis Mendoza
- Institute for Systems Biology (ISB), Seattle, Washington 98109, USA
| | - Eric W Deutsch
- Institute for Systems Biology (ISB), Seattle, Washington 98109, USA
- Authors for correspondence: (K.J.V.W.), (E.W.D.)
| |
Collapse
|
6
|
Neely BA. Cloudy with a Chance of Peptides: Accessibility, Scalability, and Reproducibility with Cloud-Hosted Environments. J Proteome Res 2021; 20:2076-2082. [PMID: 33513299 PMCID: PMC8637422 DOI: 10.1021/acs.jproteome.0c00920] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
Cloud-hosted environments offer known benefits when computational needs outstrip affordable local workstations, enabling high-performance computation without a physical cluster. What has been less apparent, especially to novice users, is the transformative potential for cloud-hosted environments to bridge the digital divide that exists between poorly funded and well-resourced laboratories, and to empower modern research groups with remote personnel and trainees. Using cloud-based proteomic bioinformatic pipelines is not predicated on analyzing thousands of files, but instead can be used to improve accessibility during remote work, extreme weather, or working with under-resourced remote trainees. The general benefits of cloud-hosted environments also allow for scalability and encourage reproducibility. Since one possible hurdle to adoption is awareness, this paper is written with the nonexpert in mind. The benefits and possibilities of using a cloud-hosted environment are emphasized by describing how to setup an example workflow to analyze a previously published label-free data-dependent acquisition mass spectrometry data set of mammalian urine. Cost and time of analysis are compared using different computational tiers, and important practical considerations are described. Overall, cloud-hosted environments offer the potential to solve large computational problems, but more importantly can enable and accelerate research in smaller research groups with inadequate infrastructure and suboptimal local computational resources.
Collapse
Affiliation(s)
- Benjamin A Neely
- Chemical Sciences Division, National Institute of Standards and Technology, Charleston, South Carolina 29412, United States
| |
Collapse
|
7
|
Verheggen K, Raeder H, Berven FS, Martens L, Barsnes H, Vaudel M. Anatomy and evolution of database search engines-a central component of mass spectrometry based proteomic workflows. MASS SPECTROMETRY REVIEWS 2020; 39:292-306. [PMID: 28902424 DOI: 10.1002/mas.21543] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Accepted: 07/05/2017] [Indexed: 06/07/2023]
Abstract
Sequence database search engines are bioinformatics algorithms that identify peptides from tandem mass spectra using a reference protein sequence database. Two decades of development, notably driven by advances in mass spectrometry, have provided scientists with more than 30 published search engines, each with its own properties. In this review, we present the common paradigm behind the different implementations, and its limitations for modern mass spectrometry datasets. We also detail how the search engines attempt to alleviate these limitations, and provide an overview of the different software frameworks available to the researcher. Finally, we highlight alternative approaches for the identification of proteomic mass spectrometry datasets, either as a replacement for, or as a complement to, sequence database search engines.
Collapse
Affiliation(s)
- Kenneth Verheggen
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium
| | - Helge Raeder
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway
- Department of Pediatrics, Haukeland University Hospital, Bergen, Norway
| | - Frode S Berven
- Proteomics Unit, Department of Biomedicine, University of Bergen, Norway
| | - Lennart Martens
- VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium
- Department of Biochemistry, Ghent University, Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium
| | - Harald Barsnes
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway
- Proteomics Unit, Department of Biomedicine, University of Bergen, Norway
- Computational Biology Unit, Department of Informatics, University of Bergen, Norway
| | - Marc Vaudel
- KG Jebsen Center for Diabetes Research, Department of Clinical Science, University of Bergen, Norway
- Proteomics Unit, Department of Biomedicine, University of Bergen, Norway
- Center for Medical Genetics and Molecular Medicine, Haukeland University Hospital, Bergen, Norway
| |
Collapse
|
8
|
Liang X, Xia Z, Jian L, Wang Y, Niu X, Link AJ. A cost-sensitive online learning method for peptide identification. BMC Genomics 2020; 21:324. [PMID: 32334531 PMCID: PMC7183122 DOI: 10.1186/s12864-020-6693-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2019] [Accepted: 03/24/2020] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Post-database search is a key procedure in peptide identification with tandem mass spectrometry (MS/MS) strategies for refining peptide-spectrum matches (PSMs) generated by database search engines. Although many statistical and machine learning-based methods have been developed to improve the accuracy of peptide identification, the challenge remains on large-scale datasets and datasets with a distribution of unbalanced PSMs. A more efficient learning strategy is required for improving the accuracy of peptide identification on challenging datasets. While complex learning models have larger power of classification, they may cause overfitting problems and introduce computational complexity on large-scale datasets. Kernel methods map data from the sample space to high dimensional spaces where data relationships can be simplified for modeling. RESULTS In order to tackle the computational challenge of using the kernel-based learning model for practical peptide identification problems, we present an online learning algorithm, OLCS-Ranker, which iteratively feeds only one training sample into the learning model at each round, and, as a result, the memory requirement for computation is significantly reduced. Meanwhile, we propose a cost-sensitive learning model for OLCS-Ranker by using a larger loss of decoy PSMs than that of target PSMs in the loss function. CONCLUSIONS The new model can reduce its false discovery rate on datasets with a distribution of unbalanced PSMs. Experimental studies show that OLCS-Ranker outperforms other methods in terms of accuracy and stability, especially on datasets with a distribution of unbalanced PSMs. Furthermore, OLCS-Ranker is 15-85 times faster than CRanker.
Collapse
Affiliation(s)
- Xijun Liang
- College of Science, China University of Petroleum, Changjiang West Road, Qingdao, 266580 China
| | - Zhonghang Xia
- School of Engineering and Applied Science, Western Kentucky University, Bowling Green, 42101 KY USA
| | - Ling Jian
- School of Economics and Management, China University of Petroleum, Changjiang West Road, Qingdao, 266580 China
| | - Yongxiang Wang
- College of Science, China University of Petroleum, Changjiang West Road, Qingdao, 266580 China
| | - Xinnan Niu
- Dept. of Pathology, Microbiology and Immunology, Vanderbilt University School of Medicine, Nashville, 37232 TN USA
| | - Andrew J. Link
- Dept. of Pathology, Microbiology and Immunology, Vanderbilt University School of Medicine, Nashville, 37232 TN USA
| |
Collapse
|
9
|
Jin P, Lan J, Wang K, Baker MS, Huang C, Nice EC. Pathology, proteomics and the pathway to personalised medicine. Expert Rev Proteomics 2018; 15:231-243. [PMID: 29310484 DOI: 10.1080/14789450.2018.1425618] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Affiliation(s)
- Ping Jin
- Key Laboratory of Tropical Diseases and Translational Medicine of Ministry of Education & Department of Neurology, The Affiliated Hospital of Hainan Medical College, Haikou, P.R. China
| | - Jiang Lan
- Key Laboratory of Tropical Diseases and Translational Medicine of Ministry of Education & Department of Neurology, The Affiliated Hospital of Hainan Medical College, Haikou, P.R. China
- West China School of Basic Medical Sciences & Forensic Medicine, and State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University and Collaborative Innovation Center for Biotherapy, Chengdu, 610041, P.R. China
| | - Kui Wang
- West China School of Basic Medical Sciences & Forensic Medicine, and State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University and Collaborative Innovation Center for Biotherapy, Chengdu, 610041, P.R. China
| | - Mark S. Baker
- Department of Biomedical Sciences, Faculty of Medicine & Health Sciences, Macquarie University, Sydney, Australia
| | - Canhua Huang
- Key Laboratory of Tropical Diseases and Translational Medicine of Ministry of Education & Department of Neurology, The Affiliated Hospital of Hainan Medical College, Haikou, P.R. China
- West China School of Basic Medical Sciences & Forensic Medicine, and State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University and Collaborative Innovation Center for Biotherapy, Chengdu, 610041, P.R. China
| | - Edouard C. Nice
- Department of Biochemistry and Molecular Biology, Monash University, Clayton, Australia and Visiting Professor, State Key Laboratory of Biotherapy and Cancer Center, West China Hospital, Sichuan University, and Collaborative Innovation Center for Biotherapy, Chengdu, 610041, P.R. China
| |
Collapse
|
10
|
Maabreh M, Qolomany B, Alsmadi I, Gupta A. Deep Learning-based MSMS Spectra Reduction in Support of Running Multiple Protein Search Engines on Cloud. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2017; 2017:1909-1914. [PMID: 34430067 PMCID: PMC8382039 DOI: 10.1109/bibm.2017.8217951] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The diversity of the available protein search engines with respect to the utilized matching algorithms, the low overlap ratios among their results and the disparity of their coverage encourage the community of proteomics to utilize ensemble solutions of different search engines. The advancing in cloud computing technology and the availability of distributed processing clusters can also provide support to this task. However, data transferring and results' combining, in this case, could be the major bottleneck. The flood of billions of observed mass spectra, hundreds of Gigabytes or potentially Terabytes of data, could easily cause the congestions, increase the risk of failure, poor performance, add more computations' cost, and waste available resources. Therefore, in this study, we propose a deep learning model in order to mitigate the traffic over cloud network and, thus reduce the cost of cloud computing. The model, which depends on the top 50 intensities and their m/z values of each spectrum, removes any spectrum which is predicted not to pass the majority voting of the participated search engines. Our results using three search engines namely: pFind, Comet and X!Tandem, and four different datasets are promising and promote the investment in deep learning to solve such type of Big data problems.
Collapse
Affiliation(s)
- Majdi Maabreh
- Department of Computer Science, Western Michigan University, Kalamazoo, MI, USA
| | - Basheer Qolomany
- Department of Computer Science, Western Michigan University, Kalamazoo, MI, USA
| | - Izzat Alsmadi
- Department of Computing and Cyber Security, Texas A&M University, San Antonio, TX, USA
| | - Ajay Gupta
- Department of Computer Science, Western Michigan University, Kalamazoo, MI, USA
| |
Collapse
|
11
|
Towards a one-stop solution for large-scale proteomics data analysis. SCIENCE CHINA-LIFE SCIENCES 2017; 61:351-354. [PMID: 28801860 DOI: 10.1007/s11427-017-9113-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/20/2017] [Accepted: 06/27/2017] [Indexed: 10/19/2022]
|
12
|
Abstract
Protein identification from tandem mass spectra is one of the most versatile and widely used proteomics workflows, able to identify proteins, characterize post-translational modifications, and provide semiquantitative measurements of relative protein abundance. This manuscript describes the concepts, prerequisites, and methods required to analyze a tandem mass spectrometry dataset in order to identify its proteins, by using a tandem mass spectrometry search engine to search protein sequence databases. The discussion includes instructions for extraction, preparation, and formatting of spectral datafiles, selection of appropriate search parameter settings, and basic interpretation of the results.
Collapse
Affiliation(s)
- Nathan J Edwards
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, 20007, USA.
| |
Collapse
|
13
|
May JC, McLean JA. Advanced Multidimensional Separations in Mass Spectrometry: Navigating the Big Data Deluge. ANNUAL REVIEW OF ANALYTICAL CHEMISTRY (PALO ALTO, CALIF.) 2016; 9:387-409. [PMID: 27306312 PMCID: PMC5763907 DOI: 10.1146/annurev-anchem-071015-041734] [Citation(s) in RCA: 60] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
Hybrid analytical instrumentation constructed around mass spectrometry (MS) is becoming the preferred technique for addressing many grand challenges in science and medicine. From the omics sciences to drug discovery and synthetic biology, multidimensional separations based on MS provide the high peak capacity and high measurement throughput necessary to obtain large-scale measurements used to infer systems-level information. In this article, we describe multidimensional MS configurations as technologies that are big data drivers and review some new and emerging strategies for mining information from large-scale datasets. We discuss the information content that can be obtained from individual dimensions, as well as the unique information that can be derived by comparing different levels of data. Finally, we summarize some emerging data visualization strategies that seek to make highly dimensional datasets both accessible and comprehensible.
Collapse
Affiliation(s)
- Jody C May
- Department of Chemistry, Center for Innovative Technology, Vanderbilt Institute for Chemical Biology, Vanderbilt Institute for Integrative Biosystems Research and Education, Vanderbilt University, Nashville, Tennessee 37235;
| | - John A McLean
- Department of Chemistry, Center for Innovative Technology, Vanderbilt Institute for Chemical Biology, Vanderbilt Institute for Integrative Biosystems Research and Education, Vanderbilt University, Nashville, Tennessee 37235;
| |
Collapse
|
14
|
Dinov ID. Methodological challenges and analytic opportunities for modeling and interpreting Big Healthcare Data. Gigascience 2016; 5:12. [PMID: 26918190 PMCID: PMC4766610 DOI: 10.1186/s13742-016-0117-6] [Citation(s) in RCA: 59] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2015] [Accepted: 02/09/2016] [Indexed: 11/25/2022] Open
Abstract
Managing, processing and understanding big healthcare data is challenging, costly and demanding. Without a robust fundamental theory for representation, analysis and inference, a roadmap for uniform handling and analyzing of such complex data remains elusive. In this article, we outline various big data challenges, opportunities, modeling methods and software techniques for blending complex healthcare data, advanced analytic tools, and distributed scientific computing. Using imaging, genetic and healthcare data we provide examples of processing heterogeneous datasets using distributed cloud services, automated and semi-automated classification techniques, and open-science protocols. Despite substantial advances, new innovative technologies need to be developed that enhance, scale and optimize the management and processing of large, complex and heterogeneous data. Stakeholder investments in data acquisition, research and development, computational infrastructure and education will be critical to realize the huge potential of big data, to reap the expected information benefits and to build lasting knowledge assets. Multi-faceted proprietary, open-source, and community developments will be essential to enable broad, reliable, sustainable and efficient data-driven discovery and analytics. Big data will affect every sector of the economy and their hallmark will be 'team science'.
Collapse
Affiliation(s)
- Ivo D. Dinov
- Statistics Online Computational Resource (SOCR), Health Behavior and Biological Sciences, Michigan Institute for Data Science, University of Michigan, 426 N. Ingalls, Ann Arbor, MI 49109 USA
| |
Collapse
|
15
|
Big Data in Plant Science: Resources and Data Mining Tools for Plant Genomics and Proteomics. Methods Mol Biol 2016; 1415:533-47. [PMID: 27115651 DOI: 10.1007/978-1-4939-3572-7_27] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
In modern plant biology, progress is increasingly defined by the scientists' ability to gather and analyze data sets of high volume and complexity, otherwise known as "big data". Arguably, the largest increase in the volume of plant data sets over the last decade is a consequence of the application of the next-generation sequencing and mass-spectrometry technologies to the study of experimental model and crop plants. The increase in quantity and complexity of biological data brings challenges, mostly associated with data acquisition, processing, and sharing within the scientific community. Nonetheless, big data in plant science create unique opportunities in advancing our understanding of complex biological processes at a level of accuracy without precedence, and establish a base for the plant systems biology. In this chapter, we summarize the major drivers of big data in plant science and big data initiatives in life sciences with a focus on the scope and impact of iPlant, a representative cyberinfrastructure platform for plant science.
Collapse
|
16
|
Toga AW, Foster I, Kesselman C, Madduri R, Chard K, Deutsch EW, Price ND, Glusman G, Heavner BD, Dinov ID, Ames J, Van Horn J, Kramer R, Hood L. Big biomedical data as the key resource for discovery science. J Am Med Inform Assoc 2015; 22:1126-31. [PMID: 26198305 PMCID: PMC5009918 DOI: 10.1093/jamia/ocv077] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Revised: 05/07/2015] [Accepted: 05/15/2015] [Indexed: 12/19/2022] Open
Abstract
Modern biomedical data collection is generating exponentially more data in a multitude of formats. This flood of complex data poses significant opportunities to discover and understand the critical interplay among such diverse domains as genomics, proteomics, metabolomics, and phenomics, including imaging, biometrics, and clinical data. The Big Data for Discovery Science Center is taking an "-ome to home" approach to discover linkages between these disparate data sources by mining existing databases of proteomic and genomic data, brain images, and clinical assessments. In support of this work, the authors developed new technological capabilities that make it easy for researchers to manage, aggregate, manipulate, integrate, and model large amounts of distributed data. Guided by biological domain expertise, the Center's computational resources and software will reveal relationships and patterns, aiding researchers in identifying biomarkers for the most confounding conditions and diseases, such as Parkinson's and Alzheimer's.
Collapse
Affiliation(s)
- Arthur W Toga
- Laboratory of Neuro Imaging, USC Stevens Neuroimaging and Informatics Institute, University of Southern California, Los Angeles, CA, USA
| | - Ian Foster
- Computation Institute, University of Chicago and Argonne National Laboratory, Chicago, IL, USA
| | - Carl Kesselman
- Information Sciences Institute, University of Southern California, Los Angeles, CA, USA
| | - Ravi Madduri
- Computation Institute, University of Chicago and Argonne National Laboratory, Chicago, IL, USA
| | - Kyle Chard
- Computation Institute, University of Chicago and Argonne National Laboratory, Chicago, IL, USA
| | | | | | | | | | - Ivo D Dinov
- Statistics Online Computational Resource (SOCR), UMSN, University of Michigan, Ann Arbor, MI, USA
| | - Joseph Ames
- Laboratory of Neuro Imaging, USC Stevens Neuroimaging and Informatics Institute, University of Southern California, Los Angeles, CA, USA
| | - John Van Horn
- Laboratory of Neuro Imaging, USC Stevens Neuroimaging and Informatics Institute, University of Southern California, Los Angeles, CA, USA
| | | | - Leroy Hood
- Institute for Systems Biology, Seattle, WA, USA
| |
Collapse
|
17
|
Deutsch EW, Mendoza L, Shteynberg D, Slagel J, Sun Z, Moritz RL. Trans-Proteomic Pipeline, a standardized data processing pipeline for large-scale reproducible proteomics informatics. Proteomics Clin Appl 2015; 9:745-54. [PMID: 25631240 PMCID: PMC4506239 DOI: 10.1002/prca.201400164] [Citation(s) in RCA: 250] [Impact Index Per Article: 27.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2014] [Revised: 12/19/2014] [Accepted: 01/27/2015] [Indexed: 11/11/2022]
Abstract
Democratization of genomics technologies has enabled the rapid determination of genotypes. More recently the democratization of comprehensive proteomics technologies is enabling the determination of the cellular phenotype and the molecular events that define its dynamic state. Core proteomic technologies include MS to define protein sequence, protein:protein interactions, and protein PTMs. Key enabling technologies for proteomics are bioinformatic pipelines to identify, quantitate, and summarize these events. The Trans-Proteomics Pipeline (TPP) is a robust open-source standardized data processing pipeline for large-scale reproducible quantitative MS proteomics. It supports all major operating systems and instrument vendors via open data formats. Here, we provide a review of the overall proteomics workflow supported by the TPP, its major tools, and how it can be used in its various modes from desktop to cloud computing. We describe new features for the TPP, including data visualization functionality. We conclude by describing some common perils that affect the analysis of MS/MS datasets, as well as some major upcoming features.
Collapse
Affiliation(s)
| | | | | | | | - Zhi Sun
- Institute for Systems Biology, Seattle, WA, USA
| | | |
Collapse
|