1
|
Zhang D, Dai ZY, Sun XP, Wu XT, Li H, Tang L, He JH. A distributed data processing scheme based on Hadoop for synchrotron radiation experiments. JOURNAL OF SYNCHROTRON RADIATION 2024; 31:635-645. [PMID: 38656774 DOI: 10.1107/s1600577524002637] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 03/20/2024] [Indexed: 04/26/2024]
Abstract
With the development of synchrotron radiation sources and high-frame-rate detectors, the amount of experimental data collected at synchrotron radiation beamlines has increased exponentially. As a result, data processing for synchrotron radiation experiments has entered the era of big data. It is becoming increasingly important for beamlines to have the capability to process large-scale data in parallel to keep up with the rapid growth of data. Currently, there is no set of data processing solutions based on the big data technology framework for beamlines. Apache Hadoop is a widely used distributed system architecture for solving the problem of massive data storage and computation. This paper presents a set of distributed data processing schemes for beamlines with experimental data using Hadoop. The Hadoop Distributed File System is utilized as the distributed file storage system, and Hadoop YARN serves as the resource scheduler for the distributed computing cluster. A distributed data processing pipeline that can carry out massively parallel computation is designed and developed using Hadoop Spark. The entire data processing platform adopts a distributed microservice architecture, which makes the system easy to expand, reduces module coupling and improves reliability.
Collapse
Affiliation(s)
- Ding Zhang
- The Institute for Advanced Studies, Wuhan University, Wuhan 430072, People's Republic of China
| | - Ze Yi Dai
- The Institute for Advanced Studies, Wuhan University, Wuhan 430072, People's Republic of China
| | - Xue Ping Sun
- The Institute for Advanced Studies, Wuhan University, Wuhan 430072, People's Republic of China
| | - Xue Ting Wu
- The Institute for Advanced Studies, Wuhan University, Wuhan 430072, People's Republic of China
| | - Hui Li
- The Institute for Advanced Studies, Wuhan University, Wuhan 430072, People's Republic of China
| | - Lin Tang
- The Institute for Advanced Studies, Wuhan University, Wuhan 430072, People's Republic of China
| | - Jian Hua He
- The Institute for Advanced Studies, Wuhan University, Wuhan 430072, People's Republic of China
| |
Collapse
|
2
|
Usón I, Sheldrick GM. Modes and model building in SHELXE. Acta Crystallogr D Struct Biol 2024; 80:4-15. [PMID: 38088896 PMCID: PMC10833347 DOI: 10.1107/s2059798323010082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Accepted: 11/21/2023] [Indexed: 01/12/2024] Open
Abstract
Density modification is a standard step to provide a route for routine structure solution by any experimental phasing method, with single-wavelength or multi-wavelength anomalous diffraction being the most popular methods, as well as to extend fragments or incomplete models into a full solution. The effect of density modification on the starting maps from either source is illustrated in the case of SHELXE. The different modes in which the program can run are reviewed; these include less well known uses such as reading external phase values and weights or phase distributions encoded in Hendrickson-Lattman coefficients. Typically in SHELXE, initial phases are calculated from experimental data, from a partial model or map, or from a combination of both sources. The initial phase set is improved and extended by density modification and, if the resolution of the data and the type of structure permits, polyalanine tracing. As a feature to systematically eliminate model bias from phases derived from predicted models, the trace can be set to exclude the area occupied by the starting model. The trace now includes an extension into the gamma position or hydrophobic and aromatic side chains if a sequence is provided, which is performed in every tracing cycle. Once a correlation coefficient of over 30% between the structure factors calculated from such a trace and the native data indicates that the structure has been solved, the sequence is docked in all model-building cycles and side chains are fitted if the map supports it. The extensions to the tracing algorithm brought in to provide a complete model are discussed. The improvement in phasing performance is assessed using a set of tests.
Collapse
Affiliation(s)
- Isabel Usón
- ICREA, Institució Catalana de Recerca i Estudis Avançats, Passeig Lluís Companys, 23, Barcelona, E-08003, Spain
- Crystallographic Methods, Institute of Molecular Biology of Barcelona (IBMB-CSIC), Barcelona Science Park, Helix Building, Baldiri Reixach, 15, Barcelona, 08028, Spain
| | - George M. Sheldrick
- Department of Structural Chemistry, Georg-August Universität Göttingen, Tammannstrasse 4, 37077 Göttingen, Germany
| |
Collapse
|
3
|
Thorn A. Artificial intelligence in the experimental determination and prediction of macromolecular structures. Curr Opin Struct Biol 2022; 74:102368. [DOI: 10.1016/j.sbi.2022.102368] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Revised: 02/22/2022] [Accepted: 03/08/2022] [Indexed: 11/26/2022]
|
4
|
Helliwell JR. Pre- and Post-publication Verification for Reproducible Data Mining in Macromolecular Crystallography. Methods Mol Biol 2022; 2449:235-261. [PMID: 35507266 DOI: 10.1007/978-1-0716-2095-3_10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Like an article narrative is deemed by an editor and referees to be worthy of being a version of record on acceptance as a publication, so must the underpinning data also be scrutinized before passing it as a version of record. Indeed without the underpinning data, a study and its conclusions cannot be reproduced at any stage of evaluation, pre- or post-publication. Likewise, an independent study without its own underpinning data also cannot be reproduced let alone be considered a replicate of the first study. The PDB is a modern marvel of achievement providing an organized open access to depositor and user of the data held there opening numerous applications. Methods for modeling protein structures and for determination of structures are still improving their precision, and artifacts of the method exist. So their accuracy is realized if they are reproduced by other methods. It is on such foundations that reproducible data mining is based. Data rates are expanding considerably be they at synchrotrons, the X-ray free electron lasers (XFELs), electron cryomicroscopes (cryoEM), or at the neutron facilities. The work of a person as a referee or user with a narrative and its underpinning data may well be complemented in future by artificial intelligence with machine learning, the former for specific refereeing and the latter for the more general validation, both ideally before publication. Examples are described involving rhenium theranostics, the anti-cancer platins and the SARS-CoV-2 main protease.
Collapse
Affiliation(s)
- John R Helliwell
- Department of Chemistry, University of Manchester, Manchester, UK.
| |
Collapse
|
5
|
Alharbi E, Bond P, Calinescu R, Cowtan K. Predicting the performance of automated crystallographic model-building pipelines. Acta Crystallogr D Struct Biol 2021; 77:1591-1601. [PMID: 34866614 PMCID: PMC8647178 DOI: 10.1107/s2059798321010500] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 10/10/2021] [Indexed: 12/02/2022] Open
Abstract
Proteins are macromolecules that perform essential biological functions which depend on their three-dimensional structure. Determining this structure involves complex laboratory and computational work. For the computational work, multiple software pipelines have been developed to build models of the protein structure from crystallographic data. Each of these pipelines performs differently depending on the characteristics of the electron-density map received as input. Identifying the best pipeline to use for a protein structure is difficult, as the pipeline performance differs significantly from one protein structure to another. As such, researchers often select pipelines that do not produce the best possible protein models from the available data. Here, a software tool is introduced which predicts key quality measures of the protein structures that a range of pipelines would generate if supplied with a given crystallographic data set. These measures are crystallographic quality-of-fit indicators based on included and withheld observations, and structure completeness. Extensive experiments carried out using over 2500 data sets show that the tool yields accurate predictions for both experimental phasing data sets (at resolutions between 1.2 and 4.0 Å) and molecular-replacement data sets (at resolutions between 1.0 and 3.5 Å). The tool can therefore provide a recommendation to the user concerning the pipelines that should be run in order to proceed most efficiently to a depositable model.
Collapse
Affiliation(s)
- Emad Alharbi
- Department of Computer Science, University of York, Heslington, York YO10 5GH, United Kingdom
- Department of Information Technology, University of Tabuk, Tabuk, Saudi Arabia
| | - Paul Bond
- Department of Chemistry, University of York, Heslington, York YO10 5DD, United Kingdom
| | - Radu Calinescu
- Department of Computer Science, University of York, Heslington, York YO10 5GH, United Kingdom
| | - Kevin Cowtan
- Department of Chemistry, University of York, Heslington, York YO10 5DD, United Kingdom
| |
Collapse
|
6
|
Affiliation(s)
- Melanie Vollmar
- Diamond Light Source Ltd., Harwell Science & Innovation Campus, Didcot, UK
| | - Gwyndaf Evans
- Diamond Light Source Ltd., Harwell Science & Innovation Campus, Didcot, UK
- Rosalind Franklin Institute, Harwell Science & Innovation Campus, Didcot, UK
| |
Collapse
|
7
|
Kolenko P, Stránský J, Koval' T, Malý M, Dohnálek J. SHELIXIR: automation of experimental phasing procedures using SHELXC/ D/ E. J Appl Crystallogr 2021. [DOI: 10.1107/s1600576721002454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
The program SHELIXIR represents a simple and efficient tool for routine phase-problem solution using data for experimental phasing by the single-wavelength anomalous dispersion, multiwavelength anomalous dispersion, single isomorphous replacement with anomalous scattering and radiation-damage-induced phasing methods. As indicated in its name, all calculation procedures are performed with the SHELXC/D/E program package. SHELIXIR provides screening for alternative space groups, optimal solvent content, and high- and low-resolution limits. The procedures of SHELXE are parallelized to minimize the computational time. The automation and parallelization of such procedures are suitable for phasing at synchrotron beamlines directly or for finding the optimal parameters for further data processing. A simple graphical interface is designed to make use easier and to increase efficiency during beam time.
Collapse
|
8
|
Bond PS, Wilson KS, Cowtan KD. Predicting protein model correctness in Coot using machine learning. Acta Crystallogr D Struct Biol 2020; 76:713-723. [PMID: 32744253 PMCID: PMC7397494 DOI: 10.1107/s2059798320009080] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Accepted: 07/02/2020] [Indexed: 11/11/2022] Open
Abstract
Manually identifying and correcting errors in protein models can be a slow process, but improvements in validation tools and automated model-building software can contribute to reducing this burden. This article presents a new correctness score that is produced by combining multiple sources of information using a neural network. The residues in 639 automatically built models were marked as correct or incorrect by comparing them with the coordinates deposited in the PDB. A number of features were also calculated for each residue using Coot, including map-to-model correlation, density values, B factors, clashes, Ramachandran scores, rotamer scores and resolution. Two neural networks were created using these features as inputs: one to predict the correctness of main-chain atoms and the other for side chains. The 639 structures were split into 511 that were used to train the neural networks and 128 that were used to test performance. The predicted correctness scores could correctly categorize 92.3% of the main-chain atoms and 87.6% of the side chains. A Coot ML Correctness script was written to display the scores in a graphical user interface as well as for the automatic pruning of chains, residues and side chains with low scores. The automatic pruning function was added to the CCP4i2 Buccaneer automated model-building pipeline, leading to significant improvements, especially for high-resolution structures.
Collapse
Affiliation(s)
- Paul S. Bond
- Department of Chemistry, University of York, York YO10 5DD, United Kingdom
| | - Keith S. Wilson
- Department of Chemistry, University of York, York YO10 5DD, United Kingdom
| | - Kevin D. Cowtan
- Department of Chemistry, University of York, York YO10 5DD, United Kingdom
| |
Collapse
|