51
|
Chen J, Zhu X, Liu H. A mutual neighbor-based clustering method and its medical applications. Comput Biol Med 2022; 150:106184. [PMID: 37859282 DOI: 10.1016/j.compbiomed.2022.106184] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2022] [Revised: 09/23/2022] [Accepted: 10/08/2022] [Indexed: 11/03/2022]
Abstract
Clustering analysis has been widely used in various real-world applications. Due to the simplicity of K-means, it has become the most popular clustering analysis technique in reality. Unfortunately, the performance of K-means heavily relies on initial centers, which should be specified in prior. Besides, it cannot effectively identify manifold clusters. In this paper, we propose a novel clustering algorithm based on representative data objects derived from mutual neighbors to identify different shaped clusters. Specifically, it first obtains mutual neighbors to estimate the density for each data object, and then identifies representative objects with high densities to represent the whole data. Moreover, a concept of path distance, deriving from a minimum spanning tree, is introduced to measure the similarities of representative objects for manifold structures. Finally, an improved K-means with initial centers and path-based distances is proposed to group the representative objects into clusters. For non-representative objects, their cluster labels are determined by neighborhood information. To verify the effectiveness of the proposed method, we conducted comparison experiments on synthetic data and further applied it to medical scenarios. The results show that our clustering method can effectively identify arbitrary-shaped clusters and disease types in comparing to the state-of-the-art clustering ones.
Collapse
|
52
|
Chen R, Li B, Jia B, Xu J, Ma L, Yang H, Wang H. Oil spill identification in X-band marine radar image using K-means and texture feature. PeerJ Comput Sci 2022; 8:e1133. [PMID: 36426254 PMCID: PMC9680884 DOI: 10.7717/peerj-cs.1133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 09/26/2022] [Indexed: 06/16/2023]
Abstract
Marine oil pollution poses a serious threat to the marine ecological balance. It is of great significance to develop rapid and efficient oil spill detection methods for the mitigation of marine oil spill pollution and the restoration of the marine ecological environment. X-band marine radar is one of the important monitoring devices, in this article, we perform the digital X-band radar image by "Sperry Marine" radar system for an oil film extraction experiment. First, the de-noised image was obtained by preprocessing the original image in the Cartesian coordinate system. Second, it was cut into slices. Third, the texture features of the slices were calculated based on the gray-level co-occurrence matrix (GLCM) and K-means method to extract the rough oil spill regions. Finally, the oil spill regions were segmented using the Sauvola threshold algorithm. The experimental results indicate that this study provides a scientific method for the research of oil film extraction. Compared with other methods of oil spill extraction in X-band single-polarization marine radar images, the proposed technology is more intelligent, and it can provide technical support for marine oil spill emergency response in the future.
Collapse
|
53
|
Chen TL, Fushing H, Chou EP. Learned Practical Guidelines for Evaluating Conditional Entropy and Mutual Information in Discovering Major Factors of Response-vs.-Covariate Dynamics. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1382. [PMID: 37420402 DOI: 10.3390/e24101382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 09/22/2022] [Accepted: 09/26/2022] [Indexed: 07/09/2023]
Abstract
We reformulate and reframe a series of increasingly complex parametric statistical topics into a framework of response-vs.-covariate (Re-Co) dynamics that is described without any explicit functional structures. Then we resolve these topics' data analysis tasks by discovering major factors underlying such Re-Co dynamics by only making use of data's categorical nature. The major factor selection protocol at the heart of Categorical Exploratory Data Analysis (CEDA) paradigm is illustrated and carried out by employing Shannon's conditional entropy (CE) and mutual information (I[Re;Co]) as the two key Information Theoretical measurements. Through the process of evaluating these two entropy-based measurements and resolving statistical tasks, we acquire several computational guidelines for carrying out the major factor selection protocol in a do-and-learn fashion. Specifically, practical guidelines are established for evaluating CE and I[Re;Co] in accordance with the criterion called [C1:confirmable]. Following the [C1:confirmable] criterion, we make no attempts on acquiring consistent estimations of these theoretical information measurements. All evaluations are carried out on a contingency table platform, upon which the practical guidelines also provide ways of lessening the effects of the curse of dimensionality. We explicitly carry out six examples of Re-Co dynamics, within each of which, several widely extended scenarios are also explored and discussed.
Collapse
|
54
|
Particle Swarm Optimization and Two-Way Fixed-Effects Analysis of Variance for Efficient Brain Tumor Segmentation. Cancers (Basel) 2022; 14:cancers14184399. [PMID: 36139559 PMCID: PMC9496881 DOI: 10.3390/cancers14184399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2022] [Revised: 09/04/2022] [Accepted: 09/07/2022] [Indexed: 11/29/2022] Open
Abstract
Simple Summary Segmentation of brain tumor images from magnetic resonance imaging (MRI) is a challenging topic in medical image analysis. The brain tumor can take many shapes, and MRI images vary considerably in intensity, making lesion detection difficult for radiologists. This paper proposes a three-step approach to solving this problem: (1) pre-processing, based on morphological operations, is applied to remove the skull bone from the image; (2) the particle swarm optimization (PSO) algorithm, with a two-way fixed-effects analysis of variance (ANOVA)-based fitness function, is used to find the optimal block containing the brain lesion; (3) the K-means clustering algorithm is adopted, to classify the detected block as tumor or non-tumor. An extensive experimental analysis, including visual and statistical evaluations, was conducted, using two MRI databases: a private database provided by the Kouba imaging center—Algiers (KICA)—and the multimodal brain tumor segmentation challenge (BraTS) 2015 database. The results show that the proposed methodology achieved impressive performance, compared to several competing approaches. Abstract Segmentation of brain tumor images, to refine the detection and understanding of abnormal masses in the brain, is an important research topic in medical imaging. This paper proposes a new segmentation method, consisting of three main steps, to detect brain lesions using magnetic resonance imaging (MRI). In the first step, the parts of the image delineating the skull bone are removed, to exclude insignificant data. In the second step, which is the main contribution of this study, the particle swarm optimization (PSO) technique is applied, to detect the block that contains the brain lesions. The fitness function, used to determine the best block among all candidate blocks, is based on a two-way fixed-effects analysis of variance (ANOVA). In the last step of the algorithm, the K-means segmentation method is used in the lesion block, to classify it as a tumor or not. A thorough evaluation of the proposed algorithm was performed, using: (1) a private MRI database provided by the Kouba imaging center—Algiers (KICA); (2) the multimodal brain tumor segmentation challenge (BraTS) 2015 database. Estimates of the selected fitness function were first compared to those based on the sum-of-absolute-differences (SAD) dissimilarity criterion, to demonstrate the efficiency and robustness of the ANOVA. The performance of the optimized brain tumor segmentation algorithm was then compared to the results of several state-of-the-art techniques. The results obtained, by using the Dice coefficient, Jaccard distance, correlation coefficient, and root mean square error (RMSE) measurements, demonstrated the superiority of the proposed optimized segmentation algorithm over equivalent techniques.
Collapse
|
55
|
Li J, Huang J, Jiang T, Tu L, Cui L, Cui J, Ma X, Yao X, Shi Y, Wang S, Wang Y, Liu J, Li Y, Zhou C, Hu X, Xu J. A multi-step approach for tongue image classification in patients with diabetes. Comput Biol Med 2022; 149:105935. [PMID: 35986968 DOI: 10.1016/j.compbiomed.2022.105935] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Revised: 06/30/2022] [Accepted: 07/14/2022] [Indexed: 11/03/2022]
Abstract
BACKGROUND In China, diabetes is a common, high-incidence chronic disease. Diabetes has become a severe public health problem. However, the current diagnosis and treatment methods are difficult to control the progress of diabetes. Traditional Chinese Medicine (TCM) has become an option for the treatment of diabetes due to its low cost, good curative effect, and good accessibility. OBJECTIVE Based on the tongue images data to realize the fine classification of the diabetic population, provide a diagnostic basis for the formulation of individualized treatment plans for diabetes, ensure the accuracy and consistency of the TCM diagnosis, and promote the objective and standardized development of TCM diagnosis. METHODS We use the TFDA-1 tongue examination instrument to collect the tongue images of the subjects. Tongue Diagnosis Analysis System (TDAS) is used to extract the TDAS features of the tongue images. Vector Quantized Variational Autoencoder (VQ-VAE) extracts VQ-VAE features from tongue images. Based on VQ-VAE features, K-means clustering tongue images. TDAS features are used to describe the differences between clusters. Vision Transformer (ViT) combined with Grad-weighted Class Activation Mapping (Grad-CAM) is used to verify the clustering results and calculate positioning diagnostic information. RESULTS Based on VQ-VAE features, K-means divides the diabetic population into 4 clusters with clear boundaries. The silhouette, calinski harabasz, and davies bouldin scores are 0.391, 673.256, and 0.809, respectively. Cluster 1 had the highest Tongue Body L (TB-L) and Tongue Coating L (TC-L) and the lowest Tongue Coating Angular second moment (TC-ASM), with a pale red tongue and white coating. Cluster 2 had the highest TC-b with a yellow tongue coating. Cluster 3 had the highest TB-a with a red tongue. Group 4 had the lowest TB-L, TC-L, and TB-b and the highest Per-all with a purple tongue and the largest tongue coating area. ViT verifies the clustering results of K-means, the highest Top-1 Classification Accuracy (CA) is 87.8%, and the average CA is 84.4%. CONCLUSIONS The study organically combined unsupervised learning, self-supervised learning, and supervised learning and designed a complete diabetic tongue image classification method. This method does not rely on human intervention, makes decisions based entirely on tongue image data, and achieves state-of-the-art results. Our research will help TCM deeply participate in the individualized treatment of diabetes and provide new ideas for promoting the standardization of TCM diagnosis.
Collapse
|
56
|
Zhan X, Li Y, Liu Y, Cecchi NJ, Gevaert O, Zeineh MM, Grant GA, Camarillo DB. Piecewise Multivariate Linearity Between Kinematic Features and Cumulative Strain Damage Measure (CSDM) Across Different Types of Head Impacts. Ann Biomed Eng 2022; 50:1596-1607. [PMID: 35922726 DOI: 10.1007/s10439-022-03020-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 07/12/2022] [Indexed: 11/28/2022]
Abstract
In a previous study, we found that the relationship between brain strain and kinematic features cannot be described by a generalized linear model across different types of head impacts. In this study, we investigate if such a linear relationship exists when partitioning head impacts using a data-driven approach. We applied the K-means clustering method to partition 3161 impacts from various sources including simulation, college football, mixed martial arts, and car crashes. We found piecewise multivariate linearity between the cumulative strain damage (CSDM; assessed at the threshold of 0.15) and head kinematic features. Compared with the linear regression models without partition and the partition according to the types of head impacts, K-means-based data-driven partition showed significantly higher CSDM regression accuracy, which suggested the presence of piecewise multivariate linearity across types of head impacts. Additionally, we compared the piecewise linearity with the partitions based on individual features used in clustering. We found that the partition with maximum angular acceleration magnitude at 4706 rad/s2 led to the highest piecewise linearity. This study may contribute to an improved method for the rapid prediction of CSDM in the future.
Collapse
|
57
|
Pérez-Campuzano D, Rubio Andrada L, Morcillo Ortega P, López-Lázaro A. Visualizing the historical COVID-19 shock in the US airline industry: A Data Mining approach for dynamic market surveillance. JOURNAL OF AIR TRANSPORT MANAGEMENT 2022; 101:102194. [PMID: 36568914 PMCID: PMC9759375 DOI: 10.1016/j.jairtraman.2022.102194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Revised: 02/21/2022] [Accepted: 02/21/2022] [Indexed: 06/17/2023]
Abstract
One of the purposes of Artificial Intelligence tools is to ease the analysis of large amounts of data. In order to support the strategic decision-making process of the airlines, this paper proposes a Data Mining approach (focused on visualization) with the objective of extracting market knowledge from any database of industry players or competitors. The method combines two clustering techniques (Self-Organizing Maps, SOMs, and K-means) via unsupervised learning with promising dynamic applications in different sectors. As a case study, 30-year data from 18 diverse US passenger airlines is used to showcase the capabilities of this tool including the identification and assessment of market trends, M&A events or the COVID-19 consequences.
Collapse
|
58
|
Parvizi S, Eslamian S, Gheysari M, Gohari A, Kopai SS. Regional frequency analysis of drought severity and duration in Karkheh River Basin, Iran using univariate L-moments method. ENVIRONMENTAL MONITORING AND ASSESSMENT 2022; 194:336. [PMID: 35389125 DOI: 10.1007/s10661-022-09977-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Accepted: 03/19/2022] [Indexed: 06/14/2023]
Abstract
Drought is one of the natural disasters that causes a great damage to human life and natural ecosystems. The main differences are in the gradual effect of drought over a relatively long period, impossibility of accurately determining time of the beginning and end of drought, and geographical extent of the associated effects. On the other hand, lack of a universally accepted definition of drought has added to the complexity of this phenomenon. In the last decade, due to increasing frequency of drought in Iran and reduction of water resources, its consequences have become apparent and have caused problems for planners and managers. So in this research, regional frequency analysis using L-moments methods was performed to investigate severity and duration of Standardized Precipitation Index (SPI), Standardized Evapotranspiration Index (SEI), Standardized Runoff Index (SRI), and Standardized Soil Moisture Index (SSI) and to study of meteorological, agricultural, and hydrological droughts in Karkheh River Basin in Iran. Using K-means clustering method, basin was divided into four homogeneous areas. Uncoordinated stations in each cluster were removed. The best regional distribution function was selected for each homogeneous region, and it was found that Pearson type (3) has the highest fit on the data set in the basin. Based on Hosking and Wallis heterogeneity test, Karkheh Basin with H1 < 1 was identified as acceptable homogeneous in all clusters. The results showed that hydrological drought occurs with a very short time delay in Karkheh River Basin after the meteorological drought, and two indicators show meteorological and hydrological drought conditions well. Agricultural drought occurs after meteorological and hydrological drought, respectively, and its severity and duration are less than the other indicators. Meteorological, hydrological, and agricultural droughts do not occur at the same time in all of the years. In general, the SPI drought index shows the most severe droughts compared with the other three indices. By this way, in 5- to 20-year return period with severity of 3SPI and in 20- to 100-year return period with severity of 7SPI, region IV or the western and northwestern areas of the basin has been affected by severe meteorological drought. By using the regional standardized quantities, it is possible to estimate the probability of drought in any part of the catchment that does not have sufficient data for hydrological studies.
Collapse
|
59
|
Pathak S, Raj R, Singh K, Verma PK, Kumar B. Development of portable and robust cataract detection and grading system by analyzing multiple texture features for Tele-Ophthalmology. MULTIMEDIA TOOLS AND APPLICATIONS 2022; 81:23355-23371. [PMID: 35317470 PMCID: PMC8931454 DOI: 10.1007/s11042-022-12544-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 02/18/2021] [Accepted: 01/31/2022] [Indexed: 06/14/2023]
Abstract
This paper presents a low cost, robust, portable and automated cataract detection system which can detect the presence of cataract from the colored digital eye images and grade their severity. Ophthalmologists detect cataract through visual screening using ophthalmoscope and slit lamps. Conventionally a patient has to visit an ophthalmologist for eye screening and treatment follows the course. Developing countries lack the proper health infrastructure and face huge scarcity of trained medical professionals as well as technicians. The condition is not very satisfactory with the rural and remote areas of developed nations. To bridge this barrier between the patient and the availability of resources, current work focuses on the development of portable low-cost, robust cataract screening and grading system. Similar works use fundus and retinal images which use costly imaging modules and image based detection algorithms which use much complex neural network models. Current work derives its benefit from the advancements in digital image processing techniques. A set of preprocessing has been done on the colored eye image and later texture information in form of mean intensity, uniformity, standard deviation and randomness has been calculated and mapped with the diagnostic opinion of doctor for cataract screening of over 200 patients. For different grades of cataract severity edge pixel count was calculated as per doctor's opinion and later these data are used for calculating the thresholds using hybrid k-means algorithm, for giving a decision on the presence of cataract and grade its severity. Low value of uniformity and high value of other texture parameters confirm the presence of cataract as clouding in eye lens causes the uniformity function to take lower value due to presence of coarse texture. Higher the edge pixel count value, this confirms the presence of starting of cataract as solidified regions in lens are nonuniform. Lower value corresponds to fully solidified region or matured cataract. Proposed algorithm was initially developed on MATLAB, and tested on over 300 patients in an eye camp. The system has shown more than 98% accuracy in detection and grading of cataract. Later a cloud based system was developed with 3D printed image acquisition module to manifest an automated, portable and efficient cataract detection system for Tele-Ophthalmology. The proposed system uses a very simple and efficient technique by mapping the diagnostic opinion of the doctor as well, giving very promising results which suggest its potential use in teleophthalmology applications to reduce the cost of delivering eye care services and increasing its reach effectively. Developed system is simple in design and easy to operate and suitable for mass screening of cataracts. Due to non-invasive and non-mydriatic and mountable nature of device, in person screening is not required. Hence, social distancing norms are easy to follow and device is very useful in COVID-19 like situation.
Collapse
|
60
|
Dong Q, Cao M, Gu F, Gong W, Cai Q. Method for puncture trajectory planning in liver tumors thermal ablation based on NSGA-III. Technol Health Care 2022; 30:1243-1256. [PMID: 35342068 DOI: 10.3233/thc-213592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
BACKGROUND Thermal ablation of liver tumors is a conventional mode for treating liver tumors. In order to reduce the damage to normal tissue endangered by thermal ablation, the physician needs to plan the puncture path before surgery. OBJECTIVE In this paper, a puncture trajectory planning method for thermal ablation of liver tumor based on NSGA-III is proposed. This method takes the clinical hard constraints and soft constraints into account. METHOD The feasible puncture region is solved by the hard constraints, and after that the pareto front points are obtained under the soft constraints. When accessing the feasible puncture region, an adaptive morphological closing operation method based on K-means algorithm is adopted to process the spherical angle binary image of obstacles that might be encountered in the puncture process. RANSAC is performed to fit the tangent plane of liver surface when calculating the angle between the puncture trajectory and liver surface. In order to evaluate the puncture path obtained by this method, 6 tumors are selected as experimental subjects, and Hausdorff distance and Overlap Rate of Pareto front points with manually recommend points are calculated respectively. RESULTS The average value of Hausdorff distance is 24.91 mm, and the mean value of the overlap rate is 86.43%. CONCLUSION The proposed method can provide high safety and clinical practice of the puncture route.
Collapse
|
61
|
Deciphering heterogeneous populations of migrating cells based on the computational assessment of their dynamic properties. Stem Cell Reports 2022; 17:911-923. [PMID: 35303437 PMCID: PMC9023771 DOI: 10.1016/j.stemcr.2022.02.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2021] [Revised: 02/17/2022] [Accepted: 02/18/2022] [Indexed: 11/23/2022] Open
Abstract
Neuronal migration is a highly dynamic process, and multiple cell movement metrics can be extracted from time-lapse imaging datasets. However, these parameters alone are often insufficient to evaluate the heterogeneity of neuroblast populations. We developed an analytical pipeline based on reducing the dimensions of the dataset by principal component analysis (PCA) and determining sub-populations using k-means, supported by the elbow criterion method and validated by a decision tree algorithm. We showed that neuroblasts derived from the same adult neural stem cell (NSC) lineage as well as across different lineages are heterogeneous and can be sub-divided into different clusters based on their dynamic properties. Interestingly, we also observed overlapping clusters for neuroblasts derived from different NSC lineages. We further showed that genetic perturbations or environmental stimuli affect the migratory properties of neuroblasts in a sub-cluster-specific manner. Our data thus provide a framework for assessing the heterogeneity of migrating neuroblasts. Pipeline to study the heterogeneity of migrating cells based on their dynamic properties Neuroblasts derived from the same neural stem cell (NSC) lineage are heterogeneous Neuroblasts derived from different NSC lineages have overlapping and distinct clusters These clusters are differently affected by genetic factors or environmental stimuli
Collapse
|
62
|
Clark S, Lomax N, Birkin M, Morris M. A foresight whole systems obesity classification for the English UK biobank cohort. BMC Public Health 2022; 22:349. [PMID: 35180877 PMCID: PMC8856870 DOI: 10.1186/s12889-022-12650-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2021] [Accepted: 01/18/2022] [Indexed: 12/20/2022] Open
Abstract
Background The number of people living with obesity or who are overweight presents a global challenge, and the development of effective interventions is hampered by a lack of research which takes a joined up, whole system, approach that considers multiple elements of the complex obesity system together. We need to better understand the collective characteristics and behaviours of those who are overweight or have obesity and how these differ from those who maintain a healthy weight. Methods Using the UK Biobank cohort we develop an obesity classification system using k-means clustering. Variable selection from the UK Biobank cohort is informed by the Foresight obesity system map across key domains (Societal Influences, Individual Psychology, Individual Physiology, Individual Physical Activity, Physical Activity Environment). Results Our classification identifies eight groups of people, similar in respect to their exposure to known drivers of obesity: ‘Younger, urban hard-pressed’, ‘Comfortable, fit families’, ‘Healthy, active and retirees’, ‘Content, rural and retirees’, ‘Comfortable professionals’, ‘Stressed and not in work’, ‘Deprived with less healthy lifestyles’ and ‘Active manual workers’. Pen portraits are developed to describe the characteristics of these different groups. Multinomial logistic regression is used to demonstrate that the classification can effectively detect groups of individuals more likely to be living with overweight or obesity. The group identified as ‘Comfortable, fit families’ are observed to have a higher proportion of healthy weight, while three groups have increased relative risk of being overweight or having obesity: ‘Active manual workers’, ‘Stressed and not in work’ and ‘Deprived with less healthy lifestyles’. Conclusions This paper presents the first study of UK Biobank participants to adopt this obesity system approach to characterising participants. It provides an innovative new approach to better understand the complex drivers of obesity which has the potential to produce meaningful tools for policy makers to better target interventions across the whole system to reduce overweight and obesity. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-022-12650-x.
Collapse
|
63
|
Mansoldo FRP, Berrino E, Guglielmi P, Carradori S, Carta F, Secci D, Supuran CT, Vermelho AB. An innovative spectroscopic approach for qualitative and quantitative evaluation of Mb-CO from myoglobin carbonylation reaction through chemometrics methods. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2022; 267:120602. [PMID: 34801390 DOI: 10.1016/j.saa.2021.120602] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Revised: 09/13/2021] [Accepted: 11/07/2021] [Indexed: 06/13/2023]
Abstract
In this work, an innovative approach using K-means and multivariate curve resolution-purity based algorithm (MCR-Purity) for the evaluation and quantification of carboxymyoglobin (Mb-CO) formation from Deoxy-Myoglobin (Deoxy-Mb) was presented. Through a multilevel multifactor experimental design, samples with different concentrations of Mb-CO were created. The UV-Vis spectra of these samples were submitted to K-means analysis, finding 3 clusters. The mean spectra of the clusters were extracted and it was possible to detect 2 totally differentiable groups through peaks 423 and 434 nm, which are wavelengths related to the Mb-CO and Deoxy-Mb components, respectively. The spectral data were subjected to MCR-Purity analysis. The MCR-Purity result successfully described the analyzed reaction, explaining more than 99.9% of the variance (R2) with a LOF of 1.43%. Then, a predictive model of MbCO was created through the linear relationship between MCR-Purity contributions and known concentrations of MbCO. The performance parameters of the created predictive model were R2CV = 0.98, RMSECV = 0.58 and RPDcv = 7.8 for the training set, and R2P = 0.98, RMSEP = 0.7 and RPDp = 6.8 for the test set. Thus, the predictive model presented an excellent performance considering that the Mb-CO variation is comprised between 0 and 21 µM. Therefore, these results demonstrate that the application of the proposed strategy to the analysis of spectral data presenting overlapping bands is feasible and robust.
Collapse
|
64
|
Hypoxia-related gene signature for predicting LUAD patients' prognosis and immune microenvironment. Cytokine 2022; 152:155820. [PMID: 35176657 DOI: 10.1016/j.cyto.2022.155820] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 01/10/2022] [Accepted: 01/29/2022] [Indexed: 12/11/2022]
Abstract
Lung adenocarcinoma (LUAD) is a prevalent lung cancer histology with high morbidity and mortality. Moreover, assessment approaches for patients' prognoses are still not effective. Based on mRNA expression and clinical data from the Cancer Genome Atlas (TCGA)-LUAD data set, we utilized hypoxia-related gene set in MsigDB database to identify hypoxia-related differentially expressed genes (DEGs). On the basis of levels of hypoxia-related DEGs, K-means consensus clustering was introduced to divide LUAD patients into subgroups. After hypoxia-related DEGs were analyzed through univariate, Lasso and multivariate Cox regression analyses, 6 of them were determined to be used for evaluating LUAD patients' prognostic signature. With median risk score obtained from hypoxia-related gene signature as threshold, LUAD patients were divided into high- and low-risk groups. Besides, Kaplan-Meier curves, receiver operator characteristic (ROC) curves, univariate and multivariate Cox regression analyses verified that hypoxia-related gene signature was an important prognostic factor independent of clinical features. Gene set enrichment analysis (GSEA) displayed that pathways which showed differences between high- and low-risk groups in activation of pentose-phosphate pathway and p53 signaling pathway. CIBERSORT was utilized to assess infiltration level of each immune cell from two groups, indicating the differences in infiltration abundance of Plasma cells, T cells CD4+ memory activated and Macrophages M1 cells between high- and low-risk groups. We drew a nomogram for predicting one-, three- and five-year survival of LUAD patients following risk scores of hypoxia-related gene signature and six clinical factors. Calibration curves showed a high fit between survival predicted by nomogram and actual survival. In conclusion, hypoxia-related gene signature can be introduced for predicting LUAD patients' prognosis and assessment of the patients' immune microenvironment, guiding clinicians to make appropriate decisions during diagnosis and treatment of LUAD patients.
Collapse
|
65
|
Hu J, Chen J, Zhu P, Hao S, Wang M, Li H, Liu N. Difference and Cluster Analysis on the Carbon Dioxide Emissions in China During COVID-19 Lockdown via a Complex Network Model. Front Psychol 2022; 12:795142. [PMID: 35095680 PMCID: PMC8790068 DOI: 10.3389/fpsyg.2021.795142] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 12/16/2021] [Indexed: 12/23/2022] Open
Abstract
The continuous increase of carbon emissions is a serious challenge all over the world, and many countries are striving to solve this problem. Since 2020, a widespread lockdown in the country to prevent the spread of COVID-19 escalated, severely restricting the movement of people and unnecessary economic activities, which unexpectedly reduced carbon emissions. This paper aims to analyze the carbon emissions data of 30 provinces in the 2020 and provide references for reducing emissions with epidemic lockdown measures. Based on the method of time series visualization, we transform the time series data into complex networks to find out the hidden information in these data. We found that the lockdown would bring about a short-term decrease in carbon emissions, and most provinces have a short time point of impact, which is closely related to the level of economic development and industrial structure. The current results provide some insights into the evolution of carbon emissions under COVID-19 blockade measures and valuable insights into energy conservation and response to the energy crisis in the post-epidemic era.
Collapse
|
66
|
Sawalmeh A, Othman NS, Liu G, Khreishah A, Alenezi A, Alanazi A. Power-Efficient Wireless Coverage Using Minimum Number of UAVs. SENSORS 2021; 22:s22010223. [PMID: 35009766 PMCID: PMC8749821 DOI: 10.3390/s22010223] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/20/2021] [Revised: 12/17/2021] [Accepted: 12/23/2021] [Indexed: 11/16/2022]
Abstract
Unmanned aerial vehicles (UAVs) can be deployed as backup aerial base stations due to cellular outage either during or post natural disaster. In this paper, an approach involving multi-UAV three-dimensional (3D) deployment with power-efficient planning was proposed with the objective of minimizing the number of UAVs used to provide wireless coverage to all outdoor and indoor users that minimizes the required UAV transmit power and satisfies users’ required data rate. More specifically, the proposed algorithm iteratively invoked a clustering algorithm and an efficient UAV 3D placement algorithm, which aimed for maximum wireless coverage using the minimum number of UAVs while minimizing the required UAV transmit power. Two scenarios where users are uniformly and non-uniformly distributed were considered. The proposed algorithm that employed a Particle Swarm Optimization (PSO)-based clustering algorithm resulted in a lower number of UAVs needed to serve all users compared with that when a K-means clustering algorithm was employed. Furthermore, the proposed algorithm that iteratively invoked a PSO-based clustering algorithm and PSO-based efficient UAV 3D placement algorithms reduced the execution time by a factor of ≈1/17 and ≈1/79, respectively, compared to that when the Genetic Algorithm (GA)-based and Artificial Bees Colony (ABC)-based efficient UAV 3D placement algorithms were employed. For the uniform distribution scenario, it was observed that the proposed algorithm required six UAVs to ensure 100% user coverage, whilst the benchmarker algorithm that utilized Circle Packing Theory (CPT) required five UAVs but at the expense of 67% of coverage density.
Collapse
|
67
|
Duan T, Kuang Z, Wang J, Ma Z. GBDTLRL2D Predicts LncRNA-Disease Associations Using MetaGraph2Vec and K-Means Based on Heterogeneous Network. Front Cell Dev Biol 2021; 9:753027. [PMID: 34977011 PMCID: PMC8718797 DOI: 10.3389/fcell.2021.753027] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Accepted: 11/22/2021] [Indexed: 12/16/2022] Open
Abstract
In recent years, the long noncoding RNA (lncRNA) has been shown to be involved in many disease processes. The prediction of the lncRNA-disease association is helpful to clarify the mechanism of disease occurrence and bring some new methods of disease prevention and treatment. The current methods for predicting the potential lncRNA-disease association seldom consider the heterogeneous networks with complex node paths, and these methods have the problem of unbalanced positive and negative samples. To solve this problem, a method based on the Gradient Boosting Decision Tree (GBDT) and logistic regression (LR) to predict the lncRNA-disease association (GBDTLRL2D) is proposed in this paper. MetaGraph2Vec is used for feature learning, and negative sample sets are selected by using K-means clustering. The innovation of the GBDTLRL2D is that the clustering algorithm is used to select a representative negative sample set, and the use of MetaGraph2Vec can better retain the semantic and structural features in heterogeneous networks. The average area under the receiver operating characteristic curve (AUC) values of GBDTLRL2D obtained on the three datasets are 0.98, 0.98, and 0.96 in 10-fold cross-validation.
Collapse
|
68
|
Alexander N, Alexander DC, Barkhof F, Denaxas S. Identifying and evaluating clinical subtypes of Alzheimer's disease in care electronic health records using unsupervised machine learning. BMC Med Inform Decis Mak 2021; 21:343. [PMID: 34879829 PMCID: PMC8653614 DOI: 10.1186/s12911-021-01693-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Accepted: 11/15/2021] [Indexed: 02/02/2023] Open
Abstract
BACKGROUND Alzheimer's disease (AD) is a highly heterogeneous disease with diverse trajectories and outcomes observed in clinical populations. Understanding this heterogeneity can enable better treatment, prognosis and disease management. Studies to date have mainly used imaging or cognition data and have been limited in terms of data breadth and sample size. Here we examine the clinical heterogeneity of Alzheimer's disease patients using electronic health records (EHR) to identify and characterise disease subgroups using multiple clustering methods, identifying clusters which are clinically actionable. METHODS We identified AD patients in primary care EHR from the Clinical Practice Research Datalink (CPRD) using a previously validated rule-based phenotyping algorithm. We extracted and included a range of comorbidities, symptoms and demographic features as patient features. We evaluated four different clustering methods (k-means, kernel k-means, affinity propagation and latent class analysis) to cluster Alzheimer's disease patients. We compared clusters on clinically relevant outcomes and evaluated each method using measures of cluster structure, stability, efficiency of outcome prediction and replicability in external data sets. RESULTS We identified 7,913 AD patients, with a mean age of 82 and 66.2% female. We included 21 features in our analysis. We observed 5, 2, 5 and 6 clusters in k-means, kernel k-means, affinity propagation and latent class analysis respectively. K-means was found to produce the most consistent results based on four evaluative measures. We discovered a consistent cluster found in three of the four methods composed of predominantly female, younger disease onset (43% between ages 42-73) diagnosed with depression and anxiety, with a quicker rate of progression compared to the average across other clusters. CONCLUSION Each clustering approach produced substantially different clusters and K-Means performed the best out of the four methods based on the four evaluative criteria. However, the consistent appearance of one particular cluster across three of the four methods potentially suggests the presence of a distinct disease subtype that merits further exploration. Our study underlines the variability of the results obtained from different clustering approaches and the importance of systematically evaluating different approaches for identifying disease subtypes in complex EHR.
Collapse
|
69
|
Roni RG, Tsipi H, Ofir BA, Nir S, Robert K. Disease evolution and risk-based disease trajectories in congestive heart failure patients. J Biomed Inform 2021; 125:103949. [PMID: 34875386 DOI: 10.1016/j.jbi.2021.103949] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Revised: 10/10/2021] [Accepted: 11/03/2021] [Indexed: 11/28/2022]
Abstract
Congestive Heart Failure (CHF) is among the most prevalent chronic diseases worldwide, and is commonly associated with comorbidities and complex health conditions. Consequently, CHF patients are typically hospitalized frequently, and are at a high risk of premature death. Early detection of an envisaged patient disease trajectory is crucial for precision medicine. However, despite the abundance of patient-level data, cardiologists currently struggle to identify disease trajectories and track the evolution patterns of the disease over time, especially in small groups of patients with specific disease subtypes. The present study proposed a five-step method that allows clustering CHF patients, detecting cluster similarity, and identifying disease trajectories, and promises to overcome the existing difficulties. This work is based on a rich dataset of patients' records spanning ten years of hospital visits. The dataset contains all the health information documented in the hospital during each visit, including diagnoses, lab results, clinical data, and demographics. It utilizes an innovative Cluster Evolution Analysis (CEA) method to analyze the complex CHF population where each subject is potentially associated with numerous variables. We have defined sub-groups for mortality risk levels, which we used to characterize patients' disease evolution by refined data clustering in three points in time over ten years, and generating patients' migration patterns across periods. The results elicited 18, 23, and 25 clusters respective to the first, second, and third visits, uncovering clinically interesting small sub-groups of patients. In the following post-processing stage, we identified meaningful patterns. The analysis yielded fine-grained patient clusters divided into several finite risk levels, including several small-sized groups of high-risk patients. Significantly, the analysis also yielded longitudinal patterns where patients' risk levels changed over time. Four types of disease trajectories were identified: decline, preserved state, improvement, and mixed-progress. This stage is a unique contribution of the work. The resulting fine partitioning and longitudinal insights promise to significantly assist cardiologists in tailoring personalized interventions to improve care quality. Cardiologists could utilize these results to glean previously undetected relationships between symptoms and disease evolution that would allow a more informed clinical decision-making and effective interventions.
Collapse
|
70
|
Kamat PV, Sugandhi R, Kumar S. Deep learning-based anomaly-onset aware remaining useful life estimation of bearings. PeerJ Comput Sci 2021; 7:e795. [PMID: 34909464 PMCID: PMC8641573 DOI: 10.7717/peerj-cs.795] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Accepted: 11/03/2021] [Indexed: 06/01/2023]
Abstract
Remaining Useful Life (RUL) estimation of rotating machinery based on their degradation data is vital for machine supervisors. Deep learning models are effective and popular methods for forecasting when rotating machinery such as bearings may malfunction and ultimately break down. During healthy functioning of the machinery, however, RUL is ill-defined. To address this issue, this study recommends using anomaly monitoring during both RUL estimator training and operation. Essential time-domain data is extracted from the raw bearing vibration data, and deep learning models are used to detect the onset of the anomaly. This further acts as a trigger for data-driven RUL estimation. The study employs an unsupervised clustering approach for anomaly trend analysis and a semi-supervised method for anomaly detection and RUL estimation. The novel combined deep learning-based anomaly-onset aware RUL estimation framework showed enhanced results on the benchmarked PRONOSTIA bearings dataset under non-varying operating conditions. The framework consisting of Autoencoder and Long Short Term Memory variants achieved an accuracy of over 90% in anomaly detection and RUL prediction. In the future, the framework can be deployed under varying operational situations using the transfer learning approach.
Collapse
|
71
|
Oukil S, Kasmi R, Mokrani K, García-Zapirain B. Automatic segmentation and melanoma detection based on color and texture features in dermoscopic images. Skin Res Technol 2021; 28:203-211. [PMID: 34779062 PMCID: PMC9907597 DOI: 10.1111/srt.13111] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Accepted: 09/25/2021] [Indexed: 02/06/2023]
Abstract
PURPOSE Melanoma is known as the most aggressive form of skin cancer and one of the fastest growing malignant tumors worldwide. Several computer-aided diagnosis systems for melanoma have been proposed, still, the algorithms encounter difficulties in the early stage of lesions. This paper aims to discriminate melanoma and benign skin lesion in dermoscopic images. METHODS The proposed algorithm is based on the color and texture of skin lesions by introducing a novel feature extraction technique. The algorithm uses an automatic segmentation based on k-means generating a fairly accurate mask for each lesion. The feature extraction consists of the existing and novel color and texture attributes measuring how color and texture vary inside the lesion. To find the optimal results, all the attributes are extracted from lesions in five different color spaces (RGB, HSV, Lab, XYZ, and YCbCr) and used as the inputs for three classifiers (K nearest neighbors, support vector machine , and artificial neural network). RESULTS The PH2 set is used to assess the performance of the proposed algorithm. The results of our algorithm are compared to the results of published articles that used the same dataset, and it shows that the proposed method outperforms the state of the art by attaining a sensitivity of 99.25%, specificity of 99.58%, and accuracy of 99.51%. CONCLUSION The final results show that the colors combined with texture are powerful and relevant attributes for melanoma detection and show improvement over the state of the art.
Collapse
|
72
|
da Costa JP, Garcia A. New confinement index and new perspective for comparing countries - COVID-19. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2021; 210:106346. [PMID: 34464767 PMCID: PMC8418097 DOI: 10.1016/j.cmpb.2021.106346] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Accepted: 08/03/2021] [Indexed: 06/13/2023]
Abstract
BACKGROUND AND OBJECTIVE In the difficult problem of comparing countries regarding their lockdown measures or deaths caused by the COVID-19, there is still no agreement on what is the best strategy to follow. Thus, we propose a new way of comparison countries that avoids the main difficulties in the comparison by using three-dimensional trajectories for this type of data. METHODS We introduce a new index to analyze the level of confinement that each country was subject to overtime, based on the Community Mobility Reports published by Google resorting to Principal Component Analysis. Subsequently, by using longitudinal clustering, we divide the European countries into similar groups according to the COVID-19 obits and also to the confinement index. However, to make the most out of the clustering methods we resort to artificial longitudinal data to evaluate both the methods and the indices. RESULTS By using artificial data, we discover that Calinski-Harabasz outperformed other internal indices in indicating the real number of clusters. The tests also suggested that K-means with Euclidean distance was the best method among the ones studied. With the application to both the mobility and fatalities datasets, we found two groups in each one. CONCLUSIONS Our analysis enables us to discover that European northern countries had more mobility during the first confinement and that the deaths caused by COVID-19 started to drop around the 40th day since the first death.
Collapse
|
73
|
Feature selection for unsupervised machine learning of accelerometer data physical activity clusters - A systematic review. Gait Posture 2021; 90:120-128. [PMID: 34438293 DOI: 10.1016/j.gaitpost.2021.08.007] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Revised: 03/03/2021] [Accepted: 08/08/2021] [Indexed: 02/02/2023]
Abstract
BACKGROUND Identifying clusters of physical activity (PA) from accelerometer data is important to identify levels of sedentary behaviour and physical activity associated with risks of serious health conditions and time spent engaging in healthy PA. Unsupervised machine learning models can capture PA in everyday free-living activity without the need for labelled data. However, there is scant research addressing the selection of features from accelerometer data. The aim of this systematic review is to summarise feature selection techniques applied in studies concerned with unsupervised machine learning of accelerometer-based device obtained physical activity, and to identify commonly used features identified through these techniques. Feature selection methods can reduce the complexity and computational burden of these models by removing less important features and assist in understanding the relative importance of feature sets and individual features in clustering. METHOD We conducted a systematic search of Pubmed, Medline, Google Scholar, Scopus, Arxiv and Web of Science databases to identify studies published before January 2021 which used feature selection methods to derive PA clusters using unsupervised machine learning models. RESULTS A total of 13 studies were eligible for inclusion within the review. The most popular feature selection techniques were Principal Component Analysis (PCA) and correlation-based methods, with k-means frequently used in clustering accelerometer data. Cluster quality evaluation methods were diverse, including both external (e.g. cluster purity) or internal evaluation measures (silhouette score most frequently). Only four of the 13 studies had more than 25 participants and only four studies included two or more datasets. CONCLUSION There is a need to assess multiple feature selection methods upon large cohort data consisting of multiple (3 or more) PA datasets. The cut-off criteria e.g. number of components, pairwise correlation value, explained variance ratio for PCA, etc. should be expressly stated along with any hyperparameters used in clustering.
Collapse
|
74
|
Saeidifar M, Yazdi M, Zolghadrasli A. Performance Improvement in Brain Tumor Detection in MRI Images Using a Combination of Evolutionary Algorithms and Active Contour Method. J Digit Imaging 2021; 34:1209-1224. [PMID: 34561783 DOI: 10.1007/s10278-021-00514-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2021] [Revised: 08/23/2021] [Accepted: 08/31/2021] [Indexed: 10/20/2022] Open
Abstract
The process of treating brain cancer depends on the experience and knowledge of the physician, which may be associated with eye errors or may vary from person to person. For this reason, it is important to utilize an automatic tumor detection algorithm to assist radiologists and physicians for brain tumor diagnosis. The aim of the present study is to automatically detect the location of the tumor in a brain MRI image with high accuracy. For this end, in the proposed algorithm, first, the skull is separated from the brain using morphological operators. The image is then segmented by six evolutionary algorithms, i.e., Particle Swarm Optimization (PSO), Artificial Bee Colony (ABC), Genetic Algorithm (GA), Differential Evolution (DE), Harmony Search (HS), and Gray Wolf Optimization (GWO), as well as two other frequently-used techniques in the literature, i.e., K-means and Otsu thresholding algorithms. Afterwards, the tumor area is isolated from the brain using the four features extracted from the main tumor. Evaluation of the segmented area revealed that the PSO has the best performance compared with the other approaches. The segmented results of the PSO are then used as the initial curve for the Active contour to precisely specify the tumor boundaries. The proposed algorithm is applied on fifty images with two different types of tumors. Experimental results on T1-weighted brain MRI images show a better performance of the proposed algorithm compared to other evolutionary algorithms, K-means, and Otsu thresholding methods.
Collapse
|
75
|
Yang Z, Liu M, Wang B, Wang B. Classification of protein domains based on their three-dimensional shapes (CPD3DS). Synth Syst Biotechnol 2021; 6:224-230. [PMID: 34541344 PMCID: PMC8429105 DOI: 10.1016/j.synbio.2021.08.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Revised: 08/23/2021] [Accepted: 08/30/2021] [Indexed: 11/13/2022] Open
Abstract
Protein design has become a powerful method to expand the number of natural proteins and design customized proteins according to demands. Domain-based protein design spares the need to create novel elements from scratch, which makes it a more efficient strategy than scratch-based protein design in designing multi-domain proteins, protein complexes and biomaterials. As the surface shape plays a central role in domain-domain and protein-protein interactions, a global map of the surface shapes of all domains should be very beneficial for domain-based protein design. Therefore, in this study, we characterized the surface shapes of protein domains, collected from CATH and SCOP databases, with their 3D-Zernike descriptors (3DZDs). Then similarities of domain shape features were identified, and all domains were classified accordingly. The preferences of the combinations of domains between different clusters were analyzed in natural proteins from the Protein Data Bank. A user-friendly website, termed CPD3DS, was also developed for storage, retrieval, analyses and visualization of our results. This work not only provides an overall view of protein domain shapes by showing their variety and similarities, but also opens up a new avenue to understand the properties of protein structural domains, and design principles of protein architectures.
Collapse
|