1
|
Datta S, Nabeel Asim M, Dengel A, Ahmed S. NTpred: a robust and precise machine learning framework for in silico identification of Tyrosine nitration sites in protein sequences. Brief Funct Genomics 2024; 23:163-179. [PMID: 37248673 DOI: 10.1093/bfgp/elad018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 04/12/2023] [Accepted: 05/02/2023] [Indexed: 05/31/2023] Open
Abstract
Post-translational modifications (PTMs) either enhance a protein's activity in various sub-cellular processes, or degrade their activity which leads toward failure of intracellular processes. Tyrosine nitration (NT) modification degrades protein's activity that initiates and propagates various diseases including neurodegenerative, cardiovascular, autoimmune diseases and carcinogenesis. Identification of NT modification supports development of novel therapies and drug discoveries for associated diseases. Identification of NT modification in biochemical labs is expensive, time consuming and error-prone. To supplement this process, several computational approaches have been proposed. However these approaches fail to precisely identify NT modification, due to the extraction of irrelevant, redundant and less discriminative features from protein sequences. This paper presents the NTpred framework that is competent in extracting comprehensive features from raw protein sequences using four different sequence encoders. To reap the benefits of different encoders, it generates four additional feature spaces by fusing different combinations of individual encodings. Furthermore, it eradicates irrelevant and redundant features from eight different feature spaces through a Recursive Feature Elimination process. Selected features of four individual encodings and four feature fusion vectors are used to train eight different Gradient Boosted Tree classifiers. The probability scores from the trained classifiers are utilized to generate a new probabilistic feature space, which is used to train a Logistic Regression classifier. On the BD1 benchmark dataset, the proposed framework outperforms the existing best-performing predictor in 5-fold cross validation and independent test evaluation with combined improvement of 13.7% in MCC and 20.1% in AUC. Similarly, on the BD2 benchmark dataset, the proposed framework outperforms the existing best-performing predictor with combined improvement of 5.3% in MCC and 1.0% in AUC. NTpred is publicly available for further experimentation and predictive use at: https://sds_genetic_analysis.opendfki.de/PredNTS/.
Collapse
Affiliation(s)
- Sourajyoti Datta
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern, 67663, Germany
| | - Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern, 67663, Germany
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany
| |
Collapse
|
2
|
Zehra SS, Magarini M, Qureshi R, Mustafa SMN, Farooq F. Proactive approach for preamble detection in 5G-NR PRACH using supervised machine learning and ensemble model. Sci Rep 2022; 12:8378. [PMID: 35589934 PMCID: PMC9120483 DOI: 10.1038/s41598-022-12349-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 04/07/2022] [Indexed: 11/28/2022] Open
Abstract
The physical random access channel (PRACH) is used in the uplink of cellular systems for initial access requests from the users. It is very hard to achieve low latency by implementing conventional methods in 5G. The performance of the system degrades when multiple users try to access the PRACH receiver with the same preamble signature, resulting in a collision of request signals and dual peak occurrence. In this paper, we used two machine learning classification technique models with signals samples as big data to obtain the best proactive approach. First, we implemented three supervised learning algorithms, Decision Tree Classification (DTC), naïve bayes (NB), and K-nearest neighbor (KNN) to classify the outcome based on two classes, labeled as ‘peak’ and ‘false peak’. For the second approach, we constructed a Bagged Tree Ensembler, using multiple learners which contributes to the reduction of the variance of DTC and comparing their asymptotes. The comparison shows that Ensembler method proves to be a better proactive approach for the stated problem.
Collapse
Affiliation(s)
| | | | - Rehan Qureshi
- Sir Syed University of Engineering and Technology, Karachi, Pakistan
| | | | - Faiza Farooq
- Sir Syed University of Engineering and Technology, Karachi, Pakistan
| |
Collapse
|
3
|
Abstract
Assessing the threat posed by bacterial samples is fundamentally important to safeguarding human health. Whole-genome sequence analysis of bacteria provides a route to achieving this goal. However, this approach is fundamentally constrained by the scope, the diversity, and our understanding of the bacterial genome sequences that are available for devising threat assessment schemes. For example, genome-based strategies offer limited utility for assessing the threat associated with pathogens that exploit novel virulence mechanisms or are recently emergent. To address these limitations, we developed PathEngine, a machine learning strategy that features the use of phenotypic hallmarks of pathogenesis to assess pathogenic threat. PathEngine successfully classified potential pathogenic threats with high accuracy and thereby establishes a phenotype-based, sequence-independent pipeline for threat assessment. Bacterial pathogen identification, which is critical for human health, has historically relied on culturing organisms from clinical specimens. More recently, the application of machine learning (ML) to whole-genome sequences (WGSs) has facilitated pathogen identification. However, relying solely on genetic information to identify emerging or new pathogens is fundamentally constrained, especially if novel virulence factors exist. In addition, even WGSs with ML pipelines are unable to discern phenotypes associated with cryptic genetic loci linked to virulence. Here, we set out to determine if ML using phenotypic hallmarks of pathogenesis could assess potential pathogenic threat without using any sequence-based analysis. This approach successfully classified potential pathogenetic threat associated with previously machine-observed and unobserved bacteria with 99% and 85% accuracy, respectively. This work establishes a phenotype-based pipeline for potential pathogenic threat assessment, which we term PathEngine, and offers strategies for the identification of bacterial pathogens.
Collapse
|
4
|
Singh SK, Taylor RW, Pradhan B, Shirzadi A, Pham BT. Predicting sustainable arsenic mitigation using machine learning techniques. ECOTOXICOLOGY AND ENVIRONMENTAL SAFETY 2022; 232:113271. [PMID: 35121252 DOI: 10.1016/j.ecoenv.2022.113271] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 01/21/2022] [Accepted: 01/28/2022] [Indexed: 06/14/2023]
Abstract
This study evaluates state-of-the-art machine learning models in predicting the most sustainable arsenic mitigation preference. A Gaussian distribution-based Naïve Bayes (NB) classifier scored the highest Area Under the Curve (AUC) of the Receiver Operating Characteristic curve (0.82), followed by Nu Support Vector Classification (0.80), and K-Neighbors (0.79). Ensemble classifiers scored higher than 70% AUC, with Random Forest being the top performer (0.77), and Decision Tree model ranked fourth with an AUC of 0.77. The multilayer perceptron model also achieved high performance (AUC=0.75). Most linear classifiers underperformed, with the Ridge classifier at the top (AUC=0.73) and perceptron at the bottom (AUC=0.57). A Bernoulli distribution-based Naïve Bayes classifier was the poorest model (AUC=0.50). The Gaussian NB was also the most robust ML model with the slightest variation of Kappa score on training (0.58) and test data (0.64). The results suggest that nonlinear or ensemble classifiers could more accurately understand the complex relationships of socio-environmental data and help develop accurate and robust prediction models of sustainable arsenic mitigation. Furthermore, Gaussian NB is the best option when data is scarce.
Collapse
Affiliation(s)
- Sushant K Singh
- Department of Earth and Environmental Studies, Montclair State University, New Jersey, USA; The Center for Artificial Intelligence and Environmental Sustainability (CAIES) Foundation, Patna, Bihar, India.
| | - Robert W Taylor
- Department of Earth and Environmental Studies, Montclair State University, New Jersey, USA.
| | - Biswajeet Pradhan
- Centre for Advanced Modelling and Geospatial Information Systems (CAMGIS), School of Civil and Environmental Engineering, University of Technology Sydney, NSW 2007, Australia; Department of Energy and Mineral Resources Engineering, Sejong University, Choongmu-gwan, 209 Neungdong-ro Gwangjin-gu, Seoul 05006, Republic of Korea; Center of Excellence for Climate Change Research, King Abdulaziz University, P. O. Box 80234, Jeddah 21589, Saudi Arabia; Earth Observation Centre, Institute of Climate Change, Universiti Kebangsaan Malaysia, 43600 UKM, Bangi, Selangor, Malaysia.
| | - Ataollah Shirzadi
- College of Natural Resources, Department of Rangeland and Watershed Management Sciences, University of Kurdistan, Sanandaj, Iran.
| | - Binh Thai Pham
- Department of Geotechnical Engineering, University of Transport Technology, 54 Trieu Khuc, Thanh Xuan, Ha Noi, Viet Nam.
| |
Collapse
|
5
|
Ismail A, Elpeltagy M, Zaki M, ElDahshan KA. Deepfake video detection: YOLO-Face convolution recurrent approach. PeerJ Comput Sci 2021; 7:e730. [PMID: 34712799 PMCID: PMC8507472 DOI: 10.7717/peerj-cs.730] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2021] [Accepted: 09/02/2021] [Indexed: 06/13/2023]
Abstract
Recently, the deepfake techniques for swapping faces have been spreading, allowing easy creation of hyper-realistic fake videos. Detecting the authenticity of a video has become increasingly critical because of the potential negative impact on the world. Here, a new project is introduced; You Only Look Once Convolution Recurrent Neural Networks (YOLO-CRNNs), to detect deepfake videos. The YOLO-Face detector detects face regions from each frame in the video, whereas a fine-tuned EfficientNet-B5 is used to extract the spatial features of these faces. These features are fed as a batch of input sequences into a Bidirectional Long Short-Term Memory (Bi-LSTM), to extract the temporal features. The new scheme is then evaluated on a new large-scale dataset; CelebDF-FaceForencics++ (c23), based on a combination of two popular datasets; FaceForencies++ (c23) and Celeb-DF. It achieves an Area Under the Receiver Operating Characteristic Curve (AUROC) 89.35% score, 89.38% accuracy, 83.15% recall, 85.55% precision, and 84.33% F1-measure for pasting data approach. The experimental analysis approves the superiority of the proposed method compared to the state-of-the-art methods.
Collapse
Affiliation(s)
- Aya Ismail
- Mathematics Department, Tanta University, Tanta, Al-Gharbia, Egypt
| | - Marwa Elpeltagy
- Systems and Computers Department, Al-Azhar University, Cairo, Nasr City, Egypt
| | - Mervat Zaki
- Mathematics Department, Al-Azhar University (Girls Branch), Cairo, Nasr City, Egypt
| | | |
Collapse
|
6
|
Development of accurate classification of heavenly bodies using novel machine learning techniques. Soft comput 2021. [DOI: 10.1007/s00500-021-05687-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
AbstractThe heavenly bodies are objects that swim in the outer space. The classification of these objects is a challenging task for astronomers. This article presents a novel methodology that enables an efficient and accurate classification of cosmic objects (3 classes) based on evolutionary optimization of classifiers. This research collected the data from Sloan Digital Sky Survey database. In this work, we are proposing to develop a novel machine learning model to classify stellar spectra of stars, quasars and galaxies. First, the input data are normalized and then subjected to principal component analysis to reduce the dimensionality. Then, the genetic algorithm is implemented on the data which helps to find the optimal parameters for the classifiers. We have used 21 classifiers to develop an accurate and robust classification with fivefold cross-validation strategy. Our developed model has achieved an improvement in the accuracy using nineteen out of twenty-one models. We have obtained the highest classification accuracy of 99.16%, precision of 98.78%, recall of 98.08% and F1-score of 98.32% using evolutionary system based on voting classifier. The developed machine learning prototype can help the astronomers to make accurate classification of heavenly bodies in the sky. Proposed evolutionary system can be used in other areas where accurate classification of many classes is required.
Collapse
|
7
|
Qummar S, Khan FG, Shah S, Khan A, Din A, Gao J. Deep Learning Techniques for Diabetic Retinopathy Detection. Curr Med Imaging 2021; 16:1201-1213. [DOI: 10.2174/1573405616666200213114026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Revised: 11/26/2019] [Accepted: 12/19/2019] [Indexed: 11/22/2022]
Abstract
Diabetes occurs due to the excess of glucose in the blood that may affect many organs
of the body. Elevated blood sugar in the body causes many problems including Diabetic Retinopathy
(DR). DR occurs due to the mutilation of the blood vessels in the retina. The manual detection
of DR by ophthalmologists is complicated and time-consuming. Therefore, automatic detection is
required, and recently different machine and deep learning techniques have been applied to detect
and classify DR. In this paper, we conducted a study of the various techniques available in the literature
for the identification/classification of DR, the strengths and weaknesses of available datasets
for each method, and provides the future directions. Moreover, we also discussed the different
steps of detection, that are: segmentation of blood vessels in a retina, detection of lesions, and other
abnormalities of DR.
Collapse
Affiliation(s)
- Sehrish Qummar
- Department of Computer Science, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan
| | - Fiaz Gul Khan
- Department of Computer Science, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan
| | - Sajid Shah
- Department of Computer Science, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan
| | - Ahmad Khan
- Department of Computer Science, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan
| | - Ahmad Din
- Department of Computer Science, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, Pakistan
| | - Jinfeng Gao
- Department of Information Engineering, Huanghuai University, Henan, China
| |
Collapse
|
8
|
Large J, Lines J, Bagnall A. A probabilistic classifier ensemble weighting scheme based on cross-validated accuracy estimates. Data Min Knowl Discov 2019; 33:1674-1709. [PMID: 31632184 PMCID: PMC6790343 DOI: 10.1007/s10618-019-00638-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2017] [Accepted: 06/03/2019] [Indexed: 11/17/2022]
Abstract
Our hypothesis is that building ensembles of small sets of strong classifiers constructed with different learning algorithms is, on average, the best approach to classification for real-world problems. We propose a simple mechanism for building small heterogeneous ensembles based on exponentially weighting the probability estimates of the base classifiers with an estimate of the accuracy formed through cross-validation on the train data. We demonstrate through extensive experimentation that, given the same small set of base classifiers, this method has measurable benefits over commonly used alternative weighting, selection or meta-classifier approaches to heterogeneous ensembles. We also show how an ensemble of five well-known, fast classifiers can produce an ensemble that is not significantly worse than large homogeneous ensembles and tuned individual classifiers on datasets from the UCI archive. We provide evidence that the performance of the cross-validation accuracy weighted probabilistic ensemble (CAWPE) generalises to a completely separate set of datasets, the UCR time series classification archive, and we also demonstrate that our ensemble technique can significantly improve the state-of-the-art classifier for this problem domain. We investigate the performance in more detail, and find that the improvement is most marked in problems with smaller train sets. We perform a sensitivity analysis and an ablation study to demonstrate the robustness of the ensemble and the significant contribution of each design element of the classifier. We conclude that it is, on average, better to ensemble strong classifiers with a weighting scheme rather than perform extensive tuning and that CAWPE is a sensible starting point for combining classifiers.
Collapse
Affiliation(s)
- James Large
- School of Computing Sciences, University of East Anglia, Norwich, UK
| | - Jason Lines
- School of Computing Sciences, University of East Anglia, Norwich, UK
| | - Anthony Bagnall
- School of Computing Sciences, University of East Anglia, Norwich, UK
| |
Collapse
|
9
|
Imbalance-Aware Machine Learning for Predicting Rare and Common Disease-Associated Non-Coding Variants. Sci Rep 2017; 7:2959. [PMID: 28592878 PMCID: PMC5462751 DOI: 10.1038/s41598-017-03011-5] [Citation(s) in RCA: 43] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2016] [Accepted: 04/21/2017] [Indexed: 12/15/2022] Open
Abstract
Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task.
Collapse
|
10
|
Automatic Estimation of Osteoporotic Fracture Cases by Using Ensemble Learning Approaches. J Med Syst 2015; 40:61. [DOI: 10.1007/s10916-015-0413-1] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2015] [Accepted: 11/17/2015] [Indexed: 10/22/2022]
|
11
|
Kokkinos Y, Margaritis KG. Confidence ratio affinity propagation in ensemble selection of neural network classifiers for distributed privacy-preserving data mining. Neurocomputing 2015. [DOI: 10.1016/j.neucom.2014.07.065] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
12
|
Nascimento DS, Coelho AL, Canuto AM. Integrating complementary techniques for promoting diversity in classifier ensembles: A systematic study. Neurocomputing 2014. [DOI: 10.1016/j.neucom.2014.01.027] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
13
|
|
14
|
Determining the Number of Beams in 3D Conformal Radiotherapy: A Classification Approach. ACTA ACUST UNITED AC 2013. [DOI: 10.1016/j.protcy.2013.12.107] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
15
|
|