1
|
Abbasi Holasou H, Panahi B, Shahi A, Nami Y. Integration of machine learning models with microsatellite markers: New avenue in world grapevine germplasm characterization. Biochem Biophys Rep 2024; 38:101678. [PMID: 38495412 PMCID: PMC10940787 DOI: 10.1016/j.bbrep.2024.101678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2023] [Revised: 02/09/2024] [Accepted: 02/27/2024] [Indexed: 03/19/2024] Open
Abstract
Development of efficient analytical techniques is required for effective interpretation of biological data to take novel hypotheses and finding the critical predictive patterns. Machine Learning algorithms provide a novel opportunity for development of low-cost and practical solutions in biology. In this study, we proposed a new integrated analytical approach using supervised machine learning algorithms and microsatellites data of worldwide vitis populations. A total of 1378 wild (V. vinifera spp. sylvestris) and cultivated (V. vinifera spp. sativa) accessions of grapevine were investigated using 20 microsatellite markers. Data cleaning, feature selection, and supervised machine learning classification models vis, Naive Bayes, Support Vector Machine (SVM) and Tree Induction methods were implied to find most indicative and diagnostic alleles to represent wild/cultivated and originated geography of each population. Our combined approaches showed microsatellite markers with the highest differentiating capacity and proved efficiency for our pipeline of classification and prediction of vitis accessions. Moreover, our study proposed the best combination of markers for better distinguishing of populations, which can be exploited in future germplasm conservation and breeding programs.
Collapse
Affiliation(s)
- Hossein Abbasi Holasou
- Department of Plant Breeding and Biotechnology, Faculty of Agriculture, University of Tabriz, Tabriz, Iran
| | - Bahman Panahi
- Department of Genomics, Branch for Northwest and West Region, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research, Education and Extension Organization (AREEO), Tabriz, Iran
| | - Ali Shahi
- Faculty of Agriculture (Meshgin Shahr Campus), Mohaghegh Ardabili University, Ardabil, Iran
| | - Yousef Nami
- Department of Food Biotechnology, Branch for Northwest and West Region, Agricultural Biotechnology Research Institute of Iran (ABRII), Agricultural Research, Education and Extension Organization (AREEO), Tabriz, Iran
| |
Collapse
|
2
|
Agho CA, Śliwka J, Nassar H, Niinemets Ü, Runno-Paurson E. Machine Learning-Based Identification of Mating Type and Metalaxyl Response in Phytophthora infestans Using SSR Markers. Microorganisms 2024; 12:982. [PMID: 38792811 PMCID: PMC11124124 DOI: 10.3390/microorganisms12050982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 05/06/2024] [Accepted: 05/09/2024] [Indexed: 05/26/2024] Open
Abstract
Phytophthora infestans is the causal agent of late blight in potato. The occurrence of P. infestans with both A1 and A2 mating types in the field may result in sexual reproduction and the generation of recombinant strains. Such strains with new combinations of traits can be highly aggressive, resistant to fungicides, and can make the disease difficult to control in the field. Metalaxyl-resistant isolates are now more prevalent in potato fields. Understanding the genetic structure and rapid identification of mating types and metalaxyl response of P. infestans in the field is a prerequisite for effective late blight disease monitoring and management. Molecular and phenotypic assays involving molecular and phenotypic markers such as mating types and metalaxyl response are typically conducted separately in the studies of the genotypic and phenotypic diversity of P. infestans. As a result, there is a pressing need to reduce the experimental workload and more efficiently assess the aggressiveness of different strains. We think that employing genetic markers to not only estimate genotypic diversity but also to identify the mating type and fungicide response using machine learning techniques can guide and speed up the decision-making process in late blight disease management, especially when the mating type and metalaxyl resistance data are not available. This technique can also be applied to determine these phenotypic traits for dead isolates. In this study, over 600 P. infestans isolates from different populations-Estonia, Pskov region, and Poland-were classified for mating types and metalaxyl response using machine learning techniques based on simple sequence repeat (SSR) markers. For both traits, random forest and the support vector machine demonstrated good accuracy of over 70%, compared to the decision tree and artificial neural network models whose accuracy was lower. There were also associations (p < 0.05) between the traits and some of the alleles detected, but machine learning prediction techniques based on multilocus SSR genotypes offered better prediction accuracy.
Collapse
Affiliation(s)
- Collins A. Agho
- Institute of Agricultural and Environmental Sciences, Estonian University of Life Sciences, Kreutzwaldi 1, 51006 Tartu, Estonia
| | - Jadwiga Śliwka
- Plant Breeding and Acclimatization Institute—National Research Institute in Radzików, Department of Potato Genetics and Parental Lines, Platanowa Str. 19, 05-831 Młochów, Poland
| | - Helina Nassar
- Institute of Agricultural and Environmental Sciences, Estonian University of Life Sciences, Kreutzwaldi 1, 51006 Tartu, Estonia
| | - Ülo Niinemets
- Institute of Agricultural and Environmental Sciences, Estonian University of Life Sciences, Kreutzwaldi 1, 51006 Tartu, Estonia
- Estonian Academy of Sciences, Kohtu 6, 10130 Tallinn, Estonia
| | - Eve Runno-Paurson
- Institute of Agricultural and Environmental Sciences, Estonian University of Life Sciences, Kreutzwaldi 1, 51006 Tartu, Estonia
| |
Collapse
|
3
|
Zhou Z, Huang C, Fu P, Huang H, Zhang Q, Wu X, Yu Q, Sun Y. Prediction of in-hospital hypokalemia using machine learning and first hospitalization day records in patients with traumatic brain injury. CNS Neurosci Ther 2022; 29:181-191. [PMID: 36258296 PMCID: PMC9804086 DOI: 10.1111/cns.13993] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Revised: 09/18/2022] [Accepted: 09/23/2022] [Indexed: 02/06/2023] Open
Abstract
AIMS Hypokalemia is a common complication following traumatic brain injury, which may complicate treatment and lead to unfavorable outcomes. Identifying patients at risk of hypokalemia on the first day of admission helps to implement prophylactic treatment, reduce complications, and improve prognosis. METHODS This multicenter retrospective study was performed between January 2017 and December 2020 using the electronic medical records of patients admitted due to traumatic brain injury. A propensity score matching approach was adopted with a ratio of 1:1 to overcome overfitting and data imbalance during subgroup analyses. Five machine learning algorithms were applied to generate a best-performed prediction model for in-hospital hypokalemia. The internal fivefold cross-validation and external validation were performed to demonstrate the interpretability and generalizability. RESULTS A total of 4445 TBI patients were recruited for analysis and model generation. Hypokalemia occurred in 46.55% of recruited patients and the incidences of mild, moderate, and severe hypokalemia were 32.06%, 12.69%, and 1.80%, respectively. Hypokalemia was associated with increased mortality, while severe hypokalemia cast greater impacts. The logistic regression algorithm had the best performance in predicting decreased serum potassium and moderate-to-severe hypokalemia, with an AUC of 0.73 ± 0.011 and 0.74 ± 0.019, respectively. The prediction model was further verified using two external datasets, including our previous published data and the open-assessed Medical Information Mart for Intensive Care database. Linearized calibration curves showed no statistical difference (p > 0.05) with perfect predictions. CONCLUSIONS The occurrence of hypokalemia following traumatic brain injury can be predicted by first hospitalization day records and machine learning algorithms. The logistic regression algorithm showed an optimal predicting performance verified by both internal and external validation.
Collapse
Affiliation(s)
- Zhengyu Zhou
- Department of Anesthesia, Huashan HospitalFudan UniversityShanghaiChina
| | - Chiungwei Huang
- Health Consultation and Physical Examination Center, Zhongshan HospitalFudan UniversityShanghaiChina,Department of Neurosurgery, Huashan Hospital, Shanghai Medical CollegeFudan UniversityShanghaiChina
| | - Pengfei Fu
- Department of Neurosurgery, Huashan Hospital, Shanghai Medical CollegeFudan UniversityShanghaiChina
| | - Hong Huang
- Information Center, Huashan HospitalFudan UniversityShanghaiChina
| | - Qi Zhang
- Information Center, Huashan HospitalFudan UniversityShanghaiChina
| | - Xuehai Wu
- Department of Neurosurgery, Huashan Hospital, Shanghai Medical CollegeFudan UniversityShanghaiChina,National Center for Neurological DisordersShanghaiChina,Shanghai Key Laboratory of Brain Function Restoration and Neural RegenerationShanghaiChina,Neurosurgical Institute of Fudan UniversityShanghaiChina,Shanghai Clinical Medical Center of NeurosurgeryShanghaiChina
| | - Qiong Yu
- Department of Anesthesia, Huashan HospitalFudan UniversityShanghaiChina
| | - Yirui Sun
- Department of Neurosurgery, Huashan Hospital, Shanghai Medical CollegeFudan UniversityShanghaiChina,National Center for Neurological DisordersShanghaiChina,Shanghai Key Laboratory of Brain Function Restoration and Neural RegenerationShanghaiChina,Neurosurgical Institute of Fudan UniversityShanghaiChina,Shanghai Clinical Medical Center of NeurosurgeryShanghaiChina
| |
Collapse
|
4
|
Jafari O, Ebrahimi M, Hedayati SAA, Zeinalabedini M, Poorbagher H, Nasrolahpourmoghadam M, Fernandes JMO. Integration of Morphometrics and Machine Learning Enables Accurate Distinction between Wild and Farmed Common Carp. LIFE (BASEL, SWITZERLAND) 2022; 12:life12070957. [PMID: 35888047 PMCID: PMC9315565 DOI: 10.3390/life12070957] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Revised: 06/16/2022] [Accepted: 06/20/2022] [Indexed: 11/16/2022]
Abstract
Morphology and feature selection are key approaches to address several issues in fisheries science and stock management, such as the hypothesis of admixture of Caspian common carp (Cyprinus carpio) and farmed carp stocks in Iran. The present study was performed to investigate the population classification of common carp in the southern Caspian basin using data mining algorithms to find the most important characteristic(s) differing between Iranian and farmed common carp. A total of 74 individuals were collected from three locations within the southern Caspian basin and from one farm between November 2015 and April 2016. A dataset of 26 traditional morphometric (TMM) attributes and a dataset of 14 geometric landmark points were constructed and then subjected to various machine learning methods. In general, the machine learning methods had a higher prediction rate with TMM datasets. The highest decision tree accuracy of 77% was obtained by rule and decision tree parallel algorithms, and “head height on eye area” was selected as the best marker to distinguish between wild and farmed common carp. Various machine learning algorithms were evaluated, and we found that the linear discriminant was the best method, with 81.1% accuracy. The results obtained from this novel approach indicate that Darwin’s domestication syndrome is observed in common carp. Moreover, they pave the way for automated detection of farmed fish, which will be most beneficial to detect escapees and improve restocking programs.
Collapse
Affiliation(s)
- Omid Jafari
- International Sturgeon Research Institute, Iranian Fisheries Science Research Institute, Agricultural Research, Education and Extension Organization, Rasht 416353464, Iran
- Correspondence: (O.J.); (J.M.O.F.)
| | - Mansour Ebrahimi
- Department of Biology, School of Basic Science, University of Qom, Qom 3716146611, Iran;
| | - Seyed Ali-Akbar Hedayati
- Department of Fisheries, Faculty of Fisheries and Environmental Sciences, Gorgan University of Agricultural Sciences and Natural Resources, Gorgan 4913815739, Iran;
| | - Mehrshad Zeinalabedini
- Department of Genomics, Agricultural Biotechnology Research Institute of Iran (ABRII), Karaj 3135933151, Iran;
| | - Hadi Poorbagher
- Department of Fisheries Sciences, Faculty of Natural Resources, University of Tehran, Karaj 3158777871, Iran; (H.P.); (M.N.)
| | - Maryam Nasrolahpourmoghadam
- Department of Fisheries Sciences, Faculty of Natural Resources, University of Tehran, Karaj 3158777871, Iran; (H.P.); (M.N.)
| | - Jorge M. O. Fernandes
- Faculty of Biosciences and Aquaculture, Nord University, 8026 Bodø, Norway
- Correspondence: (O.J.); (J.M.O.F.)
| |
Collapse
|
5
|
Karami K, Akbari M, Moradi MT, Soleymani B, Fallahi H. Survival prognostic factors in patients with acute myeloid leukemia using machine learning techniques. PLoS One 2021; 16:e0254976. [PMID: 34288963 PMCID: PMC8294525 DOI: 10.1371/journal.pone.0254976] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 07/07/2021] [Indexed: 12/26/2022] Open
Abstract
This paper identifies prognosis factors for survival in patients with acute myeloid leukemia (AML) using machine learning techniques. We have integrated machine learning with feature selection methods and have compared their performances to identify the most suitable factors in assessing the survival of AML patients. Here, six data mining algorithms including Decision Tree, Random Forrest, Logistic Regression, Naive Bayes, W-Bayes Net, and Gradient Boosted Tree (GBT) are employed for the detection model and implemented using the common data mining tool RapidMiner and open-source R package. To improve the predictive ability of our model, a set of features were selected by employing multiple feature selection methods. The accuracy of classification was obtained using 10-fold cross-validation for the various combinations of the feature selection methods and machine learning algorithms. The performance of the models was assessed by various measurement indexes including accuracy, kappa, sensitivity, specificity, positive predictive value, negative predictive value, and area under the ROC curve (AUC). Our results showed that GBT with an accuracy of 85.17%, AUC of 0.930, and the feature selection via the Relief algorithm has the best performance in predicting the survival rate of AML patients.
Collapse
Affiliation(s)
- Keyvan Karami
- Medical Biology Research Center, Kermanshah University of Medical Sciences, Kermanshah, Iran
- Department of Animal Science, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mahboubeh Akbari
- Department of Statistics, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mohammad-Taher Moradi
- Medical Biology Research Center, Kermanshah University of Medical Sciences, Kermanshah, Iran
| | - Bijan Soleymani
- Medical Biology Research Center, Kermanshah University of Medical Sciences, Kermanshah, Iran
- * E-mail: , (HF); (BS)
| | - Hossein Fallahi
- Department of Biology, School of Sciences, Razi University, Kermanshah, Iran
- * E-mail: , (HF); (BS)
| |
Collapse
|
6
|
Machine learning and statistics to qualify environments through multi-traits in Coffea arabica. PLoS One 2021; 16:e0245298. [PMID: 33434204 PMCID: PMC7802962 DOI: 10.1371/journal.pone.0245298] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 12/25/2020] [Indexed: 11/30/2022] Open
Abstract
Several factors such as genotype, environment, and post-harvest processing can affect the responses of important traits in the coffee production chain. Determining the influence of these factors is of great relevance, as they can be indicators of the characteristics of the coffee produced. The most efficient models choice to be applied should take into account the variety of information and the particularities of each biological material. This study was developed to evaluate statistical and machine learning models that would better discriminate environments through multi-traits of coffee genotypes and identify the main agronomic and beverage quality traits responsible for the variation of the environments. For that, 31 morpho-agronomic and post-harvest traits were evaluated, from field experiments installed in three municipalities in the Matas de Minas region, in the State of Minas Gerais, Brazil. Two types of post-harvest processing were evaluated: natural and pulped. The apparent error rate was estimated for each method. The Multilayer Perceptron and Radial Basis Function networks were able to discriminate the coffee samples in multi-environment more efficiently than the other methods, identifying differences in multi-traits responses according to the production sites and type of post-harvest processing. The local factors did not present specific traits that favored the severity of diseases and differentiated vegetative vigor. Sensory traits acidity and fragrance/aroma score also made little contribution to the discrimination process, indicating that acidity and fragrance/aroma are characteristic of coffee produced and all coffee samples evaluated are of the special type in the Mata of Minas region. The main traits responsible for the differentiation of production sites are plant height, fruit size, and bean production. The sensory trait "Body" is the main one to discriminate the form of post-harvest processing.
Collapse
|
7
|
Hyperspectral Reflectance as a Basis to Discriminate Olive Varieties—A Tool for Sustainable Crop Management. SUSTAINABILITY 2020. [DOI: 10.3390/su12073059] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
Worldwide sustainable development is threatened by current agricultural land change trends, particularly by the increasing rural farmland abandonment and agricultural intensification phenomena. In Mediterranean countries, these processes are affecting especially traditional olive groves with enormous socio-economic costs to rural areas, endangering environmental sustainability and biodiversity. Traditional olive groves abandonment and intensification are clearly related to the reduction of olive oil production income, leading to reduced economic viability. Most promising strategies to boost traditional groves competitiveness—such as olive oil differentiation through adoption of protected denomination of origin labels and development of value-added olive products—rely on knowledge of the olive varieties and its specific properties that confer their uniqueness and authenticity. Given the lack of information about olive varieties on traditional groves, a feasible and inexpensive method of variety identification is required. We analyzed leaf spectral information of ten Portuguese olive varieties with a powerful data-mining approach in order to verify the ability of satellite’s hyperspectral sensors to provide an accurate olive variety identification. Our results show that these olive varieties are distinguishable by leaf reflectance information and suggest that even satellite open-source data could be used to map them. Additional advantages of olive varieties mapping were further discussed.
Collapse
|
8
|
|
9
|
Karami K, Zerehdaran S, Javadmanesh A, Shariati MM, Fallahi H. Characterization of bovine (Bos taurus) imprinted genes from genomic to amino acid attributes by data mining approaches. PLoS One 2019; 14:e0217813. [PMID: 31170205 PMCID: PMC6553745 DOI: 10.1371/journal.pone.0217813] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2018] [Accepted: 05/21/2019] [Indexed: 01/05/2023] Open
Abstract
Genomic imprinting results in monoallelic expression of genes in mammals and flowering plants. Understanding the function of imprinted genes improves our knowledge of the regulatory processes in the genome. In this study, we have employed classification and clustering algorithms with attribute weighting to specify the unique attributes of both imprinted (monoallelic) and biallelic expressed genes. We have obtained characteristics of 22 known monoallelically expressed (imprinted) and 8 biallelic expressed genes that have been experimentally validated alongside 208 randomly selected genes in bovine (Bos taurus). Attribute weighting methods and various supervised and unsupervised algorithms in machine learning were applied. Unique characteristics were discovered and used to distinguish mono and biallelic expressed genes from each other in bovine. To obtain the accuracy of classification, 10-fold cross-validation with concerning each combination of attribute weighting (feature selection) and machine learning algorithms, was used. Our approach was able to accurately predict mono and biallelic genes using the genomics and proteomics attributes.
Collapse
Affiliation(s)
- Keyvan Karami
- Department of Animal Science, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Saeed Zerehdaran
- Department of Animal Science, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran
- * E-mail:
| | - Ali Javadmanesh
- Department of Animal Science, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mohammad Mahdi Shariati
- Department of Animal Science, Faculty of Agriculture, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Hossein Fallahi
- Department of Biology, School of Sciences, Razi University, Kermanshah, Iran
| |
Collapse
|
10
|
Kargarfard F, Sami A, Mohammadi-Dehcheshmeh M, Ebrahimie E. Novel approach for identification of influenza virus host range and zoonotic transmissible sequences by determination of host-related associative positions in viral genome segments. BMC Genomics 2016; 17:925. [PMID: 27852224 PMCID: PMC5112743 DOI: 10.1186/s12864-016-3250-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2016] [Accepted: 11/02/2016] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Recent (2013 and 2009) zoonotic transmission of avian or porcine influenza to humans highlights an increase in host range by evading species barriers. Gene reassortment or antigenic shift between viruses from two or more hosts can generate a new life-threatening virus when the new shuffled virus is no longer recognized by antibodies existing within human populations. There is no large scale study to help understand the underlying mechanisms of host transmission. Furthermore, there is no clear understanding of how different segments of the influenza genome contribute in the final determination of host range. METHODS To obtain insight into the rules underpinning host range determination, various supervised machine learning algorithms were employed to mine reassortment changes in different viral segments in a range of hosts. Our multi-host dataset contained whole segments of 674 influenza strains organized into three host categories: avian, human, and swine. Some of the sequences were assigned to multiple hosts. In point of fact, the datasets are a form of multi-labeled dataset and we utilized a multi-label learning method to identify discriminative sequence sites. Then algorithms such as CBA, Ripper, and decision tree were applied to extract informative and descriptive association rules for each viral protein segment. RESULT We found informative rules in all segments that are common within the same host class but varied between different hosts. For example, for infection of an avian host, HA14V and NS1230S were the most important discriminative and combinatorial positions. CONCLUSION Host range identification is facilitated by high support combined rules in this study. Our major goal was to detect discriminative genomic positions that were able to identify multi host viruses, because such viruses are likely to cause pandemic or disastrous epidemics.
Collapse
Affiliation(s)
- Fatemeh Kargarfard
- Department of Computer Science and Engineering, School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran
| | - Ashkan Sami
- Department of Computer Science and Engineering, School of Electrical and Computer Engineering, Shiraz University, Shiraz, Iran
| | - Manijeh Mohammadi-Dehcheshmeh
- School of Animal and Veterinary Sciences, The University of Adelaide, Adelaide, Australia
- Institute of Biotechnology, Shiraz University, Shiraz, Iran
| | - Esmaeil Ebrahimie
- School of Animal and Veterinary Sciences, The University of Adelaide, Adelaide, Australia
- School of Medicine, Faculty of Health Sciences, The University of Adelaide, Adelaide, Australia
- Institute of Biotechnology, Shiraz University, Shiraz, Iran
- School of Information Technology and Mathematical Sciences, Division of Information Technology, Engineering and the Environment, University of South Australia, Adelaide, Australia
- School of Biological Sciences, Faculty of Science and Engineering, Flinders University, Adelaide, Australia
| |
Collapse
|
11
|
Pashaiasl M, Khodadadi K, Kayvanjoo AH, Pashaei-asl R, Ebrahimie E, Ebrahimi M. Unravelling evolution of Nanog, the key transcription factor involved in self-renewal of undifferentiated embryonic stem cells, by pattern recognition in nucleotide and tandem repeats characteristics. Gene 2016; 578:194-204. [DOI: 10.1016/j.gene.2015.12.023] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2015] [Revised: 12/10/2015] [Accepted: 12/10/2015] [Indexed: 12/27/2022]
|
12
|
Zinati Z, Alemzadeh A, KayvanJoo AH. Computational approaches for classification and prediction of P-type ATPase substrate specificity in Arabidopsis. PHYSIOLOGY AND MOLECULAR BIOLOGY OF PLANTS : AN INTERNATIONAL JOURNAL OF FUNCTIONAL PLANT BIOLOGY 2016; 22:163-174. [PMID: 27186030 PMCID: PMC4840148 DOI: 10.1007/s12298-016-0351-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2015] [Revised: 03/15/2016] [Accepted: 03/28/2016] [Indexed: 06/05/2023]
Abstract
As an extended gamut of integral membrane (extrinsic) proteins, and based on their transporting specificities, P-type ATPases include five subfamilies in Arabidopsis, inter alia, P4ATPases (phospholipid-transporting ATPase), P3AATPases (plasma membrane H(+) pumps), P2A and P2BATPases (Ca(2+) pumps) and P1B ATPases (heavy metal pumps). Although, many different computational methods have been developed to predict substrate specificity of unknown proteins, further investigation needs to improve the efficiency and performance of the predicators. In this study, various attribute weighting and supervised clustering algorithms were employed to identify the main amino acid composition attributes, which can influence the substrate specificity of ATPase pumps, classify protein pumps and predict the substrate specificity of uncharacterized ATPase pumps. The results of this study indicate that both non-reduced coefficients pertaining to absorption and Cys extinction within 280 nm, the frequencies of hydrogen, Ala, Val, carbon, hydrophilic residues, the counts of Val, Asn, Ser, Arg, Phe, Tyr, hydrophilic residues, Phe-Phe, Ala-Ile, Phe-Leu, Val-Ala and length are specified as the most important amino acid attributes through applying the whole attribute weighting models. Here, learning algorithms engineered in a predictive machine (Naive Bays) is proposed to foresee the Q9LVV1 and O22180 substrate specificities (P-type ATPase like proteins) with 100 % prediction confidence. For the first time, our analysis demonstrated promising application of bioinformatics algorithms in classifying ATPases pumps. Moreover, we suggest the predictive systems that can assist towards the prediction of the substrate specificity of any new ATPase pumps with the maximum possible prediction confidence.
Collapse
Affiliation(s)
- Zahra Zinati
- />Department of Agroecology, College of Agriculture and Natural Resources of Darab, Shiraz University, Shiraz, Iran
| | - Abbas Alemzadeh
- />Department of Crop Production and Plant Breeding, College of Agriculture, Shiraz University, Shiraz, Iran
| | - Amir Hossein KayvanJoo
- />Bonn-Aachen International Center for Information Technology B-IT, University of Bonn, Bonn, Germany
| |
Collapse
|
13
|
Torkzaban B, Kayvanjoo AH, Ardalan A, Mousavi S, Mariotti R, Baldoni L, Ebrahimie E, Ebrahimi M, Hosseini-Mazinani M. Machine Learning Based Classification of Microsatellite Variation: An Effective Approach for Phylogeographic Characterization of Olive Populations. PLoS One 2015; 10:e0143465. [PMID: 26599001 PMCID: PMC4658005 DOI: 10.1371/journal.pone.0143465] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2015] [Accepted: 11/05/2015] [Indexed: 11/24/2022] Open
Abstract
Finding efficient analytical techniques is overwhelmingly turning into a bottleneck for the effectiveness of large biological data. Machine learning offers a novel and powerful tool to advance classification and modeling solutions in molecular biology. However, these methods have been less frequently used with empirical population genetics data. In this study, we developed a new combined approach of data analysis using microsatellite marker data from our previous studies of olive populations using machine learning algorithms. Herein, 267 olive accessions of various origins including 21 reference cultivars, 132 local ecotypes, and 37 wild olive specimens from the Iranian plateau, together with 77 of the most represented Mediterranean varieties were investigated using a finely selected panel of 11 microsatellite markers. We organized data in two ‘4-targeted’ and ‘16-targeted’ experiments. A strategy of assaying different machine based analyses (i.e. data cleaning, feature selection, and machine learning classification) was devised to identify the most informative loci and the most diagnostic alleles to represent the population and the geography of each olive accession. These analyses revealed microsatellite markers with the highest differentiating capacity and proved efficiency for our method of clustering olive accessions to reflect upon their regions of origin. A distinguished highlight of this study was the discovery of the best combination of markers for better differentiating of populations via machine learning models, which can be exploited to distinguish among other biological populations.
Collapse
Affiliation(s)
- Bahareh Torkzaban
- National Institute of Genetic Engineering & Biotechnology, Tehran, Iran
| | | | - Arman Ardalan
- National Institute of Genetic Engineering & Biotechnology, Tehran, Iran
- Department of Gene Technology, KTH, Royal Institute of Technology, Science for Life Laboratory, Solna, Sweden
| | - Soraya Mousavi
- National Institute of Genetic Engineering & Biotechnology, Tehran, Iran
| | | | - Luciana Baldoni
- CNR, Institute of Biosciences & Bioresources, Perugia, Italy
| | - Esmaeil Ebrahimie
- Institute of Biotechnology, College of Agriculture, Shiraz University, Shiraz, Iran
- Department of Genetics and Evolution, School of Biological Sciences, University of Adelaide, Adelaide, Australia
- School of Information Technology and Mathematical Sciences, Division of Information Technology, Engineering and the Environment, University of South Australia, Adelaide, Australia
- School of Biological Sciences, Faculty of Science and Engineering, Flinders University, Adelaide, Australia
| | - Mansour Ebrahimi
- Department of Biology, School of Basic Science, University of Qom, Qom, Iran
- * E-mail: (MHM); (ME)
| | - Mehdi Hosseini-Mazinani
- National Institute of Genetic Engineering & Biotechnology, Tehran, Iran
- * E-mail: (MHM); (ME)
| |
Collapse
|
14
|
Gürüler H, Peker M, Baysal Ö. Soft computing model on genetic diversity and pathotype differentiation of pathogens: A novel approach. ELECTRON J BIOTECHN 2015. [DOI: 10.1016/j.ejbt.2015.06.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
15
|
Nasiri J, Naghavi MR, Kayvanjoo AH, Nasiri M, Ebrahimi M. Precision assessment of some supervised and unsupervised algorithms for genotype discrimination in the genus Pisum using SSR molecular data. J Theor Biol 2015; 368:122-32. [PMID: 25591889 DOI: 10.1016/j.jtbi.2015.01.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2014] [Revised: 11/06/2014] [Accepted: 01/01/2015] [Indexed: 10/24/2022]
Abstract
For the first time, prediction accuracies of some supervised and unsupervised algorithms were evaluated in an SSR-based DNA fingerprinting study of a pea collection containing 20 cultivars and 57 wild samples. In general, according to the 10 attribute weighting models, the SSR alleles of PEAPHTAP-2 and PSBLOX13.2-1 were the two most important attributes to generate discrimination among eight different species and subspecies of genus Pisum. In addition, K-Medoids unsupervised clustering run on Chi squared dataset exhibited the best prediction accuracy (83.12%), while the lowest accuracy (25.97%) gained as K-Means model ran on FCdb database. Irrespective of some fluctuations, the overall accuracies of tree induction models were significantly high for many algorithms, and the attributes PSBLOX13.2-3 and PEAPHTAP could successfully detach Pisum fulvum accessions and cultivars from the others when two selected decision trees were taken into account. Meanwhile, the other used supervised algorithms exhibited overall reliable accuracies, even though in some rare cases, they gave us low amounts of accuracies. Our results, altogether, demonstrate promising applications of both supervised and unsupervised algorithms to provide suitable data mining tools regarding accurate fingerprinting of different species and subspecies of genus Pisum, as a fundamental priority task in breeding programs of the crop.
Collapse
Affiliation(s)
- Jaber Nasiri
- Department of Agronomy and Plant Breeding, Division of Molecular Plant Genetics, College of Agricultural & Natural Resources, University of Tehran, Karaj, Tehran, Iran.
| | - Mohammad Reza Naghavi
- Department of Agronomy and Plant Breeding, College of Agricultural & Natural Resources, University of Tehran, Karaj, Tehran, Iran.
| | | | - Mojtaba Nasiri
- School of Life Sciences, Biomedical Science, Division of Molecular Biology, University of Sussex, Falmer, Brighton, UK.
| | - Mansour Ebrahimi
- Department of Biology, School of Basic Sciences, University of Qom, Qom, Iran.
| |
Collapse
|
16
|
Comparison of two exploratory data analysis methods for classification of Phyllanthus chemical fingerprint: unsupervised vs. supervised pattern recognition technologies. Anal Bioanal Chem 2014; 407:1389-401. [DOI: 10.1007/s00216-014-8371-x] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2014] [Revised: 11/17/2014] [Accepted: 11/26/2014] [Indexed: 12/20/2022]
|
17
|
New layers in understanding and predicting α-linolenic acid content in plants using amino acid characteristics of omega-3 fatty acid desaturase. Comput Biol Med 2014; 54:14-23. [DOI: 10.1016/j.compbiomed.2014.08.019] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2014] [Revised: 08/16/2014] [Accepted: 08/17/2014] [Indexed: 12/11/2022]
|
18
|
KayvanJoo AH, Ebrahimi M, Haqshenas G. Prediction of hepatitis C virus interferon/ribavirin therapy outcome based on viral nucleotide attributes using machine learning algorithms. BMC Res Notes 2014; 7:565. [PMID: 25150834 PMCID: PMC4246553 DOI: 10.1186/1756-0500-7-565] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2014] [Accepted: 08/10/2014] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Hepatitis C virus (HCV) causes chronic hepatitis C in 2-3% of world population and remains one of the health threatening human viruses, worldwide. In the absence of an effective vaccine, therapeutic approach is the only option to combat hepatitis C. Interferon-alpha (IFN-alpha) and ribavirin (RBV) combination alone or in combination with recently introduced new direct-acting antivirals (DAA) is used to treat patients infected with HCV. The present study utilized feature selection methods (Gini Index, Chi Squared and machine learning algorithms) and other bioinformatics tools to identify genetic determinants of therapy outcome within the entire HCV nucleotide sequence. RESULTS Using combination of several algorithms, the present study performed a comprehensive bioinformatics analysis and identified several nucleotide attributes within the full-length nucleotide sequences of HCV subtypes 1a and 1b that correlated with treatment outcome. Feature selection algorithms identified several nucleotide features (e.g. count of hydrogen and CG). Combination of algorithms utilized the selected nucleotide attributes and predicted HCV subtypes 1a and 1b therapy responders from non-responders with an accuracy of 75.00% and 85.00%, respectively. In addition, therapy responders and relapsers were categorized with an accuracy of 82.50% and 84.17%, respectively. Based on the identified attributes, decision trees were induced to differentiate different therapy response groups. CONCLUSIONS The present study identified new genetic markers that potentially impact the outcome of hepatitis C treatment. In addition, the results suggest new viral genomic attributes that might influence the outcome of IFN-mediated immune response to HCV infection.
Collapse
Affiliation(s)
| | - Mansour Ebrahimi
- Department of Biology, School of Basic Sciences, University of Qom, Qom, Iran.
| | | |
Collapse
|
19
|
Bakhtiarizadeh MR, Moradi-Shahrbabak M, Ebrahimi M, Ebrahimie E. Neural network and SVM classifiers accurately predict lipid binding proteins, irrespective of sequence homology. J Theor Biol 2014; 356:213-22. [PMID: 24819464 DOI: 10.1016/j.jtbi.2014.04.040] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2014] [Revised: 04/03/2014] [Accepted: 04/29/2014] [Indexed: 01/05/2023]
Abstract
Due to the central roles of lipid binding proteins (LBPs) in many biological processes, sequence based identification of LBPs is of great interest. The major challenge is that LBPs are diverse in sequence, structure, and function which results in low accuracy of sequence homology based methods. Therefore, there is a need for developing alternative functional prediction methods irrespective of sequence similarity. To identify LBPs from non-LBPs, the performances of support vector machine (SVM) and neural network were compared in this study. Comprehensive protein features and various techniques were employed to create datasets. Five-fold cross-validation (CV) and independent evaluation (IE) tests were used to assess the validity of the two methods. The results indicated that SVM outperforms neural network. SVM achieved 89.28% (CV) and 89.55% (IE) overall accuracy in identification of LBPs from non-LBPs and 92.06% (CV) and 92.90% (IE) (in average) for classification of different LBPs classes. Increasing the number and the range of extracted protein features as well as optimization of the SVM parameters significantly increased the efficiency of LBPs class prediction in comparison to the only previous report in this field. Altogether, the results showed that the SVM algorithm can be run on broad, computationally calculated protein features and offers a promising tool in detection of LBPs classes. The proposed approach has the potential to integrate and improve the common sequence alignment based methods.
Collapse
Affiliation(s)
| | - Mohammad Moradi-Shahrbabak
- Department of Animal Science, College of Agriculture and Natural Resources, University of Tehran, Karaj, Iran
| | - Mansour Ebrahimi
- Department of Biology, School of Basic Sciences, University of Qom, Qom, Iran
| | - Esmaeil Ebrahimie
- Department of Crop Production & Plant Breeding, College of Agriculture, Shiraz University, Shiraz, Iran; School of Molecular and Biomedical Science, The University of Adelaide, Adelaide, Australia.
| |
Collapse
|
20
|
Hosseinzadeh F, Kayvanjoo AH, Ebrahimi M, Goliaei B. Prediction of lung tumor types based on protein attributes by machine learning algorithms. SPRINGERPLUS 2013; 2:238. [PMID: 23888262 PMCID: PMC3710575 DOI: 10.1186/2193-1801-2-238] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/16/2013] [Accepted: 03/21/2013] [Indexed: 01/15/2023]
Abstract
Early diagnosis of lung cancers and distinction between the tumor types (Small Cell Lung Cancer (SCLC) and Non-Small Cell Lung Cancer (NSCLC) are very important to increase the survival rate of patients. Herein, we propose a diagnostic system based on sequence-derived structural and physicochemical attributes of proteins that involved in both types of tumors via feature extraction, feature selection and prediction models. 1497 proteins attributes computed and important features selected by 12 attribute weighting models and finally machine learning models consist of seven SVM models, three ANN models and two NB models applied on original database and newly created ones from attribute weighting models; models accuracies calculated through 10-fold cross and wrapper validation (just for SVM algorithms). In line with our previous findings, dipeptide composition, autocorrelation and distribution descriptor were the most important protein features selected by bioinformatics tools. The algorithms performances in lung cancer tumor type prediction increased when they applied on datasets created by attribute weighting models rather than original dataset. Wrapper-Validation performed better than X-Validation; the best cancer type prediction resulted from SVM and SVM Linear models (82%). The best accuracy of ANN gained when Neural Net model applied on SVM dataset (88%). This is the first report suggesting that the combination of protein features and attribute weighting models with machine learning algorithms can be effectively used to predict the type of lung cancer tumors (SCLC and NSCLC).
Collapse
Affiliation(s)
- Faezeh Hosseinzadeh
- Laboratory of biophysics and molecular biology, Institute of Biophysics and Biochemistry (IBB), University of Tehran, Tehran, Iran
| | | | | | | |
Collapse
|