1
|
Robson B, Cooper R. Glass Box and Black Box Machine Learning Approaches to Exploit Compositional Descriptors of Molecules in Drug Discovery and Aid the Medicinal Chemist. ChemMedChem 2024:e202400169. [PMID: 38837320 DOI: 10.1002/cmdc.202400169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 05/29/2024] [Accepted: 06/03/2024] [Indexed: 06/07/2024]
Abstract
The synthetic medicinal chemist plays a vital role in drug discovery. Today there are AI tools to guide next syntheses, but many are "Black Boxes" (BB). One learns little more than the prediction made. There are now also AI methods emphasizing visibility and "explainability" (thus explainable AI or XAI) that could help when "compositional data" are used, but they often still start from seemingly arbitrary learned weights and lack familiar probabilistic measures based on observation and counting from the outset. If probabilistic methods were used in a complementary way with BB methods and demonstrated comparable predictive power, they would provide guidelines about what groups to include and avoid in next syntheses and quantify the relationships in probabilistic terms. These points are demonstrated by blind test comparison of two main types of BB methods and a probabilistic "Glass Box" (GB) method new outside of medicine, but which appears well suited to the above. Because many probabilities can be involved, emphasis is on the predictive power of its simplest explanatory models. There are usually more inactive compounds by orders of magnitude, often a problem for machine learning methods. However, the approaches used here appear to work well for such "real world data".
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc., 2723 Rocklyn Road, Cleveland, OH-44122, USA
- The Dirac Foundation, c/o The Academy Partnership Ltd., Windrush Park, Witney, OX2929, UK
| | - Richard Cooper
- Oxford Drug Design, Oxford Centre for Innovation, New Rd, Oxford, OX1 3TA, UK
- Department of Chemistry, 12 Mansfield Road, Oxford, OX1 1BY, UK
| |
Collapse
|
2
|
Robson B, Baek OK. Glass box machine learning for retrospective cohort studies using many patient records. The complex example of bleeding peptic ulcer. Comput Biol Med 2024; 173:108085. [PMID: 38513393 DOI: 10.1016/j.compbiomed.2024.108085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 01/26/2024] [Accepted: 01/27/2024] [Indexed: 03/23/2024]
Abstract
Glass Box Machine Learning is, in this study, a type of partially supervised data mining and prediction technique, like a neural network in which each weight or pattern of mutually relevant weights is now replaced by a meaningful "probabilistic knowledge element." We apply it to retrospective cohort studies using large numbers of structured medical records to help select candidate patients for future cohort studies and similar clinical trials. Here it is applied to aid analysis of approaches to aid Deep Learning, but the method lends itself well to direct computation of odds with "explainability" in study design that can complement "Black Box" Deep Learning. Cohort studies and clinical trials traditionally involved at least one 2 × 2 contingency table, but in the age of emerging personalized medicine and the use of machine learning to discover and incorporate further relevant factors, these tables can extend into many extra dimensions as a 2 × 2 x 2 × 2 x ….data structure by considering different conditional demographic and clinical factors of a patient or group, as well as variations in treatment. We consider this in terms of multiple 2 × 2 x 2 data substructures where each one is summarized by an appropriate measure of risk and success called DOR*. This is the diagnostic odds ratio DOR for a specified disease conditional on a favorable outcome divided by the corresponding DOR conditional on an unfavorable outcome. Bleeding peptic ulcer was chosen as a complex disease with many influencing factors, one that is still subject to controversy and that highlights the challenges of using Real World Data.
Collapse
Affiliation(s)
- B Robson
- Ingine Inc., Cleveland, OH, USA; Dirac Foundation, Oxfordshire, UK; Advisory Board European Society of Translational Medicine, Austria.
| | - O K Baek
- Electronics and Telecommunications Research Institute, South Korea
| |
Collapse
|
3
|
Robson B, Baek O. An ontology for very large numbers of longitudinal health records to facilitate data mining and machine learning. INFORMATICS IN MEDICINE UNLOCKED 2023. [DOI: 10.1016/j.imu.2023.101204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/06/2023] Open
|
4
|
Mahmoodi S, Amirzakaria JZ, Ghasemian A. In silico design and validation of a novel multi-epitope vaccine candidate against structural proteins of Chikungunya virus using comprehensive immunoinformatics analyses. PLoS One 2023; 18:e0285177. [PMID: 37146081 PMCID: PMC10162528 DOI: 10.1371/journal.pone.0285177] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 04/16/2023] [Indexed: 05/07/2023] Open
Abstract
Chikungunya virus (CHIKV) is an emerging viral infectious agent with the potential of causing pandemic. There is neither a protective vaccine nor an approved drug against the virus. The aim of this study was design of a novel multi-epitope vaccine (MEV) candidate against the CHIKV structural proteins using comprehensive immunoinformatics and immune simulation analyses. In this study, using comprehensive immunoinformatics approaches, we developed a novel MEV candidate using the CHIKV structural proteins (E1, E2, 6 K, and E3). The polyprotein sequence was obtained from the UniProt Knowledgebase and saved in FASTA format. The helper and cytotoxic T lymphocytes (HTLs and CTLs respectively) and B cell epitopes were predicted. The toll-like receptor 4 (TLR4) agonist RS09 and PADRE epitope were employed as promising immunostimulatory adjuvant proteins. All vaccine components were fused using proper linkers. The MEV construct was checked in terms of antigenicity, allergenicity, immunogenicity, and physicochemical features. The docking of the MEV construct and the TLR4 and molecular dynamics (MD) simulation were also performed to assess the binding stability. The designed construct was non-allergen and was immunogen which efficiently stimulated immune responses using the proper synthetic adjuvant. The MEV candidate exhibited acceptable physicochemical features. Immune provocation included prediction of HTL, B cell, and CTL epitopes. The docking and MD simulation confirmed the stability of the docked TLR4-MEV complex. The high-level protein expression in the Escherichia coli (E. coli) host was observed through in silico cloning. The in vitro, in vivo, and clinical trial investigations are required to verify the findings of the current study.
Collapse
Affiliation(s)
- Shirin Mahmoodi
- Department of Medical Biotechnology, School of Medicine, Fasa University of Medical Sciences, Fasa, Iran
| | - Javad Zamani Amirzakaria
- Department of Plant Biotechnology, National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| | - Abdolmajid Ghasemian
- Noncommunicable Diseases Research Center, Fasa University of Medical Sciences, Fasa, Iran
| |
Collapse
|
5
|
Pan C, Poddar A, Mukherjee R, Ray AK. Impact of categorical and numerical features in ensemble machine learning frameworks for heart disease prediction. Biomed Signal Process Control 2022. [DOI: 10.1016/j.bspc.2022.103666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
6
|
Absar N, Das EK, Shoma SN, Khandaker MU, Miraz MH, Faruque MRI, Tamam N, Sulieman A, Pathan RK. The Efficacy of Machine-Learning-Supported Smart System for Heart Disease Prediction. Healthcare (Basel) 2022; 10:1137. [PMID: 35742188 PMCID: PMC9222326 DOI: 10.3390/healthcare10061137] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Revised: 06/13/2022] [Accepted: 06/14/2022] [Indexed: 11/26/2022] Open
Abstract
The disease may be an explicit status that negatively affects human health. Cardiopathy is one of the common deadly diseases that is attributed to unhealthy human habits compared to alternative diseases. With the help of machine learning (ML) algorithms, heart disease can be noticed in a short time as well as at a low cost. This study adopted four machine learning models, such as random forest (RF), decision tree (DT), AdaBoost (AB), and K-nearest neighbor (KNN), to detect heart disease. A generalized algorithm was constructed to analyze the strength of the relevant factors that contribute to heart disease prediction. The models were evaluated using the datasets Cleveland, Hungary, Switzerland, and Long Beach (CHSLB), and all were collected from Kaggle. Based on the CHSLB dataset, RF, DT, AB, and KNN models predicted an accuracy of 99.03%, 96.10%, 100%, and 100%, respectively. In the case of a single (Cleveland) dataset, only two models, namely RF and KNN, show good accuracy of 93.437% and 97.83%, respectively. Finally, the study used Streamlit, an internet-based cloud hosting platform, to develop a computer-aided smart system for disease prediction. It is expected that the proposed tool together with the ML algorithm will play a key role in diagnosing heart diseases in a very convenient manner. Above all, the study has made a substantial contribution to the computation of strength scores with significant predictors in the prognosis of heart disease.
Collapse
Affiliation(s)
- Nurul Absar
- Department of Computer Science and Engineering, BGC Trust University Bangladesh, Chittagong 4381, Bangladesh; (N.A.); (E.K.D.); (S.N.S.)
| | - Emon Kumar Das
- Department of Computer Science and Engineering, BGC Trust University Bangladesh, Chittagong 4381, Bangladesh; (N.A.); (E.K.D.); (S.N.S.)
| | - Shamsun Nahar Shoma
- Department of Computer Science and Engineering, BGC Trust University Bangladesh, Chittagong 4381, Bangladesh; (N.A.); (E.K.D.); (S.N.S.)
| | - Mayeen Uddin Khandaker
- Centre for Applied Physics and Radiation Technologies, School of Engineering and Technology, Sunway University, Petaling Jaya 47500, Selangor, Malaysia
- Department of General Educational Development, Faculty of Science and Information Technology, Daffodil International University, DIU Rd, Dhaka 1341, Bangladesh
| | - Mahadi Hasan Miraz
- Department of Business Analytics, Sunway University, Petaling Jaya 47500, Selangor, Malaysia;
| | - M. R. I. Faruque
- Space Science Center, Universiti Kebangsaan Malaysia, Bangi 43600, Selangor, Malaysia;
| | - Nissren Tamam
- Department of Physics, College of Science, Princess Nourah Bint Abdulrahman University, Riyadh 11671, Saudi Arabia;
| | - Abdelmoneim Sulieman
- Department of Radiology and Medical Imaging, Prince Sattam Bin Abdulaziz University, Alkharj 11942, Saudi Arabia;
| | - Refat Khan Pathan
- Department of Computing and Information Systems, School of Engineering and Technology, Sunway University, Petaling Jaya 47500, Selangor, Malaysia;
| |
Collapse
|
7
|
Robson B, St Clair J. Principles of Quantum Mechanics for Artificial Intelligence in medicine. Discussion with reference to the Quantum Universal Exchange Language (Q-UEL). Comput Biol Med 2022; 143:105323. [PMID: 35240388 DOI: 10.1016/j.compbiomed.2022.105323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 01/30/2022] [Accepted: 02/13/2022] [Indexed: 11/22/2022]
Abstract
This paper reviews some basic principles of Quantum Mechanics, Quantum Computing, and Artificial Intelligence in terms of a specific unifying theme. This theme relates to the hyperbolic or split-complex imaginary numbers and their equivalent matrices, rediscovered by Dirac, and the underlying mathematics of the previously described Q-UEL language based on them. Hyperbolic imaginary numbers h have the property hh = +1: contrast the more familiar i such that ii = -1. Examples of analogous matrices include that for the Hadamard gate as used in quantum computing and the Pauli spin matrices, and all Hermitian matrices of interest in quantum computing can readily be derived from these. They also relate to Dirac dualization, spinor projectors of Quantum Field Theory, the non-wave-like part of quantum theory, collapse of the wave function, and a dualized form of classical probability theory that has advantages in automated reasoning for medicine.
Collapse
Affiliation(s)
- Barry Robson
- The Dirac Foundation, Oxfordshire, UK; Ingine Inc, USA.
| | - Jim St Clair
- Linux Foundation Public Health, San Franciso, USA
| |
Collapse
|
8
|
Robson B. Towards faster response against emerging epidemics and prediction of variants of concern. INFORMATICS IN MEDICINE UNLOCKED 2022; 31:100966. [PMID: 35611320 PMCID: PMC9119712 DOI: 10.1016/j.imu.2022.100966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 05/05/2022] [Accepted: 05/11/2022] [Indexed: 01/11/2023] Open
Abstract
The author, the journal, Computers in Biology and Medicine (CBM), and Elsevier Press more generally, played a helpful very early role in responding to COVID-19. Within a few days of the appearance of the "Wuhan Seafood isolate" genome on GenBank, a bioinformatics study was posted by the present author in ResearchGate in January 2020, "Preliminary Bioinformatics Studies on the Design of Synthetic Vaccines and Preventative Peptidomimetic Antagonists against the Wuhan Seafood Market Coronavirus. Possible Importance of the KRSFIEDLLFNKV Motif" DOI: 10.13140/RG.2.2.18275.09761. On February 2nd, 2020, a more thorough analysis was submitted to CBM, e-published on February 26, and formally published in April 2020, at about the same time as the virus named as 2019n-CoV was identified as essentially SARS and renames SARS-COV-2. This was followed by four further papers describing in more detail some previously unreported aspects of the early investigation. The speed of research and writing of the papers was made possible by knowledge-gathering tools. Based on this and earlier experiences with fast responses to emerging epidemics such as HIV and Mad Cow Disease, it is possible to envisage the nature of a speedier response to emerging epidemics and new variants of concern in established epidemics.
Collapse
Affiliation(s)
- B Robson
- Ingine Inc., Cleveland, Ohio, USA.,The Dirac Foundation, Oxfordshire, UK
| |
Collapse
|
9
|
Searching for the principles of a less artificial A.I. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.101018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
10
|
Robson B, Boray S, Weisman J. Mining real-world high dimensional structured data in medicine and its use in decision support. Some different perspectives on unknowns, interdependency, and distinguishability. Comput Biol Med 2021; 141:105118. [PMID: 34971979 DOI: 10.1016/j.compbiomed.2021.105118] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Revised: 11/18/2021] [Accepted: 12/02/2021] [Indexed: 11/03/2022]
Abstract
There are many difficulties in extracting and using knowledge for medical analytic and predictive purposes from Real-World Data, even when the data is already well structured in the manner of a large spreadsheet. Preparative curation and standardization or "normalization" of such data involves a variety of chores but underlying them is an interrelated set of fundamental problems that can in part be dealt with automatically during the datamining and inference processes. These fundamental problems are reviewed here and illustrated and investigated with examples. They concern the treatment of unknowns, the need to avoid independency assumptions, and the appearance of entries that may not be fully distinguished from each other. Unknowns include errors detected as implausible (e.g., out of range) values that are subsequently converted to unknowns. These problems are further impacted by high dimensionality and problems of sparse data that inevitably arise from high-dimensional datamining even if the data is extensive. All these considerations are different aspects of incomplete information, though they also relate to problems that arise if care is not taken to avoid or ameliorate consequences of including the same information twice or more, or if misleading or inconsistent information is combined. This paper addresses these aspects from a slightly different perspective using the Q-UEL language and inference methods based on it by borrowing some ideas from the mathematics of quantum mechanics and information theory. It takes the view that detection and correction of probabilistic elements of knowledge subsequently used in inference need only involve testing and correction so that they satisfy certain extended notions of coherence between probabilities. This is by no means the only possible view, and it is explored here and later compared with a related notion of consistency.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc, Ohio, USA; The Dirac Foundation, Oxfordshire, UK.
| | | | - J Weisman
- The Dirac Foundation, Oxfordshire, UK.
| |
Collapse
|
11
|
Robson B. Testing machine learning techniques for general application by using protein secondary structure prediction. A brief survey with studies of pitfalls and benefits using a simple progressive learning approach. Comput Biol Med 2021; 138:104883. [PMID: 34598067 DOI: 10.1016/j.compbiomed.2021.104883] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 09/05/2021] [Accepted: 09/17/2021] [Indexed: 01/05/2023]
Abstract
Many researchers have recently used the prediction of protein secondary structure (local conformational states of amino acid residues) to test advances in predictive and machine learning technology such as Neural Net Deep Learning. Protein secondary structure prediction continues to be a helpful tool in research in biomedicine and the life sciences, but it is also extremely enticing for testing predictive methods such as neural nets that are intended for different or more general purposes. A complication is highlighted here for researchers testing their methods for other applications. Modern protein databases inevitably contain important clues to the answer, so-called "strong buried clues", though often obscurely; they are hard to avoid. This is because most proteins or parts of proteins in a modern protein data base are related to others by biological evolution. For researchers developing machine learning and predictive methods, this can overstate and so confuse understanding of the true quality of a predictive method. However, for researchers using the algorithms as tools, understanding strong buried clues is of great value, because they need to make maximum use of all information available. A simple method related to the GOR methods but with some features of neural nets in the sense of progressive learning of large numbers of weights, is used to explore this. It can acquire tens of millions and hence gigabytes of weights, but they are learned stably by exhaustive sampling. The significance of the findings is discussed in the light of promising recent results from AlphaFold using Google's DeepMind.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc. Ohio, USA and the Dirac Foundation Oxfordshire, UK.
| |
Collapse
|
12
|
Robson B. Bioinformatics studies on a function of the SARS-CoV-2 spike glycoprotein as the binding of host sialic acid glycans. Comput Biol Med 2020; 122:103849. [PMID: 32658736 PMCID: PMC7278709 DOI: 10.1016/j.compbiomed.2020.103849] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Revised: 06/04/2020] [Accepted: 06/04/2020] [Indexed: 02/08/2023]
Abstract
SARS-CoV and SARS-CoV-2 do not appear to have functions of a hemagglutinin and neuraminidase. This is a mystery, because sugar binding activities appear essential to many other viruses including influenza and even most other coronaviruses in order to bind to and escape from the glycans (sugars, oligosaccharides or polysaccharides) characteristic of cell surfaces and saliva and mucin. The S1 N terminal Domains (S1-NTD) of the spike protein, largely responsible for the bulk of the characteristic knobs at the end of the spikes of SARS-CoV and SARS-CoV-2, are here predicted to be “hiding” sites for recognizing and binding glycans containing sialic acid. This may be important for infection and the ability of the virus to locate ACE2 as its known main host cell surface receptor, and if so it becomes a pharmaceutical target. It might even open up the possibility of an alternative receptor to ACE2. The prediction method developed, which uses amino acid residue sequence alone to predict domains or proteins that bind to sialic acids, is naïve, and will be advanced in future work. Nonetheless, it was surprising that such a very simple approach was so useful, and it can easily be reproduced in a very few lines of computer program to help make quick comparisons between SARS-CoV-2 sequences and to consider the effects of viral mutations. This paper extends the studies of the author's previous SARS-CoV-2 papers. Designing vaccine and drugs must seek to avoid escape mutations. Strangely, SARS-CoV and SARS-CoV-2 appear to lack sialic acid binding functions. Sequence motifs are found, but they require a simple prediction method.
Collapse
Affiliation(s)
- B Robson
- Ingine Inc. Cleveland Ohio USA and the Dirac Foundation, Oxfordshire, UK.
| |
Collapse
|
13
|
Robson B. COVID-19 Coronavirus spike protein analysis for synthetic vaccines, a peptidomimetic antagonist, and therapeutic drugs, and analysis of a proposed achilles' heel conserved region to minimize probability of escape mutations and drug resistance. Comput Biol Med 2020; 121:103749. [PMID: 32568687 PMCID: PMC7151553 DOI: 10.1016/j.compbiomed.2020.103749] [Citation(s) in RCA: 91] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Revised: 04/03/2020] [Accepted: 04/03/2020] [Indexed: 12/17/2022]
Abstract
This paper continues a recent study of the spike protein sequence of the COVID-19 virus (SARS-CoV-2). It is also in part an introductory review to relevant computational techniques for tackling viral threats, using COVID-19 as an example. Q-UEL tools for facilitating access to knowledge and bioinformatics tools were again used for efficiency, but the focus in this paper is even more on the virus. Subsequence KRSFIEDLLFNKV of the S2′ spike glycoprotein proteolytic cleavage site continues to appear important. Here it is shown to be recognizable in the common cold coronaviruses, avian coronaviruses and possibly as traces in the nidoviruses of reptiles and fish. Its function or functions thus seem important to the coronaviruses. It might represent SARS-CoV-2 Achilles’ heel, less likely to acquire resistance by mutation, as has happened in some early SARS vaccine studies discussed in the previous paper. Preliminary conformational analysis of the receptor (ACE2) binding site of the spike protein is carried out suggesting that while it is somewhat conserved, it appears to be more variable than KRSFIEDLLFNKV. However compounds like emodin that inhibit SARS entry, apparently by binding ACE2, might also have functions at several different human protein binding sites. The enzyme 11β-hydroxysteroid dehydrogenase type 1 is again argued to be a convenient model pharmacophore perhaps representing an ensemble of targets, and it is noted that it occurs both in lung and alimentary tract. Perhaps it benefits the virus to block an inflammatory response by inhibiting the dehydrogenase, but a fairly complex web involves several possible targets. This paper “drills down” into the studies of the author's previous COVID-19 paper. Designing vaccine and drugs must seek to avoid escape mutations. Subsequence KRSFIEDLLFNKV seems recognizable across many coronaviruses. The ACE2 binding domain is a target, but shows variation. A steroid dehydrogenase is argued to remain an interesting model pharmacophore.
Collapse
Affiliation(s)
- B Robson
- Ingine Inc. Cleveland Ohio USA, The Dirac Foundation, Oxfordshire, UK.
| |
Collapse
|
14
|
Robson B. Computers and viral diseases. Preliminary bioinformatics studies on the design of a synthetic vaccine and a preventative peptidomimetic antagonist against the SARS-CoV-2 (2019-nCoV, COVID-19) coronavirus. Comput Biol Med 2020; 119:103670. [PMID: 32209231 PMCID: PMC7094376 DOI: 10.1016/j.compbiomed.2020.103670] [Citation(s) in RCA: 126] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2020] [Revised: 02/17/2020] [Accepted: 02/17/2020] [Indexed: 12/19/2022]
Abstract
This paper concerns study of the genome of the Wuhan Seafood Market isolate believed to represent the causative agent of the disease COVID-19. This is to find a short section or sections of viral protein sequence suitable for preliminary design proposal for a peptide synthetic vaccine and a peptidomimetic therapeutic, and to explore some design possibilities. The project was originally directed towards a use case for the Q-UEL language and its implementation in a knowledge management and automated inference system for medicine called the BioIngine, but focus here remains mostly on the virus itself. However, using Q-UEL systems to access relevant and emerging literature, and to interact with standard publically available bioinformatics tools on the Internet, did help quickly identify sequences of amino acids that are well conserved across many coronaviruses including 2019-nCoV. KRSFIEDLLFNKV was found to be particularly well conserved in this study and corresponds to the region around one of the known cleavage sites of the SARS virus that are believed to be required for virus activation for cell entry. This sequence motif and surrounding variations formed the basis for proposing a specific synthetic vaccine epitope and peptidomimetic agent. The work can, nonetheless, be described in traditional bioinformatics terms, and readily reproduced by others, albeit with the caveat that new data and research into 2019-nCoV is emerging and evolving at an explosive pace. Preliminary studies using molecular modeling and docking, and in that context the potential value of certain known herbal extracts, are also described. Bioinformatics studies are carried out on the COVID-19 virus. A sequence motif KRSFIEDLLFNKV is of particular interest. Based on the above, synthetic peptides are designed. Preliminary considerations are also given to non-peptide organic molecules.
Collapse
Affiliation(s)
- B Robson
- Ingine Inc., Cleveland, Ohio, USA; The Dirac Foundation, Oxfordshire, UK.
| |
Collapse
|
15
|
Robson B. Extension of the Quantum Universal Exchange Language to precision medicine and drug lead discovery. Preliminary example studies using the mitochondrial genome. Comput Biol Med 2020; 117:103621. [PMID: 32072972 DOI: 10.1016/j.compbiomed.2020.103621] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 01/12/2020] [Accepted: 01/12/2020] [Indexed: 12/21/2022]
Abstract
The Quantum Universal Exchange Language (Q-UEL) based on Dirac notation and algebra from quantum mechanics, along with its associated data mining and Hyperbolic Dirac Net (HDN) for probabilistic inference, has proven to be a useful architectural principle for knowledge management, analysis and prediction systems in medicine. It has been described in several papers; here is described its extension to clinical genomics and precision medicine. Two use cases are studied: (a) bioinformatics in clinical decision support especially for risk for type 2 diabetes using mitochondrial patient DNA sequences, and (b) bioinformatics and computational biology (conformational) research examples related to drug discovery involving the recently discovered class of mitochondrial derived peptides (MDPs). MDPs were surprising when first discovered as coded in small open reading frames (sORFs), and are emerging as having a fundamental role in metabolic control, longevity and disease. This project originally represented a language specification study relating to what information related to genomics is essential or useful to carry, and what processing will be needed. However, novel aspects introduced or discovered include the HDN-like neural nets and their use, along with more established methods, for prediction of type 2 diabetes, and in particular for proposals for over 80 natural MDPs most of which that have not previously been described at the time of the study, as potential drug lead targets. Also, use of many medical records with simulated joining of mtDNA as performance tests led to some insightful observations regarding the behavior of HDN predictions where independent factors are involved.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc., Delaware, USA; The Dirac Foundation, OxfordShire, UK.
| |
Collapse
|
16
|
Robson B, Boray S. Studies in the use of data mining, prediction algorithms, and a universal exchange and inference language in the analysis of socioeconomic health data. Comput Biol Med 2019; 112:103369. [PMID: 31377681 DOI: 10.1016/j.compbiomed.2019.103369] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Revised: 07/22/2019] [Accepted: 07/23/2019] [Indexed: 12/18/2022]
Abstract
While clinical and biomedical information in digital form has been escalating, it is socioeconomic factors that are important determinants of health on the national and global scale. We show how collective use of data mining and prediction algorithms to analyze socioeconomic population health data can stand beside classical correlation analysis in routine data analysis. The underlying theoretical basis is the Dirac notation and algebra that is a scientific standard but unusual outside of the physical sciences, combined with a theory of expected information first developed for analyzing sparse data but still largely confined to bioinformatics. The latter was important here because the records analyzed (which are for US counties and equivalents, not patients) are very few by contemporary data mining standards. The approach is very unlikely to be familiar to socioeconomic researchers, so the theory and the advantages of our inference nets over the Bayes Net are reviewed here, mostly using socioeconomic examples. While our expertise and focus is in regard to novel analytical methods rather than socioeconomics per se, a significant negative (countertrending) relationship between population health and equity was initially surprising, at least to the present authors. This encouraged deeper exploration including that of the relationship between our data mining methods and traditional Pearson's correlation. The latter is susceptible to giving wrong conclusions if a phenomenon called Simpson's paradox applies, so this is also investigated. Also discussed is that, even for very few records, associative data mining can still demand significant computational resources due to a combinatorial explosion.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc. Virginia, USA and the Dirac Foundation OxfordShire, UK.
| | - S Boray
- Ingine Inc. Virginia, USA and the Dirac Foundation OxfordShire, UK
| |
Collapse
|
17
|
Robson B. Bidirectional General Graphs for inference. Principles and implications for medicine. Comput Biol Med 2019; 108:382-399. [PMID: 31075569 DOI: 10.1016/j.compbiomed.2019.04.005] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Revised: 04/03/2019] [Accepted: 04/04/2019] [Indexed: 12/17/2022]
Abstract
Probabilistic inference methods require a more general and realistic description of the world as a Bidirectional General Graph (BGG). While in its original form the Bayes Net (BN) has been promoted as a predictive tool, it is more immediately a way of testing a hypothesis or model about interactions in a system usually considered on a causal basis. Once established, the model can be used in a predictive way, but the problem here is that for a traditional BN the hypotheses or models that can be formed are limited to the Directed Acyclic Graph (DAG) by definition. Three interrelated features are highlighted that represent deficiencies of the DAG which are corrected by conversion to a method based on a BGG: (i) lack of intrinsic representation of coherence by Bayes' rule, (ii) relatedly the need to consider interdependence in parent nodes, and (iii) the need for management of a property called recurrence. These deficiencies can represent large errors in absolute estimates of probabilities, and while relative and renormalized probabilities ameliorate that, they can often make much of a net superfluous through cancelations by division. The Hyperbolic Dirac Net (HDN) based on Dirac's quantum mechanics is a solution that led naturally to avoiding these deficiencies. It encodes bidirectional probabilities in an h-complex value rediscovered by Dirac, i.e. with the imaginary number h such that hh = +1. Properties of the HDN described previously are reviewed (though emphasis is on descriptions in familiar probability terms), the issue of recurrence is introduced, methods of construction are simplified, and the severity of the quantitative differences between BNs and analogous HDNs are exemplified. There is also discussion of how results compare with other approaches in practice.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc. Viginia, USA; The Dirac Foundation, OxfordShire, UK.
| |
Collapse
|
18
|
A rule-based semantic approach for data integration, standardization and dimensionality reduction utilizing the UMLS: Application to predicting bariatric surgery outcomes. Comput Biol Med 2019; 106:84-90. [DOI: 10.1016/j.compbiomed.2019.01.019] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2018] [Revised: 01/21/2019] [Accepted: 01/21/2019] [Indexed: 11/24/2022]
|
19
|
Studies in the extensively automatic construction of large odds-based inference networks from structured data. Examples from medical, bioinformatics, and health insurance claims data. Comput Biol Med 2018; 95:147-166. [PMID: 29500985 DOI: 10.1016/j.compbiomed.2018.02.013] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2018] [Revised: 02/19/2018] [Accepted: 02/19/2018] [Indexed: 12/11/2022]
Abstract
Theoretical and methodological principles are presented for the construction of very large inference nets for odds calculations, composed of hundreds or many thousands or more of elements, in this paper generated by structured data mining. It is argued that the usual small inference nets can sometimes represent rather simple, arbitrary estimates. Examples of applications in clinical and public health data analysis, medical claims data and detection of irregular entries, and bioinformatics data, are presented. Construction of large nets benefits from application of a theory of expected information for sparse data and the Dirac notation and algebra. The extent to which these are important here is briefly discussed. Purposes of the study include (a) exploration of the properties of large inference nets and a perturbation and tacit conditionality models, (b) using these to propose simpler models including one that a physician could use routinely, analogous to a "risk score", (c) examination of the merit of describing optimal performance in a single measure that combines accuracy, specificity, and sensitivity in place of a ROC curve, and (d) relationship to methods for detecting anomalous and potentially fraudulent data.
Collapse
|
20
|
Neural Network-Based Coronary Heart Disease Risk Prediction Using Feature Correlation Analysis. JOURNAL OF HEALTHCARE ENGINEERING 2017; 2017:2780501. [PMID: 29065583 PMCID: PMC5606055 DOI: 10.1155/2017/2780501] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/10/2017] [Revised: 07/05/2017] [Accepted: 07/12/2017] [Indexed: 12/03/2022]
Abstract
Background Of the machine learning techniques used in predicting coronary heart disease (CHD), neural network (NN) is popularly used to improve performance accuracy. Objective Even though NN-based systems provide meaningful results based on clinical experiments, medical experts are not satisfied with their predictive performances because NN is trained in a “black-box” style. Method We sought to devise an NN-based prediction of CHD risk using feature correlation analysis (NN-FCA) using two stages. First, the feature selection stage, which makes features acceding to the importance in predicting CHD risk, is ranked, and second, the feature correlation analysis stage, during which one learns about the existence of correlations between feature relations and the data of each NN predictor output, is determined. Result Of the 4146 individuals in the Korean dataset evaluated, 3031 had low CHD risk and 1115 had CHD high risk. The area under the receiver operating characteristic (ROC) curve of the proposed model (0.749 ± 0.010) was larger than the Framingham risk score (FRS) (0.393 ± 0.010). Conclusions The proposed NN-FCA, which utilizes feature correlation analysis, was found to be better than FRS in terms of CHD risk prediction. Furthermore, the proposed model resulted in a larger ROC curve and more accurate predictions of CHD risk in the Korean population than the FRS.
Collapse
|
21
|
Arabasadi Z, Alizadehsani R, Roshanzamir M, Moosaei H, Yarifard AA. Computer aided decision making for heart disease detection using hybrid neural network-Genetic algorithm. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2017; 141:19-26. [PMID: 28241964 DOI: 10.1016/j.cmpb.2017.01.004] [Citation(s) in RCA: 139] [Impact Index Per Article: 19.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2016] [Revised: 12/18/2016] [Accepted: 01/12/2017] [Indexed: 05/28/2023]
Abstract
Cardiovascular disease is one of the most rampant causes of death around the world and was deemed as a major illness in Middle and Old ages. Coronary artery disease, in particular, is a widespread cardiovascular malady entailing high mortality rates. Angiography is, more often than not, regarded as the best method for the diagnosis of coronary artery disease; on the other hand, it is associated with high costs and major side effects. Much research has, therefore, been conducted using machine learning and data mining so as to seek alternative modalities. Accordingly, we herein propose a highly accurate hybrid method for the diagnosis of coronary artery disease. As a matter of fact, the proposed method is able to increase the performance of neural network by approximately 10% through enhancing its initial weights using genetic algorithm which suggests better weights for neural network. Making use of such methodology, we achieved accuracy, sensitivity and specificity rates of 93.85%, 97% and 92% respectively, on Z-Alizadeh Sani dataset.
Collapse
Affiliation(s)
- Zeinab Arabasadi
- Department of Computer Engineering, University of Bojnord, Bojnord, Iran
| | - Roohallah Alizadehsani
- Department of Computer Engineering, Sharif University of Technology, Azadi Ave, Tehran, Iran.
| | - Mohamad Roshanzamir
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, Iran
| | - Hossein Moosaei
- Department of Mathematics, Faculty of Science, University of Bojnord, Iran
| | | |
Collapse
|
22
|
Robson B. Studies in using a universal exchange and inference language for evidence based medicine. Semi-automated learning and reasoning for PICO methodology, systematic review, and environmental epidemiology. Comput Biol Med 2016; 79:299-323. [PMID: 27846446 DOI: 10.1016/j.compbiomed.2016.10.009] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2016] [Revised: 09/28/2016] [Accepted: 10/11/2016] [Indexed: 11/24/2022]
Abstract
The Q-UEL language of XML-like tags and the associated software applications are providing a valuable toolkit for Evidence Based Medicine (EBM). In this paper the already existing applications, data bases, and tags are brought together with new ones. The particular Q-UEL embodiment used here is the BioIngine. The main challenge is one of bringing together the methods of symbolic reasoning and calculative probabilistic inference that underlie EBM and medical decision making. Some space is taken to review this background. The unification is greatly facilitated by Q-UEL's roots in the notation and algebra of Dirac, and by extending Q-UEL into the Wolfram programming environment. Further, the overall problem of integration is also a relatively simple one because of the nature of Q-UEL as a language for interoperability in healthcare and biomedicine, while the notion of workflow is facilitated because of the EBM best practice known as PICO. What remains difficult is achieving a high degree of overall automation because of a well-known difficulty in capturing human expertise in computers: the Feigenbaum bottleneck.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc. Delaware, USA, and The Dirac Foundation Clg, Oxfordshire, UK; St. Matthew's University School of Medicine, Cayman Islands, UK.
| |
Collapse
|
23
|
Ciaccio EJ. Honored papers 2015. Comput Biol Med 2016. [DOI: 10.1016/j.compbiomed.2016.05.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
24
|
Robson B, Boray S. Studies of the role of a smart web for precision medicine supported by biobanking. Per Med 2016; 13:361-380. [DOI: 10.2217/pme-2015-0012] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Both the extraction of medical knowledge from data mining many patient records and from authoritative natural language text on the Internet are important for clinical decision support and biomedical research. The samples in biobanks represent a further kind of information repository of recognized increasing importance, so mechanisms being developed for a smart web for medicine should take them into account. While this paper is primarily a review of Quantum Universal Exchange Language as an XML extension to enable a future smart web for healthcare and biomedicine, it is the first time that we have discussed the connection with biobanks and the design of Quantum Universal Exchange Language's XML-like tags to support their use.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc. 46581 Riverwood Terrace, Potomac Falls, VA 20165 AND DE, USA
- The Dirac Foundation clg, Oxfordshire, UK
- St Matthew's University, Grand Cayman, USA
- The University of Wisconsin Stout, USA
| | - Srinidhi Boray
- Ingine Inc. 46581 Riverwood Terrace, Potomac Falls, VA 20165 AND DE, USA
| |
Collapse
|
25
|
Robson B, Boray S. Data-mining to build a knowledge representation store for clinical decision support. Studies on curation and validation based on machine performance in multiple choice medical licensing examinations. Comput Biol Med 2016; 73:71-93. [PMID: 27089305 PMCID: PMC7094475 DOI: 10.1016/j.compbiomed.2016.02.010] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2015] [Revised: 02/05/2016] [Accepted: 02/17/2016] [Indexed: 11/23/2022]
Abstract
Extracting medical knowledge by structured data mining of many medical records and from unstructured data mining of natural language source text on the Internet will become increasingly important for clinical decision support. Output from these sources can be transformed into large numbers of elements of knowledge in a Knowledge Representation Store (KRS), here using the notation and to some extent the algebraic principles of the Q-UEL Web-based universal exchange and inference language described previously, rooted in Dirac notation from quantum mechanics and linguistic theory. In a KRS, semantic structures or statements about the world of interest to medicine are analogous to natural language sentences seen as formed from noun phrases separated by verbs, prepositions and other descriptions of relationships. A convenient method of testing and better curating these elements of knowledge is by having the computer use them to take the test of a multiple choice medical licensing examination. It is a venture which perhaps tells us almost as much about the reasoning of students and examiners as it does about the requirements for Artificial Intelligence as employed in clinical decision making. It emphasizes the role of context and of contextual probabilities as opposed to the more familiar intrinsic probabilities, and of a preliminary form of logic that we call presyllogistic reasoning.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc., DE, USA; The Dirac Foundation clg, Oxfordshire, UK; St. Matthew's University School of Medicine, Cayman Islands.
| | - Srinidhi Boray
- Ingine Inc., DE, USA; The Dirac Foundation clg, Oxfordshire, UK
| |
Collapse
|