1
|
Robson B, Cooper R. Glass Box and Black Box Machine Learning Approaches to Exploit Compositional Descriptors of Molecules in Drug Discovery and Aid the Medicinal Chemist. ChemMedChem 2024:e202400169. [PMID: 38837320 DOI: 10.1002/cmdc.202400169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 05/29/2024] [Accepted: 06/03/2024] [Indexed: 06/07/2024]
Abstract
The synthetic medicinal chemist plays a vital role in drug discovery. Today there are AI tools to guide next syntheses, but many are "Black Boxes" (BB). One learns little more than the prediction made. There are now also AI methods emphasizing visibility and "explainability" (thus explainable AI or XAI) that could help when "compositional data" are used, but they often still start from seemingly arbitrary learned weights and lack familiar probabilistic measures based on observation and counting from the outset. If probabilistic methods were used in a complementary way with BB methods and demonstrated comparable predictive power, they would provide guidelines about what groups to include and avoid in next syntheses and quantify the relationships in probabilistic terms. These points are demonstrated by blind test comparison of two main types of BB methods and a probabilistic "Glass Box" (GB) method new outside of medicine, but which appears well suited to the above. Because many probabilities can be involved, emphasis is on the predictive power of its simplest explanatory models. There are usually more inactive compounds by orders of magnitude, often a problem for machine learning methods. However, the approaches used here appear to work well for such "real world data".
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc., 2723 Rocklyn Road, Cleveland, OH-44122, USA
- The Dirac Foundation, c/o The Academy Partnership Ltd., Windrush Park, Witney, OX2929, UK
| | - Richard Cooper
- Oxford Drug Design, Oxford Centre for Innovation, New Rd, Oxford, OX1 3TA, UK
- Department of Chemistry, 12 Mansfield Road, Oxford, OX1 1BY, UK
| |
Collapse
|
2
|
Robson B, Baek OK. Glass box machine learning for retrospective cohort studies using many patient records. The complex example of bleeding peptic ulcer. Comput Biol Med 2024; 173:108085. [PMID: 38513393 DOI: 10.1016/j.compbiomed.2024.108085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 01/26/2024] [Accepted: 01/27/2024] [Indexed: 03/23/2024]
Abstract
Glass Box Machine Learning is, in this study, a type of partially supervised data mining and prediction technique, like a neural network in which each weight or pattern of mutually relevant weights is now replaced by a meaningful "probabilistic knowledge element." We apply it to retrospective cohort studies using large numbers of structured medical records to help select candidate patients for future cohort studies and similar clinical trials. Here it is applied to aid analysis of approaches to aid Deep Learning, but the method lends itself well to direct computation of odds with "explainability" in study design that can complement "Black Box" Deep Learning. Cohort studies and clinical trials traditionally involved at least one 2 × 2 contingency table, but in the age of emerging personalized medicine and the use of machine learning to discover and incorporate further relevant factors, these tables can extend into many extra dimensions as a 2 × 2 x 2 × 2 x ….data structure by considering different conditional demographic and clinical factors of a patient or group, as well as variations in treatment. We consider this in terms of multiple 2 × 2 x 2 data substructures where each one is summarized by an appropriate measure of risk and success called DOR*. This is the diagnostic odds ratio DOR for a specified disease conditional on a favorable outcome divided by the corresponding DOR conditional on an unfavorable outcome. Bleeding peptic ulcer was chosen as a complex disease with many influencing factors, one that is still subject to controversy and that highlights the challenges of using Real World Data.
Collapse
Affiliation(s)
- B Robson
- Ingine Inc., Cleveland, OH, USA; Dirac Foundation, Oxfordshire, UK; Advisory Board European Society of Translational Medicine, Austria.
| | - O K Baek
- Electronics and Telecommunications Research Institute, South Korea
| |
Collapse
|
3
|
Robson B, Baek O. An ontology for very large numbers of longitudinal health records to facilitate data mining and machine learning. INFORMATICS IN MEDICINE UNLOCKED 2023. [DOI: 10.1016/j.imu.2023.101204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/06/2023] Open
|
4
|
Robson B, St Clair J. Principles of Quantum Mechanics for Artificial Intelligence in medicine. Discussion with reference to the Quantum Universal Exchange Language (Q-UEL). Comput Biol Med 2022; 143:105323. [PMID: 35240388 DOI: 10.1016/j.compbiomed.2022.105323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 01/30/2022] [Accepted: 02/13/2022] [Indexed: 11/22/2022]
Abstract
This paper reviews some basic principles of Quantum Mechanics, Quantum Computing, and Artificial Intelligence in terms of a specific unifying theme. This theme relates to the hyperbolic or split-complex imaginary numbers and their equivalent matrices, rediscovered by Dirac, and the underlying mathematics of the previously described Q-UEL language based on them. Hyperbolic imaginary numbers h have the property hh = +1: contrast the more familiar i such that ii = -1. Examples of analogous matrices include that for the Hadamard gate as used in quantum computing and the Pauli spin matrices, and all Hermitian matrices of interest in quantum computing can readily be derived from these. They also relate to Dirac dualization, spinor projectors of Quantum Field Theory, the non-wave-like part of quantum theory, collapse of the wave function, and a dualized form of classical probability theory that has advantages in automated reasoning for medicine.
Collapse
Affiliation(s)
- Barry Robson
- The Dirac Foundation, Oxfordshire, UK; Ingine Inc, USA.
| | - Jim St Clair
- Linux Foundation Public Health, San Franciso, USA
| |
Collapse
|
5
|
Robson B. Towards faster response against emerging epidemics and prediction of variants of concern. INFORMATICS IN MEDICINE UNLOCKED 2022; 31:100966. [PMID: 35611320 PMCID: PMC9119712 DOI: 10.1016/j.imu.2022.100966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 05/05/2022] [Accepted: 05/11/2022] [Indexed: 01/11/2023] Open
Abstract
The author, the journal, Computers in Biology and Medicine (CBM), and Elsevier Press more generally, played a helpful very early role in responding to COVID-19. Within a few days of the appearance of the "Wuhan Seafood isolate" genome on GenBank, a bioinformatics study was posted by the present author in ResearchGate in January 2020, "Preliminary Bioinformatics Studies on the Design of Synthetic Vaccines and Preventative Peptidomimetic Antagonists against the Wuhan Seafood Market Coronavirus. Possible Importance of the KRSFIEDLLFNKV Motif" DOI: 10.13140/RG.2.2.18275.09761. On February 2nd, 2020, a more thorough analysis was submitted to CBM, e-published on February 26, and formally published in April 2020, at about the same time as the virus named as 2019n-CoV was identified as essentially SARS and renames SARS-COV-2. This was followed by four further papers describing in more detail some previously unreported aspects of the early investigation. The speed of research and writing of the papers was made possible by knowledge-gathering tools. Based on this and earlier experiences with fast responses to emerging epidemics such as HIV and Mad Cow Disease, it is possible to envisage the nature of a speedier response to emerging epidemics and new variants of concern in established epidemics.
Collapse
Affiliation(s)
- B Robson
- Ingine Inc., Cleveland, Ohio, USA.,The Dirac Foundation, Oxfordshire, UK
| |
Collapse
|
6
|
Searching for the principles of a less artificial A.I. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.101018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
7
|
Robson B, Boray S, Weisman J. Mining real-world high dimensional structured data in medicine and its use in decision support. Some different perspectives on unknowns, interdependency, and distinguishability. Comput Biol Med 2021; 141:105118. [PMID: 34971979 DOI: 10.1016/j.compbiomed.2021.105118] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Revised: 11/18/2021] [Accepted: 12/02/2021] [Indexed: 11/03/2022]
Abstract
There are many difficulties in extracting and using knowledge for medical analytic and predictive purposes from Real-World Data, even when the data is already well structured in the manner of a large spreadsheet. Preparative curation and standardization or "normalization" of such data involves a variety of chores but underlying them is an interrelated set of fundamental problems that can in part be dealt with automatically during the datamining and inference processes. These fundamental problems are reviewed here and illustrated and investigated with examples. They concern the treatment of unknowns, the need to avoid independency assumptions, and the appearance of entries that may not be fully distinguished from each other. Unknowns include errors detected as implausible (e.g., out of range) values that are subsequently converted to unknowns. These problems are further impacted by high dimensionality and problems of sparse data that inevitably arise from high-dimensional datamining even if the data is extensive. All these considerations are different aspects of incomplete information, though they also relate to problems that arise if care is not taken to avoid or ameliorate consequences of including the same information twice or more, or if misleading or inconsistent information is combined. This paper addresses these aspects from a slightly different perspective using the Q-UEL language and inference methods based on it by borrowing some ideas from the mathematics of quantum mechanics and information theory. It takes the view that detection and correction of probabilistic elements of knowledge subsequently used in inference need only involve testing and correction so that they satisfy certain extended notions of coherence between probabilities. This is by no means the only possible view, and it is explored here and later compared with a related notion of consistency.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc, Ohio, USA; The Dirac Foundation, Oxfordshire, UK.
| | | | - J Weisman
- The Dirac Foundation, Oxfordshire, UK.
| |
Collapse
|
8
|
Robson B. Testing machine learning techniques for general application by using protein secondary structure prediction. A brief survey with studies of pitfalls and benefits using a simple progressive learning approach. Comput Biol Med 2021; 138:104883. [PMID: 34598067 DOI: 10.1016/j.compbiomed.2021.104883] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2021] [Revised: 09/05/2021] [Accepted: 09/17/2021] [Indexed: 01/05/2023]
Abstract
Many researchers have recently used the prediction of protein secondary structure (local conformational states of amino acid residues) to test advances in predictive and machine learning technology such as Neural Net Deep Learning. Protein secondary structure prediction continues to be a helpful tool in research in biomedicine and the life sciences, but it is also extremely enticing for testing predictive methods such as neural nets that are intended for different or more general purposes. A complication is highlighted here for researchers testing their methods for other applications. Modern protein databases inevitably contain important clues to the answer, so-called "strong buried clues", though often obscurely; they are hard to avoid. This is because most proteins or parts of proteins in a modern protein data base are related to others by biological evolution. For researchers developing machine learning and predictive methods, this can overstate and so confuse understanding of the true quality of a predictive method. However, for researchers using the algorithms as tools, understanding strong buried clues is of great value, because they need to make maximum use of all information available. A simple method related to the GOR methods but with some features of neural nets in the sense of progressive learning of large numbers of weights, is used to explore this. It can acquire tens of millions and hence gigabytes of weights, but they are learned stably by exhaustive sampling. The significance of the findings is discussed in the light of promising recent results from AlphaFold using Google's DeepMind.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc. Ohio, USA and the Dirac Foundation Oxfordshire, UK.
| |
Collapse
|
9
|
Robson B. The use of knowledge management tools in viroinformatics. Example study of a highly conserved sequence motif in Nsp3 of SARS-CoV-2 as a therapeutic target. Comput Biol Med 2020; 125:103963. [PMID: 32828990 PMCID: PMC7424310 DOI: 10.1016/j.compbiomed.2020.103963] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2020] [Revised: 08/07/2020] [Accepted: 08/07/2020] [Indexed: 12/16/2022]
Abstract
Knowledge management tools that assist in systematic review and exploration of scientific knowledge generally are of obvious potential importance in evidence based medicine in general, but also to the design of therapeutics based on the protein subsequences and fold motifs of virus proteins as considered here. Rapid access to bundles (clusters) of related elements of knowledge gathered from diverse sources on the Internet and from growing knowledge repositories seem particularly helpful when exploring less obvious therapeutic targets in viruses (for which knowledge new to the researcher is important), and when using the following concept. Subsequences of amino acid residue sequences of proteins that are conserved across strains and species are (a) more likely to be important targets and (b) less likely to exhibit escape mutations that would make them resistant to vaccines and therapeutic agents. However, the terms "conserved" and even "highly conserved" used by authors are matters of degree, depending on how distant from SARS-CoV-2 they wished to go in comparing other sequences. The binding site to the human ACE2 protein as virus receptor and human antibody CR3022 binding site on the spike glycoprotein are rather variable by the criteria used in the present and preceding studies. To look for more strongly conserved targets, open reading frames of SARS-CoV-2 were examined for extremely highly conserved regions, meaning recognizable across many viruses and organisms. Most prominent is a motif found in SARS-CoV-2 non-structural protein 3 (Nsp3). It relates to a fold called type called the macro domain and has remarkably wide distribution across organisms including humans with significant homologies involving three especially conserved subsequences (a) VVVNAANVYLKHGGGVAGALNK, (b) LHVVGPNVNKG, and (c) PLLSAGIFG. Careful study of the variations of these and of the more variable sequences between and around them might provide a finer "scalpel" to ensure inhibition of a vital function of the virus without impairing the functions of related host macro domains.
Collapse
Affiliation(s)
- B. Robson
- Ingine Inc., Cleveland, OH, USA,The Dirac Foundation, Oxfordshire, UK
| |
Collapse
|
10
|
Robson B. Bioinformatics studies on a function of the SARS-CoV-2 spike glycoprotein as the binding of host sialic acid glycans. Comput Biol Med 2020; 122:103849. [PMID: 32658736 PMCID: PMC7278709 DOI: 10.1016/j.compbiomed.2020.103849] [Citation(s) in RCA: 39] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Revised: 06/04/2020] [Accepted: 06/04/2020] [Indexed: 02/08/2023]
Abstract
SARS-CoV and SARS-CoV-2 do not appear to have functions of a hemagglutinin and neuraminidase. This is a mystery, because sugar binding activities appear essential to many other viruses including influenza and even most other coronaviruses in order to bind to and escape from the glycans (sugars, oligosaccharides or polysaccharides) characteristic of cell surfaces and saliva and mucin. The S1 N terminal Domains (S1-NTD) of the spike protein, largely responsible for the bulk of the characteristic knobs at the end of the spikes of SARS-CoV and SARS-CoV-2, are here predicted to be “hiding” sites for recognizing and binding glycans containing sialic acid. This may be important for infection and the ability of the virus to locate ACE2 as its known main host cell surface receptor, and if so it becomes a pharmaceutical target. It might even open up the possibility of an alternative receptor to ACE2. The prediction method developed, which uses amino acid residue sequence alone to predict domains or proteins that bind to sialic acids, is naïve, and will be advanced in future work. Nonetheless, it was surprising that such a very simple approach was so useful, and it can easily be reproduced in a very few lines of computer program to help make quick comparisons between SARS-CoV-2 sequences and to consider the effects of viral mutations. This paper extends the studies of the author's previous SARS-CoV-2 papers. Designing vaccine and drugs must seek to avoid escape mutations. Strangely, SARS-CoV and SARS-CoV-2 appear to lack sialic acid binding functions. Sequence motifs are found, but they require a simple prediction method.
Collapse
Affiliation(s)
- B Robson
- Ingine Inc. Cleveland Ohio USA and the Dirac Foundation, Oxfordshire, UK.
| |
Collapse
|
11
|
Robson B. COVID-19 Coronavirus spike protein analysis for synthetic vaccines, a peptidomimetic antagonist, and therapeutic drugs, and analysis of a proposed achilles' heel conserved region to minimize probability of escape mutations and drug resistance. Comput Biol Med 2020; 121:103749. [PMID: 32568687 PMCID: PMC7151553 DOI: 10.1016/j.compbiomed.2020.103749] [Citation(s) in RCA: 90] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2020] [Revised: 04/03/2020] [Accepted: 04/03/2020] [Indexed: 12/17/2022]
Abstract
This paper continues a recent study of the spike protein sequence of the COVID-19 virus (SARS-CoV-2). It is also in part an introductory review to relevant computational techniques for tackling viral threats, using COVID-19 as an example. Q-UEL tools for facilitating access to knowledge and bioinformatics tools were again used for efficiency, but the focus in this paper is even more on the virus. Subsequence KRSFIEDLLFNKV of the S2′ spike glycoprotein proteolytic cleavage site continues to appear important. Here it is shown to be recognizable in the common cold coronaviruses, avian coronaviruses and possibly as traces in the nidoviruses of reptiles and fish. Its function or functions thus seem important to the coronaviruses. It might represent SARS-CoV-2 Achilles’ heel, less likely to acquire resistance by mutation, as has happened in some early SARS vaccine studies discussed in the previous paper. Preliminary conformational analysis of the receptor (ACE2) binding site of the spike protein is carried out suggesting that while it is somewhat conserved, it appears to be more variable than KRSFIEDLLFNKV. However compounds like emodin that inhibit SARS entry, apparently by binding ACE2, might also have functions at several different human protein binding sites. The enzyme 11β-hydroxysteroid dehydrogenase type 1 is again argued to be a convenient model pharmacophore perhaps representing an ensemble of targets, and it is noted that it occurs both in lung and alimentary tract. Perhaps it benefits the virus to block an inflammatory response by inhibiting the dehydrogenase, but a fairly complex web involves several possible targets. This paper “drills down” into the studies of the author's previous COVID-19 paper. Designing vaccine and drugs must seek to avoid escape mutations. Subsequence KRSFIEDLLFNKV seems recognizable across many coronaviruses. The ACE2 binding domain is a target, but shows variation. A steroid dehydrogenase is argued to remain an interesting model pharmacophore.
Collapse
Affiliation(s)
- B Robson
- Ingine Inc. Cleveland Ohio USA, The Dirac Foundation, Oxfordshire, UK.
| |
Collapse
|
12
|
Robson B. Computers and viral diseases. Preliminary bioinformatics studies on the design of a synthetic vaccine and a preventative peptidomimetic antagonist against the SARS-CoV-2 (2019-nCoV, COVID-19) coronavirus. Comput Biol Med 2020; 119:103670. [PMID: 32209231 PMCID: PMC7094376 DOI: 10.1016/j.compbiomed.2020.103670] [Citation(s) in RCA: 126] [Impact Index Per Article: 31.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2020] [Revised: 02/17/2020] [Accepted: 02/17/2020] [Indexed: 12/19/2022]
Abstract
This paper concerns study of the genome of the Wuhan Seafood Market isolate believed to represent the causative agent of the disease COVID-19. This is to find a short section or sections of viral protein sequence suitable for preliminary design proposal for a peptide synthetic vaccine and a peptidomimetic therapeutic, and to explore some design possibilities. The project was originally directed towards a use case for the Q-UEL language and its implementation in a knowledge management and automated inference system for medicine called the BioIngine, but focus here remains mostly on the virus itself. However, using Q-UEL systems to access relevant and emerging literature, and to interact with standard publically available bioinformatics tools on the Internet, did help quickly identify sequences of amino acids that are well conserved across many coronaviruses including 2019-nCoV. KRSFIEDLLFNKV was found to be particularly well conserved in this study and corresponds to the region around one of the known cleavage sites of the SARS virus that are believed to be required for virus activation for cell entry. This sequence motif and surrounding variations formed the basis for proposing a specific synthetic vaccine epitope and peptidomimetic agent. The work can, nonetheless, be described in traditional bioinformatics terms, and readily reproduced by others, albeit with the caveat that new data and research into 2019-nCoV is emerging and evolving at an explosive pace. Preliminary studies using molecular modeling and docking, and in that context the potential value of certain known herbal extracts, are also described. Bioinformatics studies are carried out on the COVID-19 virus. A sequence motif KRSFIEDLLFNKV is of particular interest. Based on the above, synthetic peptides are designed. Preliminary considerations are also given to non-peptide organic molecules.
Collapse
Affiliation(s)
- B Robson
- Ingine Inc., Cleveland, Ohio, USA; The Dirac Foundation, Oxfordshire, UK.
| |
Collapse
|
13
|
Robson B. Extension of the Quantum Universal Exchange Language to precision medicine and drug lead discovery. Preliminary example studies using the mitochondrial genome. Comput Biol Med 2020; 117:103621. [PMID: 32072972 DOI: 10.1016/j.compbiomed.2020.103621] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 01/12/2020] [Accepted: 01/12/2020] [Indexed: 12/21/2022]
Abstract
The Quantum Universal Exchange Language (Q-UEL) based on Dirac notation and algebra from quantum mechanics, along with its associated data mining and Hyperbolic Dirac Net (HDN) for probabilistic inference, has proven to be a useful architectural principle for knowledge management, analysis and prediction systems in medicine. It has been described in several papers; here is described its extension to clinical genomics and precision medicine. Two use cases are studied: (a) bioinformatics in clinical decision support especially for risk for type 2 diabetes using mitochondrial patient DNA sequences, and (b) bioinformatics and computational biology (conformational) research examples related to drug discovery involving the recently discovered class of mitochondrial derived peptides (MDPs). MDPs were surprising when first discovered as coded in small open reading frames (sORFs), and are emerging as having a fundamental role in metabolic control, longevity and disease. This project originally represented a language specification study relating to what information related to genomics is essential or useful to carry, and what processing will be needed. However, novel aspects introduced or discovered include the HDN-like neural nets and their use, along with more established methods, for prediction of type 2 diabetes, and in particular for proposals for over 80 natural MDPs most of which that have not previously been described at the time of the study, as potential drug lead targets. Also, use of many medical records with simulated joining of mtDNA as performance tests led to some insightful observations regarding the behavior of HDN predictions where independent factors are involved.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc., Delaware, USA; The Dirac Foundation, OxfordShire, UK.
| |
Collapse
|
14
|
Robson B, Boray S. Studies in the use of data mining, prediction algorithms, and a universal exchange and inference language in the analysis of socioeconomic health data. Comput Biol Med 2019; 112:103369. [PMID: 31377681 DOI: 10.1016/j.compbiomed.2019.103369] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Revised: 07/22/2019] [Accepted: 07/23/2019] [Indexed: 12/18/2022]
Abstract
While clinical and biomedical information in digital form has been escalating, it is socioeconomic factors that are important determinants of health on the national and global scale. We show how collective use of data mining and prediction algorithms to analyze socioeconomic population health data can stand beside classical correlation analysis in routine data analysis. The underlying theoretical basis is the Dirac notation and algebra that is a scientific standard but unusual outside of the physical sciences, combined with a theory of expected information first developed for analyzing sparse data but still largely confined to bioinformatics. The latter was important here because the records analyzed (which are for US counties and equivalents, not patients) are very few by contemporary data mining standards. The approach is very unlikely to be familiar to socioeconomic researchers, so the theory and the advantages of our inference nets over the Bayes Net are reviewed here, mostly using socioeconomic examples. While our expertise and focus is in regard to novel analytical methods rather than socioeconomics per se, a significant negative (countertrending) relationship between population health and equity was initially surprising, at least to the present authors. This encouraged deeper exploration including that of the relationship between our data mining methods and traditional Pearson's correlation. The latter is susceptible to giving wrong conclusions if a phenomenon called Simpson's paradox applies, so this is also investigated. Also discussed is that, even for very few records, associative data mining can still demand significant computational resources due to a combinatorial explosion.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc. Virginia, USA and the Dirac Foundation OxfordShire, UK.
| | - S Boray
- Ingine Inc. Virginia, USA and the Dirac Foundation OxfordShire, UK
| |
Collapse
|
15
|
Robson B. Bidirectional General Graphs for inference. Principles and implications for medicine. Comput Biol Med 2019; 108:382-399. [PMID: 31075569 DOI: 10.1016/j.compbiomed.2019.04.005] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Revised: 04/03/2019] [Accepted: 04/04/2019] [Indexed: 12/17/2022]
Abstract
Probabilistic inference methods require a more general and realistic description of the world as a Bidirectional General Graph (BGG). While in its original form the Bayes Net (BN) has been promoted as a predictive tool, it is more immediately a way of testing a hypothesis or model about interactions in a system usually considered on a causal basis. Once established, the model can be used in a predictive way, but the problem here is that for a traditional BN the hypotheses or models that can be formed are limited to the Directed Acyclic Graph (DAG) by definition. Three interrelated features are highlighted that represent deficiencies of the DAG which are corrected by conversion to a method based on a BGG: (i) lack of intrinsic representation of coherence by Bayes' rule, (ii) relatedly the need to consider interdependence in parent nodes, and (iii) the need for management of a property called recurrence. These deficiencies can represent large errors in absolute estimates of probabilities, and while relative and renormalized probabilities ameliorate that, they can often make much of a net superfluous through cancelations by division. The Hyperbolic Dirac Net (HDN) based on Dirac's quantum mechanics is a solution that led naturally to avoiding these deficiencies. It encodes bidirectional probabilities in an h-complex value rediscovered by Dirac, i.e. with the imaginary number h such that hh = +1. Properties of the HDN described previously are reviewed (though emphasis is on descriptions in familiar probability terms), the issue of recurrence is introduced, methods of construction are simplified, and the severity of the quantitative differences between BNs and analogous HDNs are exemplified. There is also discussion of how results compare with other approaches in practice.
Collapse
Affiliation(s)
- Barry Robson
- Ingine Inc. Viginia, USA; The Dirac Foundation, OxfordShire, UK.
| |
Collapse
|