1
|
Hobæk Haff I, Aas K, Frigessi A. On the simplified pair-copula construction — Simply useful or too simplistic? J MULTIVARIATE ANAL 2010. [DOI: 10.1016/j.jmva.2009.12.001] [Citation(s) in RCA: 107] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
|
|
15 |
107 |
2
|
Akbar R, Robert PA, Pavlović M, Jeliazkov JR, Snapkov I, Slabodkin A, Weber CR, Scheffer L, Miho E, Haff IH, Haug DTT, Lund-Johansen F, Safonova Y, Sandve GK, Greiff V. A compact vocabulary of paratope-epitope interactions enables predictability of antibody-antigen binding. Cell Rep 2021; 34:108856. [PMID: 33730590 DOI: 10.1016/j.celrep.2021.108856] [Citation(s) in RCA: 82] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 11/29/2020] [Accepted: 02/22/2021] [Indexed: 12/16/2022] Open
Abstract
Antibody-antigen binding relies on the specific interaction of amino acids at the paratope-epitope interface. The predictability of antibody-antigen binding is a prerequisite for de novo antibody and (neo-)epitope design. A fundamental premise for the predictability of antibody-antigen binding is the existence of paratope-epitope interaction motifs that are universally shared among antibody-antigen structures. In a dataset of non-redundant antibody-antigen structures, we identify structural interaction motifs, which together compose a commonly shared structure-based vocabulary of paratope-epitope interactions. We show that this vocabulary enables the machine learnability of antibody-antigen binding on the paratope-epitope level using generative machine learning. The vocabulary (1) is compact, less than 104 motifs; (2) distinct from non-immune protein-protein interactions; and (3) mediates specific oligo- and polyreactive interactions between paratope-epitope pairs. Our work leverages combined structure- and sequence-based learning to demonstrate that machine-learning-driven predictive paratope and epitope engineering is feasible.
Collapse
|
Journal Article |
4 |
82 |
3
|
|
|
12 |
62 |
4
|
Akbar R, Robert PA, Weber CR, Widrich M, Frank R, Pavlović M, Scheffer L, Chernigovskaya M, Snapkov I, Slabodkin A, Mehta BB, Miho E, Lund-Johansen F, Andersen JT, Hochreiter S, Hobæk Haff I, Klambauer G, Sandve GK, Greiff V. In silico proof of principle of machine learning-based antibody design at unconstrained scale. MAbs 2022; 14:2031482. [PMID: 35377271 PMCID: PMC8986205 DOI: 10.1080/19420862.2022.2031482] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 01/17/2022] [Indexed: 12/15/2022] Open
Abstract
Generative machine learning (ML) has been postulated to become a major driver in the computational design of antigen-specific monoclonal antibodies (mAb). However, efforts to confirm this hypothesis have been hindered by the infeasibility of testing arbitrarily large numbers of antibody sequences for their most critical design parameters: paratope, epitope, affinity, and developability. To address this challenge, we leveraged a lattice-based antibody-antigen binding simulation framework, which incorporates a wide range of physiological antibody-binding parameters. The simulation framework enables the computation of synthetic antibody-antigen 3D-structures, and it functions as an oracle for unrestricted prospective evaluation and benchmarking of antibody design parameters of ML-generated antibody sequences. We found that a deep generative model, trained exclusively on antibody sequence (one dimensional: 1D) data can be used to design conformational (three dimensional: 3D) epitope-specific antibodies, matching, or exceeding the training dataset in affinity and developability parameter value variety. Furthermore, we established a lower threshold of sequence diversity necessary for high-accuracy generative antibody ML and demonstrated that this lower threshold also holds on experimental real-world data. Finally, we show that transfer learning enables the generation of high-affinity antibody sequences from low-N training data. Our work establishes a priori feasibility and the theoretical foundation of high-throughput ML-based mAb design.
Collapse
|
report |
3 |
37 |
5
|
Hobæk Haff I, Segers J. Nonparametric estimation of pair-copula constructions with the empirical pair-copula. Comput Stat Data Anal 2015. [DOI: 10.1016/j.csda.2014.10.020] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
|
10 |
21 |
6
|
Robert PA, Akbar R, Frank R, Pavlović M, Widrich M, Snapkov I, Slabodkin A, Chernigovskaya M, Scheffer L, Smorodina E, Rawat P, Mehta BB, Vu MH, Mathisen IF, Prósz A, Abram K, Olar A, Miho E, Haug DTT, Lund-Johansen F, Hochreiter S, Haff IH, Klambauer G, Sandve GK, Greiff V. Unconstrained generation of synthetic antibody-antigen structures to guide machine learning methodology for antibody specificity prediction. NATURE COMPUTATIONAL SCIENCE 2022; 2:845-865. [PMID: 38177393 DOI: 10.1038/s43588-022-00372-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 11/09/2022] [Indexed: 01/06/2024]
Abstract
Machine learning (ML) is a key technology for accurate prediction of antibody-antigen binding. Two orthogonal problems hinder the application of ML to antibody-specificity prediction and the benchmarking thereof: the lack of a unified ML formalization of immunological antibody-specificity prediction problems and the unavailability of large-scale synthetic datasets to benchmark real-world relevant ML methods and dataset design. Here we developed the Absolut! software suite that enables parameter-based unconstrained generation of synthetic lattice-based three-dimensional antibody-antigen-binding structures with ground-truth access to conformational paratope, epitope and affinity. We formalized common immunological antibody-specificity prediction problems as ML tasks and confirmed that for both sequence- and structure-based tasks, accuracy-based rankings of ML methods trained on experimental data hold for ML methods trained on Absolut!-generated data. The Absolut! framework has the potential to enable real-world relevant development and benchmarking of ML strategies for biotherapeutics design.
Collapse
|
|
3 |
17 |
7
|
Slabodkin A, Chernigovskaya M, Mikocziova I, Akbar R, Scheffer L, Pavlović M, Bashour H, Snapkov I, Mehta BB, Weber CR, Gutierrez-Marcos J, Sollid LM, Haff IH, Sandve GK, Robert PA, Greiff V. Individualized VDJ recombination predisposes the available Ig sequence space. Genome Res 2021; 31:2209-2224. [PMID: 34815307 PMCID: PMC8647828 DOI: 10.1101/gr.275373.121] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Accepted: 10/20/2021] [Indexed: 11/25/2022]
Abstract
The process of recombination between variable (V), diversity (D), and joining (J) immunoglobulin (Ig) gene segments determines an individual's naive Ig repertoire and, consequently, (auto)antigen recognition. VDJ recombination follows probabilistic rules that can be modeled statistically. So far, it remains unknown whether VDJ recombination rules differ between individuals. If these rules differed, identical (auto)antigen-specific Ig sequences would be generated with individual-specific probabilities, signifying that the available Ig sequence space is individual specific. We devised a sensitivity-tested distance measure that enables inter-individual comparison of VDJ recombination models. We discovered, accounting for several sources of noise as well as allelic variation in Ig sequencing data, that not only unrelated individuals but also human monozygotic twins and even inbred mice possess statistically distinguishable immunoglobulin recombination models. This suggests that, in addition to genetic, there is also nongenetic modulation of VDJ recombination. We demonstrate that population-wide individualized VDJ recombination can result in orders of magnitude of difference in the probability to generate (auto)antigen-specific Ig sequences. Our findings have implications for immune receptor-based individualized medicine approaches relevant to vaccination, infection, and autoimmunity.
Collapse
|
research-article |
4 |
11 |
8
|
|
|
6 |
9 |
9
|
Minotto T, Robert PA, Hobæk Haff I, Sandve GK. Assessing the feasibility of statistical inference using synthetic antibody-antigen datasets. Stat Appl Genet Mol Biol 2024; 23:sagmb-2023-0027. [PMID: 38563699 DOI: 10.1515/sagmb-2023-0027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 03/13/2024] [Indexed: 04/04/2024]
Abstract
Simulation frameworks are useful to stress-test predictive models when data is scarce, or to assert model sensitivity to specific data distributions. Such frameworks often need to recapitulate several layers of data complexity, including emergent properties that arise implicitly from the interaction between simulation components. Antibody-antigen binding is a complex mechanism by which an antibody sequence wraps itself around an antigen with high affinity. In this study, we use a synthetic simulation framework for antibody-antigen folding and binding on a 3D lattice that include full details on the spatial conformation of both molecules. We investigate how emergent properties arise in this framework, in particular the physical proximity of amino acids, their presence on the binding interface, or the binding status of a sequence, and relate that to the individual and pairwise contributions of amino acids in statistical models for binding prediction. We show that weights learnt from a simple logistic regression model align with some but not all features of amino acids involved in the binding, and that predictive sequence binding patterns can be enriched. In particular, main effects correlated with the capacity of a sequence to bind any antigen, while statistical interactions were related to sequence specificity.
Collapse
|
|
1 |
|
10
|
Chernigovskaya M, Pavlović M, Kanduri C, Gielis S, Robert P, Scheffer L, Slabodkin A, Haff IH, Meysman P, Yaari G, Sandve GK, Greiff V. Simulation of adaptive immune receptors and repertoires with complex immune information to guide the development and benchmarking of AIRR machine learning. Nucleic Acids Res 2025; 53:gkaf025. [PMID: 39873270 PMCID: PMC11773363 DOI: 10.1093/nar/gkaf025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Accepted: 01/25/2025] [Indexed: 01/30/2025] Open
Abstract
Machine learning (ML) has shown great potential in the adaptive immune receptor repertoire (AIRR) field. However, there is a lack of large-scale ground-truth experimental AIRR data suitable for AIRR-ML-based disease diagnostics and therapeutics discovery. Simulated ground-truth AIRR data are required to complement the development and benchmarking of robust and interpretable AIRR-ML methods where experimental data is currently inaccessible or insufficient. The challenge for simulated data to be useful is incorporating key features observed in experimental repertoires. These features, such as antigen or disease-associated immune information, cause AIRR-ML problems to be challenging. Here, we introduce LIgO, a software suite, which simulates AIRR data for the development and benchmarking of AIRR-ML methods. LIgO incorporates different types of immune information both on the receptor and the repertoire level and preserves native-like generation probability distribution. Additionally, LIgO assists users in determining the computational feasibility of their simulations. We show two examples where LIgO supports the development and validation of AIRR-ML methods: (i) how individuals carrying out-of-distribution immune information impacts receptor-level prediction performance and (ii) how immune information co-occurring in the same AIRs impacts the performance of conventional receptor-level encoding and repertoire-level classification approaches. LIgO guides the advancement and assessment of interpretable AIRR-ML methods.
Collapse
|
research-article |
1 |
|
11
|
Scheffer L, Reber EE, Mehta BB, Pavlović M, Chernigovskaya M, Richardson E, Akbar R, Lund-Johansen F, Greiff V, Haff IH, Sandve GK. Predictability of antigen binding based on short motifs in the antibody CDRH3. Brief Bioinform 2024; 25:bbae537. [PMID: 39438077 PMCID: PMC11495870 DOI: 10.1093/bib/bbae537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 09/30/2024] [Accepted: 10/16/2024] [Indexed: 10/25/2024] Open
Abstract
Adaptive immune receptors, such as antibodies and T-cell receptors, recognize foreign threats with exquisite specificity. A major challenge in adaptive immunology is discovering the rules governing immune receptor-antigen binding in order to predict the antigen binding status of previously unseen immune receptors. Many studies assume that the antigen binding status of an immune receptor may be determined by the presence of a short motif in the complementarity determining region 3 (CDR3), disregarding other amino acids. To test this assumption, we present a method to discover short motifs which show high precision in predicting antigen binding and generalize well to unseen simulated and experimental data. Our analysis of a mutagenesis-based antibody dataset reveals 11 336 position-specific, mostly gapped motifs of 3-5 amino acids that retain high precision on independently generated experimental data. Using a subset of only 178 motifs, a simple classifier was made that on the independently generated dataset outperformed a deep learning model proposed specifically for such datasets. In conclusion, our findings support the notion that for some antibodies, antigen binding may be largely determined by a short CDR3 motif. As more experimental data emerge, our methodology could serve as a foundation for in-depth investigations into antigen binding signals.
Collapse
|
research-article |
1 |
|
12
|
Brant SB, Hobæk Haff I. The fraud loss for selecting the model complexity in fraud detection. J Appl Stat 2022; 50:2209-2227. [PMID: 37434626 PMCID: PMC10332194 DOI: 10.1080/02664763.2022.2070137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Accepted: 04/20/2022] [Indexed: 10/18/2022]
Abstract
Statistical fraud detection consists in making a system that automatically selects a subset of all cases (insurance claims, financial transactions, etc.) that are the most interesting for further investigation. The reason why such a system is needed is that the total number of cases typically is much higher than one realistically could investigate manually and that fraud tends to be quite rare. Further, the investigator is typically limited to controlling a restricted number k of cases, due to limited resources. The most efficient manner of allocating these resources is then to try selecting the k cases with the highest probability of being fraudulent. The prediction model used for this purpose must normally be regularised to avoid overfitting and consequently bad prediction performance. A loss function, denoted the fraud loss, is proposed for selecting the model complexity via a tuning parameter. A simulation study is performed to find the optimal settings for validation. Further, the performance of the proposed procedure is compared to the most relevant competing procedure, based on the area under the receiver operating characteristic curve (AUC), in a set of simulations, as well as on a credit card default dataset. Choosing the complexity of the model by the fraud loss resulted in either comparable or better results in terms of the fraud loss than choosing it according to the AUC.
Collapse
|
research-article |
3 |
|