1
|
Colas JT, O’Doherty JP, Grafton ST. Active reinforcement learning versus action bias and hysteresis: control with a mixture of experts and nonexperts. PLoS Comput Biol 2024; 20:e1011950. [PMID: 38552190 PMCID: PMC10980507 DOI: 10.1371/journal.pcbi.1011950] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Accepted: 02/26/2024] [Indexed: 04/01/2024] Open
Abstract
Active reinforcement learning enables dynamic prediction and control, where one should not only maximize rewards but also minimize costs such as of inference, decisions, actions, and time. For an embodied agent such as a human, decisions are also shaped by physical aspects of actions. Beyond the effects of reward outcomes on learning processes, to what extent can modeling of behavior in a reinforcement-learning task be complicated by other sources of variance in sequential action choices? What of the effects of action bias (for actions per se) and action hysteresis determined by the history of actions chosen previously? The present study addressed these questions with incremental assembly of models for the sequential choice data from a task with hierarchical structure for additional complexity in learning. With systematic comparison and falsification of computational models, human choices were tested for signatures of parallel modules representing not only an enhanced form of generalized reinforcement learning but also action bias and hysteresis. We found evidence for substantial differences in bias and hysteresis across participants-even comparable in magnitude to the individual differences in learning. Individuals who did not learn well revealed the greatest biases, but those who did learn accurately were also significantly biased. The direction of hysteresis varied among individuals as repetition or, more commonly, alternation biases persisting from multiple previous actions. Considering that these actions were button presses with trivial motor demands, the idiosyncratic forces biasing sequences of action choices were robust enough to suggest ubiquity across individuals and across tasks requiring various actions. In light of how bias and hysteresis function as a heuristic for efficient control that adapts to uncertainty or low motivation by minimizing the cost of effort, these phenomena broaden the consilient theory of a mixture of experts to encompass a mixture of expert and nonexpert controllers of behavior.
Collapse
Affiliation(s)
- Jaron T. Colas
- Department of Psychological and Brain Sciences, University of California, Santa Barbara, California, United States of America
- Division of the Humanities and Social Sciences, California Institute of Technology, Pasadena, California, United States of America
- Computation and Neural Systems Program, California Institute of Technology, Pasadena, California, United States of America
| | - John P. O’Doherty
- Division of the Humanities and Social Sciences, California Institute of Technology, Pasadena, California, United States of America
- Computation and Neural Systems Program, California Institute of Technology, Pasadena, California, United States of America
| | - Scott T. Grafton
- Department of Psychological and Brain Sciences, University of California, Santa Barbara, California, United States of America
| |
Collapse
|
2
|
Sato R, Shimomura K, Morita K. Opponent learning with different representations in the cortico-basal ganglia pathways can develop obsession-compulsion cycle. PLoS Comput Biol 2023; 19:e1011206. [PMID: 37319256 PMCID: PMC10306209 DOI: 10.1371/journal.pcbi.1011206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2023] [Accepted: 05/23/2023] [Indexed: 06/17/2023] Open
Abstract
Obsessive-compulsive disorder (OCD) has been suggested to be associated with impairment of model-based behavioral control. Meanwhile, recent work suggested shorter memory trace for negative than positive prediction errors (PEs) in OCD. We explored relations between these two suggestions through computational modeling. Based on the properties of cortico-basal ganglia pathways, we modeled human as an agent having a combination of successor representation (SR)-based system that enables model-based-like control and individual representation (IR)-based system that only hosts model-free control, with the two systems potentially learning from positive and negative PEs in different rates. We simulated the agent's behavior in the environmental model used in the recent work that describes potential development of obsession-compulsion cycle. We found that the dual-system agent could develop enhanced obsession-compulsion cycle, similarly to the agent having memory trace imbalance in the recent work, if the SR- and IR-based systems learned mainly from positive and negative PEs, respectively. We then simulated the behavior of such an opponent SR+IR agent in the two-stage decision task, in comparison with the agent having only SR-based control. Fitting of the agents' behavior by the model weighing model-based and model-free control developed in the original two-stage task study resulted in smaller weights of model-based control for the opponent SR+IR agent than for the SR-only agent. These results reconcile the previous suggestions about OCD, i.e., impaired model-based control and memory trace imbalance, raising a novel possibility that opponent learning in model(SR)-based and model-free controllers underlies obsession-compulsion. Our model cannot explain the behavior of OCD patients in punishment, rather than reward, contexts, but it could be resolved if opponent SR+IR learning operates also in the recently revealed non-canonical cortico-basal ganglia-dopamine circuit for threat/aversiveness, rather than reward, reinforcement learning, and the aversive SR + appetitive IR agent could actually develop obsession-compulsion if the environment is modeled differently.
Collapse
Affiliation(s)
- Reo Sato
- Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo, Japan
| | - Kanji Shimomura
- Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo, Japan
| | - Kenji Morita
- Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo, Japan
- International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo, Tokyo, Japan
| |
Collapse
|
3
|
Morita K, Shimomura K, Kawaguchi Y. Opponent Learning with Different Representations in the Cortico-Basal Ganglia Circuits. eNeuro 2023; 10:ENEURO.0422-22.2023. [PMID: 36653187 PMCID: PMC9884109 DOI: 10.1523/eneuro.0422-22.2023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Revised: 12/06/2022] [Accepted: 01/03/2023] [Indexed: 01/20/2023] Open
Abstract
The direct and indirect pathways of the basal ganglia (BG) have been suggested to learn mainly from positive and negative feedbacks, respectively. Since these pathways unevenly receive inputs from different cortical neuron types and/or regions, they may preferentially use different state/action representations. We explored whether such a combined use of different representations, coupled with different learning rates from positive and negative reward prediction errors (RPEs), has computational benefits. We modeled animal as an agent equipped with two learning systems, each of which adopted individual representation (IR) or successor representation (SR) of states. With varying the combination of IR or SR and also the learning rates from positive and negative RPEs in each system, we examined how the agent performed in a dynamic reward navigation task. We found that combination of SR-based system learning mainly from positive RPEs and IR-based system learning mainly from negative RPEs could achieve a good performance in the task, as compared with other combinations. In such a combination of appetitive SR-based and aversive IR-based systems, both systems show activities of comparable magnitudes with opposite signs, consistent with the suggested profiles of the two BG pathways. Moreover, the architecture of such a combination provides a novel coherent explanation for the functional significance and underlying mechanism of diverse findings about the cortico-BG circuits. These results suggest that particularly combining different representations with appetitive and aversive learning could be an effective learning strategy in certain dynamic environments, and it might actually be implemented in the cortico-BG circuits.
Collapse
Affiliation(s)
- Kenji Morita
- Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo 113-0033, Japan
- International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo, Tokyo 113-0033, Japan
| | - Kanji Shimomura
- Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo 113-0033, Japan
- Department of Behavioral Medicine, National Institute of Mental Health, National Center of Neurology and Psychiatry, Kodaira 187-8551, Japan
| | - Yasuo Kawaguchi
- Brain Science Institute, Tamagawa University, Machida 194-8610, Japan
- National Institute for Physiological Sciences (NIPS), Okazaki 444-8787, Japan
| |
Collapse
|
4
|
Colas JT, Dundon NM, Gerraty RT, Saragosa‐Harris NM, Szymula KP, Tanwisuth K, Tyszka JM, van Geen C, Ju H, Toga AW, Gold JI, Bassett DS, Hartley CA, Shohamy D, Grafton ST, O'Doherty JP. Reinforcement learning with associative or discriminative generalization across states and actions: fMRI at 3 T and 7 T. Hum Brain Mapp 2022; 43:4750-4790. [PMID: 35860954 PMCID: PMC9491297 DOI: 10.1002/hbm.25988] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 05/20/2022] [Accepted: 06/10/2022] [Indexed: 11/12/2022] Open
Abstract
The model-free algorithms of "reinforcement learning" (RL) have gained clout across disciplines, but so too have model-based alternatives. The present study emphasizes other dimensions of this model space in consideration of associative or discriminative generalization across states and actions. This "generalized reinforcement learning" (GRL) model, a frugal extension of RL, parsimoniously retains the single reward-prediction error (RPE), but the scope of learning goes beyond the experienced state and action. Instead, the generalized RPE is efficiently relayed for bidirectional counterfactual updating of value estimates for other representations. Aided by structural information but as an implicit rather than explicit cognitive map, GRL provided the most precise account of human behavior and individual differences in a reversal-learning task with hierarchical structure that encouraged inverse generalization across both states and actions. Reflecting inference that could be true, false (i.e., overgeneralization), or absent (i.e., undergeneralization), state generalization distinguished those who learned well more so than action generalization. With high-resolution high-field fMRI targeting the dopaminergic midbrain, the GRL model's RPE signals (alongside value and decision signals) were localized within not only the striatum but also the substantia nigra and the ventral tegmental area, including specific effects of generalization that also extend to the hippocampus. Factoring in generalization as a multidimensional process in value-based learning, these findings shed light on complexities that, while challenging classic RL, can still be resolved within the bounds of its core computations.
Collapse
Affiliation(s)
- Jaron T. Colas
- Department of Psychological and Brain SciencesUniversity of CaliforniaSanta BarbaraCaliforniaUSA
- Division of the Humanities and Social SciencesCalifornia Institute of TechnologyPasadenaCaliforniaUSA
- Computation and Neural Systems Program, California Institute of TechnologyPasadenaCaliforniaUSA
| | - Neil M. Dundon
- Department of Psychological and Brain SciencesUniversity of CaliforniaSanta BarbaraCaliforniaUSA
- Department of Child and Adolescent Psychiatry, Psychotherapy, and PsychosomaticsUniversity of FreiburgFreiburg im BreisgauGermany
| | - Raphael T. Gerraty
- Department of PsychologyColumbia UniversityNew YorkNew YorkUSA
- Zuckerman Mind Brain Behavior Institute, Columbia UniversityNew YorkNew YorkUSA
- Center for Science and SocietyColumbia UniversityNew YorkNew YorkUSA
| | - Natalie M. Saragosa‐Harris
- Department of PsychologyNew York UniversityNew YorkNew YorkUSA
- Department of PsychologyUniversity of CaliforniaLos AngelesCaliforniaUSA
| | - Karol P. Szymula
- Department of BioengineeringUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Koranis Tanwisuth
- Division of the Humanities and Social SciencesCalifornia Institute of TechnologyPasadenaCaliforniaUSA
- Department of PsychologyUniversity of CaliforniaBerkeleyCaliforniaUSA
| | - J. Michael Tyszka
- Division of the Humanities and Social SciencesCalifornia Institute of TechnologyPasadenaCaliforniaUSA
| | - Camilla van Geen
- Zuckerman Mind Brain Behavior Institute, Columbia UniversityNew YorkNew YorkUSA
- Department of PsychologyUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Harang Ju
- Neuroscience Graduate GroupUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Arthur W. Toga
- Laboratory of Neuro ImagingUSC Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern CaliforniaLos AngelesCaliforniaUSA
| | - Joshua I. Gold
- Department of NeuroscienceUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Dani S. Bassett
- Department of BioengineeringUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Electrical and Systems EngineeringUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of NeurologyUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of PsychiatryUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of Physics and AstronomyUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Santa Fe InstituteSanta FeNew MexicoUSA
| | - Catherine A. Hartley
- Department of PsychologyNew York UniversityNew YorkNew YorkUSA
- Center for Neural ScienceNew York UniversityNew YorkNew YorkUSA
| | - Daphna Shohamy
- Department of PsychologyColumbia UniversityNew YorkNew YorkUSA
- Zuckerman Mind Brain Behavior Institute, Columbia UniversityNew YorkNew YorkUSA
- Kavli Institute for Brain ScienceColumbia UniversityNew YorkNew YorkUSA
| | - Scott T. Grafton
- Department of Psychological and Brain SciencesUniversity of CaliforniaSanta BarbaraCaliforniaUSA
| | - John P. O'Doherty
- Division of the Humanities and Social SciencesCalifornia Institute of TechnologyPasadenaCaliforniaUSA
- Computation and Neural Systems Program, California Institute of TechnologyPasadenaCaliforniaUSA
| |
Collapse
|
5
|
Barakchian Z, Vahabie AH, Nili Ahmadabadi M. Implicit Counterfactual Effect in Partial Feedback Reinforcement Learning: Behavioral and Modeling Approach. Front Neurosci 2022; 16:631347. [PMID: 35620668 PMCID: PMC9127865 DOI: 10.3389/fnins.2022.631347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2020] [Accepted: 03/28/2022] [Indexed: 11/13/2022] Open
Abstract
Context remarkably affects learning behavior by adjusting option values according to the distribution of available options. Displaying counterfactual outcomes, the outcomes of the unchosen option alongside the chosen one (i.e., providing complete feedback), would increase the contextual effect by inducing participants to compare the two outcomes during learning. However, when the context only consists of the juxtaposition of several options and there is no such explicit counterfactual factor (i.e., only partial feedback is provided), it is not clear whether and how the contextual effect emerges. In this research, we employ Partial and Complete feedback paradigms in which options are associated with different reward distributions. Our modeling analysis shows that the model that uses the outcome of the chosen option for updating the values of both chosen and unchosen options in opposing directions can better account for the behavioral data. This is also in line with the diffusive effect of dopamine on the striatum. Furthermore, our data show that the contextual effect is not limited to probabilistic rewards, but also extends to magnitude rewards. These results suggest that by extending the counterfactual concept to include the effect of the chosen outcome on the unchosen option, we can better explain why there is a contextual effect in situations in which there is no extra information about the unchosen outcome.
Collapse
Affiliation(s)
- Zahra Barakchian
- Department of Cognitive Neuroscience, Institute for Research in Fundamental Sciences, Tehran, Iran
- *Correspondence: Zahra Barakchian
| | - Abdol-Hossein Vahabie
- Cognitive Systems Laboratory, Control and Intelligent Processing Center of Excellence, School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
- Department of Psychology, Faculty of Psychology and Education, University of Tehran, Tehran, Iran
| | - Majid Nili Ahmadabadi
- Cognitive Systems Laboratory, Control and Intelligent Processing Center of Excellence, School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
| |
Collapse
|
6
|
Phasic Dopamine Changes and Hebbian Mechanisms during Probabilistic Reversal Learning in Striatal Circuits: A Computational Study. Int J Mol Sci 2022; 23:ijms23073452. [PMID: 35408811 PMCID: PMC8998230 DOI: 10.3390/ijms23073452] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Revised: 03/18/2022] [Accepted: 03/19/2022] [Indexed: 11/22/2022] Open
Abstract
Cognitive flexibility is essential to modify our behavior in a non-stationary environment and is often explored by reversal learning tasks. The basal ganglia (BG) dopaminergic system, under a top-down control of the pre-frontal cortex, is known to be involved in flexible action selection through reinforcement learning. However, how adaptive dopamine changes regulate this process and learning mechanisms for training the striatal synapses remain open questions. The current study uses a neurocomputational model of the BG, based on dopamine-dependent direct (Go) and indirect (NoGo) pathways, to investigate reinforcement learning in a probabilistic environment through a task that associates different stimuli to different actions. Here, we investigated: the efficacy of several versions of the Hebb rule, based on covariance between pre- and post-synaptic neurons, as well as the required control in phasic dopamine changes crucial to achieving a proper reversal learning. Furthermore, an original mechanism for modulating the phasic dopamine changes is proposed, assuming that the expected reward probability is coded by the activity of the winner Go neuron before a reward/punishment takes place. Simulations show that this original formulation for an automatic phasic dopamine control allows the achievement of a good flexible reversal even in difficult conditions. The current outcomes may contribute to understanding the mechanisms for active control of dopamine changes during flexible behavior. In perspective, it may be applied in neuropsychiatric or neurological disorders, such as Parkinson’s or schizophrenia, in which reinforcement learning is impaired.
Collapse
|
7
|
Morita K, Kato A. Dopamine ramps for accurate value learning under uncertainty. Trends Neurosci 2022; 45:254-256. [PMID: 35181147 DOI: 10.1016/j.tins.2022.01.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2022] [Accepted: 01/31/2022] [Indexed: 10/19/2022]
Abstract
Dopamine signals ramping towards reward timings have become widely reported, but their functions remain elusive. Through modeling analyses and experiments in mice, a recent study by Mikhael, Kim et al. shows that such signals represent reward prediction errors used for accurate value learning in conditions with uncertainty about upcoming state and its resolution by sensory feedback.
Collapse
Affiliation(s)
- Kenji Morita
- Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo, Japan; International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo, Tokyo, Japan.
| | - Ayaka Kato
- Laboratory for Circuit Mechanisms of Sensory Perception, RIKEN Center for Brain Science, Wako, Japan; Department of Life Sciences, Graduate School of Arts and Sciences, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
8
|
Dopamine firing plays a dual role in coding reward prediction errors and signaling motivation in a working memory task. Proc Natl Acad Sci U S A 2022; 119:2113311119. [PMID: 34992139 PMCID: PMC8764687 DOI: 10.1073/pnas.2113311119] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/29/2021] [Indexed: 11/21/2022] Open
Abstract
Little is known about how dopamine (DA) neuron firing rates behave in cognitively demanding decision-making tasks. Here, we investigated midbrain DA activity in monkeys performing a discrimination task in which the animal had to use working memory (WM) to report which of two sequentially applied vibrotactile stimuli had the higher frequency. We found that perception was altered by an internal bias, likely generated by deterioration of the representation of the first frequency during the WM period. This bias greatly controlled the DA phasic response during the two stimulation periods, confirming that DA reward prediction errors reflected stimulus perception. In contrast, tonic dopamine activity during WM was not affected by the bias and did not encode the stored frequency. More interestingly, both delay-period activity and phasic responses before the second stimulus negatively correlated with reaction times of the animals after the trial start cue and thus represented motivated behavior on a trial-by-trial basis. During WM, this motivation signal underwent a ramp-like increase. At the same time, motivation positively correlated with accuracy, especially in difficult trials, probably by decreasing the effect of the bias. Overall, our results indicate that DA activity, in addition to encoding reward prediction errors, could at the same time be involved in motivation and WM. In particular, the ramping activity during the delay period suggests a possible DA role in stabilizing sustained cortical activity, hypothetically by increasing the gain communicated to prefrontal neurons in a motivation-dependent way.
Collapse
|
9
|
Feng Z, Nagase AM, Morita K. A Reinforcement Learning Approach to Understanding Procrastination: Does Inaccurate Value Approximation Cause Irrational Postponing of a Task? Front Neurosci 2021; 15:660595. [PMID: 34602962 PMCID: PMC8481628 DOI: 10.3389/fnins.2021.660595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Accepted: 08/16/2021] [Indexed: 11/27/2022] Open
Abstract
Procrastination is the voluntary but irrational postponing of a task despite being aware that the delay can lead to worse consequences. It has been extensively studied in psychological field, from contributing factors, to theoretical models. From value-based decision making and reinforcement learning (RL) perspective, procrastination has been suggested to be caused by non-optimal choice resulting from cognitive limitations. Exactly what sort of cognitive limitations are involved, however, remains elusive. In the current study, we examined if a particular type of cognitive limitation, namely, inaccurate valuation resulting from inadequate state representation, would cause procrastination. Recent work has suggested that humans may adopt a particular type of state representation called the successor representation (SR) and that humans can learn to represent states by relatively low-dimensional features. Combining these suggestions, we assumed a dimension-reduced version of SR. We modeled a series of behaviors of a "student" doing assignments during the school term, when putting off doing the assignments (i.e., procrastination) is not allowed, and during the vacation, when whether to procrastinate or not can be freely chosen. We assumed that the "student" had acquired a rigid reduced SR of each state, corresponding to each step in completing an assignment, under the policy without procrastination. The "student" learned the approximated value of each state which was computed as a linear function of features of the states in the rigid reduced SR, through temporal-difference (TD) learning. During the vacation, the "student" made decisions at each time-step whether to procrastinate based on these approximated values. Simulation results showed that the reduced SR-based RL model generated procrastination behavior, which worsened across episodes. According to the values approximated by the "student," to procrastinate was the better choice, whereas not to procrastinate was mostly better according to the true values. Thus, the current model generated procrastination behavior caused by inaccurate value approximation, which resulted from the adoption of the reduced SR as state representation. These findings indicate that the reduced SR, or more generally, the dimension reduction in state representation, can be a potential form of cognitive limitation that leads to procrastination.
Collapse
Affiliation(s)
- Zheyu Feng
- Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo, Japan
| | - Asako Mitsuto Nagase
- Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo, Japan
- Division of Neurology, Department of Brain and Neurosciences, Faculty of Medicine, Tottori University, Yonago, Japan
- Research Fellowship for Young Scientists, Japan Society for the Promotion of Science, Tokyo, Japan
- Department of Neurology, Faculty of Medicine, Shimane University, Izumo, Japan
| | - Kenji Morita
- Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo, Japan
- International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo, Tokyo, Japan
| |
Collapse
|
10
|
Suzuki S, Yamashita Y, Katahira K. Psychiatric symptoms influence reward-seeking and loss-avoidance decision-making through common and distinct computational processes. Psychiatry Clin Neurosci 2021; 75:277-285. [PMID: 34151477 PMCID: PMC8457174 DOI: 10.1111/pcn.13279] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Revised: 06/07/2021] [Accepted: 06/14/2021] [Indexed: 11/29/2022]
Abstract
AIM Psychiatric symptoms are often accompanied by impairments in decision-making to attain rewards and avoid losses. However, due to the complex nature of mental disorders (e.g., high comorbidity), symptoms that are specifically associated with deficits in decision-making remain unidentified. Furthermore, the influence of psychiatric symptoms on computations underpinning reward-seeking and loss-avoidance decision-making remains elusive. Here, we aim to address these issues by leveraging a large-scale online experiment and computational modeling. METHODS In the online experiment, we recruited 1900 non-diagnostic participants from the general population. They performed either a reward-seeking or loss-avoidance decision-making task, and subsequently completed questionnaires about psychiatric symptoms. RESULTS We found that one trans-diagnostic dimension of psychiatric symptoms related to compulsive behavior and intrusive thought (CIT) was negatively correlated with overall decision-making performance in both the reward-seeking and loss-avoidance tasks. A deeper analysis further revealed that, in both tasks, the CIT psychiatric dimension was associated with lower preference for the options that recently led to better outcomes (i.e. reward or no-loss). On the other hand, in the reward-seeking task only, the CIT dimension was associated with lower preference for recently unchosen options. CONCLUSION These findings suggest that psychiatric symptoms influence the two types of decision-making, reward-seeking and loss-avoidance, through both common and distinct computational processes.
Collapse
Affiliation(s)
- Shinsuke Suzuki
- Brain, Mind and Markets Laboratory, Department of Finance, Faculty of Business and EconomicsThe University of MelbourneMelbourneVictoriaAustralia
- Frontier Research Institute for Interdisciplinary SciencesTohoku UniversitySendaiJapan
| | - Yuichi Yamashita
- Department of Information MedicineNational Institute of Neuroscience, National Center of Neurology and PsychiatryTokyoJapan
| | - Kentaro Katahira
- Department of Psychological and Cognitive Sciences, Graduate School of InformaticsNagoya UniversityNagoyaJapan
- Mental and Physical Functions Modeling Group, Human Informatics and Interaction Research InstituteNational Institute of Advanced Industrial Science and Technology (AIST)TsukubaJapan
| |
Collapse
|
11
|
Shimomura K, Kato A, Morita K. Rigid reduced successor representation as a potential mechanism for addiction. Eur J Neurosci 2021; 53:3768-3790. [PMID: 33840120 PMCID: PMC8252639 DOI: 10.1111/ejn.15227] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2020] [Revised: 03/30/2021] [Accepted: 04/07/2021] [Indexed: 12/14/2022]
Abstract
Difficulty in cessation of drinking, smoking, or gambling has been widely recognized. Conventional theories proposed relative dominance of habitual over goal-directed control, but human studies have not convincingly supported them. Referring to the recently suggested "successor representation (SR)" of states that enables partially goal-directed control, we propose a dopamine-related mechanism that makes resistance to habitual reward-obtaining particularly difficult. We considered that long-standing behavior towards a certain reward without resisting temptation can (but not always) lead to a formation of rigid dimension-reduced SR based on the goal state, which cannot be updated. Then, in our model assuming such rigid reduced SR, whereas no reward prediction error (RPE) is generated at the goal while no resistance is made, a sustained large positive RPE is generated upon goal reaching once the person starts resisting temptation. Such sustained RPE is somewhat similar to the hypothesized sustained fictitious RPE caused by drug-induced dopamine. In contrast, if rigid reduced SR is not formed and states are represented individually as in simple reinforcement learning models, no sustained RPE is generated at the goal. Formation of rigid reduced SR also attenuates the resistance-dependent decrease in the value of the cue for behavior, makes subsequent introduction of punishment after the goal ineffective, and potentially enhances the propensity of nonresistance through the influence of RPEs via the spiral striatum-midbrain circuit. These results suggest that formation of rigid reduced SR makes cessation of habitual reward-obtaining particularly difficult and can thus be a mechanism for addiction, common to substance and nonsubstance reward.
Collapse
Affiliation(s)
- Kanji Shimomura
- Physical and Health EducationGraduate School of EducationThe University of TokyoTokyoJapan
- Department of Behavioral MedicineNational Institute of Mental HealthNational Center of Neurology and PsychiatryKodairaJapan
| | - Ayaka Kato
- Department of Life SciencesGraduate School of Arts and SciencesThe University of TokyoTokyoJapan
- Laboratory for Circuit Mechanisms of Sensory PerceptionRIKEN Center for Brain ScienceWakoJapan
- Research Fellowship for Young ScientistsJapan Society for the Promotion of ScienceTokyoJapan
| | - Kenji Morita
- Physical and Health EducationGraduate School of EducationThe University of TokyoTokyoJapan
- International Research Center for Neurointelligence (WPI‐IRCN)The University of TokyoTokyoJapan
| |
Collapse
|
12
|
Revisiting the importance of model fitting for model-based fMRI: It does matter in computational psychiatry. PLoS Comput Biol 2021; 17:e1008738. [PMID: 33561125 PMCID: PMC7899379 DOI: 10.1371/journal.pcbi.1008738] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Revised: 02/22/2021] [Accepted: 01/25/2021] [Indexed: 11/19/2022] Open
Abstract
Computational modeling has been applied for data analysis in psychology, neuroscience, and psychiatry. One of its important uses is to infer the latent variables underlying behavior by which researchers can evaluate corresponding neural, physiological, or behavioral measures. This feature is especially crucial for computational psychiatry, in which altered computational processes underlying mental disorders are of interest. For instance, several studies employing model-based fMRI-a method for identifying brain regions correlated with latent variables-have shown that patients with mental disorders (e.g., depression) exhibit diminished neural responses to reward prediction errors (RPEs), which are the differences between experienced and predicted rewards. Such model-based analysis has the drawback that the parameter estimates and inference of latent variables are not necessarily correct-rather, they usually contain some errors. A previous study theoretically and empirically showed that the error in model-fitting does not necessarily cause a serious error in model-based fMRI. However, the study did not deal with certain situations relevant to psychiatry, such as group comparisons between patients and healthy controls. We developed a theoretical framework to explore such situations. We demonstrate that the parameter-misspecification can critically affect the results of group comparison. We demonstrate that even if the RPE response in patients is completely intact, a spurious difference to healthy controls is observable. Such a situation occurs when the ground-truth learning rate differs between groups but a common learning rate is used, as per previous studies. Furthermore, even if the parameters are appropriately fitted to individual participants, spurious group differences in RPE responses are observable when the model lacks a component that differs between groups. These results highlight the importance of appropriate model-fitting and the need for caution when interpreting the results of model-based fMRI.
Collapse
|
13
|
Wiencke K, Horstmann A, Mathar D, Villringer A, Neumann J. Dopamine release, diffusion and uptake: A computational model for synaptic and volume transmission. PLoS Comput Biol 2020; 16:e1008410. [PMID: 33253315 PMCID: PMC7728201 DOI: 10.1371/journal.pcbi.1008410] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 12/10/2020] [Accepted: 09/30/2020] [Indexed: 11/19/2022] Open
Abstract
Computational modeling of dopamine transmission is challenged by complex underlying mechanisms. Here we present a new computational model that (I) simultaneously regards release, diffusion and uptake of dopamine, (II) considers multiple terminal release events and (III) comprises both synaptic and volume transmission by incorporating the geometry of the synaptic cleft. We were able to validate our model in that it simulates concentration values comparable to physiological values observed in empirical studies. Further, although synaptic dopamine diffuses into extra-synaptic space, our model reflects a very localized signal occurring on the synaptic level, i.e. synaptic dopamine release is negligibly recognized by neighboring synapses. Moreover, increasing evidence suggests that cognitive performance can be predicted by signal variability of neuroimaging data (e.g. BOLD). Signal variability in target areas of dopaminergic neurons (striatum, cortex) may arise from dopamine concentration variability. On that account we compared spatio-temporal variability in a simulation mimicking normal dopamine transmission in striatum to scenarios of enhanced dopamine release and dopamine uptake inhibition. We found different variability characteristics between the three settings, which may in part account for differences in empirical observations. From a clinical perspective, differences in striatal dopaminergic signaling contribute to differential learning and reward processing, with relevant implications for addictive- and compulsive-like behavior. Specifically, dopaminergic tone is assumed to impact on phasic dopamine and hence on the integration of reward-related signals. However, in humans DA tone is classically assessed using PET, which is an indirect measure of endogenous DA availability and suffers from temporal and spatial resolution issues. We discuss how this can lead to discrepancies with observations from other methods such as microdialysis and show how computational modeling can help to refine our understanding of DA transmission.
Collapse
Affiliation(s)
- Kathleen Wiencke
- IFB Adiposity Diseases, Leipzig University Medical Center, Germany
- Department of Neurology, Max Planck Institute for Human Cognitive and Brain Sciences Leipzig, Germany
| | - Annette Horstmann
- IFB Adiposity Diseases, Leipzig University Medical Center, Germany
- Department of Neurology, Max Planck Institute for Human Cognitive and Brain Sciences Leipzig, Germany
- Department of Psychology and Logopedics, Faculty of Medicine, University of Helsinki
| | - David Mathar
- Department of Psychology, Biological Psychology, University of Cologne, Cologne, Germany
| | - Arno Villringer
- IFB Adiposity Diseases, Leipzig University Medical Center, Germany
- Department of Neurology, Max Planck Institute for Human Cognitive and Brain Sciences Leipzig, Germany
- Clinic of Cognitive Neurology, University Hospital Leipzig, Germany
- Mind & Brain Institute, Berlin School of Mind and Brain, Humboldt-University, Berlin, Germany
| | - Jane Neumann
- IFB Adiposity Diseases, Leipzig University Medical Center, Germany
- Department of Neurology, Max Planck Institute for Human Cognitive and Brain Sciences Leipzig, Germany
- Department of Medical Engineering and Biotechnology, University of Applied Sciences, Jena, Germany
| |
Collapse
|
14
|
Tanimoto S, Kondo M, Morita K, Yoshida E, Matsuzaki M. Non-action Learning: Saving Action-Associated Cost Serves as a Covert Reward. Front Behav Neurosci 2020; 14:141. [PMID: 33100979 PMCID: PMC7498735 DOI: 10.3389/fnbeh.2020.00141] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Accepted: 07/22/2020] [Indexed: 01/20/2023] Open
Abstract
“To do or not to do” is a fundamental decision that has to be made in daily life. Behaviors related to multiple “to do” choice tasks have long been explained by reinforcement learning, and “to do or not to do” tasks such as the go/no-go task have also been recently discussed within the framework of reinforcement learning. In this learning framework, alternative actions and/or the non-action to take are determined by evaluating explicitly given (overt) reward and punishment. However, we assume that there are real life cases in which an action/non-action is repeated, even though there is no obvious reward or punishment, because implicitly given outcomes such as saving physical energy and regret (we refer to this as “covert reward”) can affect the decision-making. In the current task, mice chose to pull a lever or not according to two tone cues assigned with different water reward probabilities (70% and 30% in condition 1, and 30% and 10% in condition 2). As the mice learned, the probability that they would choose to pull the lever decreased (<0.25) in trials with a 30% reward probability cue (30% cue) in condition 1, and in trials with a 10% cue in condition 2, but increased (>0.8) in trials with a 70% cue in condition 1 and a 30% cue in condition 2, even though a non-pull was followed by neither an overt reward nor avoidance of overt punishment in any trial. This behavioral tendency was not well explained by a combination of commonly used Q-learning models, which take only the action choice with an overt reward outcome into account. Instead, we found that the non-action preference of the mice was best explained by Q-learning models, which regarded the non-action as the other choice, and updated non-action values with a covert reward. We propose that “doing nothing” can be actively chosen as an alternative to “doing something,” and that a covert reward could serve as a reinforcer of “doing nothing.”
Collapse
Affiliation(s)
- Sai Tanimoto
- Department of Physiology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Masashi Kondo
- Department of Physiology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Kenji Morita
- Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo, Japan.,International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo Institutes for Advanced Study, Tokyo, Japan
| | - Eriko Yoshida
- Department of Physiology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Masanori Matsuzaki
- Department of Physiology, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.,International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo Institutes for Advanced Study, Tokyo, Japan.,Brain Functional Dynamics Collaboration Laboratory, RIKEN Center for Brain Science, Saitama, Japan
| |
Collapse
|
15
|
Abstract
This paper describes a framework for modelling dopamine function in the mammalian brain. It proposes that both learning and action planning involve processes minimizing prediction errors encoded by dopaminergic neurons. In this framework, dopaminergic neurons projecting to different parts of the striatum encode errors in predictions made by the corresponding systems within the basal ganglia. The dopaminergic neurons encode differences between rewards and expectations in the goal-directed system, and differences between the chosen and habitual actions in the habit system. These prediction errors trigger learning about rewards and habit formation, respectively. Additionally, dopaminergic neurons in the goal-directed system play a key role in action planning: They compute the difference between a desired reward and the reward expected from the current motor plan, and they facilitate action planning until this difference diminishes. Presented models account for dopaminergic responses during movements, effects of dopamine depletion on behaviour, and make several experimental predictions. In the brain, chemicals such as dopamine allow nerve cells to ‘talk’ to each other and to relay information from and to the environment. Dopamine, in particular, is released when pleasant surprises are experienced: this helps the organism to learn about the consequences of certain actions. If a new flavour of ice-cream tastes better than expected, for example, the release of dopamine tells the brain that this flavour is worth choosing again. However, dopamine has an additional role in controlling movement. When the cells that produce dopamine die, for instance in Parkinson’s disease, individuals may find it difficult to initiate deliberate movements. Here, Rafal Bogacz aimed to develop a comprehensive framework that could reconcile the two seemingly unrelated roles played by dopamine. The new theory proposes that dopamine is released when an outcome differs from expectations, which helps the organism to adjust and minimise these differences. In the ice-cream example, the difference is between how good the treat is expected to taste, and how tasty it really is. By learning to select the same flavour repeatedly, the brain aligns expectation and the result of the choice. This ability would also apply when movements are planned. In this case, the brain compares the desired reward with the predicted results of the planned actions. For example, while planning to get a spoonful of ice-cream, the brain compares the pleasure expected from the movement that is currently planned, and the pleasure of eating a full spoon of the treat. If the two differ, for example because no movement has been planned yet, the brain releases dopamine to form a better version of the action plan. The theory was then tested using a computer simulation of nerve cells that release dopamine; this showed that the behaviour of the virtual cells closely matched that of their real-life counterparts. This work offers a comprehensive description of the fundamental role of dopamine in the brain. The model now needs to be verified through experiments on living nerve cells; ultimately, it could help doctors and researchers to develop better treatments for conditions such as Parkinson’s disease or ADHD, which are linked to a lack of dopamine.
Collapse
Affiliation(s)
- Rafal Bogacz
- MRC Brain Networks Dynamics Unit, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
16
|
Song MR, Lee SW. Dynamic resource allocation during reinforcement learning accounts for ramping and phasic dopamine activity. Neural Netw 2020; 126:95-107. [PMID: 32203877 DOI: 10.1016/j.neunet.2020.03.005] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 01/22/2020] [Accepted: 03/02/2020] [Indexed: 11/29/2022]
Abstract
For an animal to learn about its environment with limited motor and cognitive resources, it should focus its resources on potentially important stimuli. However, too narrow focus is disadvantageous for adaptation to environmental changes. Midbrain dopamine neurons are excited by potentially important stimuli, such as reward-predicting or novel stimuli, and allocate resources to these stimuli by modulating how an animal approaches, exploits, explores, and attends. The current study examined the theoretical possibility that dopamine activity reflects the dynamic allocation of resources for learning. Dopamine activity may transition between two patterns: (1) phasic responses to cues and rewards, and (2) ramping activity arising as the agent approaches the reward. Phasic excitation has been explained by prediction errors generated by experimentally inserted cues. However, when and why dopamine activity transitions between the two patterns remain unknown. By parsimoniously modifying a standard temporal difference (TD) learning model to accommodate a mixed presentation of both experimental and environmental stimuli, we simulated dopamine transitions and compared them with experimental data from four different studies. The results suggested that dopamine transitions from ramping to phasic patterns as the agent focuses its resources on a small number of reward-predicting stimuli, thus leading to task dimensionality reduction. The opposite occurs when the agent re-distributes its resources to adapt to environmental changes, resulting in task dimensionality expansion. This research elucidates the role of dopamine in a broader context, providing a potential explanation for the diverse repertoire of dopamine activity that cannot be explained solely by prediction error.
Collapse
Affiliation(s)
- Minryung R Song
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, South Korea
| | - Sang Wan Lee
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, 34141, South Korea; Program of Brain and Cognitive Engineering, Daejeon, 34141, South Korea; KAIST Institute for Health, Science, and Technology, Daejeon, 34141, South Korea; KAIST Institute for Artificial Intelligence, Daejeon, 34141, South Korea; KAIST Center for Neuroscience-inspired AI, Daejeon, 34141, South Korea.
| |
Collapse
|
17
|
Adams RA, Moutoussis M, Nour MM, Dahoun T, Lewis D, Illingworth B, Veronese M, Mathys C, de Boer L, Guitart-Masip M, Friston KJ, Howes OD, Roiser JP. Variability in Action Selection Relates to Striatal Dopamine 2/3 Receptor Availability in Humans: A PET Neuroimaging Study Using Reinforcement Learning and Active Inference Models. Cereb Cortex 2020; 30:3573-3589. [PMID: 32083297 PMCID: PMC7233027 DOI: 10.1093/cercor/bhz327] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2019] [Revised: 11/18/2019] [Accepted: 12/05/2019] [Indexed: 12/17/2022] Open
Abstract
Choosing actions that result in advantageous outcomes is a fundamental function of nervous systems. All computational decision-making models contain a mechanism that controls the variability of (or confidence in) action selection, but its neural implementation is unclear-especially in humans. We investigated this mechanism using two influential decision-making frameworks: active inference (AI) and reinforcement learning (RL). In AI, the precision (inverse variance) of beliefs about policies controls action selection variability-similar to decision 'noise' parameters in RL-and is thought to be encoded by striatal dopamine signaling. We tested this hypothesis by administering a 'go/no-go' task to 75 healthy participants, and measuring striatal dopamine 2/3 receptor (D2/3R) availability in a subset (n = 25) using [11C]-(+)-PHNO positron emission tomography. In behavioral model comparison, RL performed best across the whole group but AI performed best in participants performing above chance levels. Limbic striatal D2/3R availability had linear relationships with AI policy precision (P = 0.029) as well as with RL irreducible decision 'noise' (P = 0.020), and this relationship with D2/3R availability was confirmed with a 'decision stochasticity' factor that aggregated across both models (P = 0.0006). These findings are consistent with occupancy of inhibitory striatal D2/3Rs decreasing the variability of action selection in humans.
Collapse
Affiliation(s)
- Rick A Adams
- Institute of Cognitive Neuroscience, University College London, London WC1N 3AZ, UK.,Division of Psychiatry, University College London, London W1T 7NF, UK.,Psychiatric Imaging Group, Robert Steiner MRI Unit, MRC London Institute of Medical Sciences, Hammersmith Hospital, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital, London W12 0NN, UK
| | - Michael Moutoussis
- Wellcome Centre for Human Neuroimaging, University College London, London WC1N 3BG, UK.,Max Planck-UCL Centre for Computational Psychiatry and Ageing Research, London WC1B 5EH, UK
| | - Matthew M Nour
- Psychiatric Imaging Group, Robert Steiner MRI Unit, MRC London Institute of Medical Sciences, Hammersmith Hospital, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital, London W12 0NN, UK.,Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience (IoPPN), King's College London, London SE5 8AF, UK
| | - Tarik Dahoun
- Psychiatric Imaging Group, Robert Steiner MRI Unit, MRC London Institute of Medical Sciences, Hammersmith Hospital, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital, London W12 0NN, UK.,Department of Psychiatry, University of Oxford, Warneford Hospital, Oxford OX3 7JX, UK
| | - Declan Lewis
- Institute of Cognitive Neuroscience, University College London, London WC1N 3AZ, UK
| | - Benjamin Illingworth
- Institute of Cognitive Neuroscience, University College London, London WC1N 3AZ, UK
| | - Mattia Veronese
- Centre for Neuroimaging Sciences, Institute of Psychiatry, Psychology & Neuroscience (IoPPN), King's College London, London SE5 8AF, UK
| | - Christoph Mathys
- Max Planck-UCL Centre for Computational Psychiatry and Ageing Research, London WC1B 5EH, UK.,Scuola Internazionale Superiore di Studi Avanzati (SISSA), 34136 Trieste, Italy.,Translational Neuromodeling Unit (TNU), Institute for Biomedical Engineering, University of Zurich and ETH Zurich, 8032 Zurich, Switzerland
| | - Lieke de Boer
- Aging Research Center, Karolinska Institute, 171 65 Stockholm, Sweden
| | - Marc Guitart-Masip
- Max Planck-UCL Centre for Computational Psychiatry and Ageing Research, London WC1B 5EH, UK.,Aging Research Center, Karolinska Institute, 171 65 Stockholm, Sweden
| | - Karl J Friston
- Wellcome Centre for Human Neuroimaging, University College London, London WC1N 3BG, UK
| | - Oliver D Howes
- Psychiatric Imaging Group, Robert Steiner MRI Unit, MRC London Institute of Medical Sciences, Hammersmith Hospital, London W12 0NN, UK.,Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, Hammersmith Hospital, London W12 0NN, UK.,Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience (IoPPN), King's College London, London SE5 8AF, UK
| | - Jonathan P Roiser
- Institute of Cognitive Neuroscience, University College London, London WC1N 3AZ, UK
| |
Collapse
|
18
|
Joshi VV, Patel ND, Rehan MA, Kuppa A. Mysterious Mechanisms of Memory Formation: Are the Answers Hidden in Synapses? Cureus 2019; 11:e5795. [PMID: 31728242 PMCID: PMC6827877 DOI: 10.7759/cureus.5795] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 09/28/2019] [Indexed: 12/18/2022] Open
Abstract
After decades of research on memory formation and retention, we are still searching for the definite concept and process behind neuroplasticity. This review article will address the relationship between synapses, memory formation, and memory retention and their genetic correlations. In the last six decades, there have been enormous improvements in the neurochemistry domain, especially in the area of neural plasticity. In the central nervous system, the complexity of the synapses between neurons allows communication among them. It is believed that each time certain types of sensory signals pass through sequences of synapses, these synapses can transmit the same signals more efficiently the following time. The concept of Hebb synapse has provided revolutionary thinking about the nature of neural mechanisms of learning and memory formation. To improve the local circuitry for memory formation and behavioral change and stabilization in the mammalian central nervous system, long-term potentiation and long-term depression are the crucial components of Hebbian plasticity. In this review, we will be discussing the role of glutamatergic synapses, engram cells, cytokines, neuropeptides, neurosteroids and many aspects, covering the synaptic basis of memory. Lastly, we have tried to cover the etiology of neurodegenerative disorders due to synaptic dysfunction. To enhance pharmacological interventions for neurodegenerative diseases, we need more research in this direction. With the help of technology, and a better understanding of the disease etiology, not only can we identify the missing pieces of synaptic functions, but we might also cure or even prevent serious neurodegenerative diseases like Alzheimer's disease (AD).
Collapse
Affiliation(s)
- Viraj V Joshi
- Neuropsychiatry, California Instititute of Behavioral Neurosciences and Psychology, Fairfield, USA
| | - Nishita D Patel
- Research, California Institute of Behavioral Neurosciences & Psychology, Fairfield, USA
| | - Muhammad Awais Rehan
- Miscellenous, California Institute of Behavioral Neurosciences & Psychology, Fairfield, USA
| | - Annapurna Kuppa
- Internal Medicine and Gastroenterology, University of Michigan, Ann Arbor, USA
| |
Collapse
|
19
|
Jordan J, Weidel P, Morrison A. A Closed-Loop Toolchain for Neural Network Simulations of Learning Autonomous Agents. Front Comput Neurosci 2019; 13:46. [PMID: 31427939 PMCID: PMC6687756 DOI: 10.3389/fncom.2019.00046] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Accepted: 06/25/2019] [Indexed: 11/17/2022] Open
Abstract
Neural network simulation is an important tool for generating and evaluating hypotheses on the structure, dynamics, and function of neural circuits. For scientific questions addressing organisms operating autonomously in their environments, in particular where learning is involved, it is crucial to be able to operate such simulations in a closed-loop fashion. In such a set-up, the neural agent continuously receives sensory stimuli from the environment and provides motor signals that manipulate the environment or move the agent within it. So far, most studies requiring such functionality have been conducted with custom simulation scripts and manually implemented tasks. This makes it difficult for other researchers to reproduce and build upon previous work and nearly impossible to compare the performance of different learning architectures. In this work, we present a novel approach to solve this problem, connecting benchmark tools from the field of machine learning and state-of-the-art neural network simulators from computational neuroscience. The resulting toolchain enables researchers in both fields to make use of well-tested high-performance simulation software supporting biologically plausible neuron, synapse and network models and allows them to evaluate and compare their approach on the basis of standardized environments with various levels of complexity. We demonstrate the functionality of the toolchain by implementing a neuronal actor-critic architecture for reinforcement learning in the NEST simulator and successfully training it on two different environments from the OpenAI Gym. We compare its performance to a previously suggested neural network model of reinforcement learning in the basal ganglia and a generic Q-learning algorithm.
Collapse
Affiliation(s)
- Jakob Jordan
- Department of Physiology, University of Bern, Bern, Switzerland
- Institute of Neuroscience and Medicine (INM-6) & Institute for Advanced Simulation (IAS-6) & JARA-Institute Brain Structure Function Relationship (JBI 1/INM-10), Research Centre Jülich, Jülich, Germany
| | - Philipp Weidel
- Institute of Neuroscience and Medicine (INM-6) & Institute for Advanced Simulation (IAS-6) & JARA-Institute Brain Structure Function Relationship (JBI 1/INM-10), Research Centre Jülich, Jülich, Germany
- aiCTX, Zurich, Switzerland
- Department of Computer Science, RWTH Aachen University, Aachen, Germany
| | - Abigail Morrison
- Institute of Neuroscience and Medicine (INM-6) & Institute for Advanced Simulation (IAS-6) & JARA-Institute Brain Structure Function Relationship (JBI 1/INM-10), Research Centre Jülich, Jülich, Germany
- Faculty of Psychology, Institute of Cognitive Neuroscience, Ruhr-University Bochum, Bochum, Germany
| |
Collapse
|
20
|
Moens V, Zénon A. Learning and forgetting using reinforced Bayesian change detection. PLoS Comput Biol 2019; 15:e1006713. [PMID: 30995214 PMCID: PMC6488101 DOI: 10.1371/journal.pcbi.1006713] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2018] [Revised: 04/29/2019] [Accepted: 12/09/2018] [Indexed: 12/17/2022] Open
Abstract
Agents living in volatile environments must be able to detect changes in contingencies while refraining to adapt to unexpected events that are caused by noise. In Reinforcement Learning (RL) frameworks, this requires learning rates that adapt to past reliability of the model. The observation that behavioural flexibility in animals tends to decrease following prolonged training in stable environment provides experimental evidence for such adaptive learning rates. However, in classical RL models, learning rate is either fixed or scheduled and can thus not adapt dynamically to environmental changes. Here, we propose a new Bayesian learning model, using variational inference, that achieves adaptive change detection by the use of Stabilized Forgetting, updating its current belief based on a mixture of fixed, initial priors and previous posterior beliefs. The weight given to these two sources is optimized alongside the other parameters, allowing the model to adapt dynamically to changes in environmental volatility and to unexpected observations. This approach is used to implement the "critic" of an actor-critic RL model, while the actor samples the resulting value distributions to choose which action to undertake. We show that our model can emulate different adaptation strategies to contingency changes, depending on its prior assumptions of environmental stability, and that model parameters can be fit to real data with high accuracy. The model also exhibits trade-offs between flexibility and computational costs that mirror those observed in real data. Overall, the proposed method provides a general framework to study learning flexibility and decision making in RL contexts.
Collapse
Affiliation(s)
- Vincent Moens
- CoAction Lab, Institue of Neuroscience, Université Catholique de Louvain, Bruxelles, Belgium
| | - Alexandre Zénon
- CoAction Lab, Institue of Neuroscience, Université Catholique de Louvain, Bruxelles, Belgium
- INCIA, Université de Bordeaux, Bordeaux, France
| |
Collapse
|
21
|
Möller M, Bogacz R. Learning the payoffs and costs of actions. PLoS Comput Biol 2019; 15:e1006285. [PMID: 30818357 PMCID: PMC6413954 DOI: 10.1371/journal.pcbi.1006285] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2018] [Revised: 03/12/2019] [Accepted: 01/15/2019] [Indexed: 11/19/2022] Open
Abstract
A set of sub-cortical nuclei called basal ganglia is critical for learning the values of actions. The basal ganglia include two pathways, which have been associated with approach and avoid behavior respectively and are differentially modulated by dopamine projections from the midbrain. Inspired by the influential opponent actor learning model, we demonstrate that, under certain circumstances, these pathways may represent learned estimates of the positive and negative consequences (payoffs and costs) of individual actions. In the model, the level of dopamine activity encodes the motivational state and controls to what extent payoffs and costs enter the overall evaluation of actions. We show that a set of previously proposed plasticity rules is suitable to extract payoffs and costs from a prediction error signal if they occur at different moments in time. For those plasticity rules, successful learning requires differential effects of positive and negative outcome prediction errors on the two pathways and a weak decay of synaptic weights over trials. We also confirm through simulations that the model reproduces drug-induced changes of willingness to work, as observed in classical experiments with the D2-antagonist haloperidol. The basal ganglia are structures underneath the surface of the vertebrate brain, associated with error-driven learning. Much is known about the anatomical and biological features of the basal ganglia; scientists now try to understand the algorithms implemented by these structures. Numerous models aspire to capture the learning functionality, but many of them only cover some specific aspect of the algorithm. Instead of further adding to that pool of partial models, we unify two existing ones—one which captures what the basal ganglia learn, and one that describes the learning mechanism itself. The first model suggests that the basal ganglia weigh positive against negative consequences of actions according to the motivational state. It hints how payoff and cost might be represented, but does not explain how those representations arise. The other model consists of biologically plausible plasticity rules, which describe how learning takes place, but not how the brain makes use of what is learned. We show that the two theories are compatible. Together, they form a model of learning and decision making that integrates the motivational state as well as the learned payoffs and costs of opportunities.
Collapse
Affiliation(s)
- Moritz Möller
- MRC Brain Network Dynamics Unit, Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, United Kingdom
| | - Rafal Bogacz
- MRC Brain Network Dynamics Unit, Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford, United Kingdom
- * E-mail:
| |
Collapse
|
22
|
Morita K, Kawaguchi Y. A Dual Role Hypothesis of the Cortico-Basal-Ganglia Pathways: Opponency and Temporal Difference Through Dopamine and Adenosine. Front Neural Circuits 2019; 12:111. [PMID: 30687019 PMCID: PMC6338031 DOI: 10.3389/fncir.2018.00111] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Accepted: 11/29/2018] [Indexed: 01/07/2023] Open
Abstract
The hypothesis that the basal-ganglia direct and indirect pathways represent goodness (or benefit) and badness (or cost) of options, respectively, explains a wide range of phenomena. However, this hypothesis, named the Opponent Actor Learning (OpAL), still has limitations. Structurally, the OpAL model does not incorporate differentiation of the two types of cortical inputs to the basal-ganglia pathways received from intratelencephalic (IT) and pyramidal-tract (PT) neurons. Functionally, the OpAL model does not describe the temporal-difference (TD)-type reward-prediction-error (RPE), nor explains how RPE is calculated in the circuitry connecting to the DA neurons. In fact, there is a different hypothesis on the basal-ganglia pathways and DA, named the Cortico-Striatal-Temporal-Difference (CS-TD) model. The CS-TD model differentiates the IT and PT inputs, describes the TD-type RPE, and explains how TD-RPE is calculated. However, a critical difficulty in this model lies in its assumption that DA induces the same direction of plasticity in both direct and indirect pathways, which apparently contradicts the experimentally observed opposite effects of DA on these pathways. Here, we propose a new hypothesis that integrates the OpAL and CS-TD models. Specifically, we propose that the IT-basal-ganglia pathways represent goodness/badness of current options while the PT-indirect pathway represents the overall value of the previously chosen option, and both of these have influence on the DA neurons, through the basal-ganglia output, so that a variant of TD-RPE is calculated. A key assumption is that opposite directions of plasticity are induced upon phasic activation of DA neurons in the IT-indirect pathway and PT-indirect pathway because of different profiles of IT and PT inputs. Specifically, at PT→indirect-pathway-medium-spiny-neuron (iMSN) synapses, sustained glutamatergic inputs generate rich adenosine, which allosterically prevents DA-D2 receptor signaling and instead favors adenosine-A2A receptor signaling. Then, phasic DA-induced phasic adenosine, which reflects TD-RPE, causes long-term synaptic potentiation. In contrast, at IT→iMSN synapses where adenosine is scarce, phasic DA causes long-term synaptic depression via D2 receptor signaling. This new Opponency and Temporal-Difference (OTD) model provides unique predictions, part of which is potentially in line with recently reported activity patterns of neurons in the globus pallidus externus on the indirect pathway.
Collapse
Affiliation(s)
- Kenji Morita
- Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo, Japan.,International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo Institutes for Advanced Study, Tokyo, Japan
| | - Yasuo Kawaguchi
- Division of Cerebral Circuitry, National Institute for Physiological Sciences, Okazaki, Japan.,Department of Physiological Sciences, Graduate University for Advanced Studies, Okazaki, Japan
| |
Collapse
|
23
|
Hallquist MN, Dombrovski AY. Selective maintenance of value information helps resolve the exploration/exploitation dilemma. Cognition 2018; 183:226-243. [PMID: 30502584 DOI: 10.1016/j.cognition.2018.11.004] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2017] [Revised: 11/06/2018] [Accepted: 11/08/2018] [Indexed: 10/27/2022]
Abstract
In natural environments with many options of uncertain value, one faces a difficult tradeoff between exploiting familiar, valuable options or searching for better alternatives. Reinforcement learning models of this exploration/exploitation dilemma typically modulate the rate of exploratory choices or preferentially sample uncertain options. The extent to which such models capture human behavior remains unclear, in part because they do not consider the constraints on remembering what is learned. Using reinforcement-based timing as a motivating example, we show that selectively maintaining high-value actions compresses the amount of information to be tracked in learning, as quantified by Shannon's entropy. In turn, the information content of the value representation controls the balance between exploration (high entropy) and exploitation (low entropy). Selectively maintaining preferred action values while allowing others to decay renders the choices increasingly exploitative across learning episodes. To adjudicate among alternative maintenance and sampling strategies, we developed a new reinforcement learning model, StrategiC ExPloration/ExPloitation of Temporal Instrumental Contingencies (SCEPTIC). In computational studies, a resource-rational selective maintenance approach was as successful as more resource-intensive strategies. Furthermore, human behavior was consistent with selective maintenance; information compression was most pronounced in subjects with superior performance and non-verbal intelligence, and in learnable vs. unlearnable contingencies. Cognitively demanding uncertainty-directed exploration recovered a more accurate representation in simulations with no foraging advantage and was strongly unsupported in our human study.
Collapse
Affiliation(s)
- Michael N Hallquist
- Penn State University, Department of Psychology, 309 Moore Building, Penn State University, University Park, PA 16801, USA; University of Pittsburgh, Department of Psychiatry, 3811 O'Hara St., BT 742, Pittsburgh, PA 15213, USA.
| | - Alexandre Y Dombrovski
- University of Pittsburgh, Department of Psychiatry, 3811 O'Hara St., BT 742, Pittsburgh, PA 15213, USA.
| |
Collapse
|
24
|
A Neural Circuit Mechanism for the Involvements of Dopamine in Effort-Related Choices: Decay of Learned Values, Secondary Effects of Depletion, and Calculation of Temporal Difference Error. eNeuro 2018; 5:eN-NWR-0021-18. [PMID: 29468191 PMCID: PMC5820541 DOI: 10.1523/eneuro.0021-18.2018] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2018] [Accepted: 01/11/2018] [Indexed: 12/17/2022] Open
Abstract
Dopamine has been suggested to be crucially involved in effort-related choices. Key findings are that dopamine depletion (i) changed preference for a high-cost, large-reward option to a low-cost, small-reward option, (ii) but not when the large-reward option was also low-cost or the small-reward option gave no reward, (iii) while increasing the latency in all the cases but only transiently, and (iv) that antagonism of either dopamine D1 or D2 receptors also specifically impaired selection of the high-cost, large-reward option. The underlying neural circuit mechanisms remain unclear. Here we show that findings i–iii can be explained by the dopaminergic representation of temporal-difference reward-prediction error (TD-RPE), whose mechanisms have now become clarified, if (1) the synaptic strengths storing the values of actions mildly decay in time and (2) the obtained-reward-representing excitatory input to dopamine neurons increases after dopamine depletion. The former is potentially caused by background neural activity–induced weak synaptic plasticity, and the latter is assumed to occur through post-depletion increase of neural activity in the pedunculopontine nucleus, where neurons representing obtained reward exist and presumably send excitatory projections to dopamine neurons. We further show that finding iv, which is nontrivial given the suggested distinct functions of the D1 and D2 corticostriatal pathways, can also be explained if we additionally assume a proposed mechanism of TD-RPE calculation, in which the D1 and D2 pathways encode the values of actions with a temporal difference. These results suggest a possible circuit mechanism for the involvements of dopamine in effort-related choices and, simultaneously, provide implications for the mechanisms of TD-RPE calculation.
Collapse
|
25
|
Colas JT, Pauli WM, Larsen T, Tyszka JM, O’Doherty JP. Distinct prediction errors in mesostriatal circuits of the human brain mediate learning about the values of both states and actions: evidence from high-resolution fMRI. PLoS Comput Biol 2017; 13:e1005810. [PMID: 29049406 PMCID: PMC5673235 DOI: 10.1371/journal.pcbi.1005810] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Revised: 11/06/2017] [Accepted: 10/09/2017] [Indexed: 11/19/2022] Open
Abstract
Prediction-error signals consistent with formal models of "reinforcement learning" (RL) have repeatedly been found within dopaminergic nuclei of the midbrain and dopaminoceptive areas of the striatum. However, the precise form of the RL algorithms implemented in the human brain is not yet well determined. Here, we created a novel paradigm optimized to dissociate the subtypes of reward-prediction errors that function as the key computational signatures of two distinct classes of RL models-namely, "actor/critic" models and action-value-learning models (e.g., the Q-learning model). The state-value-prediction error (SVPE), which is independent of actions, is a hallmark of the actor/critic architecture, whereas the action-value-prediction error (AVPE) is the distinguishing feature of action-value-learning algorithms. To test for the presence of these prediction-error signals in the brain, we scanned human participants with a high-resolution functional magnetic-resonance imaging (fMRI) protocol optimized to enable measurement of neural activity in the dopaminergic midbrain as well as the striatal areas to which it projects. In keeping with the actor/critic model, the SVPE signal was detected in the substantia nigra. The SVPE was also clearly present in both the ventral striatum and the dorsal striatum. However, alongside these purely state-value-based computations we also found evidence for AVPE signals throughout the striatum. These high-resolution fMRI findings suggest that model-free aspects of reward learning in humans can be explained algorithmically with RL in terms of an actor/critic mechanism operating in parallel with a system for more direct action-value learning.
Collapse
Affiliation(s)
- Jaron T. Colas
- Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA, United States of America
| | - Wolfgang M. Pauli
- Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA, United States of America
- Division of the Humanities and Social Sciences, California Institute of Technology, Pasadena, CA, United States of America
| | - Tobias Larsen
- Division of the Humanities and Social Sciences, California Institute of Technology, Pasadena, CA, United States of America
- Center for Mind/Brain Sciences, University of Trento, Trento, Italy
| | - J. Michael Tyszka
- Division of the Humanities and Social Sciences, California Institute of Technology, Pasadena, CA, United States of America
| | - John P. O’Doherty
- Computation and Neural Systems Program, California Institute of Technology, Pasadena, CA, United States of America
- Division of the Humanities and Social Sciences, California Institute of Technology, Pasadena, CA, United States of America
| |
Collapse
|