1
|
Schütt HH, Kim D, Ma WJ. Reward prediction error neurons implement an efficient code for reward. Nat Neurosci 2024; 27:1333-1339. [PMID: 38898182 DOI: 10.1038/s41593-024-01671-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Accepted: 04/29/2024] [Indexed: 06/21/2024]
Abstract
We use efficient coding principles borrowed from sensory neuroscience to derive the optimal neural population to encode a reward distribution. We show that the responses of dopaminergic reward prediction error neurons in mouse and macaque are similar to those of the efficient code in the following ways: the neurons have a broad distribution of midpoints covering the reward distribution; neurons with higher thresholds have higher gains, more convex tuning functions and lower slopes; and their slope is higher when the reward distribution is narrower. Furthermore, we derive learning rules that converge to the efficient code. The learning rule for the position of the neuron on the reward axis closely resembles distributional reinforcement learning. Thus, reward prediction error neuron responses may be optimized to broadcast an efficient reward signal, forming a connection between efficient coding and reinforcement learning, two of the most successful theories in computational neuroscience.
Collapse
Affiliation(s)
- Heiko H Schütt
- Center for Neural Science and Department of Psychology, New York University, New York, NY, USA.
- Department of Behavioural and Cognitive Sciences, Université du Luxembourg, Esch-Belval, Luxembourg.
| | - Dongjae Kim
- Center for Neural Science and Department of Psychology, New York University, New York, NY, USA
- Department of AI-Based Convergence, Dankook University, Yongin, Republic of Korea
| | - Wei Ji Ma
- Center for Neural Science and Department of Psychology, New York University, New York, NY, USA
| |
Collapse
|
2
|
Muller TH, Butler JL, Veselic S, Miranda B, Wallis JD, Dayan P, Behrens TEJ, Kurth-Nelson Z, Kennerley SW. Distributional reinforcement learning in prefrontal cortex. Nat Neurosci 2024; 27:403-408. [PMID: 38200183 PMCID: PMC10917656 DOI: 10.1038/s41593-023-01535-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 11/29/2023] [Indexed: 01/12/2024]
Abstract
The prefrontal cortex is crucial for learning and decision-making. Classic reinforcement learning (RL) theories center on learning the expectation of potential rewarding outcomes and explain a wealth of neural data in the prefrontal cortex. Distributional RL, on the other hand, learns the full distribution of rewarding outcomes and better explains dopamine responses. In the present study, we show that distributional RL also better explains macaque anterior cingulate cortex neuronal responses, suggesting that it is a common mechanism for reward-guided learning.
Collapse
Affiliation(s)
- Timothy H Muller
- Department of Experimental Psychology, University of Oxford, Oxford, UK.
- Department of Clinical and Movement Neurosciences, University College London, London, UK.
| | - James L Butler
- Department of Experimental Psychology, University of Oxford, Oxford, UK
- Department of Clinical and Movement Neurosciences, University College London, London, UK
| | - Sebastijan Veselic
- Department of Experimental Psychology, University of Oxford, Oxford, UK
- Department of Clinical and Movement Neurosciences, University College London, London, UK
- Wellcome Trust Centre for Human Neuroimaging, University College London, London, UK
| | - Bruno Miranda
- Department of Clinical and Movement Neurosciences, University College London, London, UK
- Institute of Physiology and Institute of Molecular Medicine, Lisbon School of Medicine, University of Lisbon, Lisbon, Portugal
| | - Joni D Wallis
- Department of Psychology and Helen Wills Neuroscience Institute, University of California Berkeley, Berkeley, CA, USA
| | - Peter Dayan
- Max Planck Institute for Biological Cybernetics, Tübingen, Germany
- University of Tübingen, Tübingen, Germany
| | - Timothy E J Behrens
- Wellcome Trust Centre for Human Neuroimaging, University College London, London, UK
- Wellcome Centre for Integrative Neuroimaging, University of Oxford, John Radcliffe Hospital, Oxford, UK
- Sainsbury Wellcome Centre for Neural Circuits and Behaviour, University College London, London, UK
| | - Zeb Kurth-Nelson
- Google DeepMind, London, UK.
- Max Planck University College London Centre for Computational Psychiatry and Ageing Research, University College London, London, UK.
| | - Steven W Kennerley
- Department of Experimental Psychology, University of Oxford, Oxford, UK.
- Department of Clinical and Movement Neurosciences, University College London, London, UK.
- Wellcome Centre for Integrative Neuroimaging, University of Oxford, John Radcliffe Hospital, Oxford, UK.
| |
Collapse
|
3
|
Jin F, Yang L, Yang L, Li J, Li M, Shang Z. Dynamics Learning Rate Bias in Pigeons: Insights from Reinforcement Learning and Neural Correlates. Animals (Basel) 2024; 14:489. [PMID: 38338131 PMCID: PMC10854969 DOI: 10.3390/ani14030489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 01/23/2024] [Accepted: 01/30/2024] [Indexed: 02/12/2024] Open
Abstract
Research in reinforcement learning indicates that animals respond differently to positive and negative reward prediction errors, which can be calculated by assuming learning rate bias. Many studies have shown that humans and other animals have learning rate bias during learning, but it is unclear whether and how the bias changes throughout the entire learning process. Here, we recorded the behavior data and the local field potentials (LFPs) in the striatum of five pigeons performing a probabilistic learning task. Reinforcement learning models with and without learning rate biases were used to dynamically fit the pigeons' choice behavior and estimate the option values. Furthemore, the correlation between the striatal LFPs power and the model-estimated option values was explored. We found that the pigeons' learning rate bias shifted from negative to positive during the learning process, and the striatal Gamma (31 to 80 Hz) power correlated with the option values modulated by dynamic learning rate bias. In conclusion, our results support the hypothesis that pigeons employ a dynamic learning strategy in the learning process from both behavioral and neural aspects, providing valuable insights into reinforcement learning mechanisms of non-human animals.
Collapse
Affiliation(s)
- Fuli Jin
- School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China; (F.J.); (L.Y.); (L.Y.); (J.L.)
- Henan Key Laboratory of Brain Science and Brain-Computer Interface Technology, Zhengzhou 450001, China
| | - Lifang Yang
- School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China; (F.J.); (L.Y.); (L.Y.); (J.L.)
- Henan Key Laboratory of Brain Science and Brain-Computer Interface Technology, Zhengzhou 450001, China
| | - Long Yang
- School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China; (F.J.); (L.Y.); (L.Y.); (J.L.)
- Henan Key Laboratory of Brain Science and Brain-Computer Interface Technology, Zhengzhou 450001, China
| | - Jiajia Li
- School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China; (F.J.); (L.Y.); (L.Y.); (J.L.)
- Henan Key Laboratory of Brain Science and Brain-Computer Interface Technology, Zhengzhou 450001, China
| | - Mengmeng Li
- School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China; (F.J.); (L.Y.); (L.Y.); (J.L.)
- Henan Key Laboratory of Brain Science and Brain-Computer Interface Technology, Zhengzhou 450001, China
| | - Zhigang Shang
- School of Electrical and Information Engineering, Zhengzhou University, Zhengzhou 450001, China; (F.J.); (L.Y.); (L.Y.); (J.L.)
- Henan Key Laboratory of Brain Science and Brain-Computer Interface Technology, Zhengzhou 450001, China
- Institute of Medical Engineering Technology and Data Mining, Zhengzhou University, Zhengzhou 450001, China
| |
Collapse
|
4
|
Payzan-LeNestour E, Doran J. Craving money? Evidence from the laboratory and the field. SCIENCE ADVANCES 2024; 10:eadi5034. [PMID: 38215199 PMCID: PMC10786414 DOI: 10.1126/sciadv.adi5034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Accepted: 12/14/2023] [Indexed: 01/14/2024]
Abstract
Continuing to gamble despite harmful consequences has plagued human life in many ways, from loss-chasing in problem gamblers to reckless investing during stock market bubbles. Here, we propose that these anomalies in human behavior can sometimes reflect Pavlovian perturbations on instrumental behavior. To show this, we combined key elements of Pavlovian psychology literature and standard economic theory into a single model. In it, when a gambling cue such as a gaming machine or a financial asset repeatedly delivers a good outcome, the agent may start engaging with the cue even when the expected value is negative. Next, we transported the theoretical framework into an experimental task and found that participants behaved like the agent in our model. Last, we applied the model to the domain of real-world financial trading and discovered an asset-pricing anomaly suggesting that market participants are susceptible to the purported Pavlovian bias.
Collapse
Affiliation(s)
| | - James Doran
- University of New South Wales Business School, UNSW Sydney, Kensington NSW 2052, Australia
| |
Collapse
|
5
|
Lowet AS, Zheng Q, Meng M, Matias S, Drugowitsch J, Uchida N. An opponent striatal circuit for distributional reinforcement learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.02.573966. [PMID: 38260354 PMCID: PMC10802299 DOI: 10.1101/2024.01.02.573966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Machine learning research has achieved large performance gains on a wide range of tasks by expanding the learning target from mean rewards to entire probability distributions of rewards - an approach known as distributional reinforcement learning (RL)1. The mesolimbic dopamine system is thought to underlie RL in the mammalian brain by updating a representation of mean value in the striatum2,3, but little is known about whether, where, and how neurons in this circuit encode information about higher-order moments of reward distributions4. To fill this gap, we used high-density probes (Neuropixels) to acutely record striatal activity from well-trained, water-restricted mice performing a classical conditioning task in which reward mean, reward variance, and stimulus identity were independently manipulated. In contrast to traditional RL accounts, we found robust evidence for abstract encoding of variance in the striatum. Remarkably, chronic ablation of dopamine inputs disorganized these distributional representations in the striatum without interfering with mean value coding. Two-photon calcium imaging and optogenetics revealed that the two major classes of striatal medium spiny neurons - D1 and D2 MSNs - contributed to this code by preferentially encoding the right and left tails of the reward distribution, respectively. We synthesize these findings into a new model of the striatum and mesolimbic dopamine that harnesses the opponency between D1 and D2 MSNs5-15 to reap the computational benefits of distributional RL.
Collapse
Affiliation(s)
- Adam S. Lowet
- Center for Brain Science, Harvard University, Cambridge, MA, USA
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
- Program in Neuroscience, Harvard University, Boston, MA, USA
| | - Qiao Zheng
- Center for Brain Science, Harvard University, Cambridge, MA, USA
- Department of Neurobiology, Harvard Medical School, Boston, MA, USA
| | - Melissa Meng
- Center for Brain Science, Harvard University, Cambridge, MA, USA
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
| | - Sara Matias
- Center for Brain Science, Harvard University, Cambridge, MA, USA
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
| | - Jan Drugowitsch
- Center for Brain Science, Harvard University, Cambridge, MA, USA
- Department of Neurobiology, Harvard Medical School, Boston, MA, USA
| | - Naoshige Uchida
- Center for Brain Science, Harvard University, Cambridge, MA, USA
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
| |
Collapse
|
6
|
Masset P, Tano P, Kim HR, Malik AN, Pouget A, Uchida N. Multi-timescale reinforcement learning in the brain. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.12.566754. [PMID: 38014166 PMCID: PMC10680596 DOI: 10.1101/2023.11.12.566754] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
To thrive in complex environments, animals and artificial agents must learn to act adaptively to maximize fitness and rewards. Such adaptive behavior can be learned through reinforcement learning1, a class of algorithms that has been successful at training artificial agents2-6 and at characterizing the firing of dopamine neurons in the midbrain7-9. In classical reinforcement learning, agents discount future rewards exponentially according to a single time scale, controlled by the discount factor. Here, we explore the presence of multiple timescales in biological reinforcement learning. We first show that reinforcement agents learning at a multitude of timescales possess distinct computational benefits. Next, we report that dopamine neurons in mice performing two behavioral tasks encode reward prediction error with a diversity of discount time constants. Our model explains the heterogeneity of temporal discounting in both cue-evoked transient responses and slower timescale fluctuations known as dopamine ramps. Crucially, the measured discount factor of individual neurons is correlated across the two tasks suggesting that it is a cell-specific property. Together, our results provide a new paradigm to understand functional heterogeneity in dopamine neurons, a mechanistic basis for the empirical observation that humans and animals use non-exponential discounts in many situations10-14, and open new avenues for the design of more efficient reinforcement learning algorithms.
Collapse
Affiliation(s)
- Paul Masset
- Department of Molecular and Cellular Biology, Harvard University, USA
- Center for Brain Science, Harvard University, USA
| | - Pablo Tano
- Department of Basic Neuroscience, University of Geneva, Switzerland
| | - HyungGoo R Kim
- Department of Molecular and Cellular Biology, Harvard University, USA
- Center for Brain Science, Harvard University, USA
- Department of Biomedical Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea
- Center for Neuroscience Imaging Research, Institute for Basic Science (IBS), Suwon 16419, Republic of Korea
| | - Athar N Malik
- Department of Molecular and Cellular Biology, Harvard University, USA
- Center for Brain Science, Harvard University, USA
- Department of Neurosurgery, Warren Alpert Medical School of Brown University, USA
- Norman Prince Neurosciences Institute, Rhode Island Hospital, USA
| | - Alexandre Pouget
- Department of Basic Neuroscience, University of Geneva, Switzerland
| | - Naoshige Uchida
- Department of Molecular and Cellular Biology, Harvard University, USA
- Center for Brain Science, Harvard University, USA
| |
Collapse
|