Understanding dopamine and reinforcement learning: The dopamine reward prediction error hypothesis | PNAS

A number of recent advances have been achieved in the study of midbrain dopaminergic neurons. Understanding these advances and how they relate to one another requires a deep understanding of the computational models that serve as an explanatory framework and guide ongoing experimental inquiry. This intertwining of theory and experiment now suggests very clearly that the phasic activity of the midbrain dopamine neurons provides a global mechanism for synaptic modification. These synaptic modifications, in turn, provide the mechanistic underpinning for a specific class of reinforcement learning mechanisms that now seem to underlie much of human and animal behavior. This review describes both the critical empirical findings that are at the root of this conclusion and the fantastic theoretical advances from which this conclusion is drawn.

The theory and data available today indicate that the phasic activity of midbrain dopamine neurons encodes a reward prediction error used to guide learning throughout the frontal cortex and the basal ganglia. Activity in these dopaminergic neurons is now believed to signal that a subject’s estimate of the value of current and future events is in error and indicate the magnitude of this error. This is a kind of combined signal that most scholars active in dopamine studies believe adjusts synaptic strengths in a quantitative manner until the subject’s estimate of the value of current and future events is accurately encoded in the frontal cortex and basal ganglia. Although some confusion remains within the larger neuroscience community, very little data exist that are incompatible with this hypothesis. This review provides a brief overview of the explanatory synergy between behavioral, anatomical, physiological, and biophysical data that has been forged by recent computational advances. For a more detailed treatment of this hypothesis, refer to Niv and Montague (1) or Dayan and Abbot (2).

Dopamine and temporal difference learning: A fruitful relationship between neuroscience and AI

Learning and motivation are driven by internal and external rewards. Many of our day-to-day behaviours are guided by predicting, or anticipating, whether a given action will result in a positive (that is, rewarding) outcome. The study of how organisms learn from experience to correctly anticipate rewards has been a productive research field for well over a century, since Ivan Pavlov’s seminal psychological work. In his most famous experiment, dogs were trained to expect food some time after a buzzer sounded. These dogs began salivating as soon as they heard the sound, before the food had arrived, indicating they’d learned to predict the reward. In the original experiment, Pavlov estimated the dogs’ anticipation by measuring the volume of saliva they produced. But in recent decades, scientists have begun to decipher the inner workings of how the brain learns these expectations. Meanwhile, in close contact with this study of reward learning in animals, computer scientists have developed algorithms for reinforcement learning in artificial systems. These algorithms enable AI systems to learn complex strategies without external instruction, guided instead by reward predictions.

The contribution of our new work, published in Nature (PDF), is finding that a recent development in computer science – which yields significant improvements in performance on reinforcement learning problems – may provide a deep, parsimonious explanation for several previously unexplained features of reward learning in the brain, and opens up new avenues of research into the brain’s dopamine system, with potential implications for learning and motivation disorders.