Abstract
A Pavlovian bias to approach rewardpredictive cues and avoid punishmentpredictive cues can conflict with instrumentallyoptimal actions. Here, we propose that the brain arbitrates between Pavlovian and instrumental control by inferring which is a better predictor of reward. The instrumental predictor is more flexible; it can learn values that depend on both stimuli and actions, whereas the Pavlovian predictor learns values that depend only on stimuli. The arbitration theory predicts that the Pavlovian predictor will be favored when rewards are relatively uncontrollable, because the additional flexibility of the instrumental predictor is not useful. Consistent with this hypothesis, we find that the Pavlovian approach bias is stronger under low control compared to high control contexts.
Introduction
A longstanding distinction holds that a Pavlovian learning system controls behavioral responses based on stimulusoutcome relationships (independently of actions), whereas a separate instrumental learning system controls responses based on stimulusactionoutcome relationships. In violation of this strict dichotomy, Pavlovian processes are known to promote approach towards rewardpredictive stimuli and avoidance of punishmentpredictive stimuli^{1}, even when they produce maladaptive behavior^{2}. For example, Hershberger^{3} famously demonstrated that newborn chicks struggled to learn that they should walk away from a cup of food in order to obtain it. The chicks could not suppress their Pavlovian tendency to move toward the cup, which was rigged to move farther away as the chicks approached. Another example of this phenomenon, referred to as Pavlovian misbehavior, comes from studies of autoshaping, in which animals interact with a rewardpredictive cue (e.g., pigeons will peck a keylight that precedes pellet delivery) despite the fact that these behaviors do not affect the reward outcome. If an omission contingency is then introduced, such that expression of these behaviors causes the reward to be withheld, animals will sometimes persist in performing the maladaptive behavior, a phenomenon known as negative automaintenance^{4}. Humans also exhibit Pavlovian misbehavior in Go/NoGo tasks, erroneously acting in response to rewardpredictive stimuli when they should withhold action, and erroneously withholding action in response to punishmentpredictive stimuli when they should act^{5,6}.
The idea that instrumental and Pavlovian processes coexist and compete for control of behavior has been a longstanding fixture of associative learning theory^{7,8,9}, and more recently has been formalized within the framework of modern reinforcement learning theories^{10}. These theories have typically assumed that instrumental and Pavlovian processes each provide action values, which are then linearly combined to produce composite action values that control behavior. A weighting parameter determines the degree of Pavlovian influence, and this parameter is fit to each participant in the experimental data set. In this paper, we argue that the weighting parameter is determined endogenously by an arbitration process, much like an influential proposal for the arbitration between modelbased and modelfree reinforcement learning strategies^{11}.
Our theory of arbitration is based on the idea that Pavlovian and instrumental processes can be understood as constituting different predictive models of reward (we will use the terms ‘predictor’ and ‘model’ interchangeably, except where we distinguish the brain’s internal models of the environment from our models of the brain). The instrumental predictor learns reward expectations as a function of both stimuli and actions, whereas the Pavlovian predictor learns reward expectations as a function only of stimuli. Thus, the instrumental predictor is strictly more complex than the Pavlovian predictor: it can capture any pattern that the Pavlovian predictor can capture, as well as patterns that the Pavlovian predictor cannot capture. The cost of this flexibility is that the instrumental predictor can also overfit on a finite data set, which means that it will generalize poorly due to fitting noise. The basic problem of arbitration is thus to negotiate a balance between capturing the patterns in the data (favoring the more complex instrumental predictor) and avoiding overfitting (favoring the less complex Pavlovian predictor).
Bayesian model averaging elegantly resolves this problem by weighting each predictor’s output by the posterior probability of the predictor given the data. The posterior will tend to favor predictors of intermediate complexity, due to what is known as Bayesian Occam’s razor^{12}. We can think of each predictive model as ‘betting’ on observing particular data sets (Fig. 1a). Simple models concentrate their bets on a relatively small number of data sets, whereas complex models distribute their bets across a larger number of data sets. If a simple model accurately predicts a particular data set, it is rewarded more than a complex model, because it bet more on that data set. If the model is too simple (i.e., its bets are too narrowly concentrated), it will fail to predict the observed data.
Another perspective on the same idea comes from the biasvariance tradeoff^{13,14,15}. Any predictor’s generalization error (i.e., how poorly it predicts new data after learning from a finite amount of training data) can be decomposed into the sum of three components: squared bias, variance, and irreducible error. Bias is the systematic error incurred by adopting an overly simple model that cannot adequately capture the underlying regularities in the data. Variance is the random error incurred by adopting an overly complex model, which causes the model to overfit random noise in the training data. The irreducible error arises from the inherent stochasticity of the datagenerating process, which is independent of the predictor. Bias can be reduced by increasing model complexity, but at the cost of increasing variance. Optimal generalization error is achieved at an intermediate level of complexity where the sum of squared bias and variance (i.e., the reducible error) is minimal (Fig. 1b). The biasvariance tradeoff is closely connected to the Bayesian model averaging perspective, because predictive models with higher posterior probability will tend to have lower generalization error^{16}.
Applying these ideas to arbitration between Pavlovian and instrumental control, a key determinant of the optimal model complexity is controllability of reward^{17,18}. If rewards are uncontrollable (actions do not affect reward rate), then the simpler Pavlovian predictor will be favored by the posterior, because the additional complexity of the instrumental predictor is not justified relative to the penalty imposed by the Bayesian Occam’s razor. Only when rewards are sufficiently controllable, or once sufficient data have been observed, will the instrumental predictor be favored (asymptotically, the instrumental predictor will always be favored, because the risk of overfitting noise disappears as the data set becomes large).
We test the predictions of the Bayesian arbitration model by manipulating reward controllability in two Go/NoGo experiments, using the Pavlovian Go bias observed in previous experiments^{19,20} as an index of Pavlovian control. As a complementary window into the arbitration process, we also explore how controllability affects the biasvariance tradeoff.
Results
Behavior and modeling
We describe the two experiments together because they are very similar in structure (Fig. 2). Experiment 1 manipulated reward controllability between participants, whereas Experiment 2 manipulated it within participants.
To investigate the extent to which participants relied on Pavlovian control, we measured their Go bias, defined as the difference in accuracy on GotoWin and NoGotoWin trials (see Supplementary Fig. 1 for the disaggregated data). Under purely instrumental control, the Go bias should be 0; hence values greater then 0 indicate the influence of Pavlovian control.
We developed two models of behavior on this task (see Methods for details). Both models consist of two subcomponents; a Pavlovian learning system and an instrumental learning system (Fig. 3). The Pavlovian system acquires reward expectations for each stimulus, and converts these expectations into action values by promoting Go responses to cues in proportion to their expected reward. The instrumental system acquires reward expectations for each stimulusaction combination, and converts these into action values by promoting Go responses to stimuli in proportion to their expected reward for Go relative to NoGo. The learning rules for both systems are the same.
The Pavlovian and instrumental values are integrated linearly according to a weighting parameter. The two models differ in terms of how the weighting parameter changes as a function of experience. In the fixed model, the weighting parameter is held constant, treated as a free parameter that we fit to behavior. In the adaptive model, the weighting parameter is updated after each trial based on the relative predictive ability of each system. Thus, the weight is not a free parameter in the adaptive model, but is instead determined endogenously by each participant’s experience in the task.
Figure 4 shows the time series of the adaptive Pavlovian weight for the model fitted to one participant, demonstrating the prediction that low control should tend to produce a higher Pavlovian weight (w), which will in turn cause a stronger Go bias.
Consistent with the model predictions, participants across both experiments showed an increased Go bias in the Low Control (LC) condition compared to the High Control (HC) condition (Experiment 1: t(183) = 2.06, p < 0.05, d = 0.31; Experiment 2: t(128) = 2.06, p < 0.05, d = 0.18; by twosample ttest; Fig. 5). The adaptive model provided a quantitatively superior account relative to the fixed model, as assessed by random effects Bayesian model comparison^{21}. Specifically, we calculated the protected exceedance probability (PXP), the probability that a particular model is more frequent in the population than all other models under consideration, taking into account the possibility that some differences in model evidence are due to chance. For both experiments, the PXP favoring the adaptive model was >0.99.
To verify the quantitative accuracy of the adaptive model, we plotted the Go bias as a function of weight quantile (Fig. 5), finding a close fit between model and data (for both experiments, the ttest comparing the Go bias for the lowest and highest quantiles was significant; Experiment 1: t(177) = 12.30, p < 0.0001, d = 0.92; Experiment 2: t(54) = 6.16, p < 0.0001, d = 0.83), apart from a slight deviation in Experiment 2 for the lowest weight quantiles. Importantly, the quantiles were computed within participants, demonstrating that the model can capture variations in Pavlovian control over the course of a single experimental session.
The timeseries of weights generated by the adaptive model is, on average, correlated with the parameter estimates obtained from fitting the fixed model (Experiment 1: r = 0.51, p < 0.0001; Experiment 2: r = 0.63, p < 0.0001; by correlation). This demonstrates that the adaptive model’s average behavior produces behavior similar to that predicted by earlier models using fixed weights^{19,20} but with the weight determined endogenously rather than fit as a free parameter.
We also tested the prediction that the Go bias should diminish over the course of training, and eventually disappear, as can be seen in the simulations (Fig. 4). Consistent with this prediction, the Go bias in both experiments declined over the course of trials, roughly exponentially fast (Fig. 6). Specifically, we regressed the Go bias against the logtransformed trial number and then carried out ttests on the regression coefficient, finding a significant negative effect for Experiment 1 (t(184) = 2.93, p < 0.005, d = 0.22) and Experiment 2 (t(128) = 5.55, p < 0.0001, d = 0.49).
Analysis of bias and variance
We also examined the effect of controllability on the biasvariance tradeoff (Fig. 7). Because controllability favors the more complex instrumental model, we hypothesized that the HC condition would produce lower bias and higher variance (note that this bias should not be confused with the Pavlovian Go bias; see Supplementary Fig. 3 for model simulations of bias and variance). This prediction was partially confirmed in both Experiment 1 (bias: t(183) = 2.06, p < 0.05, d = 0.31; variance: t(183) = 1.69, p = 0.09, d = 0.25) and Experiment 2 (bias: t(128) = 2.07, p < 0.05, d = 0.18; variance: t(128) = 2.37, p < 0.05, d = 0.21; by ttest).
Discussion
Taken together, our experimental data provide evidence consistent with a Bayesian model averaging theory of Pavlovianinstrumental arbitration. Our key finding was that the Pavlovian Go bias was stronger under conditions of low reward controllability, consistent with our model’s prediction. Analyses in terms of the biasvariance tradeoff supported the same conclusion: low controllability favors the simpler Pavlovian predictor, leading to high bias and low variance.
Our results cannot be explained by a nonspecific Go bias, whereby Go responses are rewarded more in the High control condition, inducing an overall tendency to produce Go responses. This would in fact predict the opposite effect (stronger Go bias under high reward controllability), contrary to our experimental findings. Even stronger evidence against a nonspecific Go bias would be provided by a version of the experiment in which participants must make Go/NoGo responses to avoid punishment.
The idea that Pavlovianinstrumental interactions are governed by probabilistic inference joins a number of related ideas in the theories of reinforcement learning. Most relevantly, Daw and colleagues suggested that arbitration between modelbased and modelfree control was determined by Bayesian arbitration^{11}, but they did not address Pavlovianinstrumental interactions. A number of earlier theories argued that certain reinforcement learning behaviors could be understood as arising from a model comparison process^{22,23,24,25}. However, to our knowledge, ours is the first account that directly addresses Pavlovianinstrumental interactions in terms of model comparison/averaging.
Recent work by Swart and colleagues complicates this picture by showing that the Go bias is at least partially accounted for by instrumental learning biases^{26}. In particular, participants in their study tended to learn more quickly from rewarded Go trials compared to rewarded NoGo trials, whereas they learned more slowly from punished NoGo trials compared to punished Go trials. This instrumental learning bias causes Go responses to appear more attractive when correct actions yield reward, and less attractive when correct actions yield avoidance of punishment. This phenomenon is not accounted for by our modeling framework.
Our results suggest several directions for future work. First, we have only studied the dynamics of the Pavlovian go bias for rewards; earlier work suggests that we should find a symmetric pattern for punishments, with a stronger NoGo bias under low controllability^{6,27}. Second, neuroimaging could be used to identify the neural correlates of arbitration. If our account is correct, we would expect to see a signal in the brain that encodes the dynamically changing weight parameter. Third, an open theoretical task will be to generalize the model to explain other forms of Pavlovianinstrumental interactions, such as negative automaintenance and Pavlovianinstrumental transfer.
More broadly, our findings are consistent with the idea that agency is one factor that can mediate the tradeoff between learning processes, which has important implications for understanding psychopathology. For example, many studies in both humans and animals have shown that controllability (or lack thereof) influences future instrumental responding. Learned helplessness, where the experience of uncontrollable punishments leads to diminished instrumental learning (for example, failure to learn to escape an electric shock^{28}, is hypothesized to be a model of, and has been linked to, symptoms of depression and anxiety^{29}. Although the learned helplessness literature has focused on uncontrollable punishments, there is also evidence that individuals with depression are less likely to experience illusions of control with rewards^{30}. The idea that inferences about controllability underlie learned helplessness has been incorporated into formal Bayesian models that share some properties with the model proposed in this paper^{31}. In addition, recent work has shown that outcome controllability manipulations can induce learned helplessness in humans, and also enhance Pavlovian biases in a reinforcement learning context^{32}.
In conclusion, we have shown how the framework of Bayesian model averaging can shed light on the cognitive mechanisms underlying Pavlovian misbehavior. Although the simple model studied in this paper is not a comprehensive theory of Pavlovianinstrumental interactions, it points towards one mechanism that is likely to play an important role in future, more comprehensive theories.
Methods
Participants
We recruited two independent samples of adults from Amazon Mechanical Turk (Experiment 1: N = 271, Experiment 2: N = 183). The sample sizes were chosen in order to exceed sample sizes from previous, similar work^{5,6,19}. Participants for Experiment 2 were recruited from an existing pool of Amazon Mechanical Turk workers. These workers have completed previous experiments for our lab and expressed interest in being recontacted for additional study opportunities. Participants were excluded for inaccuracy. Specifically, if participants made the incorrect action (either a button press for a NoGo trial, or the absence of a button press for a Go trial) for ≥50% of all trials, they were excluded from analyses. We also excluded any participants that performed <30% on any one condition. This left a total of 185 accurate participants for Experiment 1 and 129 accurate participants for Experiment 2 (see Supplementary Fig. 2 for experimental results without participant exclusions). The Harvard University Committee on the Use of Humans Subjects approved the experimental procedures and participants provided informed consent prior to beginning the study.
Procedure
Participants completed a modified Go/NoGo paradigm where they made a decision on each trial to either take or avoid an action in response to a stimulus to receive reward^{6,20}. Participants viewed a single colored square on each trial and had to learn the appropriate response for each square. There was a different correct response and reward probability combination for each shape: One square was a Go stimulus, where a spacebar press was rewarded 75% of the time, one square was a NoGo stimulus, where the absence of a button press was rewarded 75% of the time, and the third square was a Decoy stimulus, where a spacebar press was rewarded with a particular probability, which was manipulated based on experimental condition. In the Low control (LC) condition, the Decoy was rewarded 50% of the time, and in the High control (HC) condition—the Decoy was rewarded 80% of the time. Our task differed from previous Go/NoGo tasks in that it did not include any punishment conditions. Rewarded outcomes were represented with dollar signs, and unrewarded outcomes were represented with a neutral (white rectangle) cue. Participants were told that they would receive a small amount of real bonus money for each reward outcome, and their total bonus was summed and disclosed at the end of the experiment.
In Experiment 1, participants were randomly assigned to one decoy condition (LC or HC), so that each participant was exposed to three different stimuli (GotoWin, NoGotoWin, and either LC or HC). The experiment consisted of 120 trials, 40 trials for each type of stimulus, randomly interleaved. In Experiment 2, each participant experienced both decoy conditions in separate blocks, the order of which was randomized. The experiment consisted of 240 trials, 120 for each block, with 40 trials for each stimulus within a block. The experiment was coded in jsPsych, version 6.0.5^{33}.
Computational model
On each trial of the task, the participant must take an action (a) in response to a stimulus (s) in order to receive a reward (r). The problem facing the participant is to determine whether they are acting in an environment where outcomes are controllable (instrumental) or uncontrollable (Pavlovian).
Each model has a set of parameters θ that must be learned. The parameters for the uncontrollable model are indexed only by the stimulus (\(\theta _{s}\)), whereas the parameters for the controllable model are indexed by both the stimulus and action (\(\theta _{sa}\)). We will walk through the learning equations for the uncontrollable model, but the idea is essentially the same for the controllable model (see Supplementary Methods for complete derivations).
The posterior over parameters given data \({\cal{D}}\) (the history of stimuli, actions and rewards) and environment m ∈ {controllable, uncontrollable} is stipulated by Bayes’ rule:
where \(P({\cal{D}}\theta ,m)\) is the likelihood of the data given hypothetical parameter values θ, and P(θ/m) is the prior probability of those parameter values. In the context of our task, where rewards are binary, \(\theta _s = {\Bbb E}[rs]\) corresponds to the mean of a stimulusspecific Bernoulli distribution. When P(θ_{s}) is a \(Beta\left( {\theta _0\frac{{\eta _0}}{2},\left( {1  \theta _0} \right)\frac{{\eta _0}}{2}} \right)\) distribution, the posterior mean \(\hat \theta _s\) (which is also the posterior predictive mean for reward) is initialized to θ_{0} and updated according to:
where δ is the reward prediction error \((r  \hat \theta _s)\), and \(\eta _s^{  1}\) is the learning rate with counter η_{s} initialized to η_{0} and incremented by 1 every time stimulus s is encountered (in the controllable model, η is indexed by both s and a). Intuitively, θ_{0} corresponds to the prior mean (the reward expectation before any observations), and η_{0} corresponds to the prior confidence (how much deviation from the prior mean the agent expects).
Because the true environment is unknown, it must be inferred, which can be done using another application of Bayes’ rule:
where
is the marginal likelihood. The posterior can be updated in closed form. For clarity we adopt a logodds convention, with the prior logodds given by:
The posterior log odds are initialized to L_{0} and updated according to:
Finally, we need to specify how each model maps reward predictions onto action values. For the instrumental model, we assume that action values simply correspond to the expected reward for a particular stateaction pair: \(V_I\left( {s,a} \right) = \hat \theta _{sa}\). For the Pavlovian model, we assume that the action value is equal to V_{P}(s, a) = 0 for a = NoGo and \(V_P\left( {s,a} \right) = \hat \theta _s\) for a = Go. This assumption follows from the influential idea that Pavlovian reward expectations invigorate action^{5}. To combine the two action values into a single integrated value for action selection, we weight each model’s value by its corresponding posterior probability:
where
is the posterior probability of the uncontrollable environment.
To allow for stochasticity of behavior, we model the agent’s action selection according to a softmax, where β is an inverse temperature parameter controlling the level of choice stochasticity:
The model outlined above, which we will refer to as the adaptive model, updates the weighting parameter from trialtotrial based on the relative predictive accuracy between the two controllers. We also fit a comparison model, which instead fits the weighting term as a free parameter. We refer to this comparison model as the fixed model. The models share the same underlying information processing architecture (Fig. 3) but differ in whether w is set exogenously (in the case of the fixed model) or endogenously (in the case of the adaptive model).
We fit each model’s free parameters using maximum likelihood estimation. The adaptive model had five free parameters: the inverse temperature β, and the parameters of the prior (θ_{0}, η_{0}) for each environment (High or Low Control). We also considered a model in which L_{0} was fit as a free parameter, but model comparison indicated that fixing L_{0} = 0.5 had greater support in our data sets. The fixed model had six free parameters: the same five as the adaptive model, plus the weighting parameter w. Average parameter estimates are reported in Supplementary Table 1.
Biasvariance analysis
To assess how controllability affects the biasvariance tradeoff, we calculated these quantities for each participant as follows:
where a_{t} is the chosen action on trial t, \(a_t^ \ast\) is the optimal action, \(\bar a_t = \frac{1}{T}\mathop {\sum}\nolimits_{t = 1}^T {\Bbb I} \left[ {a_t = Go} \right]\), T is the number of trials (note that the optimal action is not defined for the Low control decoy), and \({\Bbb I}[ \cdot ] = 1\) when its argument is true, and 0 otherwise.
Intuitively, this bias measures how much a participant’s actions deviate from the optimal policy. A bias of 0 indicates that the participant always follows the optimal policy. Positive values indicate that the participant is responding Go more frequently than optimal. The variance measures how much a participant’s actions deviate from their average policy. A variance of 0 indicates that the participant always gives the same response.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Data availability
The data that support the findings of this study are available at: https://github.com/sjgershm/GoNogocontrol.
Code availability
Code to produce computational model results and plots is available at: https://github.com/sjgershm/GoNogocontrol and code for the experiment is available at: https://github.com/hayleydorfman/pavlovianinstrumentalarbitration.
Change history
08 July 2020
An amendment to this paper has been published and can be accessed via a link at the top of the paper.
References
 1.
Wasserman, E. A., Franklin, S. R. & Hearst, E. Pavlovian appetitive contingencies and approach versus withdrawal to conditioned stimuli in pigeons. J. Comp. Physiol. Psychol. 86, 616–627 (1974).
 2.
Breland, K. & Breland, M. The misbehavior of organisms. Am. Psychologist 16, 681–684 (1961).
 3.
Hershberger, W. A. An approach through the lookingglass. Anim. Learn. Behav. 14, 443–451 (1986).
 4.
Williams, D. R. & Williams, H. Automaintenance in the pigeon: sustained pecking despite contingent nonreinforcement. J. Exp. Anal. Behav. 12, 511–520 (1969).
 5.
GuitartMasip, M., Duzel, E., Dolan, R. & Dayan, P. Action versus valence in decision making. Trends Cogn. Sci. 18, 194–202 (2014).
 6.
GuitartMasip, M. et al. Go and nogo learning in reward and punishment: Interactions between affect and effect. NeuroImage 62, 154–166 (2012).
 7.
Miller, S. & Konorski, J. On a particular form of conditioned reflex. J. Exp. Anal. Behav. 12, 187–189 (1969).
 8.
Mowrer, O. H. On the dual nature of learning—a reinterpretation of conditioning and problemsolving. Harv. Educ. Rev. 17, 102–148 (1947).
 9.
Rescorla, R. A. & Solomon, R. L. Twoprocess learning theory: relationships between Pavlovian conditioning and instrumental learning. Psychol. Rev. 74, 151–182 (1967).
 10.
Dayan, P., Niv, Y., Seymour, B. & Daw, N. D. The misbehavior of value and the discipline of the will. Neural Netw. 19, 1153–1160 (2006).
 11.
Daw, N. D., Niv, Y. & Dayan, P. Uncertaintybased competition between prefrontal and dorsolateral striatal systems for behavioral control. Nat. Neurosci. 8, 1704–1711 (2005).
 12.
MacKay, D. J. C. Information Theory, Inference and Learning Algorithms. (Cambridge University Press, 2003).
 13.
Geman, S. & Bienenstock, E. Neural networks and the bias/variance dilemma. MIT Press 4, 1–58 (1992).
 14.
Glaze, C. M., Filipowicz, A. L. S., Kable, J. W., Balasubramanian, V. & Gold, J. I. A bias–variance tradeoff governs individual differences in online learning in an unpredictable environment. Nat. Hum. Behav. 2, 213–224 (2018).
 15.
Gigerenzer, G. & Brighton, H. Homo heuristicus: why biased minds make better inferences. Top. Cogn. Sci. 1, 107–143 (2009).
 16.
Germain, P., Bach, F., Lacoste, A. & LacosteJulien, S. PACBayesian theory meets bayesian inference. Neural Information Processing Systems. (2016).
 17.
Huys, Q. & Dayan, P. A Bayesian formulation of behavioral control. Cognition 113, 314–328 (2009).
 18.
Moscarello, J. M. & Hartley, C. A. Agency and the calibration of motivated behavior. Trends Cogn. Sci. 21, 725–735 (2017).
 19.
GuitartMasip, M. et al. Action controls dopaminergic enhancement of reward representations. Proc. Natl Acad. Sci. USA 109, 7511–7516 (2012).
 20.
Cavanagh, J. F., Eisenberg, I., GuitartMasip, M., Huys, Q. & Frank, M. J. Frontal theta overrides pavlovian learning biases. J. Neurosci. 33, 8541–8548 (2013).
 21.
Stephan, K. E., Penny, W. D., Daunizeau, J., Moran, R. J. & Friston, K. J. Bayesian model selection for group studies. Neuroimage 46, 1004–1017 (2009).
 22.
Courville, A. C., Daw, N. D. & Touretzky, D. S. Bayesian theories of conditioning in a changing world. Trends Cogn. Sci. 10, 294–300 (2006).
 23.
Gershman, S. J., Blei, D. M. & Niv, Y. Context, learning, and extinction. Psychol. Rev. 117, 197–209 (2010).
 24.
Gershman, S. J. Deconstructing the human algorithms for exploration. Cognition 173, 34–42 (2017).
 25.
Tomov, M. S., Dorfman, H. M. & Gershman, S. J. Neural computations underlying causal structure learning. J. Neurosci. 38, 7143–7157 (2018).
 26.
Swart, J. C. et al. Catecholaminergic challenge uncovers distinct Pavlovian and instrumental mechanisms of motivated (in) action. eLife. https://doi.org/10.7554/eLife.22169.001 (2017).
 27.
Rigoli, F., Pavone, E. F. & Pezzulo, G. Interaction of goaldirected and Pavlovian systems in aversive domains. Front. Neurosci. 6, 134 (2012).
 28.
Maier, S. F. & Seligman, M. E. Learned helplessness: theory and evidence. J. Exp. Psychol. Gen. 105, 3–46 (1976).
 29.
Mineka, S. & Hendersen, R. W. Controllability and predictability in acquired motivation. Annu. Rev. Psychol. 36, 495–529 (1985).
 30.
Alloy, L. B. & Abramson, L. Y. Learned helplessness, depression, and the illusion of control. J. Pers. Soc. Psychol. 42, 1114–1126 (1982).
 31.
Lieder, F. & Goodman, N. D. Learned helplessness and generalization. Proc. 35th Annu. Conf. Cogn. Sci. Soc. (2013).
 32.
Csifcsák, G., Melsæter, E. & Mittner, M. Intermittent absence of control during reinforcement learning interferes with Pavlovian bias in action selection. Preprint at: https://psyarXiv.com/jpq6f/ (2019).
 33.
de Leeuw, J. R. jsPsych: a JavaScript library for creating behavioral experiments in a Web browser. Behav. Res. 47, 1–12 (2015).
Acknowledgements
We would like to thank Rebecca Hao for help with the initial setup for this study. This work was supported by the Office of Naval Research (N000141712984) and the Alfred P. Sloan Foundation.
Author information
Affiliations
Contributions
S.J.G. designed the study; H.M.D. collected the data; S.J.G. and H.M.D. analyzed the data; S.J.G. designed the computational models; S.J.G. and H.M.D. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Communications thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Dorfman, H.M., Gershman, S.J. Controllability governs the balance between Pavlovian and instrumental action selection. Nat Commun 10, 5826 (2019). https://doi.org/10.1038/s41467019137377
Received:
Accepted:
Published:
Further reading

Reinforcementlearning in frontostriatal circuits
Neuropsychopharmacology (2022)

A global framework for a systemic view of brain modeling
Brain Informatics (2021)

Advances in the computational understanding of mental illness
Neuropsychopharmacology (2021)

Motivational learning biases are differentially modulated by genetic determinants of striatal and prefrontal dopamine function
Journal of Neural Transmission (2021)

Computing the Uncontrollable: Insights from Computational Modelling of Learning and Choice in Depression
Current Behavioral Neuroscience Reports (2021)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.