{"title": "Acquisition in Autoshaping", "book": "Advances in Neural Information Processing Systems", "page_first": 24, "page_last": 30, "abstract": null, "full_text": "Acquisition in Autoshaping \n\nSham Kakade \n\nPeter Dayan \n\nGatsby Computational Neuroscience Unit \n\n17 Queen Square, London, England, WC1N 3AR. \n\nsharn@gatsby.ucl.ac.uk \n\ndayan@gatsby.ucl.ac.uk \n\nAbstract \n\nQuantitative data on the speed with which animals acquire behav(cid:173)\nioral responses during classical conditioning experiments should \nprovide strong constraints on models of learning. However, most \nmodels have simply ignored these data; the few that have attempt(cid:173)\ned to address them have failed by at least an order of magnitude. \nWe discuss key data on the speed of acquisition, and show how to \naccount for them using a statistically sound model of learning, in \nwhich differential reliabilities of stimuli playa crucial role. \n\n1 Introduction \n\nConditioning experiments probe the ways that animals make predictions about \nrewards and punishments and how those predictions are used to their advantage. \nSubstantial quantitative data are available as to how pigeons and rats acquire con(cid:173)\nditioned responses during autoshaping, which is one of the simplest paradigms \nof classical conditioning.4 These data are revealing about the statistical, and ulti(cid:173)\nmately also the neural, substrate underlying the ways that animals learn about the \ncausal texture of their environments. \nIn autoshaping experiments on pigeons, the birds acquire a peck response to a \nlighted key associated (irrespective of their actions) with the delivery of food. One \nattractive feature of autoshaping is that there is no need for separate 'probe trials' \nto assess the degree of association formed between the light and the food by the \nanimal- rather, the rate of key pecking during the light (and before the food) can \nbe used as a direct measure of this association. In particular, acquisition speeds are \noften measured by the number of trials until a certain behavioral criterion is met, \nsuch as pecking during the light on three out of four successive trials.4,8,10 \nAs stressed persuasively by Gallistel & Gibbon4 (GG; forthcoming), the critical \nfeature of autoshaping is that there is substantial experimental evidence on how \nacquisition speed depends on the three critical variables shown in figure 1A. The \nfirst is I, the inter-trial interval; the second is T, the time during the trial for which \nthe light is presented; the third is the training schedule, liS, which is the fractional \nnumber of deliveries per light -\nFigure 1 makes three key points. First, figure 1B shows that the median number of \ntrials to the acquisition criterion depends on the ratio of I IT, and not on I and T \nseparately - experiments reported for the same I IT are actually performed with I \nand T differing by more than an order of magnitude.4,8 Second, figure 1B shows \nconvincingly that the number of reinforcements is approximately inversely pro(cid:173)\nportional to I IT -\nthe relatively shorter presentation of light, the faster the leam-\n\nsome birds were only partially reinforced. \n\n\fAcquisition in Autoshaping \n\n25 \n\nA \n\nB 500 . \nE \n\n.. . . \n\n'OOOO r - - - - - -- - - , \n\nC \n\nvrnotoo \n\\~' \n\n'''' ~ 1': \n\n\\: \\ , ': ;: :::::~::;i~: \n\nlime \n\n2 \n\nJr 10 \n\n20 \n\n50 \n\n10 , - -- 2 \u00b7 \u00b7 :~ . 5 \n\n\u2022 \n\n\u2022 '0 \n\nFigure 1: Autoshaping. A) Experimental paradigm. Top: the light is presented for T seconds every C \nseconds and is always followed by the delivery of food (filled circle). Bottom: the food is delivered with \nprobability liS = 1/2 per trial. In some cases I is stochastic, with the appropriate mean. B) Log-log \nplot4 of the number of reinforcements to a given acquisition criterion versus the I IT ratio for S = l. \nThe data are median acquisition times from 12 different laboratories. C) Log-log acquisition curves for \nvarious I IT ratios and S values. The main graph shows trials versus S; the inset shows reinforcements \nversus S. (1999). \n\ning. Third, figure lC shows that partial reinforcement has almost no effect when \nmeasured as a function of the number of reinforcements (rather than the number \nof trials),4, 10 since although it takes S times as many trials to acquire, there are rein(cid:173)\nforcements on only liS trials. Changing S does not change the effective I IT when \nmeasured as a function of reinforcements, so this result might actually be expected \non the basis of figure IB, and we only consider S = 1 in this paper. Altogether, the \ndata show that: \n\n(1) \n\nwhere n is the number of rewards to the acquisition criterion. Remarkably, these \neffects seem to hold for over an order of magnitude in both I IT and S . \nThese quantitative data should be a most seductive target for statistically sound \nmodels of learning. However, few models have even attempted to capture the \nstrong constraints they provide, and those that have attempted, all fail in critical \naspects. The best of them, rate estimation theory4 (RET), is closely related to the \nRescorla-Wagnerl3 (RW) model, and actually captures the proportionality in equa(cid:173)\ntion 1. However, as shown below, RET grossly overestimates the observed speed \nof acquisition (underestimating the proportionality constant). Further, RET is de(cid:173)\nsigned to account for the time at which a particular, standard, acquisition criterion \nis met. Figure 2A shows that this is revealing only about the very early stages of \nlearning - RET is silent about the remainder of the learning curve. \nWe look at additional quantitative data on learning, which collectively suggest \nthat stimuli compete to predict the delivery of reward. Dayan & Long3 (DL) dis(cid:173)\ncussed various statistically inspired competitive models of classical conditioning, \nconcluding with one in which stimuli are differently reliable as predictors of re(cid:173)\nward. However, DL ignored the data shown in figures 1 and 2, basing their anal(cid:173)\nYSis on conditioning paradigms in which I IT was not a factor. Figures 1 and 2 \ndemand a more sophisticated statistical model - building such a model is the \nfocus of this paper. \n\n2 Rate Estimation Theory \n\nGallistel & Gibbon4 (GG; forthcoming) are amongst the strongest proponents of the \nquantitative relationships in figure 1. To account for them, GG suggest that animals \nare estimating the rates of rewards - one, >'1, for the rate associated with the light \nand another, >'b, for the rate associated with the background context. The context is \nthe ever-present environment which can itself gain associative value. The overall \n\n\f26 \n\nA \n\n1.5 \n\ni\n\n' \n\nrite ' \n\n100 \n\n200 \n\n300 \nreinforcements \n\n400 \n\nS. Kakade and P. Dayan \n\ncl40r---~~-~-----' \n\n J3 I n) = 1 - Q \n\n(3) \nwhere Q is the uncertainty threshold and J3 is slightly greater than one, reflect(cid:173)\ning the fractional increase. The n that first satisfies equation 3 can be found by \nintegrating the joint probability in equation 2. It turns out that n ex tlltb, which \nhas the approximate, linear dependence on the ratio I IT (as in figure IB), since \ntt/tb = nT InI = T I I. It also has no dependence on partial reinforcement, as \nobserved in figure 1 C. \nHowever, even with a very low uncertainty, Q = 0.001, and a reasonable fractional \nincrease, J3 = 1.5, this model predicts that learning should be more than ten times \nas fast as observed, since we get n ~ 20 * T I I as opposed to the 300 * T I I observed. \nEquation 1 can only be satisfied by setting Q between 10-20 and 10-50 (depending \non the precise values of I IT and J3)! This spells problems for GG as a normative, \nideal detector model of learning -\nit cannot, for instance, be repaired with any \nreasonable prior for the rates, as Q drops drastically with n . In other circumstances, \n\n\fAcquisition in Autoshaping \n\n27 \n\nthough, Gallistel, Mark & KingS (forthcoming) have shown that animals can be \nideal detectors of changes in rates. \nOne hint of the flaw with GG is that simple manipulations to the context before \nstarting auto shaping (in particular extinction) can produce very rapid learning.2 \nMore generally, the data show that acquisition speed is strongly- controlled by pri(cid:173)\nor rewards being given only in the context (without the light present).2 Figure 2B \nshows a parametric study of subsequent acquisition speeds during autoshaping as \na function of the number of rewards given only with the context. This effect cannot \nsimply be modeled by assuming a different prior distribution for the rates (which \ndoes not fix the problem of the speed of acquisition in any case), since the rate at \nwhich these prior context rewards were given has little effect on subsequent ac(cid:173)\nquisition speed for a given number of prior reinforcements.9 Note that the data in \nfigure 2B (ie equation 1) suggest that there were about thirty prior rewards in the \nthis is consistent with the experimental procedures used,8--10 although \ncontext -\nprior experience was not a carefully controlled factor. \n\n3 The Competitive Model \n\nFive sets of constraints govern our new model. First, since animals can be ideal \ndetectors of rates in some circumstances,s we only consider accounts under which \ntheir acquisition of responding has a rational statistical basis. Second, the number \nof reinforcements to acquisition must be n ~ 300 * T / I, as in equation 1. This re(cid:173)\nquires that the constant of proportionality should come from rational, not absurd, \nuncertainties. Third, pecking rates after the acquisition criterion is satisfied should \nalso follow the form of figure 2A (in the end, we are preventing from a normative \naccount of this by a dearth of data). Fourth, the overallieaming speed should be \nstrongly affected by the number of prior context rewards (figure 2B), but not by the \nrate at which they were presented. That is, the context, as an established predic(cid:173)\ntor, regardless of the rate it predicts, should be able to substantially block learning \nto a less established predictor. Finally, the asymptotic accuracy of rate estimates. \nshould satisfy the substantial experimental data on the intrinsic uncertainty in the \npredictions in the form of a quantitative account called scalar expectancy theory7 \n(SET). \nIn our model, as in DL, an independent prediction of the rate of reward delivery is \nmade on the basis of each stimulus that is present (wc, for the context; WI for the \nlight). These separate predictions are combined based on estimated reliabilities of \nthe predictions. Here, we present a heuristic version of a more rigorously specified \nmodel.12 \n\n3.1 Rate Predictions \n\nSET7 was originally developed to capture the nature of uncertainty in the way \nthat animals estimate time intervals. Its most important result is that the standard \ndeviation of an estimate is consistently proportional to the mean, even after an \nasymptotic number of presentations of the interval. Since the estimated time to a \nreward is just the inverse rate, asymptotic rate estimates might also be expected \nto have constant coefficients of variation. Therefore, we constrain the standard \ndeviations of rate estimates not to drop below a multiple of their means. Evidence \nsuggests that this multiple is about 0.2.7 RET clearly does not satisfy this constraint \nas the joint distribution (equation 2) becomes arbitrarily accurate over time. \nInspired by Sutton,14 we consider Kalman filter models for independent log(cid:173)\npredictions, logwc(m) and logwl(m), on trial m. The output models for the filters \n\n\f28 \n\ns. Kakade and P. Dayan \n\nspecify the relationship between the predicted and observed rates. We use a simple \nlog-normal, CN, approximation (to an underlying truly Poisson model): \n\nP(oc(m) I wc(m\u00bb \n\n,... CN(wc(m) , v;) P(ol(m) I wl(m)) ,... CN(wl(m), vt) \n\n(4) \nwhere o.(m) is the observed average reward whilst predictor * is present, so if a \nreward occurs with the light in trial m, then ol(m) = l/T and oc(m) = l/C (where \nC = T + J). The values of v; can be determined, from the Poisson model, to be \nV 2 - v2 -1 \nc -\n\u2022 \nThe other part of the Kalman filter is a model of change in the world for the w's: \n\nI -\n\nlogwc(m) = logwc(m - 1) + \u20acc(m) \nlogwl(m) = log WI (m - 1) + \u20acl(m) \n\n\u20acc(m) \n\u20acl(m) \n\n,... N(O, (1](1] + 1\u00bb-1) \n,... N(O, (1](1] + 1\u00bb-1) \n\n(5) \n(6) \n\nWe use log(rates) so that there is no inherent scale to change in the world. Here, \n1] is a constant chosen to satisfy the SET constraint, imposed as u. = w./..,fii at \nasymptote. Notice that 1] acts as the effective number of rewards remembered, \nwhich will be less than 30, to get the observed coefficient of variation above 0.2. \nAfter observing the data from m trials, the posterior distributions for the predic(cid:173)\ntions will become approximately: \n\nP(wc(m) I data) '\" N(1/C,u;(m\u00bb \n\n(7) \nand, in about m = 1] trials, uc(m) -+ (1/C)/..,fii and ul(m) -+ (l/T)/..,fii. This \ncaptures the fastest acquisition in figure 2, and also extinction. \n\nP(wl(m) I data) ,... N(1/T, ut(m\u00bb \n\n3.2 Cooperative Mixture of Experts \n\nThe two predictions (equation 7) are combined using the factorial experts model of \nJacobs et a[11 that was also used by DL. For this, during the presentation of the light \n(and the context, of course), we consider that, independently, the relationships \nbetween the actual reward rate rem) and the outputs wl(m) and wc(m) of 'experts' \nassociated with each stimulus are: \n\nP(wl(m)lr(m\u00bb '\" N(r(m), pJm) \n\n, P(wc(m)lr(m\u00bb,... N(r(m), p)m) \n\n(8) \n\nwhere PI(m)-1 and pc(m)-1 are inverse variances, or reliabilities for the stimuli. \nThese reliabilities reflect the belief as to how close wl(m) and wc(m) are to rem). \nThe estimates are combined, giving \n\nP(r(m) I wl(m),wc(m\u00bb '\" N(T(m) , (Pl(m) + pc(m\u00bb-I) \n\nrem) = 7f1(m)wl(m) + (1- 7f1(m))wc(m) 7f1(m) = Pl(m)/(Pl(m) + pc(m\u00bb \n\nThe prediction of the reward rate without the light r c (m) is determined just by the \ncontext value wc(m). \nIn this formulation, the context can block the light's prediction if it is more reliable \n(Pc\u00bb PI), since 7f1 ~ 0, making the mean rem) ~ wc(m), and this blocking occurs \nregardless of the context's rate,wc(m). If PI slowly increases, then rem) -+ WI slowly \nas 7f1 (m) -+ 1. We expect this to model the post-acquisition part of the learning \nshown in figure 2A. \nA fully normative model of acquisition would come from a statistically correct ac(cid:173)\ncount of how the reliabilities should change over time, which, in turn, would come \nfrom a statistical model of the expectations the animal has of how predictabilities \nchange in the world. Unfortunately, the slow phase of learning in figure 2A, which \nshould provide the most useful data on these expectations, is almost ubiquitously \n\n\fAcquisition in Autoshaping \n\n29 \n\nA 1.5 \n\nB lIT \n\n...__-==---1 C 500. \n\n0.6 \n\n0.3 \n\nacquiS110n Criterion \n\n10 \n\nI \n\n.: \n.. ... . \n\n\u2022 \n\n. ; I ' . . . , \n\n100 \n\n200 \n\n300 \nreinforcements \n\n400 \n\n100 \n\n200 \n\n300 \nreinforcements \n\n400 \n\n10 \n\n20 \n\n50 \n\n5 \n1fT \n\nFigure 3: Satisfaction of the Constraints. A) The fit to the behavioral response curve (figure 2B), using \nequation 9 and 7r0 = 0.004. B) Possible acquisition curves showing r{m) versus m. The +--7 on the \ncriterion line denotes the range of 15 to 120 reinforcements that are indicated by figure 2B. The -(cid:173)\ncurve is the same as in Fig 3A. The parameters displayed are values for 7r0 in multiples of 7r0 for the \ncenter curve. C) A theoretical fit to the data using equation 11. Here,o: = 5% and 7ro..jPo = 0.004. \n\nignored in experiments. We therefore make two assumptions about this, which are \nchosen to fit the acquisition data, but whose normative underpinnings are unclear. \nThe first assumption, chosen to obtain the slow learning curve, is that: \n\n1ft (m) = tanh 1fom \n\n(9) \nAssuming that the strength of the behavioral response is approximately propor(cid:173)\ntional to r(m) - rc(m), which we will estimate by 1fl(m)(i~h(m) - wc(m)), figure 3A \ncompares the rate of key pecking in the model with the data from figure 2A. Fig(cid:173)\nure 3B shows the effect on the behavioral response of varying 1fo. Within just a half \nan order magnitude of variation of 1fo, the acquisition speeds (judged at the criteri(cid:173)\non line shown) due to between 1200 and 0 prior context rewards (figure 2B) can be \nthe actual reward rate \nobtained. Note the slightly counter-intuitive explanation -\nassociated with the light is established very quickly -\nslow learning comes from \nslow changes in the importance paid to these rates. \nWe make a second assumption that the coefficient of variation of the context's pre(cid:173)\ndiction, from equation 8, does not change Significantly for the early trials before \nthe acquisition criterion is met (it could change thereafter). This gives: \n\npc(m) ~ po/wc(m)2 for early m \n\n(10) \nIt is plausible that the context is not becoming a relatively worse 'expert' for early \nm, since no other predictor has yet proven more reliable. \nFollowing GG's suggestion, we model acquisition as occurring on trial m if \nP(r(m) > rc(m)ldata) ~ 1 - 0:, ie if the animal has sound reasons to expect a \nhigher reward rate with the light. Integrating over the Kalman filter distributions \nin equation 7 gives the distribution of r(m) - rc(m) for early mas \n\nP(r(m) - rc(m) I data) '\" N\u00ab(tanh 1fom)(1/T -\n\nl/C), (pOC2)-1) \n\nwhere O\".(m) has dropped out due to 1ft(m) being small at early m. Finding the \nnumber of rewards, n, that satisfies the acquisition criterion gives: \n\nn ~ \n\n0: T \n-\n1foVPO I \n\n(11) \n\nwhere the factor of 0: depends on the uncertainty, 0:, used. Figure 3C shows the \ntheoretical fit to the data. \n\n4 Discussion \n\nAlthough a noble attempt, RET fails to satisfy the strong body of constraints under \nwhich any acquisition model must labor. Under RET, the acquisition of respond(cid:173)\ning cannot have a rational statistical basis, as the animal's modeled uncertainty in \n\n\f30 \n\nS. Kakade and P. Dayan \n\nthe association between light and reward at the time of acquisition is below 10-20 . \nFurther, RET ignores constraints set forth by the data establishing SET and also \ndata on prior context manipulations. These latter data show that the context, re(cid:173)\ngardless of the rate it predicts, will substantially block learning to a less established \npredictor. Additive models, such as RET, are unable to capture this effect. \nWe have suggested a model in which each stimulus is like an 'expert' that learns \nindependently about the world. Expert predictions can adapt quickly to changes \nin contingencies, as they are based on a Kalman filter model, with variances chosen \nto satisfy the constraint suggested by SET, and they can be combined based on their \nreliabilities. We have demonstrated the model's close fit to substantial experimental \ndata. In particular, the new model captures the I IT dependence of the number \nof rewards to acquisition, with a constant of proportionality that reflects rational \nstatistical beliefs. The slow learning that occurs in some circumstances, is due to \na slow change in the reliabilities of predictors, not due to the rates being unable \nto adapt quickly. Although we have not shown it here, the model is also able \nto account for quantitative data as to the speed of extinction of the association \nbetween the light and the reward. \nThe model leaves many directions for future study. In particular, we have not \nspecified a sound statistical basis for the changes in reliabilities given in equation(cid:173)\ns 9 and 10. Such a basis is key to understanding the slow phase of learning. Second, \nwe have not addressed data from more sophisticated conditioning paradigms. For \ninstance, overshadowing, in which multiple conditioned stimuli are similarly pre(cid:173)\ndictive of the reward, should be able to be incorporated into the model in a natural \nway. \n\nAcknowledgements \nWe are most grateful to Randy Gallistel and John Gibbon for freely sharing, prior \nto publication, their many ideas about timing and conditioning. We thank Sam \nRoweis for comments on an earlier version of the manuscript. Funding is from a \nNSF Graduate Research Fellowship (SK) and the Gatsby Charitable Foundation. \n\nReferences \n\n[1] Balsam, PD, & Gibbon, J (1988). Journal of Experimental Psychology: Animal Behavior Processes, 14: \n\n401-412. \n\n[2] Balsam, PD, & Schwartz, AL (1981). Journal of Experimental Psychology: Animal Behavior Processes, \n\n7: 382-393. \n\n[3] Dayan, P, & Long, T, (1997) Neural Information Processing Systems, 10:117-124. \n[4] Gallistel, CR, & Gibbon, J (1999). Time, Rate, and Conditioning. Forthcoming. \n[5] Gallistel, CR, Mark, TS & King, A (1999). Is the Rat an Ideal Detector of Changes in Rates of Reward? \n\nForthcoming. \n\n[6] Gamzu, ER, & Williams, DR (1973). Journal of the Experimental Analysis of Behavior, 19:225-232. \n[7] Gibbon, J (1977). Psychological Review 84:279-325. \n[8] Gibbon, J, Baldock, MD, Locurto, C, Gold, L & Terrace, HS (1977). Journal of Experimental Psychol(cid:173)\n\nogy: Animal Behavior Processes, 3: 264-284. \n\n[9] Gibbon, J & Balsam, P (1981). In CM Locurto, HS Terrace, & J Gibbon, editors, Autoshaping and \n\nConditioning Theory. 219-253. New York, NY: Academic Press. \n\n[10] Gibbon, J, Farrell, L, Locurto, CM, Duncan, JH & Terrace, HS (1980). Animal Learning and Behavior, \n\n8:45-59. \n\n[11] Jacobs, RA, Jordan, MI, & Barto, AG (1991). Cognitive Science 15:219-250. \n[12] Kakade, S & Dayan, P (2000). In preparation. \n[13] Rescorla, RA & Wagner, AR (1972). In AH Black & WF Prokasy, editors, Classical Conditioning II: \n\nCurrent Research and Theory, 64-69. New York, NY: Appleton-Century-Crofts. \n\n[14] Sutton, R (1992). In Proceedings of the 7th Yale Workshop on Adaptive and Learning Systems. \n\n\f", "award": [], "sourceid": 1777, "authors": [{"given_name": "Sham", "family_name": "Kakade", "institution": null}, {"given_name": "Peter", "family_name": "Dayan", "institution": null}]}