For the experiment in the MRI scanner, two tasks, Control and Oth

For the experiment in the MRI scanner, two tasks, Control and Other, were employed. Three conditions, one Control and two Others, were used in a separate behavioral experiment (Figure 1C). The settings for the Control and “Other I” task were the same as in the fMRI experiment, but in the

“Other II” task, a risk-averse RL model was used to generate the other’s choices. Several computational models, based on and modified from the Q learning model (Sutton and Barto, 1998), were fit to the subjects’ choice behaviors in both tasks. In the Control task, the RL CP-868596 cell line model, being risk neutral, constructed Q   values of both stimuli; the value of a stimulus was the product of the stimulus’ reward probability, p(A)p(A) (for stimulus A  ; the following description is made for this case), and the reward magnitude of the stimulus in a given trial, R(A)R(A), equation(1) QA=p(A)R(A).QA=p(A)R(A). To account for possible risk behavior of the subjects, we followed the approach of Behrens et al. (2007) by using a simple nonlinear function (see the Supplemental AZD8055 datasheet Information for more details and for a control analysis of the nonlinear function). The choice probability is given by q(A)=f(QA−QB)q(A)=f(QA−QB), where ff is a sigmoidal function. The reward prediction error was used to update the stimulus’ reward probability (see the Supplemental

Information for a control analysis), equation(2) δ=r−p(A),δ=r−p(A),where r   is the Casein kinase 1 reward outcome (1 if stimulus A   is rewarded and 0 otherwise). The reward probability was updated using p(A)←p(A)+ηδp(A)←p(A)+ηδ. In the Other task, the S-RLsRPE+sAPE model computed the subject’s choice probability using q(A)=f(QA−QB)q(A)=f(QA−QB); here, the value of a stimulus is the product of the subject’s fixed reward outcome and their reward probability

based on simulating the other’s decision making, which is equivalent to the simulated-other’s choice probability: qo  (A  ) = f  (QO  (A  ) − QO  (B  )), wherein the other’s value of a stimulus is the product of the other’s reward magnitude of the stimulus and the simulated-other’s reward probability, pO(A)pO(A). When the outcome for the other (rO)(rO) was revealed, the S-RLsRPE+sAPE model updated the simulated-other’s reward probability, using both the sRPE and the sAPE, equation(3) pO(A)←pO(A)+ηsRPEδO(A)+ηsAPEσO(A),pO(A)←pO(A)+ηsRPEδO(A)+ηsAPEσO(A),where the two η’s indicate the respective learning rates. The sRPE was given by equation(4) δo(A)=ro−po(A).δo(A)=ro−po(A). The sAPE was defined in the value level, being comparable to the sRPE. After being generated first in the action level, equation(5) σO′(A)=IA(A)−qO(A)=1−qO(A),the sAPE was obtained by a variational transformation, pulled back to the value level, equation(6) σO(A)=σO′(A)K,(see the Supplemental Information for the algebraic expression of K).

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>