For the experiment in the MRI scanner, two tasks, Control and Other, were employed. Three conditions, one Control and two Others, were used in a separate behavioral experiment (Figure 1C). The settings for the Control and “Other I” task were the same as in the fMRI experiment, but in the
“Other II” task, a risk-averse RL model was used to generate the other’s choices. Several computational models, based on and modified from the Q learning model (Sutton and Barto, 1998), were fit to the subjects’ choice behaviors in both tasks. In the Control task, the RL CP-868596 cell line model, being risk neutral, constructed Q values of both stimuli; the value of a stimulus was the product of the stimulus’ reward probability, p(A)p(A) (for stimulus A ; the following description is made for this case), and the reward magnitude of the stimulus in a given trial, R(A)R(A), equation(1) QA=p(A)R(A).QA=p(A)R(A). To account for possible risk behavior of the subjects, we followed the approach of Behrens et al. (2007) by using a simple nonlinear function (see the Supplemental AZD8055 datasheet Information for more details and for a control analysis of the nonlinear function). The choice probability is given by q(A)=f(QA−QB)q(A)=f(QA−QB), where ff is a sigmoidal function. The reward prediction error was used to update the stimulus’ reward probability (see the Supplemental
Information for a control analysis), equation(2) δ=r−p(A),δ=r−p(A),where r is the Casein kinase 1 reward outcome (1 if stimulus A is rewarded and 0 otherwise). The reward probability was updated using p(A)←p(A)+ηδp(A)←p(A)+ηδ. In the Other task, the S-RLsRPE+sAPE model computed the subject’s choice probability using q(A)=f(QA−QB)q(A)=f(QA−QB); here, the value of a stimulus is the product of the subject’s fixed reward outcome and their reward probability
based on simulating the other’s decision making, which is equivalent to the simulated-other’s choice probability: qo (A ) = f (QO (A ) − QO (B )), wherein the other’s value of a stimulus is the product of the other’s reward magnitude of the stimulus and the simulated-other’s reward probability, pO(A)pO(A). When the outcome for the other (rO)(rO) was revealed, the S-RLsRPE+sAPE model updated the simulated-other’s reward probability, using both the sRPE and the sAPE, equation(3) pO(A)←pO(A)+ηsRPEδO(A)+ηsAPEσO(A),pO(A)←pO(A)+ηsRPEδO(A)+ηsAPEσO(A),where the two η’s indicate the respective learning rates. The sRPE was given by equation(4) δo(A)=ro−po(A).δo(A)=ro−po(A). The sAPE was defined in the value level, being comparable to the sRPE. After being generated first in the action level, equation(5) σO′(A)=IA(A)−qO(A)=1−qO(A),the sAPE was obtained by a variational transformation, pulled back to the value level, equation(6) σO(A)=σO′(A)K,(see the Supplemental Information for the algebraic expression of K).