Policies.Thompson module¶

The Thompson (Bayesian) index policy.

• By default, it uses a Beta posterior (Policies.Posterior.Beta), one by arm.
• Reference: [Thompson - Biometrika, 1933].
class Policies.Thompson.Thompson(nbArms, posterior=<class 'Policies.Posterior.Beta.Beta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

The Thompson (Bayesian) index policy.

• By default, it uses a Beta posterior (Policies.Posterior.Beta), one by arm.

• Prior is initially flat, i.e., $$a=\alpha_0=1$$ and $$b=\beta_0=1$$.

• A non-flat prior for each arm can be given with parameters a and b, for instance:

nbArms = 2
prior_failures  = a = 100
prior_successes = b = 50
policy = Thompson(nbArms, a=a, b=b)
np.mean([policy.choice() for _ in range(1000)])  # 0.515 ~= 0.5: each arm has same prior!

• A different prior for each arm can be given with parameters params_for_each_posterior, for instance:

nbArms = 2
params0 = { 'a': 10, 'b': 5}  # mean 1/3
params1 = { 'a': 5, 'b': 10}  # mean 2/3
params = [params0, params1]
policy = Thompson(nbArms, params_for_each_posterior=params)
np.mean([policy.choice() for _ in range(1000)])  # 0.9719 ~= 1: arm 1 is better than arm 0 !

• Reference: [Thompson - Biometrika, 1933].

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after $$N_k(t)$$ pulls of arm k, giving $$S_k(t)$$ rewards of 1, by sampling from the Beta posterior:

$\begin{split}A(t) &\sim U(\arg\max_{1 \leq k \leq K} I_k(t)),\\ I_k(t) &\sim \mathrm{Beta}(1 + \tilde{S_k}(t), 1 + \tilde{N_k}(t) - \tilde{S_k}(t)).\end{split}$
__module__ = 'Policies.Thompson'