Policies.DiscountedBayesianIndexPolicy module

Discounted Bayesian index policy.


This is still highly experimental!

Policies.DiscountedBayesianIndexPolicy.GAMMA = 0.95

Default value for the discount factor \(\gamma\in(0,1)\). 0.95 is empirically a reasonable value for short-term non-stationary experiments.

class Policies.DiscountedBayesianIndexPolicy.DiscountedBayesianIndexPolicy(nbArms, gamma=0.95, posterior=<class 'Policies.Posterior.DiscountedBeta.DiscountedBeta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.BayesianIndexPolicy.BayesianIndexPolicy

Discounted Bayesian index policy.

  • By default, it uses a DiscountedBeta posterior (Policies.Posterior.DiscountedBeta), one by arm.

  • Use discount factor \(\gamma\in(0,1)\).

  • It keeps \(\widetilde{S_k}(t)\) and \(\widetilde{F_k}(t)\) the discounted counts of successes and failures (S and F), for each arm k.

  • But instead of using \(\widetilde{S_k}(t) = S_k(t)\) and \(\widetilde{N_k}(t) = N_k(t)\), they are updated at each time step using the discount factor \(\gamma\):

\[\begin{split}\widetilde{S_{A(t)}}(t+1) &= \gamma \widetilde{S_{A(t)}}(t) + r(t),\\ \widetilde{S_{k'}}(t+1) &= \gamma \widetilde{S_{k'}}(t), \forall k' \neq A(t).\end{split}\]
\[\begin{split}\widetilde{F_{A(t)}}(t+1) &= \gamma \widetilde{F_{A(t)}}(t) + (1 - r(t)),\\ \widetilde{F_{k'}}(t+1) &= \gamma \widetilde{F_{k'}}(t), \forall k' \neq A(t).\end{split}\]
__init__(nbArms, gamma=0.95, posterior=<class 'Policies.Posterior.DiscountedBeta.DiscountedBeta'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Create a new Bayesian policy, by creating a default posterior on each arm.

gamma = None

Discount factor \(\gamma\in(0,1)\).


-> str

getReward(arm, reward)[source]

Update the posterior on each arm, with the normalized reward.

__module__ = 'Policies.DiscountedBayesianIndexPolicy'