Policies.Exp3S module¶

The historical Exp3.S algorithm for non-stationary bandits.

Reference: [[“The nonstochastic multiarmed bandit problem”, P. Auer, N. Cesa-Bianchi, Y. Freund, R.E. Schapire, SIAM journal on computing, 2002]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8735&rep=rep1&type=pdf)

It is a simple extension of the Exp3 policy:

>>> policy = Exp3S(nbArms, C=1)
>>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)

It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

class Policies.Exp3S.Exp3S(nbArms, gamma=None, alpha=None, gamma0=1.0, alpha0=1.0, horizon=None, max_nb_random_events=None, *args, **kwargs)[source]¶

Bases: Policies.Exp3.Exp3

The historical Exp3.S algorithm for non-stationary bandits.

Reference: [[“The nonstochastic multiarmed bandit problem”, P. Auer, N. Cesa-Bianchi, Y. Freund, R.E. Schapire, SIAM journal on computing, 2002]](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.21.8735&rep=rep1&type=pdf)

__init__(nbArms, gamma=None, alpha=None, gamma0=1.0, alpha0=1.0, horizon=None, max_nb_random_events=None, *args, **kwargs)[source]¶: New policy.

weights = None¶: Weights on the arms

__str__()[source]¶: -> str

property gamma¶: Constant \(\gamma_t = \gamma\).

property alpha¶: Constant \(\alpha_t = \alpha\).

startGame()[source]¶: Start with uniform weights.

property trusts¶

Update the trusts probabilities according to Exp3 formula, and the parameter \(\gamma_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - \gamma_t) w_k(t) + \gamma_t \frac{1}{K}, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(w_k(t)\) is the current weight from arm k.

getReward(arm, reward)[source]¶

Give a reward: accumulate rewards on that arm k, then update the weight \(w_k(t)\) and renormalize the weights.

With unbiased estimators, divide by the trust on that arm k, i.e., the probability of observing arm k: \(\tilde{r}_k(t) = \frac{r_k(t)}{\mathrm{trusts}_k(t)}\).
But with a biased estimators, \(\tilde{r}_k(t) = r_k(t)\).

\[\begin{split}w'_k(t+1) &= w_k(t) \times \exp\left( \frac{\tilde{r}_k(t)}{\gamma_t N_k(t)} \right) \\ w(t+1) &= w'(t+1) / \sum_{k=1}^{K} w'_k(t+1).\end{split}\]

__module__ = 'Policies.Exp3S'¶