Policies.Hedge module

The Hedge randomized index policy.

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)

Policies.Hedge.EPSILON = 0.01

Default \(\varepsilon\) parameter.

class Policies.Hedge.Hedge(nbArms, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The Hedge randomized index policy.

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf).

__init__(nbArms, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

New policy.

weights = None

Weights on the arms

startGame()[source]

Start with uniform weights.

__str__()[source]

-> str

property epsilon

Constant \(\varepsilon_t = \varepsilon\).

property trusts

Update the trusts probabilities according to Hedge formula, and the parameter \(\varepsilon_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - \varepsilon_t) w_k(t) + \varepsilon_t \frac{1}{K}, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(w_k(t)\) is the current weight from arm k.

getReward(arm, reward)[source]

Give a reward: accumulate rewards on that arm k, then update the weight \(w_k(t)\) and renormalize the weights.

\[\begin{split}w'_k(t+1) &= w_k(t) \times \exp\left( \frac{\tilde{r}_k(t)}{\varepsilon_t N_k(t)} \right) \\ w(t+1) &= w'(t+1) / \sum_{k=1}^{K} w'_k(t+1).\end{split}\]
choice()[source]

One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (less probable).

  • Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).

choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

__module__ = 'Policies.Hedge'
class Policies.Hedge.HedgeWithHorizon(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Hedge.Hedge

Hedge with fixed epsilon, \(\varepsilon_t = \varepsilon_0\), chosen with a knowledge of the horizon.

__init__(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

__str__()[source]

-> str

property epsilon

Fixed temperature, small, knowing the horizon: \(\varepsilon_t = \sqrt(\frac{2 \log(K)}{T K})\) (heuristic).

__module__ = 'Policies.Hedge'
class Policies.Hedge.HedgeDecreasing(nbArms, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Hedge.Hedge

Hedge with decreasing parameter \(\varepsilon_t\).

__str__()[source]

-> str

property epsilon

Decreasing epsilon with the time: \(\varepsilon_t = \min(\frac{1}{K}, \sqrt(\frac{\log(K)}{t K}))\) (heuristic).

__module__ = 'Policies.Hedge'