# Policies.Hedge module¶

The Hedge randomized index policy.

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf)

Policies.Hedge.EPSILON = 0.01

Default $$\varepsilon$$ parameter.

class Policies.Hedge.Hedge(nbArms, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

The Hedge randomized index policy.

Reference: [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, S.Bubeck & N.Cesa-Bianchi, §3.1](http://research.microsoft.com/en-us/um/people/sebubeck/SurveyBCB12.pdf).

__init__(nbArms, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

New policy.

weights = None

Weights on the arms

startGame()[source]

__str__()[source]

-> str

property epsilon

Constant $$\varepsilon_t = \varepsilon$$.

property trusts

Update the trusts probabilities according to Hedge formula, and the parameter $$\varepsilon_t$$.

$\begin{split}\mathrm{trusts}'_k(t+1) &= (1 - \varepsilon_t) w_k(t) + \varepsilon_t \frac{1}{K}, \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}$

If $$w_k(t)$$ is the current weight from arm k.

getReward(arm, reward)[source]

Give a reward: accumulate rewards on that arm k, then update the weight $$w_k(t)$$ and renormalize the weights.

$\begin{split}w'_k(t+1) &= w_k(t) \times \exp\left( \frac{\tilde{r}_k(t)}{\varepsilon_t N_k(t)} \right) \\ w(t+1) &= w'(t+1) / \sum_{k=1}^{K} w'_k(t+1).\end{split}$
choice()[source]

One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (less probable).

• Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).

choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

__module__ = 'Policies.Hedge'
class Policies.Hedge.HedgeWithHorizon(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

Hedge with fixed epsilon, $$\varepsilon_t = \varepsilon_0$$, chosen with a knowledge of the horizon.

__init__(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter $$T$$ = known horizon of the experiment.

__str__()[source]

-> str

property epsilon

Fixed temperature, small, knowing the horizon: $$\varepsilon_t = \sqrt(\frac{2 \log(K)}{T K})$$ (heuristic).

__module__ = 'Policies.Hedge'
class Policies.Hedge.HedgeDecreasing(nbArms, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

Hedge with decreasing parameter $$\varepsilon_t$$.

__str__()[source]

-> str

property epsilon

Decreasing epsilon with the time: $$\varepsilon_t = \min(\frac{1}{K}, \sqrt(\frac{\log(K)}{t K}))$$ (heuristic).

__module__ = 'Policies.Hedge'