# Policies.Exp3PlusPlus module¶

The EXP3++ randomized index policy, an improved version of the EXP3 policy.

Reference: [[One practical algorithm for both stochastic and adversarial bandits, S.Seldin & A.Slivkins, ICML, 2014](http://www.jmlr.org/proceedings/papers/v32/seldinb14-supp.pdf)].

See also [[An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits, by Y.Seldin & G.Lugosi, COLT, 2017](https://arxiv.org/pdf/1702.06103)].

Policies.Exp3PlusPlus.ALPHA = 3

Value for the $$\alpha$$ parameter.

Policies.Exp3PlusPlus.BETA = 256

Value for the $$\beta$$ parameter.

class Policies.Exp3PlusPlus.Exp3PlusPlus(nbArms, alpha=3, beta=256, lower=0.0, amplitude=1.0)[source]

The EXP3++ randomized index policy, an improved version of the EXP3 policy.

Reference: [[One practical algorithm for both stochastic and adversarial bandits, S.Seldin & A.Slivkins, ICML, 2014](http://www.jmlr.org/proceedings/papers/v32/seldinb14-supp.pdf)].

See also [[An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits, by Y.Seldin & G.Lugosi, COLT, 2017](https://arxiv.org/pdf/1702.06103)].

__init__(nbArms, alpha=3, beta=256, lower=0.0, amplitude=1.0)[source]

New policy.

alpha = None

$$\alpha$$ parameter for computations of $$\xi_t(a)$$.

beta = None

$$\beta$$ parameter for computations of $$\xi_t(a)$$.

weights = None

Weights on the arms

losses = None

Cumulative sum of losses estimates for each arm

unweighted_losses = None

Cumulative sum of unweighted losses for each arm

startGame()[source]

__str__()[source]

-> str

property eta

Decreasing sequence of learning rates, given by $$\eta_t = \frac{1}{2} \sqrt{\frac{\log K}{t K}}$$.

property gamma

Constant $$\gamma_t = \gamma$$.

property gap_estimate

Compute the gap estimate $$\widehat{\Delta}^{\mathrm{LCB}}_t(a)$$ from :

• Compute the UCB: $$\mathrm{UCB}_t(a) = \min\left( 1, \frac{wide\hat{L}_{t-1}(a)}{N_{t-1}(a)} + \sqrt{\frac{a \log(t K^{1/\alpha})}{2 N_{t-1}(a)}} \right)$$,

• Compute the LCB: $$\mathrm{LCB}_t(a) = \max\left( 0, \frac{wide\hat{L}_{t-1}(a)}{N_{t-1}(a)} - \sqrt{\frac{a \log(t K^{1/\alpha})}{2 N_{t-1}(a)}} \right)$$,

• Then the gap: $$\widehat{\Delta}^{\mathrm{LCB}}_t(a) = \max\left( 0, \mathrm{LCB}_t(a) - \min_{a'} \mathrm{UCB}_t(a') \right)$$.

• The gap should be in $$[0, 1]$$.

property xi

Compute the $$\xi_t(a) = \frac{\beta \log t}{t \widehat{\Delta}^{\mathrm{LCB}}_t(a)^2}$$ vector of indexes.

property epsilon

Compute the vector of parameters $$\eta_t(a) = \min\left(\frac{1}{2 K}, \frac{1}{2} \sqrt{\frac{\log K}{t K}}, \xi_t(a) \right)$$.

property trusts

Update the trusts probabilities according to Exp3PlusPlus formula, and the parameter $$\eta_t$$.

$\begin{split}\tilde{\rho}'_{t+1}(a) &= (1 - \sum_{a'=1}^{K}\eta_t(a')) w_t(a) + \eta_t(a), \\ \tilde{\rho}_{t+1} &= \tilde{\rho}'_{t+1} / \sum_{a=1}^{K} \tilde{\rho}'_{t+1}(a).\end{split}$

If $$rho_t(a)$$ is the current weight from arm a.

getReward(arm, reward)[source]

Give a reward: accumulate losses on that arm a, then update the weight $$\rho_t(a)$$ and renormalize the weights.

• Divide by the trust on that arm a, i.e., the probability of observing arm a: $$\tilde{l}_t(a) = \frac{l_t(a)}{\tilde{\rho}_t(a)} 1(A_t = a)$$.

• Add this loss to the cumulative loss: $$\tilde{L}_t(a) := \tilde{L}_{t-1}(a) + \tilde{l}_t(a)$$.

• But the un-weighted loss is added to the other cumulative loss: $$\widehat{L}_t(a) := \widehat{L}_{t-1}(a) + l_t(a) 1(A_t = a)$$.

$\begin{split}\rho'_{t+1}(a) &= \exp\left( - \tilde{L}_t(a) \eta_t \right) \\ \rho_{t+1} &= \rho'_{t+1} / \sum_{a=1}^{K} \rho'_{t+1}(a).\end{split}$
choice()[source]

One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (less probable).

• Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).

choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

__module__ = 'Policies.Exp3PlusPlus'