Policies.Exp3PlusPlus module

The EXP3++ randomized index policy, an improved version of the EXP3 policy.

Reference: [[One practical algorithm for both stochastic and adversarial bandits, S.Seldin & A.Slivkins, ICML, 2014](http://www.jmlr.org/proceedings/papers/v32/seldinb14-supp.pdf)].

See also [[An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits, by Y.Seldin & G.Lugosi, COLT, 2017](https://arxiv.org/pdf/1702.06103)].

Policies.Exp3PlusPlus.ALPHA = 3

Value for the \(\alpha\) parameter.

Policies.Exp3PlusPlus.BETA = 256

Value for the \(\beta\) parameter.

class Policies.Exp3PlusPlus.Exp3PlusPlus(nbArms, alpha=3, beta=256, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The EXP3++ randomized index policy, an improved version of the EXP3 policy.

Reference: [[One practical algorithm for both stochastic and adversarial bandits, S.Seldin & A.Slivkins, ICML, 2014](http://www.jmlr.org/proceedings/papers/v32/seldinb14-supp.pdf)].

See also [[An Improved Parametrization and Analysis of the EXP3++ Algorithm for Stochastic and Adversarial Bandits, by Y.Seldin & G.Lugosi, COLT, 2017](https://arxiv.org/pdf/1702.06103)].

__init__(nbArms, alpha=3, beta=256, lower=0.0, amplitude=1.0)[source]

New policy.

alpha = None

\(\alpha\) parameter for computations of \(\xi_t(a)\).

beta = None

\(\beta\) parameter for computations of \(\xi_t(a)\).

weights = None

Weights on the arms

losses = None

Cumulative sum of losses estimates for each arm

unweighted_losses = None

Cumulative sum of unweighted losses for each arm

startGame()[source]

Start with uniform weights.

__str__()[source]

-> str

property eta

Decreasing sequence of learning rates, given by \(\eta_t = \frac{1}{2} \sqrt{\frac{\log K}{t K}}\).

property gamma

Constant \(\gamma_t = \gamma\).

property gap_estimate

Compute the gap estimate \(\widehat{\Delta}^{\mathrm{LCB}}_t(a)\) from :

  • Compute the UCB: \(\mathrm{UCB}_t(a) = \min\left( 1, \frac{wide\hat{L}_{t-1}(a)}{N_{t-1}(a)} + \sqrt{\frac{a \log(t K^{1/\alpha})}{2 N_{t-1}(a)}} \right)\),

  • Compute the LCB: \(\mathrm{LCB}_t(a) = \max\left( 0, \frac{wide\hat{L}_{t-1}(a)}{N_{t-1}(a)} - \sqrt{\frac{a \log(t K^{1/\alpha})}{2 N_{t-1}(a)}} \right)\),

  • Then the gap: \(\widehat{\Delta}^{\mathrm{LCB}}_t(a) = \max\left( 0, \mathrm{LCB}_t(a) - \min_{a'} \mathrm{UCB}_t(a') \right)\).

  • The gap should be in \([0, 1]\).

property xi

Compute the \(\xi_t(a) = \frac{\beta \log t}{t \widehat{\Delta}^{\mathrm{LCB}}_t(a)^2}\) vector of indexes.

property epsilon

Compute the vector of parameters \(\eta_t(a) = \min\left(\frac{1}{2 K}, \frac{1}{2} \sqrt{\frac{\log K}{t K}}, \xi_t(a) \right)\).

property trusts

Update the trusts probabilities according to Exp3PlusPlus formula, and the parameter \(\eta_t\).

\[\begin{split}\tilde{\rho}'_{t+1}(a) &= (1 - \sum_{a'=1}^{K}\eta_t(a')) w_t(a) + \eta_t(a), \\ \tilde{\rho}_{t+1} &= \tilde{\rho}'_{t+1} / \sum_{a=1}^{K} \tilde{\rho}'_{t+1}(a).\end{split}\]

If \(rho_t(a)\) is the current weight from arm a.

getReward(arm, reward)[source]

Give a reward: accumulate losses on that arm a, then update the weight \(\rho_t(a)\) and renormalize the weights.

  • Divide by the trust on that arm a, i.e., the probability of observing arm a: \(\tilde{l}_t(a) = \frac{l_t(a)}{\tilde{\rho}_t(a)} 1(A_t = a)\).

  • Add this loss to the cumulative loss: \(\tilde{L}_t(a) := \tilde{L}_{t-1}(a) + \tilde{l}_t(a)\).

  • But the un-weighted loss is added to the other cumulative loss: \(\widehat{L}_t(a) := \widehat{L}_{t-1}(a) + l_t(a) 1(A_t = a)\).

\[\begin{split}\rho'_{t+1}(a) &= \exp\left( - \tilde{L}_t(a) \eta_t \right) \\ \rho_{t+1} &= \rho'_{t+1} / \sum_{a=1}^{K} \rho'_{t+1}(a).\end{split}\]
choice()[source]

One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (less probable).

  • Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).

choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

__module__ = 'Policies.Exp3PlusPlus'