Policies.PHE module¶

The PHE, Perturbed-History Exploration, policy for bounded bandits.

Reference: [[Perturbed-History Exploration in Stochastic Multi-Armed Bandits, by Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier, 26 Feb 2019, arXiv:1902.10089]](https://arxiv.org/abs/1902.10089)

Policies.PHE.DEFAULT_PERTURBATION_SCALE = 1.0¶: By default, \(a\) the perturbation scale in PHE is 1, that is, at current time step t, if there is \(s = T_{i,t-1}\) samples of arm i, PHE generates \(s\) pseudo-rewards (of mean \(1/2\))

class Policies.PHE.PHE(nbArms, perturbation_scale=1.0, lower=0.0, amplitude=1.0)[source]¶

The PHE, Perturbed-History Exploration, policy for bounded bandits.

Reference: [[Perturbed-History Exploration in Stochastic Multi-Armed Bandits, by Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier, 26 Feb 2019, arXiv:1902.10089]](https://arxiv.org/abs/1902.10089)
They prove that PHE achieves a regret of \(\mathcal{O}(K \Delta^{-1} \log(T))\) regret for horizon \(T\), and if \(\Delta\) is the minimum gap between the expected rewards of the optimal and suboptimal arms, for \(a > 1\).
Note that the limit case of \(a=0\) gives the Follow-the-Leader algorithm (FTL), known to fail.

__init__(nbArms, perturbation_scale=1.0, lower=0.0, amplitude=1.0)[source]¶

New generic index policy.

perturbation_scale = None¶: Perturbation scale, denoted \(a\) in their paper. Should be a float or int number. With \(s\) current samples, \(\lceil a s \rceil\) additional pseudo-rewards are generated.

computeIndex(arm)[source]¶: Compute a randomized index by adding \(a\) pseudo-rewards (of mean \(1/2\)) to the current observations of this arm.