# Policies.PHE module¶

The PHE, Perturbed-History Exploration, policy for bounded bandits.

• Reference: [[Perturbed-History Exploration in Stochastic Multi-Armed Bandits, by Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier, 26 Feb 2019, arXiv:1902.10089]](https://arxiv.org/abs/1902.10089)

Policies.PHE.DEFAULT_PERTURBATION_SCALE = 1.0

By default, $$a$$ the perturbation scale in PHE is 1, that is, at current time step t, if there is $$s = T_{i,t-1}$$ samples of arm i, PHE generates $$s$$ pseudo-rewards (of mean $$1/2$$)

class Policies.PHE.PHE(nbArms, perturbation_scale=1.0, lower=0.0, amplitude=1.0)[source]

The PHE, Perturbed-History Exploration, policy for bounded bandits.

• Reference: [[Perturbed-History Exploration in Stochastic Multi-Armed Bandits, by Branislav Kveton, Csaba Szepesvari, Mohammad Ghavamzadeh, Craig Boutilier, 26 Feb 2019, arXiv:1902.10089]](https://arxiv.org/abs/1902.10089)

• They prove that PHE achieves a regret of $$\mathcal{O}(K \Delta^{-1} \log(T))$$ regret for horizon $$T$$, and if $$\Delta$$ is the minimum gap between the expected rewards of the optimal and suboptimal arms, for $$a > 1$$.

• Note that the limit case of $$a=0$$ gives the Follow-the-Leader algorithm (FTL), known to fail.

__init__(nbArms, perturbation_scale=1.0, lower=0.0, amplitude=1.0)[source]

New generic index policy.

• nbArms: the number of arms,

• lower, amplitude: lower value and known amplitude of the rewards.

perturbation_scale = None

Perturbation scale, denoted $$a$$ in their paper. Should be a float or int number. With $$s$$ current samples, $$\lceil a s \rceil$$ additional pseudo-rewards are generated.

__str__()[source]

-> str

computeIndex(arm)[source]

Compute a randomized index by adding $$a$$ pseudo-rewards (of mean $$1/2$$) to the current observations of this arm.

__module__ = 'Policies.PHE'