Policies.UCBV module

The UCB-V policy for bounded bandits, with a variance correction term. Reference: [Audibert, Munos, & Szepesvári - Theoret. Comput. Sci., 2009].

class Policies.UCBV.UCBV(nbArms, lower=0.0, amplitude=1.0)[source]

Bases: Policies.UCB.UCB

The UCB-V policy for bounded bandits, with a variance correction term. Reference: [Audibert, Munos, & Szepesvári - Theoret. Comput. Sci., 2009].

__str__()[source]

-> str

__init__(nbArms, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,

  • lower, amplitude: lower value and known amplitude of the rewards.

rewardsSquared = None

Keep track of squared of rewards, to compute an empirical variance

startGame()[source]

Initialize the policy for a new game.

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and of rewards squared for that arm (normalized in [0, 1]).

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ V_k(t) &= \frac{Z_k(t)}{N_k(t)} - \hat{\mu}_k(t)^2, \\ I_k(t) &= \hat{\mu}_k(t) + \sqrt{\frac{2 \log(t) V_k(t)}{N_k(t)}} + 3 (b - a) \frac{\log(t)}{N_k(t)}.\end{split}\]

Where rewards are in \([a, b]\), and \(V_k(t)\) is an estimator of the variance of rewards, obtained from \(X_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)\) is the sum of rewards from arm k, and \(Z_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)^2\) is the sum of rewards squared.

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.UCBV'