# Policies.UCBV module¶

The UCB-V policy for bounded bandits, with a variance correction term. Reference: [Audibert, Munos, & Szepesvári - Theoret. Comput. Sci., 2009].

class Policies.UCBV.UCBV(nbArms, lower=0.0, amplitude=1.0)[source]

The UCB-V policy for bounded bandits, with a variance correction term. Reference: [Audibert, Munos, & Szepesvári - Theoret. Comput. Sci., 2009].

__str__()[source]

-> str

__init__(nbArms, lower=0.0, amplitude=1.0)[source]

New generic index policy.

• nbArms: the number of arms,

• lower, amplitude: lower value and known amplitude of the rewards.

rewardsSquared = None

Keep track of squared of rewards, to compute an empirical variance

startGame()[source]

Initialize the policy for a new game.

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and of rewards squared for that arm (normalized in [0, 1]).

computeIndex(arm)[source]

Compute the current index, at time t and after $$N_k(t)$$ pulls of arm k:

$\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ V_k(t) &= \frac{Z_k(t)}{N_k(t)} - \hat{\mu}_k(t)^2, \\ I_k(t) &= \hat{\mu}_k(t) + \sqrt{\frac{2 \log(t) V_k(t)}{N_k(t)}} + 3 (b - a) \frac{\log(t)}{N_k(t)}.\end{split}$

Where rewards are in $$[a, b]$$, and $$V_k(t)$$ is an estimator of the variance of rewards, obtained from $$X_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)$$ is the sum of rewards from arm k, and $$Z_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)^2$$ is the sum of rewards squared.

computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.UCBV'