Policies.DiscountedUCB module¶

The Discounted-UCB index policy, with a discount factor of $$\gamma\in(0,1]$$.

• Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
• $$\gamma$$ should not be 1, otherwise you should rather use Policies.UCBalpha.UCBalpha instead.
• The smaller the $$\gamma$$, the shorter the “memory” of the algorithm is.
Policies.DiscountedUCB.ALPHA = 1

Default parameter for alpha.

Policies.DiscountedUCB.GAMMA = 0.99

Default parameter for gamma.

class Policies.DiscountedUCB.DiscountedUCB(nbArms, alpha=1, gamma=0.99, useRealDiscount=True, *args, **kwargs)[source]

The Discounted-UCB index policy, with a discount factor of $$\gamma\in(0,1]$$.

__init__(nbArms, alpha=1, gamma=0.99, useRealDiscount=True, *args, **kwargs)[source]

New generic index policy.

• nbArms: the number of arms,
• lower, amplitude: lower value and known amplitude of the rewards.
discounted_pulls = None

Number of pulls of each arms

discounted_rewards = None

Cumulated rewards of each arms

alpha = None

Parameter alpha

gamma = None

Parameter gamma

delta_time_steps = None

Keep memory of the $$\Delta_k(t)$$ for each time step.

useRealDiscount = None

Flag to know if the real update should be used, the one with a multiplication by $$\gamma^{1+\Delta_k(t)}$$ and not simply a multiplication by $$\gamma$$.

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

• Keep up-to date the following two quantities, using different definition and notation as from the article, but being consistent w.r.t. my project:
$\begin{split}N_{k,\gamma}(t+1) &:= \sum_{s=1}^{t} \gamma^{t - s} N_k(s), \\ X_{k,\gamma}(t+1) &:= \sum_{s=1}^{t} \gamma^{t - s} X_k(s).\end{split}$
• Instead of keeping the whole history of rewards, as expressed in the math formula, we keep the sum of discounted rewards from s=0 to s=t, because updating it is easy (2 operations instead of just 1 for classical Policies.UCBalpha.UCBalpha, and 2 operations instead of $$\mathcal{O}(t)$$ as expressed mathematically). Denote $$\Delta_k(t)$$ the number of time steps during which the arm k was not selected (maybe 0 if it is selected twice in a row). Then the update can be done easily by multiplying by $$\gamma^{1+\Delta_k(t)}$$:
$\begin{split}N_{k,\gamma}(t+1) &= \gamma^{1+\Delta_k(t)} \times N_{k,\gamma}(\text{last pull}) + \mathbb{1}(A(t+1) = k), \\ X_{k,\gamma}(t+1) &= \gamma^{1+\Delta_k(t)} \times X_{k,\gamma}(\text{last pull}) + X_k(t+1).\end{split}$
computeIndex(arm)[source]

Compute the current index, at time $$t$$ and after $$N_{k,\gamma}(t)$$ “discounted” pulls of arm k, and $$n_{\gamma}(t)$$ “discounted” pulls of all arms:

$\begin{split}I_k(t) &:= \frac{X_{k,\gamma}(t)}{N_{k,\gamma}(t)} + \sqrt{\frac{\alpha \log(n_{\gamma}(t))}{2 N_{k,\gamma}(t)}}, \\ \text{where}\;\; n_{\gamma}(t) &:= \sum_{k=1}^{K} N_{k,\gamma}(t).\end{split}$
computeAllIndex()[source]

Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.DiscountedUCB'
class Policies.DiscountedUCB.DiscountedUCBPlus(nbArms, horizon=None, max_nb_random_events=None, alpha=1, *args, **kwargs)[source]

The Discounted-UCB index policy, with a particular value of the discount factor of $$\gamma\in(0,1]$$, knowing the horizon and the number of breakpoints (or an upper-bound).

• Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
• Uses $$\gamma = 1 - \frac{1}{4}\sqrt{\frac{\Upsilon}{T}}$$, if the horizon $$T$$ is given and an upper-bound on the number of random events (“breakpoints”) $$\Upsilon$$ is known, otherwise use the default value.
__init__(nbArms, horizon=None, max_nb_random_events=None, alpha=1, *args, **kwargs)[source]

New generic index policy.

• nbArms: the number of arms,
• lower, amplitude: lower value and known amplitude of the rewards.
__module__ = 'Policies.DiscountedUCB'
Policies.DiscountedUCB.constant_c = 1.0

default value, as it was in pymaBandits v1.0

Policies.DiscountedUCB.tolerance = 0.0001

Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

class Policies.DiscountedUCB.DiscountedklUCB(nbArms, klucb=CPUDispatcher(<function klucbBern>), *args, **kwargs)[source]

The Discounted-klUCB index policy, with a particular value of the discount factor of $$\gamma\in(0,1]$$, knowing the horizon and the number of breakpoints (or an upper-bound).

__init__(nbArms, klucb=CPUDispatcher(<function klucbBern>), *args, **kwargs)[source]

New generic index policy.

• nbArms: the number of arms,
• lower, amplitude: lower value and known amplitude of the rewards.
klucb = None

kl function to use

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time $$t$$ and after $$N_{k,\gamma}(t)$$ “discounted” pulls of arm k, and $$n_{\gamma}(t)$$ “discounted” pulls of all arms:

$\begin{split}\hat{\mu'}_k(t) &= \frac{X_{k,\gamma}(t)}{N_{k,\gamma}(t)} , \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu'}_k(t), q) \leq \frac{c \log(t)}{N_{k,\gamma}(t)} \right\},\\ I_k(t) &= U_k(t),\\ \text{where}\;\; n_{\gamma}(t) &:= \sum_{k=1}^{K} N_{k,\gamma}(t).\end{split}$

If rewards are in $$[a, b]$$ (default to $$[0, 1]$$) and $$\mathrm{kl}(x, y)$$ is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

computeAllIndex()[source]

Compute the current indexes for all arms. Possibly vectorized, by default it can not be vectorized automatically.

__module__ = 'Policies.DiscountedUCB'
class Policies.DiscountedUCB.DiscountedklUCBPlus(nbArms, klucb=CPUDispatcher(<function klucbBern>), *args, **kwargs)[source]

The Discounted-klUCB index policy, with a particular value of the discount factor of $$\gamma\in(0,1]$$, knowing the horizon and the number of breakpoints (or an upper-bound).

• Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
• Uses $$\gamma = 1 - \frac{1}{4}\sqrt{\frac{\Upsilon}{T}}$$, if the horizon $$T$$ is given and an upper-bound on the number of random events (“breakpoints”) $$\Upsilon$$ is known, otherwise use the default value.
__str__()[source]

-> str

__module__ = 'Policies.DiscountedUCB'