Policies.DiscountedUCB module¶

The Discounted-UCB index policy, with a discount factor of \(\gamma\in(0,1]\).

Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
\(\gamma\) should not be 1, otherwise you should rather use Policies.UCBalpha.UCBalpha instead.
The smaller the \(\gamma\), the shorter the “memory” of the algorithm is.

Policies.DiscountedUCB.ALPHA = 1¶: Default parameter for alpha.

Policies.DiscountedUCB.GAMMA = 0.99¶: Default parameter for gamma.

class Policies.DiscountedUCB.DiscountedUCB(nbArms, alpha=1, gamma=0.99, useRealDiscount=True, *args, **kwargs)[source]¶

Bases: Policies.UCBalpha.UCBalpha

The Discounted-UCB index policy, with a discount factor of \(\gamma\in(0,1]\).

Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)

__init__(nbArms, alpha=1, gamma=0.99, useRealDiscount=True, *args, **kwargs)[source]¶

New generic index policy.

nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.

discounted_pulls = None¶: Number of pulls of each arms

discounted_rewards = None¶: Cumulated rewards of each arms

alpha = None¶: Parameter alpha

gamma = None¶: Parameter gamma

delta_time_steps = None¶: Keep memory of the \(\Delta_k(t)\) for each time step.

useRealDiscount = None¶: Flag to know if the real update should be used, the one with a multiplication by \(\gamma^{1+\Delta_k(t)}\) and not simply a multiplication by \(\gamma\).

__str__()[source]¶: -> str

getReward(arm, reward)[source]¶

Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

Keep up-to date the following two quantities, using different definition and notation as from the article, but being consistent w.r.t. my project:

\[\begin{split}N_{k,\gamma}(t+1) &:= \sum_{s=1}^{t} \gamma^{t - s} N_k(s), \\ X_{k,\gamma}(t+1) &:= \sum_{s=1}^{t} \gamma^{t - s} X_k(s).\end{split}\]

Instead of keeping the whole history of rewards, as expressed in the math formula, we keep the sum of discounted rewards from s=0 to s=t, because updating it is easy (2 operations instead of just 1 for classical Policies.UCBalpha.UCBalpha, and 2 operations instead of \(\mathcal{O}(t)\) as expressed mathematically). Denote \(\Delta_k(t)\) the number of time steps during which the arm k was not selected (maybe 0 if it is selected twice in a row). Then the update can be done easily by multiplying by \(\gamma^{1+\Delta_k(t)}\):

\[\begin{split}N_{k,\gamma}(t+1) &= \gamma^{1+\Delta_k(t)} \times N_{k,\gamma}(\text{last pull}) + \mathbb{1}(A(t+1) = k), \\ X_{k,\gamma}(t+1) &= \gamma^{1+\Delta_k(t)} \times X_{k,\gamma}(\text{last pull}) + X_k(t+1).\end{split}\]

computeIndex(arm)[source]¶: Compute the current index, at time \(t\) and after \(N_{k,\gamma}(t)\) “discounted” pulls of arm k, and \(n_{\gamma}(t)\) “discounted” pulls of all arms:

\[\begin{split}I_k(t) &:= \frac{X_{k,\gamma}(t)}{N_{k,\gamma}(t)} + \sqrt{\frac{\alpha \log(n_{\gamma}(t))}{2 N_{k,\gamma}(t)}}, \\ \text{where}\;\; n_{\gamma}(t) &:= \sum_{k=1}^{K} N_{k,\gamma}(t).\end{split}\]

computeAllIndex()[source]¶: Compute the current indexes for all arms, in a vectorized manner.

__module__ = 'Policies.DiscountedUCB'¶

class Policies.DiscountedUCB.DiscountedUCBPlus(nbArms, horizon=None, max_nb_random_events=None, alpha=1, *args, **kwargs)[source]¶

Bases: Policies.DiscountedUCB.DiscountedUCB

The Discounted-UCB index policy, with a particular value of the discount factor of \(\gamma\in(0,1]\), knowing the horizon and the number of breakpoints (or an upper-bound).

Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
Uses \(\gamma = 1 - \frac{1}{4}\sqrt{\frac{\Upsilon}{T}}\), if the horizon \(T\) is given and an upper-bound on the number of random events (“breakpoints”) \(\Upsilon\) is known, otherwise use the default value.

__init__(nbArms, horizon=None, max_nb_random_events=None, alpha=1, *args, **kwargs)[source]¶

New generic index policy.

nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.

__module__ = 'Policies.DiscountedUCB'¶

Policies.DiscountedUCB.constant_c = 1.0¶: default value, as it was in pymaBandits v1.0

Policies.DiscountedUCB.tolerance = 0.0001¶: Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

class Policies.DiscountedUCB.DiscountedklUCB(nbArms, klucb=CPUDispatcher(<function klucbBern>), *args, **kwargs)[source]¶

Bases: Policies.DiscountedUCB.DiscountedUCB

The Discounted-klUCB index policy, with a particular value of the discount factor of \(\gamma\in(0,1]\), knowing the horizon and the number of breakpoints (or an upper-bound).

Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)

__init__(nbArms, klucb=CPUDispatcher(<function klucbBern>), *args, **kwargs)[source]¶

New generic index policy.

nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.

klucb = None¶: kl function to use

__str__()[source]¶: -> str

computeIndex(arm)[source]¶

Compute the current index, at time \(t\) and after \(N_{k,\gamma}(t)\) “discounted” pulls of arm k, and \(n_{\gamma}(t)\) “discounted” pulls of all arms:

\[\begin{split}\hat{\mu'}_k(t) &= \frac{X_{k,\gamma}(t)}{N_{k,\gamma}(t)} , \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu'}_k(t), q) \leq \frac{c \log(t)}{N_{k,\gamma}(t)} \right\},\\ I_k(t) &= U_k(t),\\ \text{where}\;\; n_{\gamma}(t) &:= \sum_{k=1}^{K} N_{k,\gamma}(t).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

computeAllIndex()[source]¶: Compute the current indexes for all arms. Possibly vectorized, by default it can not be vectorized automatically.

__module__ = 'Policies.DiscountedUCB'¶

class Policies.DiscountedUCB.DiscountedklUCBPlus(nbArms, klucb=CPUDispatcher(<function klucbBern>), *args, **kwargs)[source]¶

Bases: Policies.DiscountedUCB.DiscountedklUCB, Policies.DiscountedUCB.DiscountedUCBPlus

The Discounted-klUCB index policy, with a particular value of the discount factor of \(\gamma\in(0,1]\), knowing the horizon and the number of breakpoints (or an upper-bound).

Reference: [“On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems”, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
Uses \(\gamma = 1 - \frac{1}{4}\sqrt{\frac{\Upsilon}{T}}\), if the horizon \(T\) is given and an upper-bound on the number of random events (“breakpoints”) \(\Upsilon\) is known, otherwise use the default value.

__str__()[source]¶: -> str

__module__ = 'Policies.DiscountedUCB'¶