Policies.SlidingWindowUCB module

An experimental policy, using only a sliding window (of for instance \(\tau=1000\) steps, not counting draws of each arms) instead of using the full-size history.

  • Reference: [On Upper-Confidence Bound Policies for Non-Stationary Bandit Problems, by A.Garivier & E.Moulines, ALT 2011](https://arxiv.org/pdf/0805.3415.pdf)
  • It uses an additional \(\mathcal{O}(\tau)\) memory but do not cost anything else in terms of time complexity (the average is done with a sliding window, and costs \(\mathcal{O}(1)\) at every time step).

Warning

This is very experimental!

Note

This is similar to SlidingWindowRestart.SWR_UCB but slightly different: SlidingWindowRestart.SWR_UCB uses a window of size \(T_0=100\) to keep in memory the last 100 draws of each arm, and restart the index if the small history mean is too far away from the whole mean, while this SWUCB uses a fixed-size window of size \(\tau=1000\) to keep in memory the last 1000 steps.

Policies.SlidingWindowUCB.TAU = 1000

Size of the sliding window.

Policies.SlidingWindowUCB.ALPHA = 1.0

Default value for the constant \(\alpha\).

class Policies.SlidingWindowUCB.SWUCB(nbArms, tau=1000, alpha=1.0, *args, **kwargs)[source]

Bases: Policies.IndexPolicy.IndexPolicy

An experimental policy, using only a sliding window (of for instance \(\tau=1000\) steps, not counting draws of each arms) instead of using the full-size history.

__init__(nbArms, tau=1000, alpha=1.0, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
tau = None

Size \(\tau\) of the sliding window.

alpha = None

Constant \(\alpha\) in the square-root in the computation for the index.

last_rewards = None

Keep in memory all the rewards obtained in the last \(\tau\) steps.

last_choices = None

Keep in memory the times where each arm was last seen.

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

computeIndex(arm)[source]

Compute the current index, at time \(t\) and after \(N_{k,\tau}(t)\) pulls of arm \(k\):

\[\begin{split}I_k(t) &= \frac{X_{k,\tau}(t)}{N_{k,\tau}(t)} + c_{k,\tau}(t),\\ \text{where}\;\; c_{k,\tau}(t) &:= \sqrt{\alpha \frac{\log(\min(t,\tau))}{N_{k,\tau}(t)}},\\ \text{and}\;\; X_{k,\tau}(t) &:= \sum_{s=t-\tau+1}^{t} X_k(s) \mathbb{1}(A(t) = k),\\ \text{and}\;\; N_{k,\tau}(t) &:= \sum_{s=t-\tau+1}^{t} \mathbb{1}(A(t) = k).\end{split}\]
__module__ = 'Policies.SlidingWindowUCB'
class Policies.SlidingWindowUCB.SWUCBPlus(nbArms, horizon=None, *args, **kwargs)[source]

Bases: Policies.SlidingWindowUCB.SWUCB

An experimental policy, using only a sliding window (of \(\tau\) steps, not counting draws of each arms) instead of using the full-size history.

  • Uses \(\tau = 4 \sqrt{T \log(T)}\) if the horizon \(T\) is given, otherwise use the default value.
__init__(nbArms, horizon=None, *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
__str__()[source]

-> str

__module__ = 'Policies.SlidingWindowUCB'
Policies.SlidingWindowUCB.constant_c = 1.0

default value, as it was in pymaBandits v1.0

Policies.SlidingWindowUCB.tolerance = 0.0001

Default value for the tolerance for computing numerical approximations of the kl-UCB indexes.

class Policies.SlidingWindowUCB.SWklUCB(nbArms, tau=1000, klucb=CPUDispatcher(<function klucbBern>), *args, **kwargs)[source]

Bases: Policies.SlidingWindowUCB.SWUCB

An experimental policy, using only a sliding window (of \(\tau\) steps, not counting draws of each arms) instead of using the full-size history, and using klUCB (see Policy.klUCB) indexes instead of UCB.

__init__(nbArms, tau=1000, klucb=CPUDispatcher(<function klucbBern>), *args, **kwargs)[source]

New generic index policy.

  • nbArms: the number of arms,
  • lower, amplitude: lower value and known amplitude of the rewards.
klucb = None

kl function to use

__str__()[source]

-> str

computeIndex(arm)[source]

Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:

\[\begin{split}\hat{\mu'}_k(t) &= \frac{X_{k,\tau}(t)}{N_{k,\tau}(t)} , \\ U_k(t) &= \sup\limits_{q \in [a, b]} \left\{ q : \mathrm{kl}(\hat{\mu'}_k(t), q) \leq \frac{c \log(\min(t,\tau))}{N_{k,\tau}(t)} \right\},\\ I_k(t) &= U_k(t),\\ \text{where}\;\; X_{k,\tau}(t) &:= \sum_{s=t-\tau+1}^{t} X_k(s) \mathbb{1}(A(t) = k),\\ \text{and}\;\; N_{k,\tau}(t) &:= \sum_{s=t-\tau+1}^{t} \mathbb{1}(A(t) = k).\end{split}\]

If rewards are in \([a, b]\) (default to \([0, 1]\)) and \(\mathrm{kl}(x, y)\) is the Kullback-Leibler divergence between two distributions of means x and y (see Arms.kullback), and c is the parameter (default to 1).

__module__ = 'Policies.SlidingWindowUCB'
class Policies.SlidingWindowUCB.SWklUCBPlus(nbArms, tau=1000, klucb=CPUDispatcher(<function klucbBern>), *args, **kwargs)[source]

Bases: Policies.SlidingWindowUCB.SWklUCB, Policies.SlidingWindowUCB.SWUCBPlus

An experimental policy, using only a sliding window (of \(\tau\) steps, not counting draws of each arms) instead of using the full-size history, and using klUCB (see Policy.klUCB) indexes instead of UCB.

  • Uses \(\tau = 4 \sqrt{T \log(T)}\) if the horizon \(T\) is given, otherwise use the default value.
__str__()[source]

-> str

__module__ = 'Policies.SlidingWindowUCB'