Policies.CD_UCB module

The CD-UCB generic policy policies for non-stationary bandits.

  • Reference: [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539)

  • It runs on top of a simple policy, e.g., UCB, and UCBLCB_IndexPolicy is a wrapper:

    >>> policy = UCBLCB_IndexPolicy(nbArms, UCB)
    >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
    
  • It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

Warning

It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!

Policies.CD_UCB.VERBOSE = False

Whether to be verbose when doing the change detection algorithm.

Policies.CD_UCB.PROBA_RANDOM_EXPLORATION = 0.1

Default probability of random exploration \(\alpha\).

Policies.CD_UCB.PER_ARM_RESTART = True

Should we reset one arm empirical average or all? Default is True, it’s usually more efficient!

Policies.CD_UCB.FULL_RESTART_WHEN_REFRESH = False

Should we fully restart the algorithm or simply reset one arm empirical average? Default is False, it’s usually more efficient!

Policies.CD_UCB.EPSILON = 0.05

Precision of the test. For CUSUM/PHT, \(\varepsilon\) is the drift correction threshold (see algorithm).

Policies.CD_UCB.LAMBDA = 1

Default value of \(\lambda\).

Policies.CD_UCB.MIN_NUMBER_OF_OBSERVATION_BETWEEN_CHANGE_POINT = 50

Hypothesis on the speed of changes: between two change points, there is at least \(M * K\) time steps, where K is the number of arms, and M is this constant.

Policies.CD_UCB.LAZY_DETECT_CHANGE_ONLY_X_STEPS = 10

XXX Be lazy and try to detect changes only X steps, where X is small like 10 for instance. It is a simple but efficient way to speed up CD tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 0, to not use this feature, and 10 should speed up the test by x10.

class Policies.CD_UCB.CD_IndexPolicy(nbArms, full_restart_when_refresh=False, per_arm_restart=True, epsilon=0.05, proba_random_exploration=None, lazy_detect_change_only_x_steps=10, *args, **kwargs)[source]

Bases: Policies.BaseWrapperPolicy.BaseWrapperPolicy

The CD-UCB generic policy for non-stationary bandits, from [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539).

__init__(nbArms, full_restart_when_refresh=False, per_arm_restart=True, epsilon=0.05, proba_random_exploration=None, lazy_detect_change_only_x_steps=10, *args, **kwargs)[source]

New policy.

epsilon = None

Parameter \(\varepsilon\) for the test.

lazy_detect_change_only_x_steps = None

Be lazy and try to detect changes only X steps, where X is small like 10 for instance.

proba_random_exploration = None

What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.

all_rewards = None

Keep in memory all the rewards obtained since the last restart on that arm.

last_pulls = None

Keep in memory the number times since last restart. Start with -1 (never seen)

last_restart_times = None

Keep in memory the times of last restarts (for each arm).

number_of_restart = None

Keep in memory the number of restarts.

__str__()[source]

-> str

choice()[source]

With a probability \(\alpha\), play uniformly at random, otherwise, pass the call to choice() of the underlying policy.

choiceWithRank(rank=1)[source]

With a probability \(\alpha\), play uniformly at random, otherwise, pass the call to choiceWithRank() of the underlying policy.

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

  • Reset the whole empirical average if the change detection algorithm says so, with method detect_change(), for this arm at this current time step.

Warning

This is computationally costly, so an easy way to speed up this step is to use lazy_detect_change_only_x_steps \(= \mathrm{Step_t}\) for a small value (e.g., 10), so not test for all \(t\in\mathbb{N}^*\) but only \(s\in\mathbb{N}^*, s % \mathrm{Step_t} = 0\) (e.g., one out of every 10 steps).

Warning

If the \(detect_change\) method also returns an estimate of the position of the change-point, \(\hat{tau}\), then it is used to reset the memory of the changing arm and keep the observations from \(\hat{tau}+1\).

detect_change(arm, verbose=False)[source]

Try to detect a change in the current arm.

Warning

This is not implemented for the generic CD algorithm, it has to be implement by a child of the class CD_IndexPolicy.

__module__ = 'Policies.CD_UCB'
class Policies.CD_UCB.SlidingWindowRestart_IndexPolicy(nbArms, full_restart_when_refresh=False, per_arm_restart=True, epsilon=0.05, proba_random_exploration=None, lazy_detect_change_only_x_steps=10, *args, **kwargs)[source]

Bases: Policies.CD_UCB.CD_IndexPolicy

A more generic implementation is the Policies.SlidingWindowRestart class.

Warning

I have no idea if what I wrote is correct or not!

detect_change(arm, verbose=False)[source]

Try to detect a change in the current arm.

Warning

This one is simply using a sliding-window of fixed size = 100. A more generic implementation is the Policies.SlidingWindowRestart class.

__module__ = 'Policies.CD_UCB'
Policies.CD_UCB.LAZY_TRY_VALUE_S_ONLY_X_STEPS = 10

XXX Be lazy and try to detect changes for \(s\) taking steps of size steps_s. Default is to have steps_s=1, but only using steps_s=2 should already speed up by 2. It is a simple but efficient way to speed up GLR tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 1, to not use this feature, and 10 should speed up the test by x10.

Policies.CD_UCB.USE_LOCALIZATION = True

Default value of use_localization for policies. All the experiments I tried showed that the localization always helps improving learning, so the default value is set to True.

class Policies.CD_UCB.UCBLCB_IndexPolicy(nbArms, delta=None, delta0=1.0, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]

Bases: Policies.CD_UCB.CD_IndexPolicy

The UCBLCB-UCB generic policy for non-stationary bandits, from [[Improved Changepoint Detection for Piecewise i.i.d Bandits, by S. Mukherjee & O.-A. Maillard, preprint 2018](https://subhojyoti.github.io/pdf/aistats_2019.pdf)].

Warning

This is still experimental! See https://github.com/SMPyBandits/SMPyBandits/issues/177

__init__(nbArms, delta=None, delta0=1.0, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]

New policy.

proba_random_exploration = None

What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.

lazy_try_value_s_only_x_steps = None

Be lazy and try to detect changes for \(s\) taking steps of size steps_s.

use_localization = None

experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one

__module__ = 'Policies.CD_UCB'
__str__()[source]

-> str

delta(t)[source]

Use \(\delta = \delta_0\) if it was given as an argument to the policy, or \(\frac{\delta_0}{t}\) as the confidence level of UCB/LCB test (default is \(\delta_0=1\)).

Warning

It is unclear (in the article) whether \(t\) is the time since the last restart or the total time?

detect_change(arm, verbose=False)[source]

Detect a change in the current arm, using the two-sided UCB-LCB algorithm [Mukherjee & Maillard, 2018].

  • Let \(\hat{\mu}_{i,t:t'}\) the empirical mean of rewards obtained for arm i from time \(t\) to \(t'\), and \(N_{i,t:t'}\) the number of samples.

  • Let \(S_{i,t:t'} = \sqrt{\frac{\log(4 t^2 / \delta)}{2 N_{i,t:t'}}}\) the length of the confidence interval.

  • When we have data starting at \(t_0=0\) (since last restart) and up-to current time \(t\), for each arm i,
    • For each intermediate time steps \(t' \in [t_0, t)\),
      • Compute \(LCB_{\text{before}} = \hat{\mu}_{i,t_0:t'} - S_{i,t_0:t'}\),

      • Compute \(UCB_{\text{before}} = \hat{\mu}_{i,t_0:t'} + S_{i,t_0:t'}\),

      • Compute \(LCB_{\text{after}} = \hat{\mu}_{i,t'+1:t} - S_{i,t'+1:t}\),

      • Compute \(UCB_{\text{after}} = \hat{\mu}_{i,t'+1:t} + S_{i,t'+1:t}\),

      • If \(UCB_{\text{before}} < LCB_{\text{after}}\) or \(UCB_{\text{after}} < LCB_{\text{before}}\), then restart.