Policies.CD_UCB module¶
The CD-UCB generic policy policies for non-stationary bandits.
- Reference: [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539) 
- It runs on top of a simple policy, e.g., - UCB, and- UCBLCB_IndexPolicyis a wrapper:- >>> policy = UCBLCB_IndexPolicy(nbArms, UCB) >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r) 
- It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\). 
Warning
It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!
- 
Policies.CD_UCB.VERBOSE= False¶
- Whether to be verbose when doing the change detection algorithm. 
- 
Policies.CD_UCB.PROBA_RANDOM_EXPLORATION= 0.1¶
- Default probability of random exploration \(\alpha\). 
- 
Policies.CD_UCB.PER_ARM_RESTART= True¶
- Should we reset one arm empirical average or all? Default is - True, it’s usually more efficient!
- 
Policies.CD_UCB.FULL_RESTART_WHEN_REFRESH= False¶
- Should we fully restart the algorithm or simply reset one arm empirical average? Default is - False, it’s usually more efficient!
- 
Policies.CD_UCB.EPSILON= 0.05¶
- Precision of the test. For CUSUM/PHT, \(\varepsilon\) is the drift correction threshold (see algorithm). 
- 
Policies.CD_UCB.LAMBDA= 1¶
- Default value of \(\lambda\). 
- 
Policies.CD_UCB.MIN_NUMBER_OF_OBSERVATION_BETWEEN_CHANGE_POINT= 50¶
- Hypothesis on the speed of changes: between two change points, there is at least \(M * K\) time steps, where K is the number of arms, and M is this constant. 
- 
Policies.CD_UCB.LAZY_DETECT_CHANGE_ONLY_X_STEPS= 10¶
- XXX Be lazy and try to detect changes only X steps, where X is small like 10 for instance. It is a simple but efficient way to speed up CD tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 0, to not use this feature, and 10 should speed up the test by x10. 
- 
class Policies.CD_UCB.CD_IndexPolicy(nbArms, full_restart_when_refresh=False, per_arm_restart=True, epsilon=0.05, proba_random_exploration=None, lazy_detect_change_only_x_steps=10, *args, **kwargs)[source]¶
- Bases: - Policies.BaseWrapperPolicy.BaseWrapperPolicy- The CD-UCB generic policy for non-stationary bandits, from [[“A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem”. F. Liu, J. Lee and N. Shroff. arXiv preprint arXiv:1711.03539, 2017]](https://arxiv.org/pdf/1711.03539). - 
__init__(nbArms, full_restart_when_refresh=False, per_arm_restart=True, epsilon=0.05, proba_random_exploration=None, lazy_detect_change_only_x_steps=10, *args, **kwargs)[source]¶
- New policy. 
 - 
epsilon= None¶
- Parameter \(\varepsilon\) for the test. 
 - 
lazy_detect_change_only_x_steps= None¶
- Be lazy and try to detect changes only X steps, where X is small like 10 for instance. 
 - 
proba_random_exploration= None¶
- What they call \(\alpha\) in their paper: the probability of uniform exploration at each time. 
 - 
all_rewards= None¶
- Keep in memory all the rewards obtained since the last restart on that arm. 
 - 
last_pulls= None¶
- Keep in memory the number times since last restart. Start with -1 (never seen) 
 - 
last_restart_times= None¶
- Keep in memory the times of last restarts (for each arm). 
 - 
number_of_restart= None¶
- Keep in memory the number of restarts. 
 - 
choice()[source]¶
- With a probability \(\alpha\), play uniformly at random, otherwise, pass the call to - choice()of the underlying policy.
 - 
choiceWithRank(rank=1)[source]¶
- With a probability \(\alpha\), play uniformly at random, otherwise, pass the call to - choiceWithRank()of the underlying policy.
 - 
getReward(arm, reward)[source]¶
- Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]). - Reset the whole empirical average if the change detection algorithm says so, with method - detect_change(), for this arm at this current time step.
 - Warning - This is computationally costly, so an easy way to speed up this step is to use - lazy_detect_change_only_x_steps\(= \mathrm{Step_t}\) for a small value (e.g., 10), so not test for all \(t\in\mathbb{N}^*\) but only \(s\in\mathbb{N}^*, s % \mathrm{Step_t} = 0\) (e.g., one out of every 10 steps).- Warning - If the \(detect_change\) method also returns an estimate of the position of the change-point, \(\hat{tau}\), then it is used to reset the memory of the changing arm and keep the observations from \(\hat{tau}+1\). 
 - 
detect_change(arm, verbose=False)[source]¶
- Try to detect a change in the current arm. - Warning - This is not implemented for the generic CD algorithm, it has to be implement by a child of the class - CD_IndexPolicy.
 - 
__module__= 'Policies.CD_UCB'¶
 
- 
- 
class Policies.CD_UCB.SlidingWindowRestart_IndexPolicy(nbArms, full_restart_when_refresh=False, per_arm_restart=True, epsilon=0.05, proba_random_exploration=None, lazy_detect_change_only_x_steps=10, *args, **kwargs)[source]¶
- Bases: - Policies.CD_UCB.CD_IndexPolicy- A more generic implementation is the - Policies.SlidingWindowRestartclass.- Warning - I have no idea if what I wrote is correct or not! - 
detect_change(arm, verbose=False)[source]¶
- Try to detect a change in the current arm. - Warning - This one is simply using a sliding-window of fixed size = 100. A more generic implementation is the - Policies.SlidingWindowRestartclass.
 - 
__module__= 'Policies.CD_UCB'¶
 
- 
- 
Policies.CD_UCB.LAZY_TRY_VALUE_S_ONLY_X_STEPS= 10¶
- XXX Be lazy and try to detect changes for \(s\) taking steps of size - steps_s. Default is to have- steps_s=1, but only using- steps_s=2should already speed up by 2. It is a simple but efficient way to speed up GLR tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 1, to not use this feature, and 10 should speed up the test by x10.
- 
Policies.CD_UCB.USE_LOCALIZATION= True¶
- Default value of - use_localizationfor policies. All the experiments I tried showed that the localization always helps improving learning, so the default value is set to True.
- 
class Policies.CD_UCB.UCBLCB_IndexPolicy(nbArms, delta=None, delta0=1.0, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]¶
- Bases: - Policies.CD_UCB.CD_IndexPolicy- The UCBLCB-UCB generic policy for non-stationary bandits, from [[Improved Changepoint Detection for Piecewise i.i.d Bandits, by S. Mukherjee & O.-A. Maillard, preprint 2018](https://subhojyoti.github.io/pdf/aistats_2019.pdf)]. - Warning - This is still experimental! See https://github.com/SMPyBandits/SMPyBandits/issues/177 - 
__init__(nbArms, delta=None, delta0=1.0, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]¶
- New policy. 
 - 
proba_random_exploration= None¶
- What they call \(\alpha\) in their paper: the probability of uniform exploration at each time. 
 - 
lazy_try_value_s_only_x_steps= None¶
- Be lazy and try to detect changes for \(s\) taking steps of size - steps_s.
 - 
use_localization= None¶
- experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one 
 - 
__module__= 'Policies.CD_UCB'¶
 - 
delta(t)[source]¶
- Use \(\delta = \delta_0\) if it was given as an argument to the policy, or \(\frac{\delta_0}{t}\) as the confidence level of UCB/LCB test (default is \(\delta_0=1\)). - Warning - It is unclear (in the article) whether \(t\) is the time since the last restart or the total time? 
 - 
detect_change(arm, verbose=False)[source]¶
- Detect a change in the current arm, using the two-sided UCB-LCB algorithm [Mukherjee & Maillard, 2018]. - Let \(\hat{\mu}_{i,t:t'}\) the empirical mean of rewards obtained for arm i from time \(t\) to \(t'\), and \(N_{i,t:t'}\) the number of samples. 
- Let \(S_{i,t:t'} = \sqrt{\frac{\log(4 t^2 / \delta)}{2 N_{i,t:t'}}}\) the length of the confidence interval. 
- When we have data starting at \(t_0=0\) (since last restart) and up-to current time \(t\), for each arm i,
- For each intermediate time steps \(t' \in [t_0, t)\),
- Compute \(LCB_{\text{before}} = \hat{\mu}_{i,t_0:t'} - S_{i,t_0:t'}\), 
- Compute \(UCB_{\text{before}} = \hat{\mu}_{i,t_0:t'} + S_{i,t_0:t'}\), 
- Compute \(LCB_{\text{after}} = \hat{\mu}_{i,t'+1:t} - S_{i,t'+1:t}\), 
- Compute \(UCB_{\text{after}} = \hat{\mu}_{i,t'+1:t} + S_{i,t'+1:t}\), 
- If \(UCB_{\text{before}} < LCB_{\text{after}}\) or \(UCB_{\text{after}} < LCB_{\text{before}}\), then restart. 
 
 
 
 
 
 
-