Policies.Monitored_UCB module¶
The Monitored-UCB generic policy for non-stationary bandits.
Reference: [[“Nearly Optimal Adaptive Procedure for Piecewise-Stationary Bandit: a Change-Point Detection Approach”. Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie. arXiv preprint arXiv:1802.03692, 2018]](https://arxiv.org/pdf/1802.03692)
It runs on top of a simple policy, e.g.,
UCB, andMonitored_IndexPolicyis a wrapper:>>> policy = Monitored_IndexPolicy(nbArms, UCB) >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
It uses an additional \(\mathcal{O}(K w)\) memory for a window of size \(w\).
Warning
It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!
-
Policies.Monitored_UCB.DELTA= 0.1¶ Default value for the parameter \(\delta\), the lower-bound for \(\delta_k^{(i)}\) the amplitude of change of arm k at break-point. Default is
0.05.
-
Policies.Monitored_UCB.PER_ARM_RESTART= False¶ Should we reset one arm empirical average or all? For M-UCB it is
Falseby default.
-
Policies.Monitored_UCB.FULL_RESTART_WHEN_REFRESH= True¶ Should we fully restart the algorithm or simply reset one arm empirical average? For M-UCB it is
Trueby default.
-
Policies.Monitored_UCB.WINDOW_SIZE= None¶ Default value of the window-size. Give
Noneto use the default value computed from a knowledge of the horizon and number of break-points.
-
Policies.Monitored_UCB.GAMMA_SCALE_FACTOR= 1¶ For any algorithm with uniform exploration and a formula to tune it, \(\alpha\) is usually too large and leads to larger regret. Multiplying it by a 0.1 or 0.2 helps, a lot!
-
class
Policies.Monitored_UCB.Monitored_IndexPolicy(nbArms, full_restart_when_refresh=True, per_arm_restart=False, horizon=None, delta=0.1, max_nb_random_events=None, w=None, b=None, gamma=None, *args, **kwargs)[source]¶ Bases:
Policies.BaseWrapperPolicy.BaseWrapperPolicyThe Monitored-UCB generic policy for non-stationary bandits, from [[“Nearly Optimal Adaptive Procedure for Piecewise-Stationary Bandit: a Change-Point Detection Approach”. Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie. arXiv preprint arXiv:1802.03692, 2018]](https://arxiv.org/pdf/1802.03692)
For a window size
w, it uses only \(\mathcal{O}(K w)\) memory.
-
__init__(nbArms, full_restart_when_refresh=True, per_arm_restart=False, horizon=None, delta=0.1, max_nb_random_events=None, w=None, b=None, gamma=None, *args, **kwargs)[source]¶ New policy.
-
window_size= None¶ Parameter \(w\) for the M-UCB algorithm.
-
threshold_b= None¶ Parameter \(b\) for the M-UCB algorithm.
-
gamma= None¶ What they call \(\gamma\) in their paper: the share of uniform exploration.
-
last_update_time_tau= None¶ Keep in memory the last time a change was detected, ie, the variable \(\tau\) in the algorithm.
-
last_w_rewards= None¶ Keep in memory all the rewards obtained since the last restart on that arm.
-
last_pulls= None¶ Keep in memory the times where each arm was last seen. Start with -1 (never seen)
-
last_restart_times= None¶ Keep in memory the times of last restarts (for each arm).
-
choice()[source]¶ Essentially play uniformly at random with probability \(\gamma\), otherwise, pass the call to
choiceof the underlying policy (eg. UCB).Warning
Actually, it’s more complicated:
If \(t\) is the current time and \(\tau\) is the latest restarting time, then uniform exploration is done if:
\[\begin{split}A &:= (t - \tau) \mod \lceil \frac{K}{\gamma} \rceil,\\ A &\leq K \implies A_t = A.\end{split}\]
-
choiceWithRank(rank=1)[source]¶ Essentially play uniformly at random with probability \(\gamma\), otherwise, pass the call to
choiceWithRankof the underlying policy (eg. UCB).
-
getReward(arm, reward)[source]¶ Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).
Reset the whole empirical average if the change detection algorithm says so.
-
__module__= 'Policies.Monitored_UCB'¶
-
detect_change(arm)[source]¶ A change is detected for the current arm if the following test is true:
\[|\sum_{i=w/2+1}^{w} Y_i - \sum_{i=1}^{w/2} Y_i | > b ?\]where \(Y_i\) is the i-th data in the latest w data from this arm (ie, \(X_k(t)\) for \(t = n_k - w + 1\) to \(t = n_k\) current number of samples from arm k).
where
threshold_bis the threshold b of the test, andwindow_sizeis the window-size w.
Warning
FIXED only the last \(w\) data are stored, using lists that got their first element ``pop()``ed out (deleted). See https://github.com/SMPyBandits/SMPyBandits/issues/174