Policies.Monitored_UCB module¶

The Monitored-UCB generic policy for non-stationary bandits.

Reference: [[“Nearly Optimal Adaptive Procedure for Piecewise-Stationary Bandit: a Change-Point Detection Approach”. Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie. arXiv preprint arXiv:1802.03692, 2018]](https://arxiv.org/pdf/1802.03692)

It runs on top of a simple policy, e.g., UCB, and Monitored_IndexPolicy is a wrapper:

>>> policy = Monitored_IndexPolicy(nbArms, UCB)
>>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)

It uses an additional \(\mathcal{O}(K w)\) memory for a window of size \(w\).

Warning

It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!

Policies.Monitored_UCB.DELTA = 0.1¶: Default value for the parameter \(\delta\), the lower-bound for \(\delta_k^{(i)}\) the amplitude of change of arm k at break-point. Default is 0.05.

Policies.Monitored_UCB.PER_ARM_RESTART = False¶: Should we reset one arm empirical average or all? For M-UCB it is False by default.

Policies.Monitored_UCB.FULL_RESTART_WHEN_REFRESH = True¶: Should we fully restart the algorithm or simply reset one arm empirical average? For M-UCB it is True by default.

Policies.Monitored_UCB.WINDOW_SIZE = None¶: Default value of the window-size. Give None to use the default value computed from a knowledge of the horizon and number of break-points.

Policies.Monitored_UCB.GAMMA_SCALE_FACTOR = 1¶: For any algorithm with uniform exploration and a formula to tune it, \(\alpha\) is usually too large and leads to larger regret. Multiplying it by a 0.1 or 0.2 helps, a lot!

class Policies.Monitored_UCB.Monitored_IndexPolicy(nbArms, full_restart_when_refresh=True, per_arm_restart=False, horizon=None, delta=0.1, max_nb_random_events=None, w=None, b=None, gamma=None, *args, **kwargs)[source]¶

Bases: Policies.BaseWrapperPolicy.BaseWrapperPolicy

The Monitored-UCB generic policy for non-stationary bandits, from [[“Nearly Optimal Adaptive Procedure for Piecewise-Stationary Bandit: a Change-Point Detection Approach”. Yang Cao, Zheng Wen, Branislav Kveton, Yao Xie. arXiv preprint arXiv:1802.03692, 2018]](https://arxiv.org/pdf/1802.03692)

For a window size w, it uses only \(\mathcal{O}(K w)\) memory.

__init__(nbArms, full_restart_when_refresh=True, per_arm_restart=False, horizon=None, delta=0.1, max_nb_random_events=None, w=None, b=None, gamma=None, *args, **kwargs)[source]¶: New policy.

window_size = None¶: Parameter \(w\) for the M-UCB algorithm.

threshold_b = None¶: Parameter \(b\) for the M-UCB algorithm.

gamma = None¶: What they call \(\gamma\) in their paper: the share of uniform exploration.

last_update_time_tau = None¶: Keep in memory the last time a change was detected, ie, the variable \(\tau\) in the algorithm.

last_w_rewards = None¶: Keep in memory all the rewards obtained since the last restart on that arm.

last_pulls = None¶: Keep in memory the times where each arm was last seen. Start with -1 (never seen)

last_restart_times = None¶: Keep in memory the times of last restarts (for each arm).

__str__()[source]¶: -> str

choice()[source]¶

Essentially play uniformly at random with probability \(\gamma\), otherwise, pass the call to choice of the underlying policy (eg. UCB).

Warning

Actually, it’s more complicated:

If \(t\) is the current time and \(\tau\) is the latest restarting time, then uniform exploration is done if:

\[\begin{split}A &:= (t - \tau) \mod \lceil \frac{K}{\gamma} \rceil,\\ A &\leq K \implies A_t = A.\end{split}\]

choiceWithRank(rank=1)[source]¶: Essentially play uniformly at random with probability \(\gamma\), otherwise, pass the call to choiceWithRank of the underlying policy (eg. UCB).

getReward(arm, reward)[source]¶

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

Reset the whole empirical average if the change detection algorithm says so.

__module__ = 'Policies.Monitored_UCB'¶

detect_change(arm)[source]¶

A change is detected for the current arm if the following test is true:

\[|\sum_{i=w/2+1}^{w} Y_i - \sum_{i=1}^{w/2} Y_i | > b ?\]

where \(Y_i\) is the i-th data in the latest w data from this arm (ie, \(X_k(t)\) for \(t = n_k - w + 1\) to \(t = n_k\) current number of samples from arm k).
where threshold_b is the threshold b of the test, and window_size is the window-size w.

Warning

FIXED only the last \(w\) data are stored, using lists that got their first element ``pop()``ed out (deleted). See https://github.com/SMPyBandits/SMPyBandits/issues/174