Policies.AdSwitch module¶

The AdSwitch policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner]](https://ewrl.files.wordpress.com/2018/09/ewrl_14_2018_paper_28.pdf)

It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

Warning

This implementation is still experimental!

class Policies.AdSwitch.Phase¶

Bases: enum.Enum

Different phases during the AdSwitch algorithm

Checking = 2¶

Estimation = 1¶

Exploitation = 3¶

__module__ = 'Policies.AdSwitch'¶

Policies.AdSwitch.mymean(x)[source]¶

Simply numpy.mean() on x if x is non empty, otherwise 0.0.

>>> np.mean([])
/usr/local/lib/python3.6/dist-packages/numpy/core/fromnumeric.py:2957: RuntimeWarning: Mean of empty slice.

Policies.AdSwitch.Constant_C1 = 1.0¶: Default value for the constant \(C_1\). Should be \(>0\) and as large as possible, but not too large.

Policies.AdSwitch.Constant_C2 = 1.0¶: Default value for the constant \(C_2\). Should be \(>0\) and as large as possible, but not too large.

class Policies.AdSwitch.AdSwitch(nbArms, horizon=None, C1=1.0, C2=1.0, *args, **kwargs)[source]¶

Bases: Policies.BasePolicy.BasePolicy

The AdSwitch policy for non-stationary bandits, from [[“Adaptively Tracking the Best Arm with an Unknown Number of Distribution Changes”. Peter Auer, Pratik Gajane and Ronald Ortner]](https://ewrl.files.wordpress.com/2018/09/ewrl_14_2018_paper_28.pdf)

__init__(nbArms, horizon=None, C1=1.0, C2=1.0, *args, **kwargs)[source]¶: New policy.

horizon = None¶: Parameter \(T\) for the AdSwitch algorithm, the horizon of the experiment. TODO try to use DoublingTrickWrapper to remove the dependency in \(T\) ?

C1 = None¶: Parameter \(C_1\) for the AdSwitch algorithm.

C2 = None¶: Parameter \(C_2\) for the AdSwitch algorithm.

phase = None¶: Current phase, exploration or exploitation.

current_exploration_arm = None¶: Currently explored arm. It cycles uniformly, in step 2.

current_exploitation_arm = None¶: Currently exploited arm. It is \(\overline{a_k}\) in the algorithm.

batch_number = None¶: Number of batch

last_restart_time = None¶: Time step of the last restart (beginning of phase of Estimation)

length_of_current_phase = None¶: Length of the current tests phase, computed as \(s_i\), with compute_di_pi_si().

step_of_current_phase = None¶: Timer inside the current phase.

current_best_arm = None¶: Current best arm, when finishing step 3. Denote \(\overline{a_k}\) in the algorithm.

current_worst_arm = None¶: Current worst arm, when finishing step 3. Denote \(\underline{a_k}\) in the algorithm.

current_estimated_gap = None¶: Gap between the current best and worst arms, ie largest gap, when finishing step 3. Denote \(\widehat{\Delta_k}\) in the algorithm.

last_used_di_pi_si = None¶: Memory of the currently used \((d_i, p_i, s_i)\).

all_rewards = None¶: Memory of all the rewards. A dictionary per arm, mapping time to rewards. Growing size until restart of that arm!

__str__()[source]¶: -> str

startGame()[source]¶: Start the game (fill pulls and rewards with 0).

getReward(arm, reward)[source]¶: Get a reward from an arm.

read_range_of_rewards(arm, start, end)[source]¶: Read the all_rewards attribute to extract all the rewards for that arm, obtained between time start (included) and end (not included).

statistical_test(t, t0)[source]¶

Test if at time \(t\) there is a \(\sigma\), \(t_0 \leq \sigma < t\), and a pair of arms \(a,b\), satisfying this test:

\[| \hat{\mu_a}[\sigma,t] - \hat{\mu_b}[\sigma,t] | > \sqrt{\frac{C_1 \log T}{t - \sigma}}.\]

where \(\hat{\mu_a}[t_1,t_2]\) is the empirical mean for arm \(a\) for samples obtained from times \(t \in [t_1,t_2)\).

Return True, sigma if the test was satisfied, and the smallest \(\sigma\) that was satisfying the test, or False, None otherwise.

find_Ik()[source]¶: Follow the algorithm and, with a gap estimate \(\widehat{\Delta_k}\), find \(I_k = \max\{ i : d_i \geq \widehat{\Delta_k} \}\), where \(d_i := 2^{-i}\). There is no need to do an exhaustive search:

\[I_k := \lfloor - \log_2(\widehat{\Delta_k}) \rfloor.\]

__module__ = 'Policies.AdSwitch'¶

compute_di_pi_si()[source]¶: Compute the values of \(d_i\), \(p_{k,i}\), \(s_i\) according to the AdSwitch algorithm.

choice()[source]¶: Choose an arm following the different phase of growing lengths according to the AdSwitch algorithm.