Policies.ExploreThenCommit module¶
Different variants of the Explore-Then-Commit policy.
Reference: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies
And [Kaufmann & Moy, 2017, ICC](http://icc2017.ieee-icc.org/program/tutorials#TT01), E.Kaufmann’s slides at IEEE ICC 2017
See also: https://github.com/SMPyBandits/SMPyBandits/issues/62 and https://github.com/SMPyBandits/SMPyBandits/issues/102
Also [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf)
Warning
They sometimes do not work empirically as well as the theory predicted…
Warning
TODO I should factor all this code and write all of them in a more “unified” way…
-
Policies.ExploreThenCommit.GAP= 0.1¶ Default value for the gap, \(\Delta = \min_{i\neq j} \mu_i - \mu_j\), \(\Delta = 0.1\) as in many basic experiments.
-
class
Policies.ExploreThenCommit.ETC_KnownGap(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedyVariant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).
-
horizon= None¶ Parameter \(T\) = known horizon of the experiment.
-
gap= None¶ Known gap parameter for the stopping rule.
-
max_t= None¶ Time until pure exploitation,
m_steps in each arm.
-
property
epsilon¶ 1 while \(t \leq T_0\), 0 after, where \(T_0\) is defined by:
\[T_0 = \lfloor \frac{4}{\Delta^2} \log(\frac{T \Delta^2}{4}) \rfloor.\]
-
__module__= 'Policies.ExploreThenCommit'¶
-
-
Policies.ExploreThenCommit.ALPHA= 4¶ Default value for parameter \(\alpha\) for
ETC_RandomStop
-
class
Policies.ExploreThenCommit.ETC_RandomStop(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedyVariant of the Explore-Then-Commit policy, with known horizon \(T\) and random stopping time. Uniform exploration until the stopping time.
-
horizon= None¶ Parameter \(T\) = known horizon of the experiment.
-
alpha= None¶ Parameter \(\alpha\) in the formula (4 by default).
-
stillRandom= None¶ Still randomly exploring?
-
property
epsilon¶ 1 while \(t \leq \tau\), 0 after, where \(\tau\) is a random stopping time, defined by:
\[\tau = \inf\{ t \in\mathbb{N},\; \max_{i \neq j} \| \widehat{X_i}(t) - \widehat{X_j}(t) \| > \sqrt{\frac{4 \log(T/t)}{t}} \}.\]
-
__module__= 'Policies.ExploreThenCommit'¶
-
-
class
Policies.ExploreThenCommit.ETC_FixedBudget(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedyThe Fixed-Budget variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\). Sequential exploration until the stopping time.
Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 1.
-
horizon= None¶ Parameter \(T\) = known horizon of the experiment.
-
gap= None¶ Known gap parameter for the stopping rule.
-
max_t= None¶ Time until pure exploitation.
-
round_robin_index= None¶ Internal index to keep the Round-Robin phase
-
best_identified_arm= None¶ Arm on which we commit, not defined in the beginning.
-
choice()[source]¶ For n rounds, choose each arm sequentially in a Round-Robin phase, then commit to the arm with highest empirical average.
\[n = \lfloor \frac{2}{\Delta^2} \mathcal{W}(\frac{T^2 \Delta^4}{32 \pi}) \rfloor.\]Where \(\mathcal{W}\) is the Lambert W function, defined implicitly by \(W(y) \exp(W(y)) = y\) for any \(y > 0\) (and computed with
scipy.special.lambertw()).
-
property
epsilon¶ 1 while \(t \leq n\), 0 after.
-
__module__= 'Policies.ExploreThenCommit'¶
-
class
Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedyBase class for variants of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\). Sequential exploration until the stopping time.
Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 2 and 3.
-
horizon= None¶ Parameter \(T\) = known horizon of the experiment.
-
gap= None¶ Known gap parameter for the stopping rule.
-
round_robin_index= None¶ Internal index to keep the Round-Robin phase
-
best_identified_arm= None¶ Arm on which we commit, not defined in the beginning.
-
choice()[source]¶ Choose each arm sequentially in a Round-Robin phase, as long as the following criteria is not satisfied, then commit to the arm with highest empirical average.
\[(t/2) \max_{i \neq j} |\hat{\mu_i} - \hat{\mu_j}| < \log(T \Delta^2).\]
-
property
epsilon¶ 1 while not fixed, 0 after.
-
__module__= 'Policies.ExploreThenCommit'¶
-
class
Policies.ExploreThenCommit.ETC_SPRT(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteriaThe Sequential Probability Ratio Test variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).
Very similar to
ETC_RandomStop, but with a sequential exploration until the stopping time.Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 2.
-
__module__= 'Policies.ExploreThenCommit'¶
-
class
Policies.ExploreThenCommit.ETC_BAI(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteriaThe Best Arm Identification variant of the Explore-Then-Commit policy, with known horizon \(T\).
Very similar to
ETC_RandomStop, but with a sequential exploration until the stopping time.Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 3.
-
alpha= None¶ Parameter \(\alpha\) in the formula (4 by default).
-
__module__= 'Policies.ExploreThenCommit'¶
-
class
Policies.ExploreThenCommit.DeltaUCB(nbArms, horizon, gap=0.1, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicyThe DeltaUCB policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).
Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 4.
-
horizon= None¶ Parameter \(T\) = known horizon of the experiment.
-
gap= None¶ Known gap parameter for the stopping rule.
-
alpha= None¶ Parameter \(\alpha\) in the formula (4 by default).
-
epsilon_T= None¶ Parameter \(\varepsilon_T = \Delta (\log(\mathrm{e} + T \Delta^2))^{-1/8}\).
-
choice()[source]¶ Chose between the most chosen and the least chosen arm, based on the following criteria:
\[\begin{split}A_{t,\min} &= \arg\min_k N_k(t),\\ A_{t,\max} &= \arg\max_k N_k(t).\end{split}\]\[\begin{split}UCB_{\min} &= \hat{\mu}_{A_{t,\min}}(t-1) + \sqrt{\alpha \frac{\log(\frac{T}{N_{A_{t,\min}}})}{N_{A_{t,\min}}}} \\ UCB_{\max} &= \hat{\mu}_{A_{t,\max}}(t-1) + \Delta - \alpha \varepsilon_T\end{split}\]\[\begin{split}A(t) = \begin{cases}\\ A(t) = A_{t,\min} & \text{if } UCB_{\min} \geq UCB_{\max},\\ A(t) = A_{t,\max} & \text{else}. \end{cases}\end{split}\]
-
__module__= 'Policies.ExploreThenCommit'¶