Policies.ExploreThenCommit module¶
Different variants of the Explore-Then-Commit policy.
Reference: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies
And [Kaufmann & Moy, 2017, ICC](http://icc2017.ieee-icc.org/program/tutorials#TT01), E.Kaufmann’s slides at IEEE ICC 2017
See also: https://github.com/SMPyBandits/SMPyBandits/issues/62 and https://github.com/SMPyBandits/SMPyBandits/issues/102
Also [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf)
Warning
They sometimes do not work empirically as well as the theory predicted…
Warning
TODO I should factor all this code and write all of them in a more “unified” way…
-
Policies.ExploreThenCommit.
GAP
= 0.1¶ Default value for the gap, \(\Delta = \min_{i\neq j} \mu_i - \mu_j\), \(\Delta = 0.1\) as in many basic experiments.
-
class
Policies.ExploreThenCommit.
ETC_KnownGap
(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedy
Variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
gap
= None¶ Known gap parameter for the stopping rule.
-
max_t
= None¶ Time until pure exploitation,
m_
steps in each arm.
-
property
epsilon
¶ 1 while \(t \leq T_0\), 0 after, where \(T_0\) is defined by:
\[T_0 = \lfloor \frac{4}{\Delta^2} \log(\frac{T \Delta^2}{4}) \rfloor.\]
-
__module__
= 'Policies.ExploreThenCommit'¶
-
-
Policies.ExploreThenCommit.
ALPHA
= 4¶ Default value for parameter \(\alpha\) for
ETC_RandomStop
-
class
Policies.ExploreThenCommit.
ETC_RandomStop
(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedy
Variant of the Explore-Then-Commit policy, with known horizon \(T\) and random stopping time. Uniform exploration until the stopping time.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
alpha
= None¶ Parameter \(\alpha\) in the formula (4 by default).
-
stillRandom
= None¶ Still randomly exploring?
-
property
epsilon
¶ 1 while \(t \leq \tau\), 0 after, where \(\tau\) is a random stopping time, defined by:
\[\tau = \inf\{ t \in\mathbb{N},\; \max_{i \neq j} \| \widehat{X_i}(t) - \widehat{X_j}(t) \| > \sqrt{\frac{4 \log(T/t)}{t}} \}.\]
-
__module__
= 'Policies.ExploreThenCommit'¶
-
-
class
Policies.ExploreThenCommit.
ETC_FixedBudget
(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedy
The Fixed-Budget variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\). Sequential exploration until the stopping time.
Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 1.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
gap
= None¶ Known gap parameter for the stopping rule.
-
max_t
= None¶ Time until pure exploitation.
-
round_robin_index
= None¶ Internal index to keep the Round-Robin phase
-
best_identified_arm
= None¶ Arm on which we commit, not defined in the beginning.
-
choice
()[source]¶ For n rounds, choose each arm sequentially in a Round-Robin phase, then commit to the arm with highest empirical average.
\[n = \lfloor \frac{2}{\Delta^2} \mathcal{W}(\frac{T^2 \Delta^4}{32 \pi}) \rfloor.\]Where \(\mathcal{W}\) is the Lambert W function, defined implicitly by \(W(y) \exp(W(y)) = y\) for any \(y > 0\) (and computed with
scipy.special.lambertw()
).
-
property
epsilon
¶ 1 while \(t \leq n\), 0 after.
-
__module__
= 'Policies.ExploreThenCommit'¶
-
class
Policies.ExploreThenCommit.
_ETC_RoundRobin_WithStoppingCriteria
(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.EpsilonGreedy.EpsilonGreedy
Base class for variants of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\). Sequential exploration until the stopping time.
Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 2 and 3.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
gap
= None¶ Known gap parameter for the stopping rule.
-
round_robin_index
= None¶ Internal index to keep the Round-Robin phase
-
best_identified_arm
= None¶ Arm on which we commit, not defined in the beginning.
-
choice
()[source]¶ Choose each arm sequentially in a Round-Robin phase, as long as the following criteria is not satisfied, then commit to the arm with highest empirical average.
\[(t/2) \max_{i \neq j} |\hat{\mu_i} - \hat{\mu_j}| < \log(T \Delta^2).\]
-
property
epsilon
¶ 1 while not fixed, 0 after.
-
__module__
= 'Policies.ExploreThenCommit'¶
-
class
Policies.ExploreThenCommit.
ETC_SPRT
(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria
The Sequential Probability Ratio Test variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).
Very similar to
ETC_RandomStop
, but with a sequential exploration until the stopping time.Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 2.
-
__module__
= 'Policies.ExploreThenCommit'¶
-
class
Policies.ExploreThenCommit.
ETC_BAI
(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria
The Best Arm Identification variant of the Explore-Then-Commit policy, with known horizon \(T\).
Very similar to
ETC_RandomStop
, but with a sequential exploration until the stopping time.Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 3.
-
alpha
= None¶ Parameter \(\alpha\) in the formula (4 by default).
-
__module__
= 'Policies.ExploreThenCommit'¶
-
class
Policies.ExploreThenCommit.
DeltaUCB
(nbArms, horizon, gap=0.1, alpha=4, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.BasePolicy.BasePolicy
The DeltaUCB policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).
Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 4.
-
horizon
= None¶ Parameter \(T\) = known horizon of the experiment.
-
gap
= None¶ Known gap parameter for the stopping rule.
-
alpha
= None¶ Parameter \(\alpha\) in the formula (4 by default).
-
epsilon_T
= None¶ Parameter \(\varepsilon_T = \Delta (\log(\mathrm{e} + T \Delta^2))^{-1/8}\).
-
choice
()[source]¶ Chose between the most chosen and the least chosen arm, based on the following criteria:
\[\begin{split}A_{t,\min} &= \arg\min_k N_k(t),\\ A_{t,\max} &= \arg\max_k N_k(t).\end{split}\]\[\begin{split}UCB_{\min} &= \hat{\mu}_{A_{t,\min}}(t-1) + \sqrt{\alpha \frac{\log(\frac{T}{N_{A_{t,\min}}})}{N_{A_{t,\min}}}} \\ UCB_{\max} &= \hat{\mu}_{A_{t,\max}}(t-1) + \Delta - \alpha \varepsilon_T\end{split}\]\[\begin{split}A(t) = \begin{cases}\\ A(t) = A_{t,\min} & \text{if } UCB_{\min} \geq UCB_{\max},\\ A(t) = A_{t,\max} & \text{else}. \end{cases}\end{split}\]
-
__module__
= 'Policies.ExploreThenCommit'¶