Policies.ExploreThenCommit module¶

Different variants of the Explore-Then-Commit policy.

Reference: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies
And [Kaufmann & Moy, 2017, ICC](http://icc2017.ieee-icc.org/program/tutorials#TT01), E.Kaufmann’s slides at IEEE ICC 2017
See also: https://github.com/SMPyBandits/SMPyBandits/issues/62 and https://github.com/SMPyBandits/SMPyBandits/issues/102
Also [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf)

Warning

They sometimes do not work empirically as well as the theory predicted…

Warning

TODO I should factor all this code and write all of them in a more “unified” way…

Policies.ExploreThenCommit.GAP = 0.1¶: Default value for the gap, \(\Delta = \min_{i\neq j} \mu_i - \mu_j\), \(\Delta = 0.1\) as in many basic experiments.

class Policies.ExploreThenCommit.ETC_KnownGap(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.EpsilonGreedy.EpsilonGreedy

Variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).

Reference: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies

__init__(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]¶: New policy.

horizon = None¶: Parameter \(T\) = known horizon of the experiment.

gap = None¶: Known gap parameter for the stopping rule.

max_t = None¶: Time until pure exploitation, m_ steps in each arm.

__str__()[source]¶: -> str

property epsilon¶: 1 while \(t \leq T_0\), 0 after, where \(T_0\) is defined by:

\[T_0 = \lfloor \frac{4}{\Delta^2} \log(\frac{T \Delta^2}{4}) \rfloor.\]

__module__ = 'Policies.ExploreThenCommit'¶

Policies.ExploreThenCommit.ALPHA = 4¶: Default value for parameter \(\alpha\) for ETC_RandomStop

class Policies.ExploreThenCommit.ETC_RandomStop(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.EpsilonGreedy.EpsilonGreedy

Variant of the Explore-Then-Commit policy, with known horizon \(T\) and random stopping time. Uniform exploration until the stopping time.

Reference: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies

__init__(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶: New policy.

horizon = None¶: Parameter \(T\) = known horizon of the experiment.

alpha = None¶: Parameter \(\alpha\) in the formula (4 by default).

stillRandom = None¶: Still randomly exploring?

__str__()[source]¶: -> str

property epsilon¶: 1 while \(t \leq \tau\), 0 after, where \(\tau\) is a random stopping time, defined by:

\[\tau = \inf\{ t \in\mathbb{N},\; \max_{i \neq j} \| \widehat{X_i}(t) - \widehat{X_j}(t) \| > \sqrt{\frac{4 \log(T/t)}{t}} \}.\]

__module__ = 'Policies.ExploreThenCommit'¶

class Policies.ExploreThenCommit.ETC_FixedBudget(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.EpsilonGreedy.EpsilonGreedy

The Fixed-Budget variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\). Sequential exploration until the stopping time.

Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 1.

__init__(nbArms, horizon=None, gap=0.1, lower=0.0, amplitude=1.0)[source]¶: New policy.

horizon = None¶: Parameter \(T\) = known horizon of the experiment.

gap = None¶: Known gap parameter for the stopping rule.

max_t = None¶: Time until pure exploitation.

round_robin_index = None¶: Internal index to keep the Round-Robin phase

best_identified_arm = None¶: Arm on which we commit, not defined in the beginning.

__str__()[source]¶: -> str

choice()[source]¶

For n rounds, choose each arm sequentially in a Round-Robin phase, then commit to the arm with highest empirical average.

\[n = \lfloor \frac{2}{\Delta^2} \mathcal{W}(\frac{T^2 \Delta^4}{32 \pi}) \rfloor.\]

Where \(\mathcal{W}\) is the Lambert W function, defined implicitly by \(W(y) \exp(W(y)) = y\) for any \(y > 0\) (and computed with scipy.special.lambertw()).

property epsilon¶: 1 while \(t \leq n\), 0 after.

__module__ = 'Policies.ExploreThenCommit'¶

class Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.EpsilonGreedy.EpsilonGreedy

Base class for variants of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\). Sequential exploration until the stopping time.

Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 2 and 3.

__init__(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]¶: New policy.

horizon = None¶: Parameter \(T\) = known horizon of the experiment.

gap = None¶: Known gap parameter for the stopping rule.

round_robin_index = None¶: Internal index to keep the Round-Robin phase

best_identified_arm = None¶: Arm on which we commit, not defined in the beginning.

__str__()[source]¶: -> str

choice()[source]¶: Choose each arm sequentially in a Round-Robin phase, as long as the following criteria is not satisfied, then commit to the arm with highest empirical average.

\[(t/2) \max_{i \neq j} |\hat{\mu_i} - \hat{\mu_j}| < \log(T \Delta^2).\]

stopping_criteria()[source]¶: Test if we should stop the Round-Robin phase.

property epsilon¶: 1 while not fixed, 0 after.

__module__ = 'Policies.ExploreThenCommit'¶

class Policies.ExploreThenCommit.ETC_SPRT(nbArms, horizon, gap=0.1, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria

The Sequential Probability Ratio Test variant of the Explore-Then-Commit policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).

Very similar to ETC_RandomStop, but with a sequential exploration until the stopping time.
Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 2.

stopping_criteria()[source]¶: Test if we should stop the Round-Robin phase.

__module__ = 'Policies.ExploreThenCommit'¶

class Policies.ExploreThenCommit.ETC_BAI(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.ExploreThenCommit._ETC_RoundRobin_WithStoppingCriteria

The Best Arm Identification variant of the Explore-Then-Commit policy, with known horizon \(T\).

Very similar to ETC_RandomStop, but with a sequential exploration until the stopping time.
Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 3.

__init__(nbArms, horizon=None, alpha=4, lower=0.0, amplitude=1.0)[source]¶: New policy.

alpha = None¶: Parameter \(\alpha\) in the formula (4 by default).

stopping_criteria()[source]¶: Test if we should stop the Round-Robin phase.

__module__ = 'Policies.ExploreThenCommit'¶

class Policies.ExploreThenCommit.DeltaUCB(nbArms, horizon, gap=0.1, alpha=4, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.BasePolicy.BasePolicy

The DeltaUCB policy, with known horizon \(T\) and gap \(\Delta = \min_{i\neq j} \mu_i - \mu_j\).

Reference: [On Explore-Then-Commit Strategies, by A.Garivier et al, NIPS, 2016](https://arxiv.org/pdf/1605.08988.pdf), Algorithm 4.

__init__(nbArms, horizon, gap=0.1, alpha=4, lower=0.0, amplitude=1.0)[source]¶: New policy.

horizon = None¶: Parameter \(T\) = known horizon of the experiment.

gap = None¶: Known gap parameter for the stopping rule.

alpha = None¶: Parameter \(\alpha\) in the formula (4 by default).

epsilon_T = None¶: Parameter \(\varepsilon_T = \Delta (\log(\mathrm{e} + T \Delta^2))^{-1/8}\).

__str__()[source]¶: -> str

choice()[source]¶: Chose between the most chosen and the least chosen arm, based on the following criteria:

\[\begin{split}A_{t,\min} &= \arg\min_k N_k(t),\\ A_{t,\max} &= \arg\max_k N_k(t).\end{split}\]

\[\begin{split}UCB_{\min} &= \hat{\mu}_{A_{t,\min}}(t-1) + \sqrt{\alpha \frac{\log(\frac{T}{N_{A_{t,\min}}})}{N_{A_{t,\min}}}} \\ UCB_{\max} &= \hat{\mu}_{A_{t,\max}}(t-1) + \Delta - \alpha \varepsilon_T\end{split}\]

\[\begin{split}A(t) = \begin{cases}\\ A(t) = A_{t,\min} & \text{if } UCB_{\min} \geq UCB_{\max},\\ A(t) = A_{t,\max} & \text{else}. \end{cases}\end{split}\]

__module__ = 'Policies.ExploreThenCommit'¶