Policies.UCBoost module¶
The UCBoost policy for bounded bandits (on [0, 1]).
Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
Warning
The whole goal of their paper is to provide a numerically efficient alternative to kl-UCB, so for my comparison to be fair, I should either use the Python versions of klUCB utility functions (using kullback
) or write C or Cython versions of this UCBoost module. My conclusion is that kl-UCB is always faster than UCBoost.
-
Policies.UCBoost.
c
= 0.0¶ Default value for better practical performance.
-
Policies.UCBoost.
tolerance_with_upperbound
= 1.0001¶ Tolerance when checking (with
assert
) that the solution(s) of any convex problem are correct.
-
Policies.UCBoost.
CHECK_SOLUTION
= False¶ Whether to check that the solution(s) of any convex problem are correct.
Warning
This is currently disabled, to try to optimize this module! WARNING bring it back when debugging!
-
Policies.UCBoost.
squadratic_distance
(p, q)[source]¶ The quadratic distance, \(d_{sq}(p, q) := 2 (p - q)^2\).
-
Policies.UCBoost.
solution_pb_sq
(p, upperbound, check_solution=False)[source]¶ Closed-form solution of the following optimisation problem, for \(d = d_{sq}\) the
biquadratic_distance()
function:\[\begin{split}P_1(d_{sq})(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_{sq}(p, q) \leq \delta.\end{split}\]The solution is:
\[q^* = p + \sqrt{\frac{\delta}{2}}.\]\(\delta\) is the
upperbound
parameter on the semi-distance between input \(p\) and solution \(q^*\).
-
class
Policies.UCBoost.
UCB_sq
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCB(d_sq) policy for bounded bandits (on [0, 1]).
It uses
solution_pb_sq()
as a closed-form solution to compute the UCB indexes (using the quadratic distance).Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_{sq})\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
Policies.UCBoost.
biquadratic_distance
(p, q)[source]¶ The biquadratic distance, \(d_{bq}(p, q) := 2 (p - q)^2 + (4/9) * (p - q)^4\).
-
Policies.UCBoost.
solution_pb_bq
(p, upperbound, check_solution=False)[source]¶ Closed-form solution of the following optimisation problem, for \(d = d_{bq}\) the
biquadratic_distance()
function:\[\begin{split}P_1(d_{bq})(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_{bq}(p, q) \leq \delta.\end{split}\]The solution is:
\[q^* = \min(1, p + \sqrt{-\frac{9}{4} + \sqrt{\frac{81}{16} + \frac{9}{4} \delta}}).\]\(\delta\) is the
upperbound
parameter on the semi-distance between input \(p\) and solution \(q^*\).
-
class
Policies.UCBoost.
UCB_bq
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCB(d_bq) policy for bounded bandits (on [0, 1]).
It uses
solution_pb_bq()
as a closed-form solution to compute the UCB indexes (using the biquadratic distance).Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_{bq})\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
Policies.UCBoost.
hellinger_distance
(p, q)[source]¶ The Hellinger distance, \(d_{h}(p, q) := (\sqrt{p} - \sqrt{q})^2 + (\sqrt{1 - p} - \sqrt{1 - q})^2\).
-
Policies.UCBoost.
solution_pb_hellinger
(p, upperbound, check_solution=False)[source]¶ Closed-form solution of the following optimisation problem, for \(d = d_{h}\) the
hellinger_distance()
function:\[\begin{split}P_1(d_h)(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_h(p, q) \leq \delta.\end{split}\]The solution is:
\[q^* = \left( (1 - \frac{\delta}{2}) \sqrt{p} + \sqrt{(1 - p) (\delta - \frac{\delta^2}{4})} \right)^{2 \times \boldsymbol{1}(\delta < 2 - 2 \sqrt{p})}.\]\(\delta\) is the
upperbound
parameter on the semi-distance between input \(p\) and solution \(q^*\).
-
class
Policies.UCBoost.
UCB_h
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCB(d_h) policy for bounded bandits (on [0, 1]).
It uses
solution_pb_hellinger()
as a closed-form solution to compute the UCB indexes (using the Hellinger distance).Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_h)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
Policies.UCBoost.
eps
= 1e-15¶ Threshold value: everything in [0, 1] is truncated to [eps, 1 - eps]
-
Policies.UCBoost.
kullback_leibler_distance_on_mean
(p, q)[source]¶ Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence
\[\mathrm{kl}(p, q) = \mathrm{KL}(\mathcal{B}(p), \mathcal{B}(q)) = p \log\left(\frac{p}{q}\right) + (1-p) \log\left(\frac{1-p}{1-q}\right).\]
-
Policies.UCBoost.
kullback_leibler_distance_lowerbound
(p, q)[source]¶ Lower-bound on the Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence
\[d_{lb}(p, q) = p \log\left( p \right) + (1-p) \log\left(\frac{1-p}{1-q}\right).\]
-
Policies.UCBoost.
solution_pb_kllb
(p, upperbound, check_solution=False)[source]¶ Closed-form solution of the following optimisation problem, for \(d = d_{lb}\) the proposed lower-bound on the Kullback-Leibler binary distance (
kullback_leibler_distance_lowerbound()
) function:\[\begin{split}P_1(d_{lb})(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_{lb}(p, q) \leq \delta.\end{split}\]The solution is:
\[q^* = 1 - (1 - p) \exp\left(\frac{p \log(p) - \delta}{1 - p}\right).\]\(\delta\) is the
upperbound
parameter on the semi-distance between input \(p\) and solution \(q^*\).
-
class
Policies.UCBoost.
UCB_lb
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCB(d_lb) policy for bounded bandits (on [0, 1]).
It uses
solution_pb_kllb()
as a closed-form solution to compute the UCB indexes (using the lower-bound on the Kullback-Leibler distance).Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_{lb})\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
Policies.UCBoost.
distance_t
(p, q)[source]¶ A shifted tangent line function of
kullback_leibler_distance_on_mean()
.\[d_t(p, q) = \frac{2 q}{p + 1} + p \log\left(\frac{p}{p + 1}\right) + \log\left(\frac{2}{\mathrm{e}(p + 1)}\right).\]Warning
I think there might be a typo in the formula in the article, as this \(d_t\) does not seem to “depend enough on q” (just intuition).
-
Policies.UCBoost.
solution_pb_t
(p, upperbound, check_solution=False)[source]¶ Closed-form solution of the following optimisation problem, for \(d = d_t\) a shifted tangent line function of
kullback_leibler_distance_on_mean()
(distance_t()
) function:\[\begin{split}P_1(d_t)(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_t(p, q) \leq \delta.\end{split}\]The solution is:
\[q^* = \min\left(1, \frac{p + 1}{2} \left( \delta - p \log\left(\frac{p}{p + 1}\right) - \log\left(\frac{2}{\mathrm{e} (p + 1)}\right) \right)\right).\]\(\delta\) is the
upperbound
parameter on the semi-distance between input \(p\) and solution \(q^*\).
-
class
Policies.UCBoost.
UCB_t
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCB(d_t) policy for bounded bandits (on [0, 1]).
It uses
solution_pb_t()
as a closed-form solution to compute the UCB indexes (using a shifted tangent line function ofkullback_leibler_distance_on_mean()
).Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
Warning
It has bad performance, as expected (see the paper for their remark).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= P_1(d_t)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
Policies.UCBoost.
is_a_true_number
(n)[source]¶ Check if n is a number or not (
int
,float
,complex
etc, any instance ofnumbers.Number
class.
-
class
Policies.UCBoost.
UCBoost
(nbArms, set_D=None, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCBoost policy for bounded bandits (on [0, 1]).
It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
set_D
should be either a set of strings (and NOT functions), or a number (3, 4 or 5). 3 indicate usingd_bq
,d_h
,d_lb
, 4 addsd_t
, and 5 addsd_sq
(see the article, Corollary 3, p5, for more details).Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, set_D=None, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.
-
set_D
= None¶ Set of strings that indicate which d functions are in the set of functions D. Warning: do not use real functions here, or the object won’t be hashable!
-
c
= None¶ Parameter c
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
class
Policies.UCBoost.
UCBoost_bq_h_lb
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCBoost.UCBoost
The UCBoost policy for bounded bandits (on [0, 1]).
It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
set_D
isd_bq
,d_h
,d_lb
(see the article, Corollary 3, p5, for more details).Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
class
Policies.UCBoost.
UCBoost_bq_h_lb_t
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCBoost.UCBoost
The UCBoost policy for bounded bandits (on [0, 1]).
It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
set_D
isd_bq
,d_h
,d_lb
,d_t
(see the article, Corollary 3, p5, for more details).Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
class
Policies.UCBoost.
UCBoost_bq_h_lb_t_sq
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.UCBoost.UCBoost
The UCBoost policy for bounded bandits (on [0, 1]).
It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
set_D
isd_bq
,d_h
,d_lb
,d_t
,d_sq
(see the article, Corollary 3, p5, for more details).Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.
-
computeIndex
(arm)[source]¶ Compute the current index, at time t and after \(N_k(t)\) pulls of arm k:
\[\begin{split}\hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ I_k(t) &= \min_{d\in D} P_1(d)\left(\hat{\mu}_k(t), \frac{\log(t) + c\log(\log(t))}{N_k(t)}\right).\end{split}\]
-
__module__
= 'Policies.UCBoost'¶
-
Policies.UCBoost.
min_solutions_pb_from_epsilon
(p, upperbound, epsilon=0.001, check_solution=False)[source]¶ List of closed-form solutions of the following optimisation problems, for \(d = d_s^k\) approximation of \(d_{kl}\) and any \(\tau_1(p) \leq k \leq \tau_2(p)\):
\[\begin{split}P_1(d_s^k)(p, \delta): & \max_{q \in \Theta} q,\\ \text{such that } & d_s^k(p, q) \leq \delta.\end{split}\]The solution is:
\[\begin{split}q^* &= q_k^{\boldsymbol{1}(\delta < d_{kl}(p, q_k))},\\ d_s^k &: (p, q) \mapsto d_{kl}(p, q_k) \boldsymbol{1}(q > q_k),\\ q_k &:= 1 - \left( 1 - \frac{\varepsilon}{1 + \varepsilon} \right)^k.\end{split}\]\(\delta\) is the
upperbound
parameter on the semi-distance between input \(p\) and solution \(q^*\).
-
class
Policies.UCBoost.
UCBoostEpsilon
(nbArms, epsilon=0.01, c=0.0, lower=0.0, amplitude=1.0)[source]¶ Bases:
Policies.IndexPolicy.IndexPolicy
The UCBoostEpsilon policy for bounded bandits (on [0, 1]).
It is quite simple: using a set of kl-dominated and candidate semi-distances D, the UCB index for each arm (at each step) is computed as the smallest upper confidence bound given (for this arm at this time t) for each distance d.
This variant uses
solutions_pb_from_epsilon()
to also compute the \(\varepsilon\) approximation of thekullback_leibler_distance_on_mean()
function (see the article for details, Th.3 p6).Reference: [Fang Liu et al, 2018](https://arxiv.org/abs/1804.05929).
-
__init__
(nbArms, epsilon=0.01, c=0.0, lower=0.0, amplitude=1.0)[source]¶ New generic index policy.
nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.
-
c
= None¶ Parameter c
-
epsilon
= None¶ Parameter epsilon
-
__module__
= 'Policies.UCBoost'¶