Policies.GLR_UCB module¶

The GLR-UCB policy and variants, for non-stationary bandits.

Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)

It runs on top of a simple policy, e.g., UCB, and BernoulliGLR_IndexPolicy is a wrapper:

>>> policy = BernoulliGLR_IndexPolicy(nbArms, UCB)
>>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)

It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).

Warning

It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!

Policies.GLR_UCB.VERBOSE = False¶: Whether to be verbose when doing the change detection algorithm.

Policies.GLR_UCB.PROBA_RANDOM_EXPLORATION = 0.1¶: Default probability of random exploration \(\alpha\).

Policies.GLR_UCB.PER_ARM_RESTART = True¶: Should we reset one arm empirical average or all? Default is True, it’s usually more efficient!

Policies.GLR_UCB.FULL_RESTART_WHEN_REFRESH = False¶: Should we fully restart the algorithm or simply reset one arm empirical average? Default is False, it’s usually more efficient!

Policies.GLR_UCB.LAZY_DETECT_CHANGE_ONLY_X_STEPS = 10¶: XXX Be lazy and try to detect changes only X steps, where X is small like 10 for instance. It is a simple but efficient way to speed up CD tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 0, to not use this feature, and 10 should speed up the test by x10.

Policies.GLR_UCB.LAZY_TRY_VALUE_S_ONLY_X_STEPS = 10¶: XXX Be lazy and try to detect changes for \(s\) taking steps of size steps_s. Default is to have steps_s=1, but only using steps_s=2 should already speed up by 2. It is a simple but efficient way to speed up GLR tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 1, to not use this feature, and 10 should speed up the test by x10.

Policies.GLR_UCB.USE_LOCALIZATION = True¶: Default value of use_localization for policies. All the experiments I tried showed that the localization always helps improving learning, so the default value is set to True.

Policies.GLR_UCB.eps = 1e-10¶: Threshold value: everything in [0, 1] is truncated to [eps, 1 - eps]

Policies.GLR_UCB.klBern(x, y)[source]¶: Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence

\[\mathrm{KL}(\mathcal{B}(x), \mathcal{B}(y)) = x \log(\frac{x}{y}) + (1-x) \log(\frac{1-x}{1-y}).\]

Policies.GLR_UCB.klGauss(x, y, sig2x=1)[source]¶

Kullback-Leibler divergence for Gaussian distributions of means x and y and variances sig2x and sig2y, \(\nu_1 = \mathcal{N}(x, \sigma_x^2)\) and \(\nu_2 = \mathcal{N}(y, \sigma_x^2)\):

\[\mathrm{KL}(\nu_1, \nu_2) = \frac{(x - y)^2}{2 \sigma_y^2} + \frac{1}{2}\left( \frac{\sigma_x^2}{\sigma_y^2} - 1 \log\left(\frac{\sigma_x^2}{\sigma_y^2}\right) \right).\]

See https://en.wikipedia.org/wiki/Normal_distribution#Other_properties

Policies.GLR_UCB.threshold_GaussianGLR(t, horizon=None, delta=None, variant=None)[source]¶

Compute the value :math:`c from the corollary of of Theorem 2 from [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].

The threshold is computed as (with \(t_0 = 0\)):

\[\beta(t_0, t, \delta) := \left(1 + \frac{1}{t - t_0 + 1}\right) 2 \log\left(\frac{2 (t - t_0) \sqrt{(t - t_0) + 2}}{\delta}\right).\]

Policies.GLR_UCB.function_h(u)[source]¶: The function \(h(u) = u - \log(u)\).

Policies.GLR_UCB.function_h_minus_one(x)[source]¶

The inverse function of \(h(u)\), that is \(h^{-1}(x) = u \Leftrightarrow h(u) = x\). It is given by the Lambert W function, see scipy.special.lambertw():

\[h^{-1}(x) = - \mathcal{W}(- \exp(-x)).\]

Example:

>>> np.random.seed(105)
>>> y = np.random.randn() ** 2
>>> print(f"y = {y}")
y = 0.060184682907834595
>>> x = function_h(y)
>>> print(f"h(y) = {x}")
h(y) = 2.8705220786966508
>>> z = function_h_minus_one(x)
>>> print(f"h^-1(x) = {z}")
h^-1(x) = 0.060184682907834595
>>> assert np.isclose(z, y), "Error: h^-1(h(y)) = z = {z} should be very close to y = {}...".format(z, y)

Policies.GLR_UCB.constant_power_function_h = 1.5¶: The constant \(\frac{3}{2}\), used in the definition of functions \(h\), \(h^{-1}\), \(\tilde{h}\) and \(\mathcal{T}\).

Policies.GLR_UCB.threshold_function_h_tilde = 3.801770285137458¶: The constant \(h^{-1}(1/\log(\frac{3}{2}))\), used in the definition of function \(\tilde{h}\).

Policies.GLR_UCB.constant_function_h_tilde = -0.90272045571788¶: The constant \(\log(\log(\frac{3}{2}))\), used in the definition of function \(\tilde{h}\).

Policies.GLR_UCB.function_h_tilde(x)[source]¶: The function \(\tilde{h}(x)\), defined by:

\[\begin{split}\tilde{h}(x) = \begin{cases} e^{1/h^{-1}(x)} h^{-1}(x) & \text{ if } x \ge h^{-1}(1/\ln (3/2)), \\ (3/2) (x-\ln \ln (3/2)) & \text{otherwise}. \end{cases}\end{split}\]

Policies.GLR_UCB.zeta_of_two = 1.6449340668482264¶: The constant \(\zeta(2) = \frac{\pi^2}{6}\).

Policies.GLR_UCB.function_T_mathcal(x)[source]¶: The function \(\mathcal{T}(x)\), defined by:

\[\mathcal{T}(x) = 2 \tilde h\left(\frac{h^{-1}(1+x) + \ln(2\zeta(2))}{2}\right).\]

Policies.GLR_UCB.approximation_function_T_mathcal(x)[source]¶: An efficiently computed approximation of \(\mathcal{T}(x)\), valid for \(x \geq 5\):

\[\mathcal{T}(x) \simeq x + 4 \log(1 + x + \sqrt(2 x)).\]

Policies.GLR_UCB.threshold_BernoulliGLR(t, horizon=None, delta=None, variant=None)[source]¶

Compute the value \(c\) from the corollary of of Theorem 2 from [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].

Warning

This is still experimental, you can try different variants of the threshold function:

Variant #0 (default) is:

\[\beta(t, \delta) := \log\left(\frac{3 t^{3/2}}{\delta}\right) = \log(\frac{1}{\delta}) + \log(3) + 3/2 \log(t).\]

Variant #1 is smaller:

\[\beta(t, \delta) := \log(\frac{1}{\delta}) + \log(1 + \log(t)).\]

Variant #2 is using \(\mathcal{T}\):

\[\beta(t, \delta) := 2 \mathcal{T}\left(\frac{\log(2 t^{3/2}) / \delta}{2}\right) + 6 \log(1 + \log(t)).\]

Variant #3 is using \(\tilde{\mathcal{T}}(x) = x + 4 \log(1 + x + \sqrt{2x})\) an approximation of \(\mathcal{T}(x)\) (valid and quite accurate as soon as \(x \geq 5\)):

\[\beta(t, \delta) := 2 \tilde{\mathcal{T}}\left(\frac{\log(2 t^{3/2}) / \delta}{2}\right) + 6 \log(1 + \log(t)).\]

Policies.GLR_UCB.EXPONENT_BETA = 1.01¶: The default value of parameter \(\beta\) for the function decreasing_alpha__GLR().

Policies.GLR_UCB.ALPHA_T1 = 0.05¶: The default value of parameter \(\alpha_{t=1}\) for the function decreasing_alpha__GLR().

Policies.GLR_UCB.decreasing_alpha__GLR(alpha0=None, t=1, exponentBeta=1.01, alpha_t1=0.05)[source]¶

Either use a fixed alpha, or compute it with an exponential decay (if alpha0=None).

Note

I am currently exploring the following variant (November 2018):

The probability of uniform exploration, \(\alpha\), is computed as a function of the current time:

\[\forall t>0, \alpha = \alpha_t := \alpha_{t=1} \frac{1}{\max(1, t^{\beta})}.\]

with \(\beta > 1, \beta\) = exponentBeta (=1.05) and \(\alpha_{t=1} < 1, \alpha_{t=1}\) = alpha_t1 (=0.01).
the only requirement on \(\alpha_t\) seems to be that sum_{t=1}^T alpha_t < +infty (ie. be finite), which is the case for \(\alpha_t = \alpha = \frac{1}{T}\), but also any \(\alpha_t = \frac{\alpha_1}{t^{\beta}}\) for any \(\beta>1\) (cf. Riemann series).

Policies.GLR_UCB.smart_delta_from_T_UpsilonT(horizon=1, max_nb_random_events=1, scaleFactor=1.0, per_arm_restart=True, nbArms=1)[source]¶

Compute a smart estimate of the optimal value for the confidence level \(\delta\), with scaleFactor \(= \delta_0\in(0,1)\) a constant.

If per_arm_restart is True (Local option):

\[\delta = \frac{\delta_0}{\sqrt{K \Upsilon_T T}.\]

If per_arm_restart is False (Global option):

\[\delta = \frac{\delta_0}{\sqrt{\Upsilon_T T}.\]

Note that if \(\Upsilon_T\) is unknown, it is assumed to be \(\Upsilon_T=1\).

Policies.GLR_UCB.smart_alpha_from_T_UpsilonT(horizon=1, max_nb_random_events=1, scaleFactor=0.1, per_arm_restart=True, nbArms=1)[source]¶

Compute a smart estimate of the optimal value for the fixed or random forced exploration probability \(\alpha\) (or tracking based), with scaleFactor \(= \alpha_0\in(0,1)\) a constant.

If per_arm_restart is True (Local option):

\[\alpha = \alpha_0 \times \sqrt{\frac{K \Upsilon_T}{T} \log(T)}.\]

If per_arm_restart is False (Global option):

\[\alpha = \alpha_0 \times \sqrt{\frac{\Upsilon_T}{T} \log(T)}.\]

Note that if \(\Upsilon_T\) is unknown, it is assumed to be \(\Upsilon_T=1\).

class Policies.GLR_UCB.GLR_IndexPolicy(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶

Bases: Policies.CD_UCB.CD_IndexPolicy

The GLR-UCB generic policy for non-stationary bandits, using the Generalized Likelihood Ratio test (GLR), for 1-dimensional exponential families.

It works for any 1-dimensional exponential family, you just have to give a kl function.
For instance kullback.klBern(), for Bernoulli distributions, gives GaussianGLR_IndexPolicy,
And kullback.klGauss() for univariate Gaussian distributions, gives BernoulliGLR_IndexPolicy.
threshold_function computes the threshold \(\beta(t, \delta)\), it can be for instance threshold_GaussianGLR() or threshold_BernoulliGLR().
From [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].
Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)

__init__(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶: New policy.

horizon = None¶: The horizon \(T\).

max_nb_random_events = None¶: The number of breakpoints \(\Upsilon_T\).

use_localization = None¶: experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one

delta = None¶: The confidence level \(\delta\). Defaults to \(\delta=\frac{1}{\sqrt{T}}\) if horizon is given and delta=None but \(\Upsilon_T\) is unknown. Defaults to \(\delta=\frac{1}{\sqrt{\Upsilon_T T}}\) if both \(T\) and \(\Upsilon_T\) are given (horizon and max_nb_random_events).

kl = None¶: The parametrized Kullback-Leibler divergence (\(\mathrm{kl}(x,y) = KL(D(x),D(y))\)) for the 1-dimensional exponential family \(x\mapsto D(x)\). Example: kullback.klBern() or kullback.klGauss().

lazy_try_value_s_only_x_steps = None¶: Be lazy and try to detect changes for \(s\) taking steps of size steps_s.

compute_threshold_h(t)[source]¶: Compute the threshold \(h\) with _threshold_function.

property proba_random_exploration¶: What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.

__str__()[source]¶: -> str

getReward(arm, reward)[source]¶: Do as CD_UCB to handle the new reward, and also, update the internal times of each arm for the indexes of klUCB_forGLR (or other index policies), which use \(f(t - \tau_i(t))\) for the exploration function of each arm \(i\) at time \(t\), where \(\tau_i(t)\) denotes the (last) restart time of the arm.

detect_change(arm, verbose=False)[source]¶

Detect a change in the current arm, using the Generalized Likelihood Ratio test (GLR) and the kl function.

For each time step \(s\) between \(t_0=0\) and \(t\), compute:

\[G^{\mathrm{kl}}_{t_0:s:t} = (s-t_0+1) \mathrm{kl}(\mu_{t_0,s}, \mu_{t_0,t}) + (t-s) \mathrm{kl}(\mu_{s+1,t}, \mu_{t_0,t}).\]

The change is detected if there is a time \(s\) such that \(G^{\mathrm{kl}}_{t_0:s:t} > h\), where threshold_h is the threshold of the test,
And \(\mu_{a,b} = \frac{1}{b-a+1} \sum_{s=a}^{b} y_s\) is the mean of the samples between \(a\) and \(b\).

Warning

This is computationally costly, so an easy way to speed up this test is to use lazy_try_value_s_only_x_steps \(= \mathrm{Step_s}\) for a small value (e.g., 10), so not test for all \(s\in[t_0, t-1]\) but only \(s\in[t_0, t-1], s \mod \mathrm{Step_s} = 0\) (e.g., one out of every 10 steps).

__module__ = 'Policies.GLR_UCB'¶

class Policies.GLR_UCB.GLR_IndexPolicy_WithTracking(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶

Bases: Policies.GLR_UCB.GLR_IndexPolicy

A variant of the GLR policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough (with a tracking).

Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)

choice()[source]¶: If any arm is not explored enough (\(n_k \leq \frac{\alpha}{K} \times (t - n_k)\), play uniformly at random one of these arms, otherwise, pass the call to choice() of the underlying policy.

__module__ = 'Policies.GLR_UCB'¶

class Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶

Bases: Policies.GLR_UCB.GLR_IndexPolicy

A variant of the GLR policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed.

If \(t\) is the current time and \(\tau\) is the latest restarting time, then uniform exploration is done if:

\[\begin{split}A &:= (t - \tau) \mod \lceil \frac{K}{\gamma} \rceil,\\ A &\leq K \implies A_t = A.\end{split}\]

Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)

choice()[source]¶: For some time steps, play uniformly at random one of these arms, otherwise, pass the call to choice() of the underlying policy.

__module__ = 'Policies.GLR_UCB'¶

class Policies.GLR_UCB.GaussianGLR_IndexPolicy(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]¶

Bases: Policies.GLR_UCB.GLR_IndexPolicy

The GaussianGLR-UCB policy for non-stationary bandits, for fixed-variance Gaussian distributions (ie, \(\sigma^2\) known and fixed).

__init__(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]¶: New policy.

_sig2 = None¶: Fixed variance \(\sigma^2\) of the Gaussian distributions. Extra parameter given to kullback.klGauss(). Default to \(\sigma^2 = \frac{1}{4}\).

__module__ = 'Policies.GLR_UCB'¶

class Policies.GLR_UCB.GaussianGLR_IndexPolicy_WithTracking(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]¶

Bases: Policies.GLR_UCB.GLR_IndexPolicy_WithTracking, Policies.GLR_UCB.GaussianGLR_IndexPolicy

A variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough.

__module__ = 'Policies.GLR_UCB'¶

class Policies.GLR_UCB.GaussianGLR_IndexPolicy_WithDeterministicExploration(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]¶

Bases: Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration, Policies.GLR_UCB.GaussianGLR_IndexPolicy

A variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed.

__module__ = 'Policies.GLR_UCB'¶

class Policies.GLR_UCB.BernoulliGLR_IndexPolicy(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶

Bases: Policies.GLR_UCB.GLR_IndexPolicy

The BernoulliGLR-UCB policy for non-stationary bandits, for Bernoulli distributions.

Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)

__init__(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶: New policy.

__module__ = 'Policies.GLR_UCB'¶

class Policies.GLR_UCB.BernoulliGLR_IndexPolicy_WithTracking(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶

Bases: Policies.GLR_UCB.GLR_IndexPolicy_WithTracking, Policies.GLR_UCB.BernoulliGLR_IndexPolicy

A variant of the BernoulliGLR-UCB policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough.

Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)

__module__ = 'Policies.GLR_UCB'¶

class Policies.GLR_UCB.BernoulliGLR_IndexPolicy_WithDeterministicExploration(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶

Bases: Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration, Policies.GLR_UCB.BernoulliGLR_IndexPolicy

A variant of the BernoulliGLR-UCB policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed.

Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)

__module__ = 'Policies.GLR_UCB'¶

class Policies.GLR_UCB.OurGaussianGLR_IndexPolicy(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶

Bases: Policies.GLR_UCB.GLR_IndexPolicy

The GaussianGLR-UCB policy for non-stationary bandits, for fixed-variance Gaussian distributions (ie, \(\sigma^2\) known and fixed), but with our threshold designed for the sub-Bernoulli case.

Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)

__init__(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶: New policy.

_sig2 = None¶: Fixed variance \(\sigma^2\) of the Gaussian distributions. Extra parameter given to kullback.klGauss(). Default to \(\sigma^2 = \frac{1}{4}\).

__module__ = 'Policies.GLR_UCB'¶

class Policies.GLR_UCB.OurGaussianGLR_IndexPolicy_WithTracking(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶

Bases: Policies.GLR_UCB.GLR_IndexPolicy_WithTracking, Policies.GLR_UCB.OurGaussianGLR_IndexPolicy

A variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough, but with our threshold designed for the sub-Bernoulli case, but with our threshold designed for the sub-Bernoulli case.

Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)

__module__ = 'Policies.GLR_UCB'¶

class Policies.GLR_UCB.OurGaussianGLR_IndexPolicy_WithDeterministicExploration(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶

Bases: Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration, Policies.GLR_UCB.OurGaussianGLR_IndexPolicy

A variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed, but with our threshold designed for the sub-Bernoulli case.

Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)

__module__ = 'Policies.GLR_UCB'¶

Policies.GLR_UCB.SubGaussianGLR_DELTA = 0.01¶: Default confidence level for SubGaussianGLR_IndexPolicy.

Policies.GLR_UCB.SubGaussianGLR_SIGMA = 0.25¶: By default, SubGaussianGLR_IndexPolicy assumes distributions are 0.25-sub Gaussian, like Bernoulli or any distributions with support on \([0,1]\).

Policies.GLR_UCB.SubGaussianGLR_JOINT = True¶: Whether to use the joint or disjoint threshold function (threshold_SubGaussianGLR_joint() or threshold_SubGaussianGLR_disjoint()) for SubGaussianGLR_IndexPolicy.

Policies.GLR_UCB.threshold_SubGaussianGLR_joint(s, t, delta=0.01, sigma=0.25)[source]¶: Compute the threshold :math:`b^{text{joint}}_{t_0}(s,t,delta) according to this formula:

\[b^{\text{joint}}_{t_0}(s,t,\delta) := \sigma \sqrt{ \left(\frac{1}{s-t_0+1} + \frac{1}{t-s}\right) \left(1 + \frac{1}{t-t_0+1}\right) 2 \log\left( \frac{2(t-t_0)\sqrt{t-t_0+2}}{\delta} \right)}.\]

Policies.GLR_UCB.threshold_SubGaussianGLR_disjoint(s, t, delta=0.01, sigma=0.25)[source]¶: Compute the threshold \(b^{\text{disjoint}}_{t_0}(s,t,\delta)\) according to this formula:

\[b^{\text{disjoint}}_{t_0}(s,t,\delta) := \sqrt{2} \sigma \sqrt{\frac{1 + \frac{1}{s - t_0 + 1}}{s - t_0 + 1} \log\left( \frac{4 \sqrt{s - t_0 + 2}}{\delta}\right)} + \sqrt{\frac{1 + \frac{1}{t - s + 1}}{t - s + 1} \log\left( \frac{4 (t - t_0) \sqrt{t - s + 1}}{\delta}\right)}.\]

Policies.GLR_UCB.threshold_SubGaussianGLR(s, t, delta=0.01, sigma=0.25, joint=True)[source]¶: Compute the threshold \(b^{\text{joint}}_{t_0}(s,t,\delta)\) or \(b^{\text{disjoint}}_{t_0}(s,t,\delta)\).

class Policies.GLR_UCB.SubGaussianGLR_IndexPolicy(nbArms, horizon=None, max_nb_random_events=None, full_restart_when_refresh=False, policy=<class 'Policies.UCB.UCB'>, delta=0.01, sigma=0.25, joint=True, exponentBeta=1.05, alpha_t1=0.1, alpha0=None, lazy_detect_change_only_x_steps=10, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]¶

Bases: Policies.CD_UCB.CD_IndexPolicy

The SubGaussianGLR-UCB policy for non-stationary bandits, using the Generalized Likelihood Ratio test (GLR), for sub-Gaussian distributions.

It works for any sub-Gaussian family of distributions, being \(\sigma^2\)-sub Gaussian with known \(\sigma\).
From [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].

__init__(nbArms, horizon=None, max_nb_random_events=None, full_restart_when_refresh=False, policy=<class 'Policies.UCB.UCB'>, delta=0.01, sigma=0.25, joint=True, exponentBeta=1.05, alpha_t1=0.1, alpha0=None, lazy_detect_change_only_x_steps=10, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]¶: New policy.

horizon = None¶: The horizon \(T\).

max_nb_random_events = None¶: The number of breakpoints \(\Upsilon_T\).

delta = None¶: The confidence level \(\delta\). Defaults to \(\delta=\frac{1}{T}\) if horizon is given and delta=None.

sigma = None¶: Parameter \(\sigma\) for the Sub-Gaussian-GLR test.

joint = None¶: Parameter joint for the Sub-Gaussian-GLR test.

lazy_try_value_s_only_x_steps = None¶: Be lazy and try to detect changes for \(s\) taking steps of size steps_s.

use_localization = None¶: experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one

compute_threshold_h(s, t)[source]¶: Compute the threshold \(h\) with threshold_SubGaussianGLR().

__module__ = 'Policies.GLR_UCB'¶

property proba_random_exploration¶: What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.

__str__()[source]¶: -> str

detect_change(arm, verbose=False)[source]¶

Detect a change in the current arm, using the non-parametric sub-Gaussian Generalized Likelihood Ratio test (GLR) works like this:

For each time step \(s\) between \(t_0=0\) and \(t\), compute:

\[G^{\text{sub-}\sigma}_{t_0:s:t} = |\mu_{t_0,s} - \mu_{s+1,t}|.\]

The change is detected if there is a time \(s\) such that \(G^{\text{sub-}\sigma}_{t_0:s:t} > b_{t_0}(s,t,\delta)\), where \(b_{t_0}(s,t,\delta)\) is the threshold of the test,
The threshold is computed as:

\[b_{t_0}(s,t,\delta) := \sigma \sqrt{ \left(\frac{1}{s-t_0+1} + \frac{1}{t-s}\right) \left(1 + \frac{1}{t-t_0+1}\right) 2 \log\left( \frac{2(t-t_0)\sqrt{t-t_0+2}}{\delta} \right)}.\]

And \(\mu_{a,b} = \frac{1}{b-a+1} \sum_{s=a}^{b} y_s\) is the mean of the samples between \(a\) and \(b\).