Policies.GLR_UCB module¶
The GLR-UCB policy and variants, for non-stationary bandits.
Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
It runs on top of a simple policy, e.g.,
UCB, andBernoulliGLR_IndexPolicyis a wrapper:>>> policy = BernoulliGLR_IndexPolicy(nbArms, UCB) >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
It uses an additional \(\mathcal{O}(\tau_\max)\) memory for a game of maximum stationary length \(\tau_\max\).
Warning
It can only work on basic index policy based on empirical averages (and an exploration bias), like UCB, and cannot work on any Bayesian policy (for which we would have to remember all previous observations in order to reset the history with a small history)!
-
Policies.GLR_UCB.VERBOSE= False¶ Whether to be verbose when doing the change detection algorithm.
-
Policies.GLR_UCB.PROBA_RANDOM_EXPLORATION= 0.1¶ Default probability of random exploration \(\alpha\).
-
Policies.GLR_UCB.PER_ARM_RESTART= True¶ Should we reset one arm empirical average or all? Default is
True, it’s usually more efficient!
-
Policies.GLR_UCB.FULL_RESTART_WHEN_REFRESH= False¶ Should we fully restart the algorithm or simply reset one arm empirical average? Default is
False, it’s usually more efficient!
-
Policies.GLR_UCB.LAZY_DETECT_CHANGE_ONLY_X_STEPS= 10¶ XXX Be lazy and try to detect changes only X steps, where X is small like 10 for instance. It is a simple but efficient way to speed up CD tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 0, to not use this feature, and 10 should speed up the test by x10.
-
Policies.GLR_UCB.LAZY_TRY_VALUE_S_ONLY_X_STEPS= 10¶ XXX Be lazy and try to detect changes for \(s\) taking steps of size
steps_s. Default is to havesteps_s=1, but only usingsteps_s=2should already speed up by 2. It is a simple but efficient way to speed up GLR tests, see https://github.com/SMPyBandits/SMPyBandits/issues/173 Default value is 1, to not use this feature, and 10 should speed up the test by x10.
-
Policies.GLR_UCB.USE_LOCALIZATION= True¶ Default value of
use_localizationfor policies. All the experiments I tried showed that the localization always helps improving learning, so the default value is set to True.
-
Policies.GLR_UCB.eps= 1e-10¶ Threshold value: everything in [0, 1] is truncated to [eps, 1 - eps]
-
Policies.GLR_UCB.klBern(x, y)[source]¶ Kullback-Leibler divergence for Bernoulli distributions. https://en.wikipedia.org/wiki/Bernoulli_distribution#Kullback.E2.80.93Leibler_divergence
\[\mathrm{KL}(\mathcal{B}(x), \mathcal{B}(y)) = x \log(\frac{x}{y}) + (1-x) \log(\frac{1-x}{1-y}).\]
-
Policies.GLR_UCB.klGauss(x, y, sig2x=1)[source]¶ Kullback-Leibler divergence for Gaussian distributions of means
xandyand variancessig2xandsig2y, \(\nu_1 = \mathcal{N}(x, \sigma_x^2)\) and \(\nu_2 = \mathcal{N}(y, \sigma_x^2)\):\[\mathrm{KL}(\nu_1, \nu_2) = \frac{(x - y)^2}{2 \sigma_y^2} + \frac{1}{2}\left( \frac{\sigma_x^2}{\sigma_y^2} - 1 \log\left(\frac{\sigma_x^2}{\sigma_y^2}\right) \right).\]See https://en.wikipedia.org/wiki/Normal_distribution#Other_properties
-
Policies.GLR_UCB.threshold_GaussianGLR(t, horizon=None, delta=None, variant=None)[source]¶ Compute the value :math:`c from the corollary of of Theorem 2 from [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].
The threshold is computed as (with \(t_0 = 0\)):
\[\beta(t_0, t, \delta) := \left(1 + \frac{1}{t - t_0 + 1}\right) 2 \log\left(\frac{2 (t - t_0) \sqrt{(t - t_0) + 2}}{\delta}\right).\]
-
Policies.GLR_UCB.function_h_minus_one(x)[source]¶ The inverse function of \(h(u)\), that is \(h^{-1}(x) = u \Leftrightarrow h(u) = x\). It is given by the Lambert W function, see
scipy.special.lambertw():\[h^{-1}(x) = - \mathcal{W}(- \exp(-x)).\]Example:
>>> np.random.seed(105) >>> y = np.random.randn() ** 2 >>> print(f"y = {y}") y = 0.060184682907834595 >>> x = function_h(y) >>> print(f"h(y) = {x}") h(y) = 2.8705220786966508 >>> z = function_h_minus_one(x) >>> print(f"h^-1(x) = {z}") h^-1(x) = 0.060184682907834595 >>> assert np.isclose(z, y), "Error: h^-1(h(y)) = z = {z} should be very close to y = {}...".format(z, y)
-
Policies.GLR_UCB.constant_power_function_h= 1.5¶ The constant \(\frac{3}{2}\), used in the definition of functions \(h\), \(h^{-1}\), \(\tilde{h}\) and \(\mathcal{T}\).
-
Policies.GLR_UCB.threshold_function_h_tilde= 3.801770285137458¶ The constant \(h^{-1}(1/\log(\frac{3}{2}))\), used in the definition of function \(\tilde{h}\).
-
Policies.GLR_UCB.constant_function_h_tilde= -0.90272045571788¶ The constant \(\log(\log(\frac{3}{2}))\), used in the definition of function \(\tilde{h}\).
-
Policies.GLR_UCB.function_h_tilde(x)[source]¶ The function \(\tilde{h}(x)\), defined by:
\[\begin{split}\tilde{h}(x) = \begin{cases} e^{1/h^{-1}(x)} h^{-1}(x) & \text{ if } x \ge h^{-1}(1/\ln (3/2)), \\ (3/2) (x-\ln \ln (3/2)) & \text{otherwise}. \end{cases}\end{split}\]
-
Policies.GLR_UCB.zeta_of_two= 1.6449340668482264¶ The constant \(\zeta(2) = \frac{\pi^2}{6}\).
-
Policies.GLR_UCB.function_T_mathcal(x)[source]¶ The function \(\mathcal{T}(x)\), defined by:
\[\mathcal{T}(x) = 2 \tilde h\left(\frac{h^{-1}(1+x) + \ln(2\zeta(2))}{2}\right).\]
-
Policies.GLR_UCB.approximation_function_T_mathcal(x)[source]¶ An efficiently computed approximation of \(\mathcal{T}(x)\), valid for \(x \geq 5\):
\[\mathcal{T}(x) \simeq x + 4 \log(1 + x + \sqrt(2 x)).\]
-
Policies.GLR_UCB.threshold_BernoulliGLR(t, horizon=None, delta=None, variant=None)[source]¶ Compute the value \(c\) from the corollary of of Theorem 2 from [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].
Warning
This is still experimental, you can try different variants of the threshold function:
Variant #0 (default) is:
\[\beta(t, \delta) := \log\left(\frac{3 t^{3/2}}{\delta}\right) = \log(\frac{1}{\delta}) + \log(3) + 3/2 \log(t).\]Variant #1 is smaller:
\[\beta(t, \delta) := \log(\frac{1}{\delta}) + \log(1 + \log(t)).\]Variant #2 is using \(\mathcal{T}\):
\[\beta(t, \delta) := 2 \mathcal{T}\left(\frac{\log(2 t^{3/2}) / \delta}{2}\right) + 6 \log(1 + \log(t)).\]Variant #3 is using \(\tilde{\mathcal{T}}(x) = x + 4 \log(1 + x + \sqrt{2x})\) an approximation of \(\mathcal{T}(x)\) (valid and quite accurate as soon as \(x \geq 5\)):
\[\beta(t, \delta) := 2 \tilde{\mathcal{T}}\left(\frac{\log(2 t^{3/2}) / \delta}{2}\right) + 6 \log(1 + \log(t)).\]
-
Policies.GLR_UCB.EXPONENT_BETA= 1.01¶ The default value of parameter \(\beta\) for the function
decreasing_alpha__GLR().
-
Policies.GLR_UCB.ALPHA_T1= 0.05¶ The default value of parameter \(\alpha_{t=1}\) for the function
decreasing_alpha__GLR().
-
Policies.GLR_UCB.decreasing_alpha__GLR(alpha0=None, t=1, exponentBeta=1.01, alpha_t1=0.05)[source]¶ Either use a fixed alpha, or compute it with an exponential decay (if
alpha0=None).Note
I am currently exploring the following variant (November 2018):
The probability of uniform exploration, \(\alpha\), is computed as a function of the current time:
\[\forall t>0, \alpha = \alpha_t := \alpha_{t=1} \frac{1}{\max(1, t^{\beta})}.\]with \(\beta > 1, \beta\) =
exponentBeta(=1.05) and \(\alpha_{t=1} < 1, \alpha_{t=1}\) =alpha_t1(=0.01).the only requirement on \(\alpha_t\) seems to be that sum_{t=1}^T alpha_t < +infty (ie. be finite), which is the case for \(\alpha_t = \alpha = \frac{1}{T}\), but also any \(\alpha_t = \frac{\alpha_1}{t^{\beta}}\) for any \(\beta>1\) (cf. Riemann series).
-
Policies.GLR_UCB.smart_delta_from_T_UpsilonT(horizon=1, max_nb_random_events=1, scaleFactor=1.0, per_arm_restart=True, nbArms=1)[source]¶ Compute a smart estimate of the optimal value for the confidence level \(\delta\), with
scaleFactor\(= \delta_0\in(0,1)\) a constant.If
per_arm_restartis True (Local option):
\[\delta = \frac{\delta_0}{\sqrt{K \Upsilon_T T}.\]If
per_arm_restartis False (Global option):
\[\delta = \frac{\delta_0}{\sqrt{\Upsilon_T T}.\]Note that if \(\Upsilon_T\) is unknown, it is assumed to be \(\Upsilon_T=1\).
-
Policies.GLR_UCB.smart_alpha_from_T_UpsilonT(horizon=1, max_nb_random_events=1, scaleFactor=0.1, per_arm_restart=True, nbArms=1)[source]¶ Compute a smart estimate of the optimal value for the fixed or random forced exploration probability \(\alpha\) (or tracking based), with
scaleFactor\(= \alpha_0\in(0,1)\) a constant.If
per_arm_restartis True (Local option):
\[\alpha = \alpha_0 \times \sqrt{\frac{K \Upsilon_T}{T} \log(T)}.\]If
per_arm_restartis False (Global option):
\[\alpha = \alpha_0 \times \sqrt{\frac{\Upsilon_T}{T} \log(T)}.\]Note that if \(\Upsilon_T\) is unknown, it is assumed to be \(\Upsilon_T=1\).
-
class
Policies.GLR_UCB.GLR_IndexPolicy(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶ Bases:
Policies.CD_UCB.CD_IndexPolicyThe GLR-UCB generic policy for non-stationary bandits, using the Generalized Likelihood Ratio test (GLR), for 1-dimensional exponential families.
It works for any 1-dimensional exponential family, you just have to give a
klfunction.For instance
kullback.klBern(), for Bernoulli distributions, givesGaussianGLR_IndexPolicy,And
kullback.klGauss()for univariate Gaussian distributions, givesBernoulliGLR_IndexPolicy.threshold_functioncomputes the threshold \(\beta(t, \delta)\), it can be for instancethreshold_GaussianGLR()orthreshold_BernoulliGLR().From [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].
Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__init__(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶ New policy.
-
horizon= None¶ The horizon \(T\).
-
max_nb_random_events= None¶ The number of breakpoints \(\Upsilon_T\).
-
use_localization= None¶ experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one
-
delta= None¶ The confidence level \(\delta\). Defaults to \(\delta=\frac{1}{\sqrt{T}}\) if
horizonis given anddelta=Nonebut \(\Upsilon_T\) is unknown. Defaults to \(\delta=\frac{1}{\sqrt{\Upsilon_T T}}\) if both \(T\) and \(\Upsilon_T\) are given (horizonandmax_nb_random_events).
-
kl= None¶ The parametrized Kullback-Leibler divergence (\(\mathrm{kl}(x,y) = KL(D(x),D(y))\)) for the 1-dimensional exponential family \(x\mapsto D(x)\). Example:
kullback.klBern()orkullback.klGauss().
-
lazy_try_value_s_only_x_steps= None¶ Be lazy and try to detect changes for \(s\) taking steps of size
steps_s.
-
property
proba_random_exploration¶ What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.
-
getReward(arm, reward)[source]¶ Do as
CD_UCBto handle the new reward, and also, update the internal times of each arm for the indexes ofklUCB_forGLR(or other index policies), which use \(f(t - \tau_i(t))\) for the exploration function of each arm \(i\) at time \(t\), where \(\tau_i(t)\) denotes the (last) restart time of the arm.
-
detect_change(arm, verbose=False)[source]¶ Detect a change in the current arm, using the Generalized Likelihood Ratio test (GLR) and the
klfunction.For each time step \(s\) between \(t_0=0\) and \(t\), compute:
\[G^{\mathrm{kl}}_{t_0:s:t} = (s-t_0+1) \mathrm{kl}(\mu_{t_0,s}, \mu_{t_0,t}) + (t-s) \mathrm{kl}(\mu_{s+1,t}, \mu_{t_0,t}).\]The change is detected if there is a time \(s\) such that \(G^{\mathrm{kl}}_{t_0:s:t} > h\), where
threshold_his the threshold of the test,And \(\mu_{a,b} = \frac{1}{b-a+1} \sum_{s=a}^{b} y_s\) is the mean of the samples between \(a\) and \(b\).
Warning
This is computationally costly, so an easy way to speed up this test is to use
lazy_try_value_s_only_x_steps\(= \mathrm{Step_s}\) for a small value (e.g., 10), so not test for all \(s\in[t_0, t-1]\) but only \(s\in[t_0, t-1], s \mod \mathrm{Step_s} = 0\) (e.g., one out of every 10 steps).
-
__module__= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.GLR_IndexPolicy_WithTracking(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicyA variant of the GLR policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough (with a tracking).
Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
choice()[source]¶ If any arm is not explored enough (\(n_k \leq \frac{\alpha}{K} \times (t - n_k)\), play uniformly at random one of these arms, otherwise, pass the call to
choice()of the underlying policy.
-
__module__= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration(nbArms, horizon=None, delta=None, max_nb_random_events=None, kl=<function klGauss>, alpha0=None, exponentBeta=1.01, alpha_t1=0.05, threshold_function=<function threshold_BernoulliGLR>, variant=None, use_increasing_alpha=False, lazy_try_value_s_only_x_steps=10, per_arm_restart=True, use_localization=True, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicyA variant of the GLR policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed.
If \(t\) is the current time and \(\tau\) is the latest restarting time, then uniform exploration is done if:
\[\begin{split}A &:= (t - \tau) \mod \lceil \frac{K}{\gamma} \rceil,\\ A &\leq K \implies A_t = A.\end{split}\]Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
choice()[source]¶ For some time steps, play uniformly at random one of these arms, otherwise, pass the call to
choice()of the underlying policy.
-
__module__= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.GaussianGLR_IndexPolicy(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicyThe GaussianGLR-UCB policy for non-stationary bandits, for fixed-variance Gaussian distributions (ie, \(\sigma^2\) known and fixed).
-
__init__(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]¶ New policy.
-
_sig2= None¶ Fixed variance \(\sigma^2\) of the Gaussian distributions. Extra parameter given to
kullback.klGauss(). Default to \(\sigma^2 = \frac{1}{4}\).
-
__module__= 'Policies.GLR_UCB'¶
-
-
class
Policies.GLR_UCB.GaussianGLR_IndexPolicy_WithTracking(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy_WithTracking,Policies.GLR_UCB.GaussianGLR_IndexPolicyA variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough.
-
__module__= 'Policies.GLR_UCB'¶
-
-
class
Policies.GLR_UCB.GaussianGLR_IndexPolicy_WithDeterministicExploration(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_GaussianGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration,Policies.GLR_UCB.GaussianGLR_IndexPolicyA variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed.
-
__module__= 'Policies.GLR_UCB'¶
-
-
class
Policies.GLR_UCB.BernoulliGLR_IndexPolicy(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicyThe BernoulliGLR-UCB policy for non-stationary bandits, for Bernoulli distributions.
Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__init__(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ New policy.
-
__module__= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.BernoulliGLR_IndexPolicy_WithTracking(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy_WithTracking,Policies.GLR_UCB.BernoulliGLR_IndexPolicyA variant of the BernoulliGLR-UCB policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough.
Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__module__= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.BernoulliGLR_IndexPolicy_WithDeterministicExploration(nbArms, kl=<function klBern>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration,Policies.GLR_UCB.BernoulliGLR_IndexPolicyA variant of the BernoulliGLR-UCB policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed.
Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__module__= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.OurGaussianGLR_IndexPolicy(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicyThe GaussianGLR-UCB policy for non-stationary bandits, for fixed-variance Gaussian distributions (ie, \(\sigma^2\) known and fixed), but with our threshold designed for the sub-Bernoulli case.
Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__init__(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ New policy.
-
_sig2= None¶ Fixed variance \(\sigma^2\) of the Gaussian distributions. Extra parameter given to
kullback.klGauss(). Default to \(\sigma^2 = \frac{1}{4}\).
-
__module__= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.OurGaussianGLR_IndexPolicy_WithTracking(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy_WithTracking,Policies.GLR_UCB.OurGaussianGLR_IndexPolicyA variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but based on a tracking of arms that haven’t been explored enough, but with our threshold designed for the sub-Bernoulli case, but with our threshold designed for the sub-Bernoulli case.
Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__module__= 'Policies.GLR_UCB'¶
-
class
Policies.GLR_UCB.OurGaussianGLR_IndexPolicy_WithDeterministicExploration(nbArms, sig2=0.25, kl=<function klGauss>, threshold_function=<function threshold_BernoulliGLR>, *args, **kwargs)[source]¶ Bases:
Policies.GLR_UCB.GLR_IndexPolicy_WithDeterministicExploration,Policies.GLR_UCB.OurGaussianGLR_IndexPolicyA variant of the GaussianGLR-UCB policy where the exploration is not forced to be uniformly random but deterministic, inspired by what M-UCB proposed, but with our threshold designed for the sub-Bernoulli case.
Reference: [[“Combining the Generalized Likelihood Ratio Test and kl-UCB for Non-Stationary Bandits. E. Kaufmann and L. Besson, 2019]](https://hal.inria.fr/hal-02006471/)
-
__module__= 'Policies.GLR_UCB'¶
-
Policies.GLR_UCB.SubGaussianGLR_DELTA= 0.01¶ Default confidence level for
SubGaussianGLR_IndexPolicy.
-
Policies.GLR_UCB.SubGaussianGLR_SIGMA= 0.25¶ By default,
SubGaussianGLR_IndexPolicyassumes distributions are 0.25-sub Gaussian, like Bernoulli or any distributions with support on \([0,1]\).
-
Policies.GLR_UCB.SubGaussianGLR_JOINT= True¶ Whether to use the joint or disjoint threshold function (
threshold_SubGaussianGLR_joint()orthreshold_SubGaussianGLR_disjoint()) forSubGaussianGLR_IndexPolicy.
-
Policies.GLR_UCB.threshold_SubGaussianGLR_joint(s, t, delta=0.01, sigma=0.25)[source]¶ Compute the threshold :math:`b^{text{joint}}_{t_0}(s,t,delta) according to this formula:
\[b^{\text{joint}}_{t_0}(s,t,\delta) := \sigma \sqrt{ \left(\frac{1}{s-t_0+1} + \frac{1}{t-s}\right) \left(1 + \frac{1}{t-t_0+1}\right) 2 \log\left( \frac{2(t-t_0)\sqrt{t-t_0+2}}{\delta} \right)}.\]
-
Policies.GLR_UCB.threshold_SubGaussianGLR_disjoint(s, t, delta=0.01, sigma=0.25)[source]¶ Compute the threshold \(b^{\text{disjoint}}_{t_0}(s,t,\delta)\) according to this formula:
\[b^{\text{disjoint}}_{t_0}(s,t,\delta) := \sqrt{2} \sigma \sqrt{\frac{1 + \frac{1}{s - t_0 + 1}}{s - t_0 + 1} \log\left( \frac{4 \sqrt{s - t_0 + 2}}{\delta}\right)} + \sqrt{\frac{1 + \frac{1}{t - s + 1}}{t - s + 1} \log\left( \frac{4 (t - t_0) \sqrt{t - s + 1}}{\delta}\right)}.\]
-
Policies.GLR_UCB.threshold_SubGaussianGLR(s, t, delta=0.01, sigma=0.25, joint=True)[source]¶ Compute the threshold \(b^{\text{joint}}_{t_0}(s,t,\delta)\) or \(b^{\text{disjoint}}_{t_0}(s,t,\delta)\).
-
class
Policies.GLR_UCB.SubGaussianGLR_IndexPolicy(nbArms, horizon=None, max_nb_random_events=None, full_restart_when_refresh=False, policy=<class 'Policies.UCB.UCB'>, delta=0.01, sigma=0.25, joint=True, exponentBeta=1.05, alpha_t1=0.1, alpha0=None, lazy_detect_change_only_x_steps=10, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]¶ Bases:
Policies.CD_UCB.CD_IndexPolicyThe SubGaussianGLR-UCB policy for non-stationary bandits, using the Generalized Likelihood Ratio test (GLR), for sub-Gaussian distributions.
It works for any sub-Gaussian family of distributions, being \(\sigma^2\)-sub Gaussian with known \(\sigma\).
From [“Sequential change-point detection: Laplace concentration of scan statistics and non-asymptotic delay bounds”, O.-A. Maillard, 2018].
-
__init__(nbArms, horizon=None, max_nb_random_events=None, full_restart_when_refresh=False, policy=<class 'Policies.UCB.UCB'>, delta=0.01, sigma=0.25, joint=True, exponentBeta=1.05, alpha_t1=0.1, alpha0=None, lazy_detect_change_only_x_steps=10, lazy_try_value_s_only_x_steps=10, use_localization=True, *args, **kwargs)[source]¶ New policy.
-
horizon= None¶ The horizon \(T\).
-
max_nb_random_events= None¶ The number of breakpoints \(\Upsilon_T\).
-
delta= None¶ The confidence level \(\delta\). Defaults to \(\delta=\frac{1}{T}\) if
horizonis given anddelta=None.
-
sigma= None¶ Parameter \(\sigma\) for the Sub-Gaussian-GLR test.
-
joint= None¶ Parameter
jointfor the Sub-Gaussian-GLR test.
-
lazy_try_value_s_only_x_steps= None¶ Be lazy and try to detect changes for \(s\) taking steps of size
steps_s.
-
use_localization= None¶ experiment to use localization of the break-point, ie, restart memory of arm by keeping observations s+1…n instead of just the last one
-
compute_threshold_h(s, t)[source]¶ Compute the threshold \(h\) with
threshold_SubGaussianGLR().
-
__module__= 'Policies.GLR_UCB'¶
-
property
proba_random_exploration¶ What they call \(\alpha\) in their paper: the probability of uniform exploration at each time.
-
detect_change(arm, verbose=False)[source]¶ Detect a change in the current arm, using the non-parametric sub-Gaussian Generalized Likelihood Ratio test (GLR) works like this:
For each time step \(s\) between \(t_0=0\) and \(t\), compute:
\[G^{\text{sub-}\sigma}_{t_0:s:t} = |\mu_{t_0,s} - \mu_{s+1,t}|.\]The change is detected if there is a time \(s\) such that \(G^{\text{sub-}\sigma}_{t_0:s:t} > b_{t_0}(s,t,\delta)\), where \(b_{t_0}(s,t,\delta)\) is the threshold of the test,
The threshold is computed as:
\[b_{t_0}(s,t,\delta) := \sigma \sqrt{ \left(\frac{1}{s-t_0+1} + \frac{1}{t-s}\right) \left(1 + \frac{1}{t-t_0+1}\right) 2 \log\left( \frac{2(t-t_0)\sqrt{t-t_0+2}}{\delta} \right)}.\]And \(\mu_{a,b} = \frac{1}{b-a+1} \sum_{s=a}^{b} y_s\) is the mean of the samples between \(a\) and \(b\).