# Policies.UCBdagger module¶

The UCB-dagger ($$\mathrm{UCB}{\dagger}$$, UCB†) policy, a significant improvement over UCB by auto-tuning the confidence level.

• Reference: [[Auto-tuning the Confidence Level for Optimistic Bandit Strategies, Lattimore, unpublished, 2017]](http://tor-lattimore.com/)

Policies.UCBdagger.ALPHA = 1

Default value for the parameter $$\alpha > 0$$ for UCBdagger.

Policies.UCBdagger.log_bar(x)[source]

The function defined as $$\mathrm{l\overline{og}}$$ by Lattimore:

$\mathrm{l\overline{og}}(x) := \log\left((x+e)\sqrt{\log(x+e)}\right)$

Some values:

>>> for x in np.logspace(0, 7, 8):
...     print("x = {:<5.3g} gives log_bar(x) = {:<5.3g}".format(x, log_bar(x)))
x = 1     gives log_bar(x) = 1.45
x = 10    gives log_bar(x) = 3.01
x = 100   gives log_bar(x) = 5.4
x = 1e+03 gives log_bar(x) = 7.88
x = 1e+04 gives log_bar(x) = 10.3
x = 1e+05 gives log_bar(x) = 12.7
x = 1e+06 gives log_bar(x) = 15.1
x = 1e+07 gives log_bar(x) = 17.5


Illustration:

>>> import matplotlib.pyplot as plt
>>> X = np.linspace(0, 1000, 2000)
>>> Y = log_bar(X)
>>> plt.plot(X, Y)
>>> plt.title(r"The $\mathrm{l\overline{og}}$ function")
>>> plt.show()

Policies.UCBdagger.Ki_function(pulls, i)[source]

Compute the $$K_i(t)$$ index as defined in the article, for one arm i.

Policies.UCBdagger.Ki_vectorized(pulls)[source]

Compute the $$K_i(t)$$ index as defined in the article, for all arms (in a vectorized manner).

Warning

I didn’t find a fast vectorized formula, so don’t use this one.

class Policies.UCBdagger.UCBdagger(nbArms, horizon=None, alpha=1, lower=0.0, amplitude=1.0)[source]

The UCB-dagger ($$\mathrm{UCB}{\dagger}$$, UCB†) policy, a significant improvement over UCB by auto-tuning the confidence level.

__init__(nbArms, horizon=None, alpha=1, lower=0.0, amplitude=1.0)[source]

New generic index policy.

• nbArms: the number of arms,

• lower, amplitude: lower value and known amplitude of the rewards.

alpha = None

Parameter $$\alpha > 0$$.

horizon = None

Parameter $$T > 0$$.

__str__()[source]

-> str

getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards for that arm (normalized in [0, 1]).

computeIndex(arm)[source]

Compute the current index, at time t and after $$N_k(t)$$ pulls of arm k:

$\begin{split}I_k(t) &= \frac{X_k(t)}{N_k(t)} + \sqrt{\frac{2 \alpha}{N_k(t)} \mathrm{l}\overline{\mathrm{og}}\left( \frac{T}{H_k(t)} \right)}, \\ \text{where}\;\; & H_k(t) := N_k(t) K_k(t) \\ \text{and}\;\; & K_k(t) := \sum_{j=1}^{K} \min(1, \sqrt{\frac{T_j(t)}{T_i(t)}}).\end{split}$
__module__ = 'Policies.UCBdagger'