# Policies.TsallisInf module¶

The 1/2-Tsallis-Inf policy for bounded bandit, (order) optimal for stochastic and adversarial bandits.

• Reference: [[“An Optimal Algorithm for Stochastic and Adversarial Bandits”, Julian Zimmert, Yevgeny Seldin, 2018, arXiv:1807.07623]](https://arxiv.org/abs/1807.07623)

Policies.TsallisInf.ALPHA = 0.5

Default value for $$\alpha$$ the parameter of the Tsallis entropy. We focus on the 1/2-Tsallis algorithm, ie, with $$\alpha=\frac{1}{2}$$.

class Policies.TsallisInf.TsallisInf(nbArms, alpha=0.5, lower=0.0, amplitude=1.0)[source]

The 1/2-Tsallis-Inf policy for bounded bandit, (order) optimal for stochastic and adversarial bandits.

• Reference: [[“An Optimal Algorithm for Stochastic and Adversarial Bandits”, Julian Zimmert, Yevgeny Seldin, 2018, arXiv:1807.07623]](https://arxiv.org/abs/1807.07623)

__init__(nbArms, alpha=0.5, lower=0.0, amplitude=1.0)[source]

New policy.

alpha = None

Store the constant $$\alpha$$ used by the Online-Mirror-Descent step using $$\alpha$$ Tsallis entropy.

inverse_exponent = None

Store $$\frac{1}{\alpha-1}$$ to only compute it once.

cumulative_losses = None

Keep in memory the vector $$\hat{L}_t$$ of cumulative (unbiased estimates) of losses.

__str__()[source]

-> str

property eta

Decreasing learning rate, $$\eta_t = \frac{1}{\sqrt{t}}$$.

property trusts

Trusts probabilities $$\mathrm{trusts}(t+1)$$ are just the normalized weights $$w_k(t)$$.

getReward(arm, reward)[source]

Give a reward: accumulate rewards on that arm k, then recompute the trusts.

Compute the trusts probabilities $$w_k(t)$$ with one step of Online-Mirror-Descent for bandit, using the $$\alpha$$ Tsallis entropy for the $$\Psi_t$$ functions.

$\begin{split}\mathrm{trusts}'_k(t+1) &= \nabla (\Psi_t + \mathcal{I}_{\Delta^K})^* (- \hat{L}_{t-1}), \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}$
• If $$\Delta^K$$ is the probability simplex of dimension $$K$$,

• and $$\hat{L}_{t-1}$$ is the cumulative loss vector, ie, the sum of the (unbiased estimate) $$\hat{\ell}_t$$ for the previous time steps,

• where $$\hat{\ell}_{t,i} = 1(I_t = i) \frac{\ell_{t,i}}{\mathrm{trusts}_i(t)}$$ is the unbiased estimate of the loss,

• With $$\Psi_t = \Psi_{t,\alpha}(w) := - \sum_{k=1}^{K} \frac{w_k^{\alpha}}{\alpha \eta_t}$$,

• With learning rate $$\eta_t = \frac{1}{\sqrt{t}}$$ the (decreasing) learning rate.

__module__ = 'Policies.TsallisInf'