Policies.TsallisInf module¶

The 1/2-Tsallis-Inf policy for bounded bandit, (order) optimal for stochastic and adversarial bandits.

Reference: [[“An Optimal Algorithm for Stochastic and Adversarial Bandits”, Julian Zimmert, Yevgeny Seldin, 2018, arXiv:1807.07623]](https://arxiv.org/abs/1807.07623)

Policies.TsallisInf.ALPHA = 0.5¶: Default value for \(\alpha\) the parameter of the Tsallis entropy. We focus on the 1/2-Tsallis algorithm, ie, with \(\alpha=\frac{1}{2}\).

class Policies.TsallisInf.TsallisInf(nbArms, alpha=0.5, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.Exp3.Exp3

The 1/2-Tsallis-Inf policy for bounded bandit, (order) optimal for stochastic and adversarial bandits.

Reference: [[“An Optimal Algorithm for Stochastic and Adversarial Bandits”, Julian Zimmert, Yevgeny Seldin, 2018, arXiv:1807.07623]](https://arxiv.org/abs/1807.07623)

__init__(nbArms, alpha=0.5, lower=0.0, amplitude=1.0)[source]¶: New policy.

alpha = None¶: Store the constant \(\alpha\) used by the Online-Mirror-Descent step using \(\alpha\) Tsallis entropy.

inverse_exponent = None¶: Store \(\frac{1}{\alpha-1}\) to only compute it once.

cumulative_losses = None¶: Keep in memory the vector \(\hat{L}_t\) of cumulative (unbiased estimates) of losses.

__str__()[source]¶: -> str

property eta¶: Decreasing learning rate, \(\eta_t = \frac{1}{\sqrt{t}}\).

property trusts¶: Trusts probabilities \(\mathrm{trusts}(t+1)\) are just the normalized weights \(w_k(t)\).

getReward(arm, reward)[source]¶

Give a reward: accumulate rewards on that arm k, then recompute the trusts.

Compute the trusts probabilities \(w_k(t)\) with one step of Online-Mirror-Descent for bandit, using the \(\alpha\) Tsallis entropy for the \(\Psi_t\) functions.

\[\begin{split}\mathrm{trusts}'_k(t+1) &= \nabla (\Psi_t + \mathcal{I}_{\Delta^K})^* (- \hat{L}_{t-1}), \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(\Delta^K\) is the probability simplex of dimension \(K\),
and \(\hat{L}_{t-1}\) is the cumulative loss vector, ie, the sum of the (unbiased estimate) \(\hat{\ell}_t\) for the previous time steps,
where \(\hat{\ell}_{t,i} = 1(I_t = i) \frac{\ell_{t,i}}{\mathrm{trusts}_i(t)}\) is the unbiased estimate of the loss,
With \(\Psi_t = \Psi_{t,\alpha}(w) := - \sum_{k=1}^{K} \frac{w_k^{\alpha}}{\alpha \eta_t}\),
With learning rate \(\eta_t = \frac{1}{\sqrt{t}}\) the (decreasing) learning rate.

__module__ = 'Policies.TsallisInf'¶