# Policies.Softmax module¶

The Boltzmann Exploration (Softmax) index policy.

Policies.Softmax.UNBIASED = False

self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just $$r_t$$, or unbiased estimators, $$r_t / trusts_t$$.

class Policies.Softmax.Softmax(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

The Boltzmann Exploration (Softmax) index policy, with a constant temperature $$\eta_t$$.

__init__(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

New policy.

unbiased = None

Flag

startGame()[source]

Nothing special to do.

__str__()[source]

-> str

property temperature

Constant temperature, $$\eta_t$$.

property trusts

Update the trusts probabilities according to the Softmax (ie Boltzmann) distribution on accumulated rewards, and with the temperature $$\eta_t$$.

$\begin{split}\mathrm{trusts}'_k(t+1) &= \exp\left( \frac{X_k(t)}{\eta_t N_k(t)} \right) \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}$

If $$X_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)$$ is the sum of rewards from arm k.

choice()[source]

One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (least probable one).

• Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).

choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

__module__ = 'Policies.Softmax'
class Policies.Softmax.SoftmaxWithHorizon(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

Softmax with fixed temperature $$\eta_t = \eta_0$$ chosen with a knowledge of the horizon.

__init__(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter $$T$$ = known horizon of the experiment.

__str__()[source]

-> str

property temperature

Fixed temperature, small, knowing the horizon: $$\eta_t = \sqrt(\frac{2 \log(K)}{T K})$$ (heuristic).

__module__ = 'Policies.Softmax'
class Policies.Softmax.SoftmaxDecreasing(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

Softmax with decreasing temperature $$\eta_t$$.

__str__()[source]

-> str

property temperature

Decreasing temperature with the time: $$\eta_t = \sqrt(\frac{\log(K)}{t K})$$ (heuristic).

__module__ = 'Policies.Softmax'
class Policies.Softmax.SoftMix(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

Another Softmax with decreasing temperature $$\eta_t$$.

__str__()[source]

-> str

__module__ = 'Policies.Softmax'
property temperature

Decreasing temperature with the time: $$\eta_t = c \frac{\log(t)}{t}$$ (heuristic).