Policies.Softmax module

The Boltzmann Exploration (Softmax) index policy.

Policies.Softmax.UNBIASED = False

self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / trusts_t\).

class Policies.Softmax.Softmax(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The Boltzmann Exploration (Softmax) index policy, with a constant temperature \(\eta_t\).

__init__(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

New policy.

unbiased = None

Flag

startGame()[source]

Nothing special to do.

__str__()[source]

-> str

property temperature

Constant temperature, \(\eta_t\).

property trusts

Update the trusts probabilities according to the Softmax (ie Boltzmann) distribution on accumulated rewards, and with the temperature \(\eta_t\).

\[\begin{split}\mathrm{trusts}'_k(t+1) &= \exp\left( \frac{X_k(t)}{\eta_t N_k(t)} \right) \\ \mathrm{trusts}(t+1) &= \mathrm{trusts}'(t+1) / \sum_{k=1}^{K} \mathrm{trusts}'_k(t+1).\end{split}\]

If \(X_k(t) = \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma)\) is the sum of rewards from arm k.

choice()[source]

One random selection, with probabilities = trusts, thank to numpy.random.choice().

choiceWithRank(rank=1)[source]

Multiple (rank >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice(), and select the last one (least probable one).

  • Note that if not enough entries in the trust vector are non-zero, then choice() is called instead (rank is ignored).

choiceFromSubSet(availableArms='all')[source]

One random selection, from availableArms, with probabilities = trusts, thank to numpy.random.choice().

choiceMultiple(nb=1)[source]

Multiple (nb >= 1) random selection, with probabilities = trusts, thank to numpy.random.choice().

estimatedOrder()[source]

Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing trust probabilities.

__module__ = 'Policies.Softmax'
class Policies.Softmax.SoftmaxWithHorizon(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Softmax.Softmax

Softmax with fixed temperature \(\eta_t = \eta_0\) chosen with a knowledge of the horizon.

__init__(nbArms, horizon, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter \(T\) = known horizon of the experiment.

__str__()[source]

-> str

property temperature

Fixed temperature, small, knowing the horizon: \(\eta_t = \sqrt(\frac{2 \log(K)}{T K})\) (heuristic).

__module__ = 'Policies.Softmax'
class Policies.Softmax.SoftmaxDecreasing(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Softmax.Softmax

Softmax with decreasing temperature \(\eta_t\).

__str__()[source]

-> str

property temperature

Decreasing temperature with the time: \(\eta_t = \sqrt(\frac{\log(K)}{t K})\) (heuristic).

__module__ = 'Policies.Softmax'
class Policies.Softmax.SoftMix(nbArms, temperature=None, unbiased=False, lower=0.0, amplitude=1.0)[source]

Bases: Policies.Softmax.Softmax

Another Softmax with decreasing temperature \(\eta_t\).

__str__()[source]

-> str

__module__ = 'Policies.Softmax'
property temperature

Decreasing temperature with the time: \(\eta_t = c \frac{\log(t)}{t}\) (heuristic).