Policies.DMED module¶

The DMED policy of [Honda & Takemura, COLT 2010] in the special case of Bernoulli rewards (can be used on any [0,1]-valued rewards, but warning: in the non-binary case, this is not the algorithm of [Honda & Takemura, COLT 2010]) (see note below on the variant).

Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).

class Policies.DMED.DMED(nbArms, genuine=False, tolerance=0.0001, kl=CPUDispatcher(<function klBern>), lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.BasePolicy.BasePolicy

The DMED policy of [Honda & Takemura, COLT 2010] in the special case of Bernoulli rewards (can be used on any [0,1]-valued rewards, but warning: in the non-binary case, this is not the algorithm of [Honda & Takemura, COLT 2010]) (see note below on the variant).

Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).

__init__(nbArms, genuine=False, tolerance=0.0001, kl=CPUDispatcher(<function klBern>), lower=0.0, amplitude=1.0)[source]¶: New policy.

kl = None¶: kl function to use

tolerance = None¶: Numerical tolerance

genuine = None¶: Flag to know which variant is implemented, DMED or DMED+

nextActions = None¶: List of next actions to play, every next step is playing nextActions.pop(0)

__str__()[source]¶: -> str

startGame()[source]¶: Initialize the policy for a new game.

choice()[source]¶

If there is still a next action to play, pop it and play it, otherwise make new list and play first action.

The list of action is obtained as all the indexes \(k\) satisfying the following equation.

For the naive version (genuine = False), DMED:

\[\mathrm{kl}(\hat{\mu}_k(t), \hat{\mu}^*(t)) < \frac{\log(t)}{N_k(t)}.\]

For the original version (genuine = True), DMED+:

\[\mathrm{kl}(\hat{\mu}_k(t), \hat{\mu}^*(t)) < \frac{\log(\frac{t}{N_k(t)})}{N_k(t)}.\]

Where \(X_k(t)\) is the sum of rewards from arm k, \(\hat{\mu}_k(t)\) is the empirical mean, and \(\hat{\mu}^*(t)\) is the best empirical mean.

\[\begin{split}X_k(t) &= \sum_{\sigma=1}^{t} 1(A(\sigma) = k) r_k(\sigma) \\ \hat{\mu}_k(t) &= \frac{X_k(t)}{N_k(t)}, \\ \hat{\mu}^*(t) &= \max_{k=1}^{K} \hat{\mu}_k(t)\end{split}\]

choiceMultiple(nb=1)[source]¶: If there is still enough actions to play, pop them and play them, otherwise make new list and play nb first actions.

__module__ = 'Policies.DMED'¶

class Policies.DMED.DMEDPlus(nbArms, tolerance=0.0001, kl=CPUDispatcher(<function klBern>), lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.DMED.DMED

The DMED+ policy of [Honda & Takemura, COLT 2010] in the special case of Bernoulli rewards (can be used on any [0,1]-valued rewards, but warning: in the non-binary case, this is not the algorithm of [Honda & Takemura, COLT 2010]).

Reference: [Garivier & Cappé - COLT, 2011](https://arxiv.org/pdf/1102.2490.pdf).

__init__(nbArms, tolerance=0.0001, kl=CPUDispatcher(<function klBern>), lower=0.0, amplitude=1.0)[source]¶: New policy.

__module__ = 'Policies.DMED'¶