# Policies.EpsilonGreedy module¶

The epsilon-greedy random policies, with the naive one and some variants.

• At every time step, a fully uniform random exploration has probability $$\varepsilon(t)$$ to happen, otherwise an exploitation is done on accumulated rewards (not means).

Warning

Except if $$\varepsilon(t)$$ is optimally tuned for a specific problem, none of these policies can hope to be efficient.

Policies.EpsilonGreedy.random() → x in the interval [0, 1).
class Policies.EpsilonGreedy.EpsilonGreedy(nbArms, epsilon=0.1, lower=0.0, amplitude=1.0)[source]

The epsilon-greedy random policy.

• At every time step, a fully uniform random exploration has probability $$\varepsilon(t)$$ to happen, otherwise an exploitation is done on accumulated rewards (not means).

__init__(nbArms, epsilon=0.1, lower=0.0, amplitude=1.0)[source]

New policy.

property epsilon
__str__()[source]

-> str

choice()[source]

With a probability of epsilon, explore (uniform choice), otherwhise exploit based on just accumulated rewards (not empirical mean rewards).

choiceWithRank(rank=1)[source]

With a probability of epsilon, explore (uniform choice), otherwhise exploit with the rank, based on just accumulated rewards (not empirical mean rewards).

choiceFromSubSet(availableArms='all')[source]

Not defined.

choiceMultiple(nb=1)[source]

Not defined.

__module__ = 'Policies.EpsilonGreedy'
class Policies.EpsilonGreedy.EpsilonDecreasing(nbArms, epsilon=0.1, lower=0.0, amplitude=1.0)[source]

The epsilon-decreasing random policy.

• $$\varepsilon(t) = \min(1, \varepsilon_0 / \max(1, t))$$

__init__(nbArms, epsilon=0.1, lower=0.0, amplitude=1.0)[source]

New policy.

__str__()[source]

-> str

property epsilon

Decreasing $$\varepsilon(t) = \min(1, \varepsilon_0 / \max(1, t))$$.

__module__ = 'Policies.EpsilonGreedy'
Policies.EpsilonGreedy.C = 0.1

Constant C in the MEGA formula

Policies.EpsilonGreedy.D = 0.5

Constant C in the MEGA formula

Policies.EpsilonGreedy.epsilon0(c, d, nbArms)[source]

MEGA heuristic:

$\varepsilon_0 = \frac{c K^2}{d^2 (K - 1)}.$
class Policies.EpsilonGreedy.EpsilonDecreasingMEGA(nbArms, c=0.1, d=0.5, lower=0.0, amplitude=1.0)[source]

The epsilon-decreasing random policy, using MEGA’s heuristic for a good choice of epsilon0 value.

• $$\varepsilon(t) = \min(1, \varepsilon_0 / \max(1, t))$$

• $$\varepsilon_0 = \frac{c K^2}{d^2 (K - 1)}$$

__init__(nbArms, c=0.1, d=0.5, lower=0.0, amplitude=1.0)[source]

New policy.

__str__()[source]

-> str

property epsilon

Decreasing $$\varepsilon(t) = \min(1, \varepsilon_0 / \max(1, t))$$.

__module__ = 'Policies.EpsilonGreedy'
class Policies.EpsilonGreedy.EpsilonFirst(nbArms, horizon, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

The epsilon-first random policy. Ref: https://en.wikipedia.org/wiki/Multi-armed_bandit#Semi-uniform_strategies

__init__(nbArms, horizon, epsilon=0.01, lower=0.0, amplitude=1.0)[source]

New policy.

horizon = None

Parameter $$T$$ = known horizon of the experiment.

__str__()[source]

-> str

property epsilon

1 while $$t \leq \varepsilon_0 T$$, 0 after.

__module__ = 'Policies.EpsilonGreedy'
Policies.EpsilonGreedy.EPSILON = 0.1

Default value for epsilon for EpsilonDecreasing

Policies.EpsilonGreedy.DECREASINGRATE = 1e-06

Default value for the constant for the decreasing rate

class Policies.EpsilonGreedy.EpsilonExpDecreasing(nbArms, epsilon=0.1, decreasingRate=1e-06, lower=0.0, amplitude=1.0)[source]

The epsilon exp-decreasing random policy.

• $$\varepsilon(t) = \varepsilon_0 \exp(-t \mathrm{decreasingRate})$$.

__init__(nbArms, epsilon=0.1, decreasingRate=1e-06, lower=0.0, amplitude=1.0)[source]

New policy.

__module__ = 'Policies.EpsilonGreedy'
__str__()[source]

-> str

property epsilon

Decreasing $$\varepsilon(t) = \min(1, \varepsilon_0 \exp(- t \tau))$$.