Policies.ProbabilityPursuit module¶

The basic Probability Pursuit algorithm.

We use the simple version of the pursuit algorithm, as described in the seminal book by Sutton and Barto (1998), https://webdocs.cs.ualberta.ca/~sutton/book/the-book.html.
Initially, a uniform probability is set on each arm, \(p_k(0) = 1/k\).
At each time step \(t\), the probabilities are all recomputed, following this equation:

\[\begin{split}p_k(t+1) = \begin{cases} (1 - \beta) p_k(t) + \beta \times 1 & \text{if}\; \hat{\mu}_k(t) = \max_j \hat{\mu}_j(t) \\ (1 - \beta) p_k(t) + \beta \times 0 & \text{otherwise}. \end{cases}\end{split}\]
\(\beta \in (0, 1)\) is a learning rate, default is BETA = 0.5.
And then arm \(A_k(t+1)\) is randomly selected from the distribution \((p_k(t+1))_{1 \leq k \leq K}\).
References: [Kuleshov & Precup - JMLR, 2000](http://www.cs.mcgill.ca/~vkules/bandits.pdf#page=6), [Sutton & Barto, 1998]

Policies.ProbabilityPursuit.BETA = 0.5¶: Default value for the beta parameter

class Policies.ProbabilityPursuit.ProbabilityPursuit(nbArms, beta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]¶

The basic Probability pursuit algorithm.

References: [Kuleshov & Precup - JMLR, 2000](http://www.cs.mcgill.ca/~vkules/bandits.pdf#page=6), [Sutton & Barto, 1998]

__init__(nbArms, beta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]¶: New policy.

getReward(arm, reward)[source]¶: Give a reward: accumulate rewards on that arm k, then update the probabilities \(p_k(t)\) of each arm.

choice()[source]¶: One random selection, with probabilities \((p_k(t))_{1 \leq k \leq K}\), thank to numpy.random.choice().

choiceWithRank(rank=1)[source]¶: Multiple (rank >= 1) random selection, with probabilities \((p_k(t))_{1 \leq k \leq K}\), thank to numpy.random.choice(), and select the last one (less probable).

choiceFromSubSet(availableArms='all')[source]¶: One random selection, from availableArms, with probabilities \((p_k(t))_{1 \leq k \leq K}\), thank to numpy.random.choice().

choiceMultiple(nb=1)[source]¶: Multiple (nb >= 1) random selection, with probabilities \((p_k(t))_{1 \leq k \leq K}\), thank to numpy.random.choice().