Policies.MusicalChair module¶

MusicalChair: implementation of the decentralized multi-player policy from [A Musical Chair approach, Shamir et al., 2015](https://arxiv.org/abs/1512.02866).

Each player has 3 states, 1st is random exploration, 2nd is musical chair, 3rd is staying sit
1st step
- Every player tries uniformly an arm for \(T_0\) steps, counting the empirical means of each arm, and the number of observed collisions \(C_{T_0}\)
- Finally, \(N^* = M\) = nbPlayers is estimated based on nb of collisions \(C_{T_0}\), and the \(N^*\) best arms are computed from their empirical means
2nd step:
- Every player Choose an arm uniformly, among the \(N^*\) best arms, until she does not encounter collision right after choosing it
- When an arm was chosen by only one player, she decides to sit on this chair (= arm)
3rd step:
- Every player stays sitted on her chair for the rest of the game
- \(\implies\) constant regret if \(N^*\) is well estimated and if the estimated N* best arms were correct
- \(\implies\) linear regret otherwise

Policies.MusicalChair.optimalT0(nbArms=10, epsilon=0.1, delta=0.05)[source]¶

Compute the lower-bound suggesting “large-enough” values for \(T_0\) that should guarantee constant regret with probability at least \(1 - \delta\), if the gap \(\Delta\) is larger than \(\epsilon\).

Cf. Theorem 1 of [Shamir et al., 2015](https://arxiv.org/abs/1512.02866).

Examples:

For \(K=2\) arms, and in order to have a constant regret with probability at least \(90\%\), if the gap \(\Delta\) is known to be \(\geq 0.05\), then their theoretical analysis suggests to use \(T_0 \geq 18459\). That’s very huge, for just two arms!

>>> optimalT0(2, 0.1, 0.05)     # Just 2 arms !
18459                           # ==> That's a LOT of steps for just 2 arms!

For a harder problem with \(K=6\) arms, for a risk smaller than \(1\%\) and a gap \(\Delta \geq 0.05\), they suggest at least \(T_0 \geq 7646924\), i.e., about 7 millions of trials. That is simply too much for any realistic system, and starts to be too large for simulated systems.

>>> optimalT0(6, 0.01, 0.05)    # Constant regret with >99% proba
7646924                         # ==> That's a LOT of steps!
>>> optimalT0(6, 0.001, 0.05)   # Reasonable value of epsilon
764692376                       # ==> That's a LOT of steps!!!

For an even harder problem with \(K=17\) arms, the values given by their Theorem 1 start to be really unrealistic:

>>> optimalT0(17, 0.01, 0.05)   # Constant regret with >99% proba
27331794                        # ==> That's a LOT of steps!
>>> optimalT0(17, 0.001, 0.05)  # Reasonable value of epsilon
2733179304                      # ==> That's a LOT of steps!!!

Policies.MusicalChair.boundOnFinalRegret(T0, nbPlayers)[source]¶

Use the upper-bound on regret when \(T_0\) and \(M\) are known.

The “constant” regret of course grows linearly with \(T_0\), as:

\[\forall T \geq T_0, \;\; R_T \leq T_0 K + 2 \mathrm{exp}(2) K.\]

Warning

this bound is not a deterministic result, it is only value with a certain probability (at least \(1 - \delta\), if \(T_0\) is chosen as given by optimalT0()).

Cf. Theorem 1 of [Shamir et al., 2015](https://arxiv.org/abs/1512.02866).
Examples:

>>> boundOnFinalRegret(18459, 2)        # Crazy constant regret!  
36947.5..
>>> boundOnFinalRegret(7646924, 6)      # Crazy constant regret!!  
45881632.6...
>>> boundOnFinalRegret(764692376, 6)    # Crazy constant regret!!  
4588154344.6...
>>> boundOnFinalRegret(27331794, 17)    # Crazy constant regret!!  
464640749.2...
>>> boundOnFinalRegret(2733179304, 17)  # Crazy constant regret!!  
46464048419.2...

class Policies.MusicalChair.State¶

Bases: enum.Enum

Different states during the Musical Chair algorithm

InitialPhase = 2¶

MusicalChair = 3¶

NotStarted = 1¶

Sitted = 4¶

__module__ = 'Policies.MusicalChair'¶

class Policies.MusicalChair.MusicalChair(nbArms, Time0=0.25, Time1=None, N=None, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.BasePolicy.BasePolicy

MusicalChair: implementation of the decentralized multi-player policy from [A Musical Chair approach, Shamir et al., 2015](https://arxiv.org/abs/1512.02866).

__init__(nbArms, Time0=0.25, Time1=None, N=None, lower=0.0, amplitude=1.0)[source]¶

nbArms: number of arms,
Time0: required, number of step, or portion of the horizon Time1 (optional), for the first step (pure random exploration by each players),
N: optional, exact or upper bound on the number of players,
Time1: optional, only used to compute Time0 if Time0 is fractional (eg. 0.2).

Example:

>>> nbArms, Time0, Time1, N = 17, 0.1, 10000, 6
>>> player1 = MusicalChair(nbArms, Time0, Time1, N)

For multi-players use:

>>> configuration["players"] = Selfish(NB_PLAYERS, MusicalChair, nbArms, Time0=0.25, Time1=HORIZON, N=NB_PLAYERS).children

state = None¶: Current state

Time0 = None¶: Parameter T0

nbPlayers = None¶: Number of players

chair = None¶: Current chair. Not sited yet.

cumulatedRewards = None¶: That’s the s_i(t) of the paper

nbObservations = None¶: That’s the o_i of the paper

A = None¶: A random permutation of arms, it will then be of size nbPlayers!

nbCollision = None¶: Number of collisions, that’s the C_Time0 of the paper

t = None¶: Internal times

__str__()[source]¶: -> str

startGame()[source]¶: Just reinitialize all the internal memory, and decide how to start (state 1 or 2).

choice()[source]¶: Choose an arm, as described by the Musical Chair algorithm.

getReward(arm, reward)[source]¶

Receive a reward on arm of index ‘arm’, as described by the Musical Chair algorithm.

If not collision, receive a reward after pulling the arm.

_endInitialPhase()[source]¶: Small computation needed at the end of the initial random exploration phase.

handleCollision(arm, reward=None)[source]¶

Handle a collision, on arm of index ‘arm’.

Warning: this method has to be implemented in the collision model, it is NOT implemented in the EvaluatorMultiPlayers.

__module__ = 'Policies.MusicalChair'¶