PoliciesMultiPlayers.RandTopM module¶
RandTopM: four proposals for an efficient multiplayers learning policy. RandTopM
and MCTopM
are the two main algorithms, with variants (see below).
Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KLUCB, BayesUCB etc),
But instead of aiming at the best (the 1st best) arm, player i constantly aims at one of the M best arms (denoted \(\hat{M}^j(t)\), according to its index policy of indexes \(g^j_k(t)\) (where M is the number of players),
When a collision occurs or when the currently chosen arm lies outside of the current estimate of the set Mbest, a new current arm is chosen.
Note
This is not fully decentralized: as each child player needs to know the (fixed) number of players.
Reference: [[MultiPlayer Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal01629733)

PoliciesMultiPlayers.RandTopM.
WITH_CHAIR
= False¶ Whether to use or not the variant with the “chair”: after using an arm successfully (no collision), a player won’t move after future collisions (she assumes the other will move). But she will still change her chosen arm if it lies outside of the estimated Mbest.
RandTopM
(and variants) uses False andMCTopM
(and variants) uses True.

PoliciesMultiPlayers.RandTopM.
OPTIM_PICK_WORST_FIRST
= False¶ XXX First experimental idea: when the currently chosen arm lies outside of the estimated Mbest set, force to first try (at least once) the arm with lowest UCB indexes in this Mbest_j(t) set. Used by
RandTopMCautious
andRandTopMExtraCautious
, and byMCTopMCautious
andMCTopMExtraCautious
.

PoliciesMultiPlayers.RandTopM.
OPTIM_EXIT_IF_WORST_WAS_PICKED
= False¶ XXX Second experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Used by
RandTopMExtraCautious
andMCTopMExtraCautious
.

PoliciesMultiPlayers.RandTopM.
OPTIM_PICK_PREV_WORST_FIRST
= True¶ XXX Third experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Default now!. False only for
RandTopMOld
andMCTopMOld
.

class
PoliciesMultiPlayers.RandTopM.
oneRandTopM
(maxRank, withChair, pickWorstFirst, exitIfWorstWasPicked, pickPrevWorstFirst, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.ChildPointer.ChildPointer
Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its ith player.
Except for the handleCollision method: a new random arm is sampled after observing a collision,
And the player does not aim at the best arm, but at one of the best arm, based on her index policy.
(See variants for more details.)

__init__
(maxRank, withChair, pickWorstFirst, exitIfWorstWasPicked, pickPrevWorstFirst, *args, **kwargs)[source]¶ Initialize self. See help(type(self)) for accurate signature.

maxRank
= None¶ Max rank, usually nbPlayers but can be different.

chosen_arm
= None¶ Current chosen arm.

sitted
= None¶ Not yet sitted. After 1 step without collision, don’t react to collision (but still react when the chosen arm lies outside Mbest).

prevWorst
= None¶ Keep track of the last arms worst than the chosen one (at previous time step).

t
= None¶ Internal time

worst_Mbest
()[source]¶ Index of the worst of the current estimate of the Mbest arms. M is the maxRank given to the algorithm.

worst_previous__and__current_Mbest
(current_arm)[source]¶ Return the set from which to select a random arm for
MCTopM
(the optimization is now the default):\[\hat{M}^j(t) \cap \{ m : g^j_m(t1) \leq g^j_k(t1) \}.\]

handleCollision
(arm, reward=None)[source]¶ Get a new random arm from the current estimate of Mbest, and give reward to the algorithm if not None.

getReward
(arm, reward)[source]¶ Pass the call to self.mother._getReward_one(playerId, arm, reward) with the player’s ID number.

choice
()[source]¶ Reconsider the choice of arm, and then use the chosen arm.
For all variants, if the chosen arm is no longer in the current estimate of the Mbest set, a new one is selected,
The basic RandTopM selects uniformly an arm in estimate Mbest,
MCTopM starts by being “non sitted” on its new chosen arm,
MCTopMCautious is forced to first try the arm with lowest UCB indexes (or whatever index policy is used).

__module__
= 'PoliciesMultiPlayers.RandTopM'¶

class
PoliciesMultiPlayers.RandTopM.
RandTopM
(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy
RandTopM: a proposal for an efficient multiplayers learning policy.

__init__
(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
withChair: see
WITH_CHAIR
,pickWorstFirst: see
OPTIM_PICK_WORST_FIRST
,exitIfWorstWasPicked: see
EXIT_IF_WORST_WAS_PICKED
,pickPrevWorstFirst: see
OPTIM_PICK_PREV_WORST_FIRST
,maxRank: maximum rank allowed by the RandTopM child (default to nbPlayers, but for instance if there is 2 × RandTopM[UCB] + 2 × RandTopM[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = RandTopM(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
To get a list of usable players, use
s.children
.
Warning
s._players
is for internal use ONLY!

maxRank
= None¶ Max rank, usually nbPlayers but can be different

nbPlayers
= None¶ Number of players

withChair
= None¶ Using a chair ?

pickWorstFirst
= None¶ Using first optimization ?

exitIfWorstWasPicked
= None¶ Using second optimization ?

pickPrevWorstFirst
= None¶ Using third optimization ? Default to yes now.

children
= None¶ List of children, fake algorithms

nbArms
= None¶ Number of arms

__module__
= 'PoliciesMultiPlayers.RandTopM'¶


class
PoliciesMultiPlayers.RandTopM.
RandTopMCautious
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
RandTopMCautious: another proposal for an efficient multiplayers learning policy, more “stationary” than RandTopM.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the RandTopMCautious child (default to nbPlayers, but for instance if there is 2 × RandTopMCautious[UCB] + 2 × RandTopMCautious[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = RandTopMCautious(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]

__module__
= 'PoliciesMultiPlayers.RandTopM'¶


class
PoliciesMultiPlayers.RandTopM.
RandTopMExtraCautious
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
RandTopMExtraCautious: another proposal for an efficient multiplayers learning policy, more “stationary” than RandTopM.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the RandTopMExtraCautious child (default to nbPlayers, but for instance if there is 2 × RandTopMExtraCautious[UCB] + 2 × RandTopMExtraCautious[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = RandTopMExtraCautious(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]

__module__
= 'PoliciesMultiPlayers.RandTopM'¶


class
PoliciesMultiPlayers.RandTopM.
RandTopMOld
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
RandTopMOld: another proposal for an efficient multiplayers learning policy, more “stationary” than RandTopM.

__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the RandTopMOld child (default to nbPlayers, but for instance if there is 2 × RandTopMOld[UCB] + 2 × RandTopMOld[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = RandTopMOld(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]

__module__
= 'PoliciesMultiPlayers.RandTopM'¶


class
PoliciesMultiPlayers.RandTopM.
MCTopM
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
MCTopM: another proposal for an efficient multiplayers learning policy, more “stationary” than RandTopM.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the MCTopM child (default to nbPlayers, but for instance if there is 2 × MCTopM[UCB] + 2 × MCTopM[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = MCTopM(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]

__module__
= 'PoliciesMultiPlayers.RandTopM'¶


class
PoliciesMultiPlayers.RandTopM.
MCTopMCautious
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
MCTopMCautious: another proposal for an efficient multiplayers learning policy, more “stationary” than RandTopM.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the MCTopMCautious child (default to nbPlayers, but for instance if there is 2 × MCTopMCautious[UCB] + 2 × MCTopMCautious[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = MCTopMCautious(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]

__module__
= 'PoliciesMultiPlayers.RandTopM'¶


class
PoliciesMultiPlayers.RandTopM.
MCTopMExtraCautious
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
MCTopMExtraCautious: another proposal for an efficient multiplayers learning policy, more “stationary” than RandTopM.
Warning
Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the MCTopMExtraCautious child (default to nbPlayers, but for instance if there is 2 × MCTopMExtraCautious[UCB] + 2 × MCTopMExtraCautious[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = MCTopMExtraCautious(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]

__module__
= 'PoliciesMultiPlayers.RandTopM'¶


class
PoliciesMultiPlayers.RandTopM.
MCTopMOld
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ Bases:
PoliciesMultiPlayers.RandTopM.RandTopM
MCTopMOld: another proposal for an efficient multiplayers learning policy, more “stationary” than RandTopM.
Warning
Still very experimental! But it seems to be one of the most efficient decentralized MP algorithm we have so far… The two other variants of MCTopM seem even better!

__module__
= 'PoliciesMultiPlayers.RandTopM'¶

__init__
(nbPlayers, nbArms, playerAlgo, maxRank=None, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶ nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
maxRank: maximum rank allowed by the MCTopMOld child (default to nbPlayers, but for instance if there is 2 × MCTopMOld[UCB] + 2 × MCTopMOld[klUCB], maxRank should be 4 not 2).
*args, **kwargs: arguments, named arguments, given to playerAlgo.
Example:
>>> from Policies import * >>> import random; random.seed(0); import numpy as np; np.random.seed(0) >>> nbArms = 17 >>> nbPlayers = 6 >>> s = MCTopMOld(nbPlayers, nbArms, UCB) >>> [ child.choice() for child in s.children ] [12, 15, 0, 3, 3, 7]
