PoliciesMultiPlayers.RandTopMEst module¶

RandTopMEstEst: four proposals for an efficient multi-players learning policy. RandTopMEstEst and MCTopMEstEst are the two main algorithms, with variants (see below).

Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),
But instead of aiming at the best (the 1-st best) arm, player i constantly aims at one of the M best arms (denoted \(\hat{M}^j(t)\), according to its index policy of indexes \(g^j_k(t)\) (where M is the number of players),
When a collision occurs or when the currently chosen arm lies outside of the current estimate of the set M-best, a new current arm is chosen.
The (fixed) number of players is learned on the run.

Note

This is fully decentralized: player do not need to know the (fixed) number of players!

Reference: [[Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal-01629733)

Warning

This is still very experimental!

Note

For a more generic approach, see the wrapper defined in EstimateM.EstimateM.

class PoliciesMultiPlayers.RandTopMEst.oneRandTopMEst(threshold, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.RandTopM.oneRandTopM

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).

__init__(threshold, *args, **kwargs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

threshold = None¶: Threshold function

nbPlayersEstimate = None¶: Number of players. Optimistic: start by assuming it is alone!

collisionCount = None¶: Count collisions on each arm, since last increase of nbPlayersEstimate

timeSinceLastCollision = None¶: Time since last collision. Don’t remember why I thought using this could be useful… But it’s not!

t = None¶: Internal time

__str__()[source]¶: Return str(self).

startGame()[source]¶: Start game.

handleCollision(arm, reward=None)[source]¶: Select a new rank, and maybe update nbPlayersEstimate.

getReward(arm, reward)[source]¶: One transmission without collision.

__module__ = 'PoliciesMultiPlayers.RandTopMEst'¶

PoliciesMultiPlayers.RandTopMEst.WITH_CHAIR = False¶: Whether to use or not the variant with the “chair”: after using an arm successfully (no collision), a player won’t move after future collisions (she assumes the other will move). But she will still change her chosen arm if it lies outside of the estimated M-best. RandTopMEst (and variants) uses False and MCTopMEst (and variants) uses True.

PoliciesMultiPlayers.RandTopMEst.OPTIM_PICK_WORST_FIRST = False¶: XXX First experimental idea: when the currently chosen arm lies outside of the estimated Mbest set, force to first try (at least once) the arm with lowest UCB indexes in this Mbest_j(t) set. Used by RandTopMEstCautious and RandTopMEstExtraCautious, and by MCTopMEstCautious and MCTopMEstExtraCautious.

PoliciesMultiPlayers.RandTopMEst.OPTIM_EXIT_IF_WORST_WAS_PICKED = False¶: XXX Second experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Used by RandTopMEstExtraCautious and MCTopMEstExtraCautious.

PoliciesMultiPlayers.RandTopMEst.OPTIM_PICK_PREV_WORST_FIRST = True¶: XXX Third experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Default now!. False only for RandTopMEstOld and MCTopMEstOld.

class PoliciesMultiPlayers.RandTopMEst.RandTopMEst(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

RandTopMEst: a proposal for an efficient multi-players learning policy, with no prior knowledge of the number of player.

__init__(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
withChair: see WITH_CHAIR,
pickWorstFirst: see OPTIM_PICK_WORST_FIRST,
exitIfWorstWasPicked: see EXIT_IF_WORST_WAS_PICKED,
pickPrevWorstFirst: see OPTIM_PICK_PREV_WORST_FIRST,
threshold: the threshold function to use, see EstimateM.threshold_on_t_with_horizon(), EstimateM.threshold_on_t_doubling_trick() or EstimateM.threshold_on_t() above.
*args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = RandTopMEst(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]

To get a list of usable players, use s.children.

Warning

s._players is for internal use ONLY!

nbPlayers = None¶: Number of players

withChair = None¶: Using a chair ?

pickWorstFirst = None¶: Using first optimization ?

exitIfWorstWasPicked = None¶: Using second optimization ?

pickPrevWorstFirst = None¶: Using third optimization ? Default to yes now.

children = None¶: List of children, fake algorithms

nbArms = None¶: Number of arms

__str__()[source]¶: Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopMEst'¶

class PoliciesMultiPlayers.RandTopMEst.RandTopMEstPlus(nbPlayers, nbArms, playerAlgo, horizon, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

RandTopMEstPlus: a proposal for an efficient multi-players learning policy, with no prior knowledge of the number of player.

__init__(nbPlayers, nbArms, playerAlgo, horizon, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
horizon: need to know the horizon \(T\).
withChair: see WITH_CHAIR,
pickWorstFirst: see OPTIM_PICK_WORST_FIRST,
exitIfWorstWasPicked: see EXIT_IF_WORST_WAS_PICKED,
pickPrevWorstFirst: see OPTIM_PICK_PREV_WORST_FIRST,
threshold: the threshold function to use, see threshold_on_t_with_horizon() or threshold_on_t() above.
*args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> horizon = 1000
>>> s = RandTopMEstPlus(nbPlayers, nbArms, UCB, horizon)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]

To get a list of usable players, use s.children.

Warning

s._players is for internal use ONLY!

nbPlayers = None¶: Number of players

withChair = None¶: Using a chair ?

pickWorstFirst = None¶: Using first optimization ?

exitIfWorstWasPicked = None¶: Using second optimization ?

pickPrevWorstFirst = None¶: Using third optimization ? Default to yes now.

children = None¶: List of children, fake algorithms

nbArms = None¶: Number of arms

__str__()[source]¶: Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopMEst'¶

class PoliciesMultiPlayers.RandTopMEst.MCTopMEst(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.RandTopMEst.RandTopMEst

MCTopMEst: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopMEst.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
*args, **kwargs: arguments, named arguments, given to playerAlgo.

__module__ = 'PoliciesMultiPlayers.RandTopMEst'¶

__str__()[source]¶: Return str(self).

class PoliciesMultiPlayers.RandTopMEst.MCTopMEstPlus(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

Bases: PoliciesMultiPlayers.RandTopMEst.RandTopMEstPlus

MCTopMEstPlus: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopMEst.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__module__ = 'PoliciesMultiPlayers.RandTopMEst'¶

__init__(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶

nbPlayers: number of players to create (in self._players).
playerAlgo: class to use for every players.
nbArms: number of arms, given as first argument to playerAlgo.
*args, **kwargs: arguments, named arguments, given to playerAlgo.

__str__()[source]¶: Return str(self).