PoliciesMultiPlayers.RandTopMEst module

RandTopMEstEst: four proposals for an efficient multi-players learning policy. RandTopMEstEst and MCTopMEstEst are the two main algorithms, with variants (see below).

  • Each child player is selfish, and plays according to an index policy (any index policy, e.g., UCB, Thompson, KL-UCB, BayesUCB etc),

  • But instead of aiming at the best (the 1-st best) arm, player i constantly aims at one of the M best arms (denoted \(\hat{M}^j(t)\), according to its index policy of indexes \(g^j_k(t)\) (where M is the number of players),

  • When a collision occurs or when the currently chosen arm lies outside of the current estimate of the set M-best, a new current arm is chosen.

  • The (fixed) number of players is learned on the run.

Note

This is fully decentralized: player do not need to know the (fixed) number of players!

Warning

This is still very experimental!

Note

For a more generic approach, see the wrapper defined in EstimateM.EstimateM.

class PoliciesMultiPlayers.RandTopMEst.oneRandTopMEst(threshold, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopM.oneRandTopM

Class that acts as a child policy, but in fact it pass all its method calls to the mother class, who passes it to its i-th player.

  • The procedure to estimate \(\hat{M}_i(t)\) is not so simple, but basically everyone starts with \(\hat{M}_i(0) = 1\), and when colliding \(\hat{M}_i(t+1) = \hat{M}_i(t) + 1\), for some time (with a complicated threshold).

__init__(threshold, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

threshold = None

Threshold function

nbPlayersEstimate = None

Number of players. Optimistic: start by assuming it is alone!

collisionCount = None

Count collisions on each arm, since last increase of nbPlayersEstimate

timeSinceLastCollision = None

Time since last collision. Don’t remember why I thought using this could be useful… But it’s not!

t = None

Internal time

__str__()[source]

Return str(self).

startGame()[source]

Start game.

handleCollision(arm, reward=None)[source]

Select a new rank, and maybe update nbPlayersEstimate.

getReward(arm, reward)[source]

One transmission without collision.

__module__ = 'PoliciesMultiPlayers.RandTopMEst'
PoliciesMultiPlayers.RandTopMEst.WITH_CHAIR = False

Whether to use or not the variant with the “chair”: after using an arm successfully (no collision), a player won’t move after future collisions (she assumes the other will move). But she will still change her chosen arm if it lies outside of the estimated M-best. RandTopMEst (and variants) uses False and MCTopMEst (and variants) uses True.

PoliciesMultiPlayers.RandTopMEst.OPTIM_PICK_WORST_FIRST = False

XXX First experimental idea: when the currently chosen arm lies outside of the estimated Mbest set, force to first try (at least once) the arm with lowest UCB indexes in this Mbest_j(t) set. Used by RandTopMEstCautious and RandTopMEstExtraCautious, and by MCTopMEstCautious and MCTopMEstExtraCautious.

PoliciesMultiPlayers.RandTopMEst.OPTIM_EXIT_IF_WORST_WAS_PICKED = False

XXX Second experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Used by RandTopMEstExtraCautious and MCTopMEstExtraCautious.

PoliciesMultiPlayers.RandTopMEst.OPTIM_PICK_PREV_WORST_FIRST = True

XXX Third experimental idea: when the currently chosen arm becomes the worst of the estimated Mbest set, leave it (even before it lies outside of Mbest_j(t)). Default now!. False only for RandTopMEstOld and MCTopMEstOld.

class PoliciesMultiPlayers.RandTopMEst.RandTopMEst(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

RandTopMEst: a proposal for an efficient multi-players learning policy, with no prior knowledge of the number of player.

__init__(nbPlayers, nbArms, playerAlgo, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, threshold=<function threshold_on_t_doubling_trick>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).

  • playerAlgo: class to use for every players.

  • nbArms: number of arms, given as first argument to playerAlgo.

  • withChair: see WITH_CHAIR,

  • pickWorstFirst: see OPTIM_PICK_WORST_FIRST,

  • exitIfWorstWasPicked: see EXIT_IF_WORST_WAS_PICKED,

  • pickPrevWorstFirst: see OPTIM_PICK_PREV_WORST_FIRST,

  • threshold: the threshold function to use, see EstimateM.threshold_on_t_with_horizon(), EstimateM.threshold_on_t_doubling_trick() or EstimateM.threshold_on_t() above.

  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> s = RandTopMEst(nbPlayers, nbArms, UCB)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
  • To get a list of usable players, use s.children.

Warning

s._players is for internal use ONLY!

nbPlayers = None

Number of players

withChair = None

Using a chair ?

pickWorstFirst = None

Using first optimization ?

exitIfWorstWasPicked = None

Using second optimization ?

pickPrevWorstFirst = None

Using third optimization ? Default to yes now.

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopMEst'
class PoliciesMultiPlayers.RandTopMEst.RandTopMEstPlus(nbPlayers, nbArms, playerAlgo, horizon, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.BaseMPPolicy.BaseMPPolicy

RandTopMEstPlus: a proposal for an efficient multi-players learning policy, with no prior knowledge of the number of player.

__init__(nbPlayers, nbArms, playerAlgo, horizon, withChair=False, pickWorstFirst=False, exitIfWorstWasPicked=False, pickPrevWorstFirst=True, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).

  • playerAlgo: class to use for every players.

  • nbArms: number of arms, given as first argument to playerAlgo.

  • horizon: need to know the horizon \(T\).

  • withChair: see WITH_CHAIR,

  • pickWorstFirst: see OPTIM_PICK_WORST_FIRST,

  • exitIfWorstWasPicked: see EXIT_IF_WORST_WAS_PICKED,

  • pickPrevWorstFirst: see OPTIM_PICK_PREV_WORST_FIRST,

  • threshold: the threshold function to use, see threshold_on_t_with_horizon() or threshold_on_t() above.

  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

Example:

>>> from Policies import *
>>> import random; random.seed(0); import numpy as np; np.random.seed(0)
>>> nbArms = 17
>>> nbPlayers = 6
>>> horizon = 1000
>>> s = RandTopMEstPlus(nbPlayers, nbArms, UCB, horizon)
>>> [ child.choice() for child in s.children ]
[12, 15, 0, 3, 3, 7]
  • To get a list of usable players, use s.children.

Warning

s._players is for internal use ONLY!

nbPlayers = None

Number of players

withChair = None

Using a chair ?

pickWorstFirst = None

Using first optimization ?

exitIfWorstWasPicked = None

Using second optimization ?

pickPrevWorstFirst = None

Using third optimization ? Default to yes now.

children = None

List of children, fake algorithms

nbArms = None

Number of arms

__str__()[source]

Return str(self).

__module__ = 'PoliciesMultiPlayers.RandTopMEst'
class PoliciesMultiPlayers.RandTopMEst.MCTopMEst(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopMEst.RandTopMEst

MCTopMEst: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopMEst.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__init__(nbPlayers, nbArms, playerAlgo, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).

  • playerAlgo: class to use for every players.

  • nbArms: number of arms, given as first argument to playerAlgo.

  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

__module__ = 'PoliciesMultiPlayers.RandTopMEst'
__str__()[source]

Return str(self).

class PoliciesMultiPlayers.RandTopMEst.MCTopMEstPlus(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: PoliciesMultiPlayers.RandTopMEst.RandTopMEstPlus

MCTopMEstPlus: another proposal for an efficient multi-players learning policy, more “stationary” than RandTopMEst.

Warning

Still very experimental! But it seems to be the most efficient decentralized MP algorithm we have so far…

__module__ = 'PoliciesMultiPlayers.RandTopMEst'
__init__(nbPlayers, nbArms, playerAlgo, horizon, lower=0.0, amplitude=1.0, *args, **kwargs)[source]
  • nbPlayers: number of players to create (in self._players).

  • playerAlgo: class to use for every players.

  • nbArms: number of arms, given as first argument to playerAlgo.

  • *args, **kwargs: arguments, named arguments, given to playerAlgo.

__str__()[source]

Return str(self).