Policies.LearnExp module

The LearnExp aggregation bandit algorithm, similar to Exp4 but not equivalent.

The algorithm is a master A, managing several “slave” algorithms, \(A_1, ..., A_N\).

  • At every step, one slave algorithm is selected, by a random selection from a trust distribution on \([1,...,N]\).

  • Then its decision is listen to, played by the master algorithm, and a feedback reward is received.

  • The reward is reweighted by the trust of the listened algorithm, and given back to it with a certain probability.

  • The other slaves, whose decision was not even asked, receive nothing.

  • The trust probabilities are first uniform, \(P_i = 1/N\), and then at every step, after receiving the feedback for one arm k (the reward), the trust in each slave Ai is updated: \(P_i\) by the reward received.

  • The detail about how to increase or decrease the probabilities are specified in the reference article.

Note

Reference: [[Learning to Use Learners’ Advice, A.Singla, H.Hassani & A.Krause, 2017](https://arxiv.org/abs/1702.04825)].

Policies.LearnExp.random() → x in the interval [0, 1).
Policies.LearnExp.renormalize_reward(reward, lower=0.0, amplitude=1.0, trust=1.0, unbiased=True, mintrust=None)[source]

Renormalize the reward to [0, 1]:

  • divide by (trust/mintrust) if unbiased is True.

  • simply project to [0, 1] if unbiased is False,

Warning

If mintrust is unknown, the unbiased estimator CANNOT be projected back to a bounded interval.

Policies.LearnExp.unnormalize_reward(reward, lower=0.0, amplitude=1.0)[source]

Project back reward to [lower, lower + amplitude].

Policies.LearnExp.UNBIASED = True

self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / p_t\), if \(p_t\) is the probability of selecting that arm at time \(t\). It seemed to work better with unbiased estimators (of course).

Policies.LearnExp.ETA = 0.5

Default value for the constant Eta in (0, 1]

class Policies.LearnExp.LearnExp(nbArms, children=None, unbiased=True, eta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]

Bases: Policies.BasePolicy.BasePolicy

The LearnExp aggregation bandit algorithm, similar to Exp4 but not equivalent.

__init__(nbArms, children=None, unbiased=True, eta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]

New policy.

nbArms = None

Number of arms.

lower = None

Lower values for rewards.

amplitude = None

Larger values for rewards.

unbiased = None

Flag, see above.

eta = None

Constant parameter \(\eta\).

rate = None

Constant \(\eta / N\), faster computations if it is stored once.

children = None

List of slave algorithms.

last_choice = None

Remember the index of the last child trusted for a decision.

trusts = None

Initial trusts in the slaves \(p_j^t\). Default to uniform, but a prior can also be given.

weights = None

Weights \(w_j^t\).

__str__()[source]

Nicely print the name of the algorithm with its relevant parameters.

startGame()[source]

Start the game for each child.

getReward(arm, reward)[source]

Give reward for each child, and then update the trust probabilities.

choice()[source]

Trust one of the slave and listen to his choice.

choiceWithRank(rank=1)[source]

Trust one of the slave and listen to his choiceWithRank.

choiceFromSubSet(availableArms='all')[source]

Trust one of the slave and listen to his choiceFromSubSet.

choiceMultiple(nb=1)[source]

Trust one of the slave and listen to his choiceMultiple.

choiceIMP(nb=1, startWithChoiceMultiple=True)[source]

Trust one of the slave and listen to his choiceIMP.

__module__ = 'Policies.LearnExp'
estimatedOrder()[source]

Trust one of the slave and listen to his estimatedOrder.

  • Return the estimate order of the arms, as a permutation on \([0,...,K-1]\) that would order the arms by increasing means.

estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.