Policies.CORRAL module¶

The CORRAL aggregation bandit algorithm, similar to Exp4 but not exactly equivalent.

The algorithm is a master A, managing several “slave” algorithms, \(A_1, ..., A_N\).

At every step, one slave algorithm is selected, by a random selection from a trust distribution on \([1,...,N]\).
Then its decision is listen to, played by the master algorithm, and a feedback reward is received.
The reward is reweighted by the trust of the listened algorithm, and given back to it.
The other slaves, whose decision was not even asked, receive a zero reward, or no reward at all.
The trust probabilities are first uniform, \(P_i = 1/N\), and then at every step, after receiving the feedback for one arm k (the reward), the trust in each slave Ai is updated: \(P_i\) by the reward received.
The detail about how to increase or decrease the probabilities are specified in the reference article.

Note

Reference: [[“Corralling a Band of Bandit Algorithms”, by A. Agarwal, H. Luo, B. Neyshabur, R.E. Schapire, 01.2017](https://arxiv.org/abs/1612.06246v2)].

Policies.CORRAL.renormalize_reward(reward, lower=0.0, amplitude=1.0, trust=1.0, unbiased=True, mintrust=None)[source]¶

Renormalize the reward to [0, 1]:

divide by (trust/mintrust) if unbiased is True.
simply project to [0, 1] if unbiased is False,

Warning

If mintrust is unknown, the unbiased estimator CANNOT be projected back to a bounded interval.

Policies.CORRAL.unnormalize_reward(reward, lower=0.0, amplitude=1.0)[source]¶: Project back reward to [lower, lower + amplitude].

Policies.CORRAL.log_Barrier_OMB(trusts, losses, rates)[source]¶

A step of the log-barrier Online Mirror Descent, updating the trusts:

Find \(\lambda \in [\min_i l_{t,i}, \max_i l_{t,i}]\) such that \(\sum_i \frac{1}{1/p_{t,i} + \eta_{t,i}(l_{t,i} - \lambda)} = 1\).
Return \(\mathbf{p}_{t+1,i}\) such that \(\frac{1}{p_{t+1,i}} = \frac{1}{p_{t,i}} + \eta_{t,i}(l_{t,i} - \lambda)\).
Note: uses scipy.optimize.minimize_scalar() for the optimization.
Reference: [Learning in games: Robustness of fast convergence, by D.Foster, Z.Li, T.Lykouris, K.Sridharan, and E.Tardos, NIPS 2016].

Policies.CORRAL.UNBIASED = True¶: self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / p_t\), if \(p_t\) is the probability of selecting that arm at time \(t\). It seemed to work better with unbiased estimators (of course).

Policies.CORRAL.BROADCAST_ALL = False¶: Whether to give back a reward to only one slave algorithm (default, False) or to all slaves who voted for the same arm

class Policies.CORRAL.CORRAL(nbArms, children=None, horizon=None, rate=None, unbiased=True, broadcast_all=False, prior='uniform', lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.BasePolicy.BasePolicy

The CORRAL aggregation bandit algorithm, similar to Exp4 but not exactly equivalent.

__init__(nbArms, children=None, horizon=None, rate=None, unbiased=True, broadcast_all=False, prior='uniform', lower=0.0, amplitude=1.0)[source]¶: New policy.

nbArms = None¶: Number of arms.

lower = None¶: Lower values for rewards.

amplitude = None¶: Larger values for rewards.

unbiased = None¶: Flag, see above.

broadcast_all = None¶: Flag, see above.

gamma = None¶: Constant \(\gamma = 1 / T\).

beta = None¶: Constant \(\beta = \exp(1 / \log(T))\).

rates = None¶: Value of the learning rate (will be increasing in time).

children = None¶: List of slave algorithms.

trusts = None¶: Initial trusts in the slaves. Default to uniform, but a prior can also be given.

bar_trusts = None¶: Initial bar trusts in the slaves. Default to uniform, but a prior can also be given.

choices = None¶: Keep track of the last choices of each slave, to know whom to update if update_all_children is false.

last_choice = None¶: Remember the index of the last child trusted for a decision.

losses = None¶: For the log-barrier OMD step, a vector of losses has to be given. Faster to keep it as an attribute instead of reallocating it every time.

rhos = None¶: I use the inverses of the \(\rho_{t,i}\) from the Algorithm in the reference article. Simpler to understand, less numerical errors.

__str__()[source]¶: Nicely print the name of the algorithm with its relevant parameters.

__setattr__(name, value)[source]¶

Trick method, to update the \(\gamma\) and \(\beta\) parameters of the CORRAL algorithm if the horizon T changes.

This is here just to eventually allow Policies.DoublingTrickWrapper to be used with a CORRAL player.

Warning

Not tested yet!

startGame()[source]¶: Start the game for each child.

getReward(arm, reward)[source]¶: Give reward for each child, and then update the trust probabilities.

choice()[source]¶: Trust one of the slave and listen to his choice.

choiceWithRank(rank=1)[source]¶: Trust one of the slave and listen to his choiceWithRank.

choiceFromSubSet(availableArms='all')[source]¶: Trust one of the slave and listen to his choiceFromSubSet.

__module__ = 'Policies.CORRAL'¶

choiceMultiple(nb=1)[source]¶: Trust one of the slave and listen to his choiceMultiple.

choiceIMP(nb=1, startWithChoiceMultiple=True)[source]¶: Trust one of the slave and listen to his choiceIMP.

estimatedOrder()[source]¶

Trust one of the slave and listen to his estimatedOrder.

Return the estimate order of the arms, as a permutation on \([0,...,K-1]\) that would order the arms by increasing means.

estimatedBestArms(M=1)[source]¶: Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.