Policies.LearnExp module¶
The LearnExp aggregation bandit algorithm, similar to Exp4 but not equivalent.
The algorithm is a master A, managing several “slave” algorithms, \(A_1, ..., A_N\).
- At every step, one slave algorithm is selected, by a random selection from a trust distribution on \([1,...,N]\). 
- Then its decision is listen to, played by the master algorithm, and a feedback reward is received. 
- The reward is reweighted by the trust of the listened algorithm, and given back to it with a certain probability. 
- The other slaves, whose decision was not even asked, receive nothing. 
- The trust probabilities are first uniform, \(P_i = 1/N\), and then at every step, after receiving the feedback for one arm k (the reward), the trust in each slave Ai is updated: \(P_i\) by the reward received. 
- The detail about how to increase or decrease the probabilities are specified in the reference article. 
Note
Reference: [[Learning to Use Learners’ Advice, A.Singla, H.Hassani & A.Krause, 2017](https://arxiv.org/abs/1702.04825)].
- 
Policies.LearnExp.random() → x in the interval [0, 1).¶
- 
Policies.LearnExp.renormalize_reward(reward, lower=0.0, amplitude=1.0, trust=1.0, unbiased=True, mintrust=None)[source]¶
- Renormalize the reward to [0, 1]: - divide by (trust/mintrust) if unbiased is True. 
- simply project to [0, 1] if unbiased is False, 
 - Warning - If mintrust is unknown, the unbiased estimator CANNOT be projected back to a bounded interval. 
- 
Policies.LearnExp.unnormalize_reward(reward, lower=0.0, amplitude=1.0)[source]¶
- Project back reward to [lower, lower + amplitude]. 
- 
Policies.LearnExp.UNBIASED= True¶
- self.unbiased is a flag to know if the rewards are used as biased estimator, i.e., just \(r_t\), or unbiased estimators, \(r_t / p_t\), if \(p_t\) is the probability of selecting that arm at time \(t\). It seemed to work better with unbiased estimators (of course). 
- 
Policies.LearnExp.ETA= 0.5¶
- Default value for the constant Eta in (0, 1] 
- 
class Policies.LearnExp.LearnExp(nbArms, children=None, unbiased=True, eta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]¶
- Bases: - Policies.BasePolicy.BasePolicy- The LearnExp aggregation bandit algorithm, similar to Exp4 but not equivalent. - 
__init__(nbArms, children=None, unbiased=True, eta=0.5, prior='uniform', lower=0.0, amplitude=1.0)[source]¶
- New policy. 
 - 
nbArms= None¶
- Number of arms. 
 - 
lower= None¶
- Lower values for rewards. 
 - 
amplitude= None¶
- Larger values for rewards. 
 - 
unbiased= None¶
- Flag, see above. 
 - 
eta= None¶
- Constant parameter \(\eta\). 
 - 
rate= None¶
- Constant \(\eta / N\), faster computations if it is stored once. 
 - 
children= None¶
- List of slave algorithms. 
 - 
last_choice= None¶
- Remember the index of the last child trusted for a decision. 
 - 
trusts= None¶
- Initial trusts in the slaves \(p_j^t\). Default to uniform, but a prior can also be given. 
 - 
weights= None¶
- Weights \(w_j^t\). 
 - 
getReward(arm, reward)[source]¶
- Give reward for each child, and then update the trust probabilities. 
 - 
choiceFromSubSet(availableArms='all')[source]¶
- Trust one of the slave and listen to his choiceFromSubSet. 
 - 
choiceIMP(nb=1, startWithChoiceMultiple=True)[source]¶
- Trust one of the slave and listen to his choiceIMP. 
 - 
__module__= 'Policies.LearnExp'¶
 
-