Policies.WrapRange module¶
A policy that acts as a wrapper on another policy P, which requires to know the range \([a, b]\) of the rewards, by implementing a “doubling trick” to adapt to an unknown range of rewards.
It’s an interesting variant of the “doubling trick”, used to tackle another unknown aspect of sequential experiments: some algorithms need to use rewards in \([0,1]\), and are easy to use if the rewards known to be in some interval \([a, b]\) (I did this from the very beginning here, with [lower, lower+amplitude]).
But if the interval \([a,b]\) is unknown, what can we do?
The “Doubling Trick”, in this setting, refers to this algorithm:
- Start with \([a_0, b_0] = [0, 1]\), 
- If a reward \(r_t\) is seen below \(a_i\), use \(a_{i+1} = r_t\), 
- If a reward \(r_t\) is seen above \(b_i\), use \(b_{i+1} = r_t - a_i\). 
Instead of just doubling the length of the interval (“doubling trick”), we use \([r_t, b_i]\) or \([a_i, r_t]\) as it is the smallest interval compatible with the past and the new observation \(r_t\)
- Reference. I’m not sure which work is the first to have proposed this idea, but [[Normalized online learning, Stéphane Ross & Paul Mineiro & John Langford, 2013](https://arxiv.org/pdf/1305.6646.pdf)] proposes a similar idea. 
See also
See for instance Obandit.WrapRange by @freuk.
- 
class Policies.WrapRange.WrapRange(nbArms, policy=<class 'Policies.UCB.UCB'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶
- Bases: - Policies.BasePolicy.BasePolicy- A policy that acts as a wrapper on another policy P, which requires to know the range \([a, b]\) of the rewards, by implementing a “doubling trick” to adapt to an unknown range of rewards. - 
__init__(nbArms, policy=<class 'Policies.UCB.UCB'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]¶
- New policy. 
 - 
policy= None¶
- Underlying policy 
 - 
getReward(arm, reward)[source]¶
- Maybe change the current range and rescale all the past history, and then pass the reward, and update t. - Let call \(r_s\) the reward at time \(s\), \(l_{t-1}\) and \(a_{t-1}\) the lower-bound and amplitude of rewards at previous time \(t-1\), and \(l_t\) and \(a_t\) the new lower-bound and amplitude for current time \(t\). The previous history is \(R_t := \sum_{s=1}^{t-1} r_s\). - The generic formula for rescaling the previous history is the following: \[R_t := \frac{(a_{t-1} \times R_t + l_{t-1}) - l_t}{a_t}.\]- So we have the following efficient algorithm: - If \(r < l_{t-1}\), let \(l_t = r\) and \(R_t := R_t + \frac{l_{t-1} - l_t}{a_t}\), 
- Else if \(r > l_{t-1} + a_{t-1}\), let \(a_t = r - l_{t-1}\) and \(R_t := R_t \times \frac{a_{t-1}}{a-t}\), 
- Otherwise, nothing to do, the current reward is still correctly in \([l_{t-1}, l_{t-1} + a_{t-1}]\), so simply keep \(l_t = l_{t-1}\) and \(a_t = a_{t-1}\). 
 
 - 
property index¶
- Get attribute - indexfrom the underlying policy.
 - 
choiceFromSubSet(availableArms='all')[source]¶
- Pass the call to - choiceFromSubSetof the underlying policy.
 - 
choiceIMP(nb=1, startWithChoiceMultiple=True)[source]¶
- Pass the call to - choiceIMPof the underlying policy.
 - 
__module__= 'Policies.WrapRange'¶
 
-