Policies.WrapRange module¶

A policy that acts as a wrapper on another policy P, which requires to know the range \([a, b]\) of the rewards, by implementing a “doubling trick” to adapt to an unknown range of rewards.

It’s an interesting variant of the “doubling trick”, used to tackle another unknown aspect of sequential experiments: some algorithms need to use rewards in \([0,1]\), and are easy to use if the rewards known to be in some interval \([a, b]\) (I did this from the very beginning here, with [lower, lower+amplitude]). But if the interval \([a,b]\) is unknown, what can we do? The “Doubling Trick”, in this setting, refers to this algorithm:

Start with \([a_0, b_0] = [0, 1]\),
If a reward \(r_t\) is seen below \(a_i\), use \(a_{i+1} = r_t\),
If a reward \(r_t\) is seen above \(b_i\), use \(b_{i+1} = r_t - a_i\).

Instead of just doubling the length of the interval (“doubling trick”), we use \([r_t, b_i]\) or \([a_i, r_t]\) as it is the smallest interval compatible with the past and the new observation \(r_t\)

Reference. I’m not sure which work is the first to have proposed this idea, but [[Normalized online learning, Stéphane Ross & Paul Mineiro & John Langford, 2013](https://arxiv.org/pdf/1305.6646.pdf)] proposes a similar idea.