Policies.WrapRange module

A policy that acts as a wrapper on another policy P, which requires to know the range \([a, b]\) of the rewards, by implementing a “doubling trick” to adapt to an unknown range of rewards.

It’s an interesting variant of the “doubling trick”, used to tackle another unknown aspect of sequential experiments: some algorithms need to use rewards in \([0,1]\), and are easy to use if the rewards known to be in some interval \([a, b]\) (I did this from the very beginning here, with [lower, lower+amplitude]). But if the interval \([a,b]\) is unknown, what can we do? The “Doubling Trick”, in this setting, refers to this algorithm:

  1. Start with \([a_0, b_0] = [0, 1]\),

  2. If a reward \(r_t\) is seen below \(a_i\), use \(a_{i+1} = r_t\),

  3. If a reward \(r_t\) is seen above \(b_i\), use \(b_{i+1} = r_t - a_i\).

Instead of just doubling the length of the interval (“doubling trick”), we use \([r_t, b_i]\) or \([a_i, r_t]\) as it is the smallest interval compatible with the past and the new observation \(r_t\)

  • Reference. I’m not sure which work is the first to have proposed this idea, but [[Normalized online learning, Stéphane Ross & Paul Mineiro & John Langford, 2013](https://arxiv.org/pdf/1305.6646.pdf)] proposes a similar idea.

See also

See for instance Obandit.WrapRange by @freuk.

class Policies.WrapRange.WrapRange(nbArms, policy=<class 'Policies.UCB.UCB'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

Bases: Policies.BasePolicy.BasePolicy

A policy that acts as a wrapper on another policy P, which requires to know the range \([a, b]\) of the rewards, by implementing a “doubling trick” to adapt to an unknown range of rewards.

__init__(nbArms, policy=<class 'Policies.UCB.UCB'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

New policy.

policy = None

Underlying policy

__str__()[source]

-> str

startGame()[source]

Initialize the policy for a new game.

getReward(arm, reward)[source]

Maybe change the current range and rescale all the past history, and then pass the reward, and update t.

Let call \(r_s\) the reward at time \(s\), \(l_{t-1}\) and \(a_{t-1}\) the lower-bound and amplitude of rewards at previous time \(t-1\), and \(l_t\) and \(a_t\) the new lower-bound and amplitude for current time \(t\). The previous history is \(R_t := \sum_{s=1}^{t-1} r_s\).

The generic formula for rescaling the previous history is the following:

\[R_t := \frac{(a_{t-1} \times R_t + l_{t-1}) - l_t}{a_t}.\]

So we have the following efficient algorithm:

  1. If \(r < l_{t-1}\), let \(l_t = r\) and \(R_t := R_t + \frac{l_{t-1} - l_t}{a_t}\),

  2. Else if \(r > l_{t-1} + a_{t-1}\), let \(a_t = r - l_{t-1}\) and \(R_t := R_t \times \frac{a_{t-1}}{a-t}\),

  3. Otherwise, nothing to do, the current reward is still correctly in \([l_{t-1}, l_{t-1} + a_{t-1}]\), so simply keep \(l_t = l_{t-1}\) and \(a_t = a_{t-1}\).

property index

Get attribute index from the underlying policy.

choice()[source]

Pass the call to choice of the underlying policy.

choiceWithRank(rank=1)[source]

Pass the call to choiceWithRank of the underlying policy.

choiceFromSubSet(availableArms='all')[source]

Pass the call to choiceFromSubSet of the underlying policy.

choiceMultiple(nb=1)[source]

Pass the call to choiceMultiple of the underlying policy.

choiceIMP(nb=1, startWithChoiceMultiple=True)[source]

Pass the call to choiceIMP of the underlying policy.

estimatedOrder()[source]

Pass the call to estimatedOrder of the underlying policy.

estimatedBestArms(M=1)[source]

Pass the call to estimatedBestArms of the underlying policy.

computeIndex(arm)[source]

Pass the call to computeIndex of the underlying policy.

computeAllIndex()[source]

Pass the call to computeAllIndex of the underlying policy.

__module__ = 'Policies.WrapRange'