# Policies.WrapRange module¶

A policy that acts as a wrapper on another policy P, which requires to know the range $$[a, b]$$ of the rewards, by implementing a “doubling trick” to adapt to an unknown range of rewards.

It’s an interesting variant of the “doubling trick”, used to tackle another unknown aspect of sequential experiments: some algorithms need to use rewards in $$[0,1]$$, and are easy to use if the rewards known to be in some interval $$[a, b]$$ (I did this from the very beginning here, with [lower, lower+amplitude]). But if the interval $$[a,b]$$ is unknown, what can we do? The “Doubling Trick”, in this setting, refers to this algorithm:

1. Start with $$[a_0, b_0] = [0, 1]$$,

2. If a reward $$r_t$$ is seen below $$a_i$$, use $$a_{i+1} = r_t$$,

3. If a reward $$r_t$$ is seen above $$b_i$$, use $$b_{i+1} = r_t - a_i$$.

Instead of just doubling the length of the interval (“doubling trick”), we use $$[r_t, b_i]$$ or $$[a_i, r_t]$$ as it is the smallest interval compatible with the past and the new observation $$r_t$$

• Reference. I’m not sure which work is the first to have proposed this idea, but [[Normalized online learning, Stéphane Ross & Paul Mineiro & John Langford, 2013](https://arxiv.org/pdf/1305.6646.pdf)] proposes a similar idea.

See for instance Obandit.WrapRange by @freuk.

class Policies.WrapRange.WrapRange(nbArms, policy=<class 'Policies.UCB.UCB'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

A policy that acts as a wrapper on another policy P, which requires to know the range $$[a, b]$$ of the rewards, by implementing a “doubling trick” to adapt to an unknown range of rewards.

__init__(nbArms, policy=<class 'Policies.UCB.UCB'>, lower=0.0, amplitude=1.0, *args, **kwargs)[source]

New policy.

policy = None

Underlying policy

__str__()[source]

-> str

startGame()[source]

Initialize the policy for a new game.

getReward(arm, reward)[source]

Maybe change the current range and rescale all the past history, and then pass the reward, and update t.

Let call $$r_s$$ the reward at time $$s$$, $$l_{t-1}$$ and $$a_{t-1}$$ the lower-bound and amplitude of rewards at previous time $$t-1$$, and $$l_t$$ and $$a_t$$ the new lower-bound and amplitude for current time $$t$$. The previous history is $$R_t := \sum_{s=1}^{t-1} r_s$$.

The generic formula for rescaling the previous history is the following:

$R_t := \frac{(a_{t-1} \times R_t + l_{t-1}) - l_t}{a_t}.$

So we have the following efficient algorithm:

1. If $$r < l_{t-1}$$, let $$l_t = r$$ and $$R_t := R_t + \frac{l_{t-1} - l_t}{a_t}$$,

2. Else if $$r > l_{t-1} + a_{t-1}$$, let $$a_t = r - l_{t-1}$$ and $$R_t := R_t \times \frac{a_{t-1}}{a-t}$$,

3. Otherwise, nothing to do, the current reward is still correctly in $$[l_{t-1}, l_{t-1} + a_{t-1}]$$, so simply keep $$l_t = l_{t-1}$$ and $$a_t = a_{t-1}$$.

property index

Get attribute index from the underlying policy.

choice()[source]

Pass the call to choice of the underlying policy.

choiceWithRank(rank=1)[source]

Pass the call to choiceWithRank of the underlying policy.

choiceFromSubSet(availableArms='all')[source]

Pass the call to choiceFromSubSet of the underlying policy.

choiceMultiple(nb=1)[source]

Pass the call to choiceMultiple of the underlying policy.

choiceIMP(nb=1, startWithChoiceMultiple=True)[source]

Pass the call to choiceIMP of the underlying policy.

estimatedOrder()[source]

Pass the call to estimatedOrder of the underlying policy.

estimatedBestArms(M=1)[source]

Pass the call to estimatedBestArms of the underlying policy.

computeIndex(arm)[source]

Pass the call to computeIndex of the underlying policy.

computeAllIndex()[source]

Pass the call to computeAllIndex of the underlying policy.

__module__ = 'Policies.WrapRange'