Policies.UCBimproved module¶

The UCB-Improved policy for bounded bandits, with knowing the horizon, as an example of successive elimination algorithm.

Reference: [[Auer et al, 2010](https://link.springer.com/content/pdf/10.1007/s10998-010-3055-6.pdf)].

Policies.UCBimproved.ALPHA = 0.5¶: Default value for parameter \(\alpha\).

Policies.UCBimproved.n_m(horizon, delta_m)[source]¶: Function \(\lceil \frac{2 \log(T \Delta_m^2)}{\Delta_m^2} \rceil\).

class Policies.UCBimproved.UCBimproved(nbArms, horizon=None, alpha=0.5, lower=0.0, amplitude=1.0)[source]¶

Bases: Policies.SuccessiveElimination.SuccessiveElimination

The UCB-Improved policy for bounded bandits, with knowing the horizon, as an example of successive elimination algorithm.

Reference: [[Auer et al, 2010](https://link.springer.com/content/pdf/10.1007/s10998-010-3055-6.pdf)].

__init__(nbArms, horizon=None, alpha=0.5, lower=0.0, amplitude=1.0)[source]¶

New generic index policy.

nbArms: the number of arms,
lower, amplitude: lower value and known amplitude of the rewards.

horizon = None¶: Parameter \(T\) = known horizon of the experiment.

alpha = None¶: Parameter alpha

activeArms = None¶: Set of active arms

estimate_delta = None¶: Current estimate of the gap \(\Delta_0\)

max_nb_of_exploration = None¶: Keep in memory the \(n_m\) quantity, using n_m()

current_m = None¶: Current round m

max_m = None¶: Bound \(m = \lfloor \frac{1}{2} \log_2(\frac{T}{e}) \rfloor\)

when_did_it_leave = None¶: Also keep in memory when the arm was kicked out of the activeArms sets, so fake index can be given, if we ask to order the arms for instance.

__str__()[source]¶: -> str

update_activeArms()[source]¶: Update the set activeArms of active arms.

choice(recursive=False)[source]¶: In policy based on successive elimination, choosing an arm is the same as choosing an arm from the set of active arms (self.activeArms) with method choiceFromSubSet.

computeIndex(arm)[source]¶: Nothing to do, just copy from when_did_it_leave.

__module__ = 'Policies.UCBimproved'¶