Policies.UCBimproved module

The UCB-Improved policy for bounded bandits, with knowing the horizon, as an example of successive elimination algorithm.

Policies.UCBimproved.ALPHA = 0.5

Default value for parameter \(\alpha\).

Policies.UCBimproved.n_m(horizon, delta_m)[source]

Function \(\lceil \frac{2 \log(T \Delta_m^2)}{\Delta_m^2} \rceil\).

class Policies.UCBimproved.UCBimproved(nbArms, horizon=None, alpha=0.5, lower=0.0, amplitude=1.0)[source]

Bases: Policies.SuccessiveElimination.SuccessiveElimination

The UCB-Improved policy for bounded bandits, with knowing the horizon, as an example of successive elimination algorithm.

__init__(nbArms, horizon=None, alpha=0.5, lower=0.0, amplitude=1.0)[source]

New generic index policy.

  • nbArms: the number of arms,

  • lower, amplitude: lower value and known amplitude of the rewards.

horizon = None

Parameter \(T\) = known horizon of the experiment.

alpha = None

Parameter alpha

activeArms = None

Set of active arms

estimate_delta = None

Current estimate of the gap \(\Delta_0\)

max_nb_of_exploration = None

Keep in memory the \(n_m\) quantity, using n_m()

current_m = None

Current round m

max_m = None

Bound \(m = \lfloor \frac{1}{2} \log_2(\frac{T}{e}) \rfloor\)

when_did_it_leave = None

Also keep in memory when the arm was kicked out of the activeArms sets, so fake index can be given, if we ask to order the arms for instance.

__str__()[source]

-> str

update_activeArms()[source]

Update the set activeArms of active arms.

choice(recursive=False)[source]

In policy based on successive elimination, choosing an arm is the same as choosing an arm from the set of active arms (self.activeArms) with method choiceFromSubSet.

computeIndex(arm)[source]

Nothing to do, just copy from when_did_it_leave.

__module__ = 'Policies.UCBimproved'