Policies.OracleSequentiallyRestartPolicy module¶

An oracle policy for non-stationary bandits, restarting an underlying stationary bandit policy at each breakpoint.

It runs on top of a simple policy, e.g., UCB, and OracleSequentiallyRestartPolicy is a wrapper:

>>> policy = OracleSequentiallyRestartPolicy(nbArms, UCB)
>>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)

It uses the knowledge of the breakpoints to restart the underlying algorithm at each breakpoint.
It is very simple but impractical: in any real problem it is impossible to know the locations of the breakpoints, but it acts as an efficient baseline.

Warning

It is an efficient baseline, but it has no reason to be the best algorithm on a given problem (empirically)! I found that Policy.DiscountedThompson.DiscountedThompson is usually the most efficient.

Policies.OracleSequentiallyRestartPolicy.PER_ARM_RESTART = True¶: Should we reset one arm empirical average or all? Default is False for this algorithm.

Policies.OracleSequentiallyRestartPolicy.FULL_RESTART_WHEN_REFRESH = False¶: Should we fully restart the algorithm or simply reset one arm empirical average? Default is False, it’s usually more efficient!

Policies.OracleSequentiallyRestartPolicy.RESET_FOR_ALL_CHANGE = False¶: True if the algorithm reset one/all arm memories when a change occur on any arm. False` if the algorithms only resets one arm memories when a change occur on this arm (needs to know listOfMeans) (default, it should be more efficient).

Policies.OracleSequentiallyRestartPolicy.RESET_FOR_SUBOPTIMAL_CHANGE = True¶: True if the algorithms resets memories of this arm no matter if it stays optimal/suboptimal (default, it should be more efficient). False if the algorithm reset memories only when a change make the previously best arm become suboptimal.

class Policies.OracleSequentiallyRestartPolicy.OracleSequentiallyRestartPolicy(nbArms, changePoints=None, listOfMeans=None, reset_for_all_change=False, reset_for_suboptimal_change=True, full_restart_when_refresh=False, per_arm_restart=True, *args, **kwargs)[source]¶

Bases: Policies.BaseWrapperPolicy.BaseWrapperPolicy

An oracle policy for non-stationary bandits, restarting an underlying stationary bandit policy at each breakpoint.

__init__(nbArms, changePoints=None, listOfMeans=None, reset_for_all_change=False, reset_for_suboptimal_change=True, full_restart_when_refresh=False, per_arm_restart=True, *args, **kwargs)[source]¶: New policy.

reset_for_all_change = None¶: See RESET_FOR_ALL_CHANGE

reset_for_suboptimal_change = None¶: See RESET_FOR_SUBOPTIMAL_CHANGE

changePoints = None¶: Locations of the break points (or change points) of the switching bandit problem, for each arm. If None, an empty list is used.

all_rewards = None¶: Keep in memory all the rewards obtained since the last restart on that arm.

last_pulls = None¶: Keep in memory the times where each arm was last seen. Start with -1 (never seen)

compute_optimized_changePoints(changePoints=None, listOfMeans=None)[source]¶

Compute the list of change points for each arm.

If reset_for_all_change is True, all change points concern all arms (sub optimal)!
If reset_for_all_change is False,
- If reset_for_suboptimal_change is True, all change points were the mean of an arm change concern it (still sub optimal)!
- If reset_for_suboptimal_change is False, only the change points were an arm goes from optimal to sub-optimal or sub-optimal to optimal concern it (optimal!)!

__str__()[source]¶: -> str

__module__ = 'Policies.OracleSequentiallyRestartPolicy'¶

getReward(arm, reward)[source]¶

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

Reset the whole empirical average if the current time step is in the list of change points.

detect_change(arm)[source]¶: Try to detect a change in the current arm.