Policies.OracleSequentiallyRestartPolicy module

An oracle policy for non-stationary bandits, restarting an underlying stationary bandit policy at each breakpoint.

  • It runs on top of a simple policy, e.g., UCB, and OracleSequentiallyRestartPolicy is a wrapper:

    >>> policy = OracleSequentiallyRestartPolicy(nbArms, UCB)
    >>> # use policy as usual, with policy.startGame(), r = policy.choice(), policy.getReward(arm, r)
    
  • It uses the knowledge of the breakpoints to restart the underlying algorithm at each breakpoint.

  • It is very simple but impractical: in any real problem it is impossible to know the locations of the breakpoints, but it acts as an efficient baseline.

Warning

It is an efficient baseline, but it has no reason to be the best algorithm on a given problem (empirically)! I found that Policy.DiscountedThompson.DiscountedThompson is usually the most efficient.

Policies.OracleSequentiallyRestartPolicy.PER_ARM_RESTART = True

Should we reset one arm empirical average or all? Default is False for this algorithm.

Policies.OracleSequentiallyRestartPolicy.FULL_RESTART_WHEN_REFRESH = False

Should we fully restart the algorithm or simply reset one arm empirical average? Default is False, it’s usually more efficient!

Policies.OracleSequentiallyRestartPolicy.RESET_FOR_ALL_CHANGE = False

True if the algorithm reset one/all arm memories when a change occur on any arm. False` if the algorithms only resets one arm memories when a change occur on this arm (needs to know listOfMeans) (default, it should be more efficient).

Policies.OracleSequentiallyRestartPolicy.RESET_FOR_SUBOPTIMAL_CHANGE = True

True if the algorithms resets memories of this arm no matter if it stays optimal/suboptimal (default, it should be more efficient). False if the algorithm reset memories only when a change make the previously best arm become suboptimal.

class Policies.OracleSequentiallyRestartPolicy.OracleSequentiallyRestartPolicy(nbArms, changePoints=None, listOfMeans=None, reset_for_all_change=False, reset_for_suboptimal_change=True, full_restart_when_refresh=False, per_arm_restart=True, *args, **kwargs)[source]

Bases: Policies.BaseWrapperPolicy.BaseWrapperPolicy

An oracle policy for non-stationary bandits, restarting an underlying stationary bandit policy at each breakpoint.

__init__(nbArms, changePoints=None, listOfMeans=None, reset_for_all_change=False, reset_for_suboptimal_change=True, full_restart_when_refresh=False, per_arm_restart=True, *args, **kwargs)[source]

New policy.

reset_for_all_change = None

See RESET_FOR_ALL_CHANGE

reset_for_suboptimal_change = None

See RESET_FOR_SUBOPTIMAL_CHANGE

changePoints = None

Locations of the break points (or change points) of the switching bandit problem, for each arm. If None, an empty list is used.

all_rewards = None

Keep in memory all the rewards obtained since the last restart on that arm.

last_pulls = None

Keep in memory the times where each arm was last seen. Start with -1 (never seen)

compute_optimized_changePoints(changePoints=None, listOfMeans=None)[source]

Compute the list of change points for each arm.

__str__()[source]

-> str

__module__ = 'Policies.OracleSequentiallyRestartPolicy'
getReward(arm, reward)[source]

Give a reward: increase t, pulls, and update cumulated sum of rewards and update small history (sliding window) for that arm (normalized in [0, 1]).

  • Reset the whole empirical average if the current time step is in the list of change points.

detect_change(arm)[source]

Try to detect a change in the current arm.