Policies package¶

Policies module : contains all the (single-player) bandits algorithms:

“Stupid” algorithms: Uniform, UniformOnSome, TakeFixedArm, TakeRandomFixedArm,
Greedy algorithms: EpsilonGreedy, EpsilonFirst, EpsilonDecreasing, EpsilonDecreasingMEGA, EpsilonExpDecreasing,
And variants of the Explore-Then-Commit policy: ExploreThenCommit.ETC_KnownGap, ExploreThenCommit.ETC_RandomStop, ExploreThenCommit.ETC_FixedBudget, ExploreThenCommit.ETC_SPRT, ExploreThenCommit.ETC_BAI, ExploreThenCommit.DeltaUCB,
Probabilistic weighting algorithms: Hedge, Softmax, Softmax.SoftmaxDecreasing, Softmax.SoftMix, Softmax.SoftmaxWithHorizon, Exp3, Exp3.Exp3Decreasing, Exp3.Exp3SoftMix, Exp3.Exp3WithHorizon, Exp3.Exp3ELM, ProbabilityPursuit, Exp3PlusPlus, a smart variant BoltzmannGumbel, and a recent extension TsallisInf,
Index based UCB algorithms: EmpiricalMeans, UCB, UCBalpha, UCBmin, UCBplus, UCBrandomInit, UCBV, UCBVtuned, UCBH, CPUCB, UCBimproved,
Index based MOSS algorithms: MOSS, MOSSH, MOSSAnytime, MOSSExperimental,
Bayesian algorithms: Thompson, BayesUCB, and DiscountedThompson,
Based on Kullback-Leibler divergence: klUCB, klUCBloglog, klUCBPlus, klUCBH, klUCBHPlus, klUCBPlusPlus, klUCBswitch,
Other index algorithms: DMED, DMED.DMEDPlus, IMED, OCUCBH, OCUCBH.AOCUCBH, OCUCB, UCBdagger,
Hybrids algorithms, mixing Bayesian and UCB indexes: AdBandits,
Aggregation algorithms: Aggregator (mine, it’s awesome, go on try it!), and CORRAL, LearnExp,
Finite-Horizon Gittins index, approximated version: ApproximatedFHGittins,
An experimental policy, using a sliding window of for instance 100 draws, and reset the algorithm as soon as the small empirical average is too far away from the full history empirical average (or just restart for one arm, if possible), SlidingWindowRestart, and 3 versions for UCB, UCBalpha and klUCB: SlidingWindowRestart.SWR_UCB, SlidingWindowRestart.SWR_UCBalpha, SlidingWindowRestart.SWR_klUCB (my algorithm, unpublished yet),
An experimental policy, using just a sliding window of for instance 100 draws, SlidingWindowUCB.SWUCB, and SlidingWindowUCB.SWUCBPlus if the horizon is known. There is also SlidingWindowUCB.SWklUCB.
Another experimental policy with a discount factor, DiscountedUCB and DiscountedUCB.DiscountedUCBPlus, as well as versions using klUCB, DiscountedUCB.DiscountedklUCB, and DiscountedUCB.DiscountedklUCBPlus.
Other policies for the non-stationary problems: LM_DSEE, SWHash_UCB.SWHash_IndexPolicy, CD_UCB.CUSUM_IndexPolicy, CD_UCB.PHT_IndexPolicy, CD_UCB.UCBLCB_IndexPolicy, CD_UCB.GaussianGLR_IndexPolicy, CD_UCB.BernoulliGLR_IndexPolicy, Monitored_UCB.Monitored_IndexPolicy, OracleSequentiallyRestartPolicy, AdSwitch.
A policy designed to tackle sparse stochastic bandit problems, SparseUCB, SparseklUCB, and SparseWrapper that can be used with any index policy.
A policy that implements a “smart doubling trick” to turn any horizon-dependent policy into a horizon-independent policy without loosing in performances: DoublingTrickWrapper,
An experimental policy, implementing a another kind of doubling trick to turn any policy that needs to know the range \([a,b]\) of rewards a policy that don’t need to know the range, and that adapt dynamically from the new observations, WrapRange,
The Optimal Sampling for Structured Bandits (OSSB) policy: OSSB (it is more generic and can be applied to almost any kind of bandit problem, it works fine for classical stationary bandits but it is not optimal), a variant for gaussian problem GaussianOSSB, and a variant for sparse bandits SparseOSSB. There is also two variants with decreasing rates, OSSB_DecreasingRate and OSSB_AutoDecreasingRate,
The Best Empirical Sampled Average (BESA) policy: BESA (it works crazily well),
New! The UCBoost (Upper Confidence bounds with Boosting) policies, first with no boosting: UCBoost.UCB_sq, UCBoost.UCB_bq, UCBoost.UCB_h, UCBoost.UCB_lb, UCBoost.UCB_t, and then the ones with non-adaptive boosting: UCBoost.UCBoost_bq_h_lb, UCBoost.UCBoost_bq_h_lb_t, UCBoost.UCBoost_bq_h_lb_t_sq, UCBoost.UCBoost, and finally the epsilon-approximation boosting with UCBoost.UCBoostEpsilon,
Some are designed only for (fully decentralized) multi-player games: MusicalChair, MEGA, TrekkingTSN, MusicalChairNoSensing, SIC_MMAB…

Note

The list above might not be complete, see the details below.

All policies have the same interface, as described in BasePolicy, in order to use them in any experiment with the following approach:

my_policy = Policy(nbArms)
my_policy.startGame()  # start the game
for t in range(T):
    chosen_arm_t = k_t = my_policy.choice()  # chose one arm
    reward_t     = sampled from an arm k_t   # sample a reward
    my_policy.getReward(k_t, reward_t)       # give it the the policy

Policies.klucb_mapping = {'Bernoulli': CPUDispatcher(<function klucbBern>), 'Exponential': CPUDispatcher(<function klucbExp>), 'Gamma': CPUDispatcher(<function klucbGamma>), 'Gaussian': CPUDispatcher(<function klucbGauss>), 'Poisson': CPUDispatcher(<function klucbPoisson>)}¶: Maps name of arms to kl functions

Policies package¶

Subpackages¶

Submodules¶