# Policies.Aggregator module¶

My Aggregated bandit algorithm, similar to Exp4 but not exactly equivalent.

The algorithm is a master A, managing several “slave” algorithms, $$A_1, ..., A_N$$.

• At every step, the prediction of every slave is gathered, and a vote is done to decide A’s decision.

• The vote is simply a majority vote, weighted by a trust probability. If $$A_i$$ decides arm $$I_i$$, then the probability of selecting $$k$$ is the sum of trust probabilities, $$P_i$$, of every $$A_i$$ for which $$I_i = k$$.

• The trust probabilities are first uniform, $$P_i = 1/N$$, and then at every step, after receiving the feedback for one arm $$k$$ (the reward), the trust in each slave $$A_i$$ is updated: $$P_i$$ increases if $$A_i$$ advised $$k$$ ($$I_i = k$$), or decreases if $$A_i$$ advised another arm.

• The detail about how to increase or decrease the probabilities are specified below.

• Reference: [[Aggregation of Multi-Armed Bandits Learning Algorithms for Opportunistic Spectrum Access, Lilian Besson and Emilie Kaufmann and Christophe Moy, 2017]](https://hal.inria.fr/hal-01705292)

Note

Why call it Aggregator ? Because this algorithm is an efficient aggregation algorithm, and like The Terminator, he beats his opponents with an iron fist! (OK, that’s a stupid joke but a cool name, thanks Emilie!)

Note

I wanted to call it Aggragorn. Because this algorithm is like Aragorn the ranger, it starts like a simple bandit, but soon it will become king!!

Policies.Aggregator.UNBIASED = True

A flag to know if the rewards are used as biased estimator, i.e., just $$r_t$$, or unbiased estimators, $$r_t / p_t$$, if $$p_t$$ is the probability of selecting that arm at time $$t$$. It seemed to work better with unbiased estimators (of course).

Policies.Aggregator.UPDATE_LIKE_EXP4 = False

Flag to know if we should update the trusts proba like in Exp4 or like in my initial Aggregator proposal

• First choice: like Exp4, trusts are fully recomputed, trusts^(t+1) = exp(rate_t * estimated mean rewards upto time t),

• Second choice: my proposal, trusts are just updated multiplicatively, trusts^(t+1) <-- trusts^t * exp(rate_t * estimate instant reward at time t).

Both choices seem fine, and anyway the trusts are renormalized to be a probability distribution, so it doesn’t matter much.

Policies.Aggregator.USE_LOSSES = False

Non parametric flag to know if the Exp4-like update uses losses or rewards. Losses are 1 - reward, in which case the rate_t is negative.

Policies.Aggregator.UPDATE_ALL_CHILDREN = False

Should all trusts be updated, or only the trusts of slaves Ai who advised the decision Aggregator[A1..AN] followed.

class Policies.Aggregator.Aggregator(nbArms, children=None, learningRate=None, decreaseRate=None, horizon=None, update_all_children=False, update_like_exp4=False, unbiased=True, prior='uniform', lower=0.0, amplitude=1.0, extra_str='')[source]

My Aggregated bandit algorithm, similar to Exp4 but not exactly equivalent.

__init__(nbArms, children=None, learningRate=None, decreaseRate=None, horizon=None, update_all_children=False, update_like_exp4=False, unbiased=True, prior='uniform', lower=0.0, amplitude=1.0, extra_str='')[source]

New policy.

nbArms = None

Number of arms

lower = None

Lower values for rewards

amplitude = None

Larger values for rewards

unbiased = None

Flag, see above.

horizon = None

Horizon T, if given and not None, can be used to compute a “good” constant learning rate, $$\sqrt{\frac{2 \log(N)}{T K}}$$ for N slaves, K arms (heuristic).

extra_str = None

A string to add at the end of the str(self), to specify which algorithms are aggregated for instance.

update_all_children = None

Flag, see above.

nbChildren = None

Number N of slave algorithms.

t = None

Internal time

update_like_exp4 = None

Flag, see above.

learningRate = None

Value of the learning rate (can be decreasing in time)

decreaseRate = None

Value of the constant used in the decreasing of the learning rate

children = None

List of slave algorithms.

trusts = None

Initial trusts in the slaves. Default to uniform, but a prior can also be given.

choices = None

Keep track of the last choices of each slave, to know whom to update if update_all_children is false.

children_cumulated_losses = None

Keep track of the cumulated loss (empirical mean)

index = None

Numerical index for each arms

__str__()[source]

Nicely print the name of the algorithm with its relevant parameters.

property rate

Learning rate, can be constant if self.decreaseRate is None, or decreasing.

• if horizon is known, use the formula which uses it,

• if horizon is not known, use the formula which uses current time $$t$$,

• else, if decreaseRate is a number, use an exponentionally decreasing learning rate, rate = learningRate * exp(- t / decreaseRate). Bad.

startGame()[source]

Start the game for each child.

getReward(arm, reward)[source]

Give reward for each child, and then update the trust probabilities.

_makeChildrenChoose()[source]

Convenience method to make every children chose their best arm, and store their decision in self.choices.

choice()[source]

Make each child vote, then sample the decision by importance sampling on their votes with the trust probabilities.

choiceWithRank(rank=1)[source]

Make each child vote, with rank, then sample the decision by importance sampling on their votes with the trust probabilities.

choiceFromSubSet(availableArms='all')[source]

Make each child vote, on subsets of arms, then sample the decision by importance sampling on their votes with the trust probabilities.

__module__ = 'Policies.Aggregator'
choiceMultiple(nb=1)[source]

Make each child vote, multiple times, then sample the decision by importance sampling on their votes with the trust probabilities.

choiceIMP(nb=1, startWithChoiceMultiple=True)[source]

Make each child vote, multiple times (with IMP scheme), then sample the decision by importance sampling on their votes with the trust probabilities.

estimatedOrder()[source]

Make each child vote for their estimate order of the arms, then randomly select an ordering by importance sampling with the trust probabilities. Return the estimate order of the arms, as a permutation on [0..K-1] that would order the arms by increasing means.

estimatedBestArms(M=1)[source]

Return a (non-necessarily sorted) list of the indexes of the M-best arms. Identify the set M-best.

computeIndex(arm)[source]

Compute the current index of arm ‘arm’, by computing all the indexes of the children policies, and computing a convex combination using the trusts probabilities.

computeAllIndex()[source]

Compute the current indexes for all arms. Possibly vectorized, by default it can not be vectorized automatically.

handleCollision(arm, reward=None)[source]

Default to give a 0 reward (or self.lower).