{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "# An example of a small Multi-Player simulation, with Centralized Algorithms\n", "\n", "First, be sure to be in the main folder, or to have [SMPyBandits](https://github.com/SMPyBandits/SMPyBandits) installed, and import `EvaluatorMultiPlayers` from `Environment` package:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: SMPyBandits in ./venv3/lib/python3.6/site-packages (0.9.4)\n", "Requirement already satisfied: watermark in ./venv3/lib/python3.6/site-packages (1.7.0)\n", "Requirement already satisfied: numpy in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (1.15.4)\n", "Requirement already satisfied: scikit-learn in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (0.20.0)\n", "Requirement already satisfied: scikit-optimize in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (0.5.2)\n", "Requirement already satisfied: matplotlib>=2 in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (3.0.2)\n", "Requirement already satisfied: joblib in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (0.13.0)\n", "Requirement already satisfied: scipy>0.9 in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (1.1.0)\n", "Requirement already satisfied: seaborn in ./venv3/lib/python3.6/site-packages (from SMPyBandits) (0.9.0)\n", "Requirement already satisfied: ipython in ./venv3/lib/python3.6/site-packages (from watermark) (7.1.1)\n", "Requirement already satisfied: python-dateutil>=2.1 in ./venv3/lib/python3.6/site-packages (from matplotlib>=2->SMPyBandits) (2.7.5)\n", "Requirement already satisfied: kiwisolver>=1.0.1 in ./venv3/lib/python3.6/site-packages (from matplotlib>=2->SMPyBandits) (1.0.1)\n", "Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in ./venv3/lib/python3.6/site-packages (from matplotlib>=2->SMPyBandits) (2.3.0)\n", "Requirement already satisfied: cycler>=0.10 in ./venv3/lib/python3.6/site-packages (from matplotlib>=2->SMPyBandits) (0.10.0)\n", "Requirement already satisfied: pandas>=0.15.2 in ./venv3/lib/python3.6/site-packages (from seaborn->SMPyBandits) (0.23.4)\n", "Requirement already satisfied: prompt-toolkit<2.1.0,>=2.0.0 in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (2.0.7)\n", "Requirement already satisfied: decorator in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (4.3.0)\n", "Requirement already satisfied: pickleshare in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (0.7.5)\n", "Requirement already satisfied: jedi>=0.10 in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (0.13.1)\n", "Requirement already satisfied: setuptools>=18.5 in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (40.6.2)\n", "Requirement already satisfied: traitlets>=4.2 in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (4.3.2)\n", "Requirement already satisfied: pexpect; sys_platform != \"win32\" in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (4.6.0)\n", "Requirement already satisfied: pygments in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (2.2.0)\n", "Requirement already satisfied: backcall in ./venv3/lib/python3.6/site-packages (from ipython->watermark) (0.1.0)\n", "Requirement already satisfied: six>=1.5 in ./venv3/lib/python3.6/site-packages (from python-dateutil>=2.1->matplotlib>=2->SMPyBandits) (1.11.0)\n", "Requirement already satisfied: pytz>=2011k in ./venv3/lib/python3.6/site-packages (from pandas>=0.15.2->seaborn->SMPyBandits) (2018.7)\n", "Requirement already satisfied: wcwidth in ./venv3/lib/python3.6/site-packages (from prompt-toolkit<2.1.0,>=2.0.0->ipython->watermark) (0.1.7)\n", "Requirement already satisfied: parso>=0.3.0 in ./venv3/lib/python3.6/site-packages (from jedi>=0.10->ipython->watermark) (0.3.1)\n", "Requirement already satisfied: ipython-genutils in ./venv3/lib/python3.6/site-packages (from traitlets>=4.2->ipython->watermark) (0.2.0)\n", "Requirement already satisfied: ptyprocess>=0.5 in ./venv3/lib/python3.6/site-packages (from pexpect; sys_platform != \"win32\"->ipython->watermark) (0.6.0)\n", "Info: Using the Jupyter notebook version of the tqdm() decorator, tqdm_notebook() ...\n", "Lilian Besson \n", "\n", "CPython 3.6.6\n", "IPython 7.1.1\n", "\n", "SMPyBandits 0.9.4\n", "\n", "compiler : GCC 8.0.1 20180414 (experimental) [trunk revision 259383\n", "system : Linux\n", "release : 4.15.0-38-generic\n", "machine : x86_64\n", "processor : x86_64\n", "CPU cores : 4\n", "interpreter: 64bit\n" ] } ], "source": [ "!pip install SMPyBandits watermark\n", "%load_ext watermark\n", "%watermark -v -m -p SMPyBandits -a \"Lilian Besson\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Local imports\n", "from SMPyBandits.Environment import EvaluatorMultiPlayers, tqdm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need arms, for instance `Bernoulli`-distributed arm:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": false }, "outputs": [], "source": [ "# Import arms\n", "from SMPyBandits.Arms import Bernoulli" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And finally we need some single-player and multi-player Reinforcement Learning algorithms:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Import algorithms\n", "from SMPyBandits.Policies import *\n", "from SMPyBandits.PoliciesMultiPlayers import *" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "code_folding": [ 0 ] }, "outputs": [], "source": [ "# Just improving the ?? in Jupyter. Thanks to https://nbviewer.jupyter.org/gist/minrk/7715212\n", "from __future__ import print_function\n", "from IPython.core import page\n", "def myprint(s):\n", " try:\n", " print(s['text/plain'])\n", " except (KeyError, TypeError):\n", " print(s)\n", "page.page = myprint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For instance, this imported the `UCB` algorithm:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0;31mInit signature:\u001b[0m \u001b[0mUCBalpha\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnbArms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0malpha\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m4\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlower\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m0.0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mamplitude\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;36m1.0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m \n", "The UCB1 (UCB-alpha) index policy, modified to take a random permutation order for the initial exploration of each arm (reduce collisions in the multi-players setting).\n", "Reference: [Auer et al. 02].\n", "\u001b[0;31mInit docstring:\u001b[0m\n", "New generic index policy.\n", "\n", "- nbArms: the number of arms,\n", "- lower, amplitude: lower value and known amplitude of the rewards.\n", "\u001b[0;31mFile:\u001b[0m /tmp/SMPyBandits/notebooks/venv3/lib/python3.6/site-packages/SMPyBandits/Policies/UCBalpha.py\n", "\u001b[0;31mType:\u001b[0m type\n", "\n" ] } ], "source": [ "UCBalpha?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As well as the `CentralizedMultiplePlay` multi-player policy:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0;31mInit signature:\u001b[0m \u001b[0mCentralizedMultiplePlay\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mnbPlayers\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnbArms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mplayerAlgo\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0muniformAllocation\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mFalse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m \n", "CentralizedMultiplePlay: a multi-player policy where ONE policy is used by a centralized agent; asking the policy to select nbPlayers arms at each step.\n", " \n", "\u001b[0;31mInit docstring:\u001b[0m\n", "- nbPlayers: number of players to create (in self._players).\n", "- playerAlgo: class to use for every players.\n", "- nbArms: number of arms, given as first argument to playerAlgo.\n", "- uniformAllocation: Should the affectations of users always be uniform, or fixed when UCB indexes have converged? First choice is more fair, but linear nb of switches, second choice is not fair, but cst nb of switches.\n", "- `*args`, `**kwargs`: arguments, named arguments, given to playerAlgo.\n", "\n", "Examples:\n", "\n", ">>> import sys; sys.path.insert(0, '..'); from Policies import *\n", ">>> s = CentralizedMultiplePlay(2, 3, UCB)\n", ">>> [ child.choice() for child in s.children ]\n", "[2, 0]\n", "\n", "- To get a list of usable players, use ``s.children``.\n", "- Warning: ``s._players`` is for internal use ONLY!\n", "\u001b[0;31mFile:\u001b[0m /tmp/SMPyBandits/notebooks/venv3/lib/python3.6/site-packages/SMPyBandits/PoliciesMultiPlayers/CentralizedMultiplePlay.py\n", "\u001b[0;31mType:\u001b[0m type\n", "\n" ] } ], "source": [ "CentralizedMultiplePlay?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also need a collision model. The usual ones are defined in the `CollisionModels` package, and the only one we need is the classical one, where two or more colliding users don't receive any rewards." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0;31mSignature:\u001b[0m \u001b[0monlyUniqUserGetsReward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mt\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0marms\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mplayers\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mchoices\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrewards\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mpulls\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcollisions\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mDocstring:\u001b[0m\n", "Simple collision model where only the players alone on one arm samples it and receives the reward.\n", "\n", "- This is the default collision model, cf. [[Multi-Player Bandits Revisited, Lilian Besson and Emilie Kaufmann, 2017]](https://hal.inria.fr/hal-01629733).\n", "- The numpy array 'choices' is increased according to the number of users who collided (it is NOT binary).\n", "\u001b[0;31mFile:\u001b[0m /tmp/SMPyBandits/notebooks/venv3/lib/python3.6/site-packages/SMPyBandits/Environment/CollisionModels.py\n", "\u001b[0;31mType:\u001b[0m function\n", "\n" ] } ], "source": [ "# Collision Models\n", "from SMPyBandits.Environment.CollisionModels import onlyUniqUserGetsReward\n", "\n", "onlyUniqUserGetsReward?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## Creating the problem" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Parameters for the simulation\n", "- $T = 10000$ is the time horizon,\n", "- $N = 100$ is the number of repetitions (should be larger to have consistent results),\n", "- $M = 2$ is the number of players,\n", "- `N_JOBS = 4` is the number of cores used to parallelize the code." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "HORIZON = 10000\n", "REPETITIONS = 100\n", "NB_PLAYERS = 2\n", "N_JOBS = 4\n", "collisionModel = onlyUniqUserGetsReward" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Three MAB problems with Bernoulli arms\n", "We consider in this example $3$ problems, with `Bernoulli` arms, of different means.\n", "\n", "1. The first problem is very easy, with two good arms and three arms, with a fixed gap $\\Delta = \\max_{\\mu_i \\neq \\mu_j}(\\mu_{i} - \\mu_{j}) = 0.1$.\n", "2. The second problem is as easier, with a larger gap.\n", "3. Third problem is harder, with a smaller gap, and a very large difference between the two optimal arms and the suboptimal arms.\n", "\n", "> Note: right now, the multi-environments evaluator does not work well for MP policies, if there is a number different of arms in the scenarios. So I use the same number of arms in all the problems." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "ENVIRONMENTS = [ # 1) Bernoulli arms\n", " { # Scenario 1 from [Komiyama, Honda, Nakagawa, 2016, arXiv 1506.00779]\n", " \"arm_type\": Bernoulli,\n", " \"params\": [0.3, 0.4, 0.5, 0.6, 0.7]\n", " },\n", " { # Classical scenario\n", " \"arm_type\": Bernoulli,\n", " \"params\": [0.1, 0.3, 0.5, 0.7, 0.9]\n", " },\n", " { # Harder scenario\n", " \"arm_type\": Bernoulli,\n", " \"params\": [0.005, 0.01, 0.015, 0.84, 0.85]\n", " }\n", " ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Some RL algorithms\n", "We will compare Thompson Sampling against $\\mathrm{UCB}_1$, using two different centralized policy:\n", "\n", "1. `CentralizedMultiplePlay` is the naive use of a Bandit algorithm for Multi-Player decision making: at every step, the internal decision making process is used to determine not $1$ arm but $M$ to sample. For UCB-like algorithm, the decision making is based on a $\\arg\\max$ on UCB-like indexes, usually of the form $I_j(t) = X_j(t) + B_j(t)$, where $X_j(t) = \\hat{\\mu_j}(t) = \\sum_{\\tau \\leq t} r_j(\\tau) / N_j(t)$ is the empirical mean of arm $j$, and $B_j(t)$ is a bias term, of the form $B_j(t) = \\sqrt{\\frac{\\alpha \\log(t)}{2 N_j(t)}}$.\n", "\n", "2. `CentralizedIMP` is very similar, but instead of following the internal decision making for all the decisions, the system uses just the empirical means $X_j(t)$ to determine $M-1$ arms to sample, and the bias-corrected term (i.e., the internal decision making, can be sampling from a Bayesian posterior for instance) is used just for one decision. It is an heuristic, proposed in [[Komiyama, Honda, Nakagawa, 2016]](https://arxiv.org/abs/1506.00779)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "name": "stdout", "output_type": "stream", "text": [ " - One new child, of index 0, and class #1