Bandits and Reinforcement Learning (Fall 2017)

Course number: COMS E6998.001, Columbia University
Instructors: Alekh Agarwal and Alex Slivkins (Microsoft Research NYC)
Schedule: Wednesdays 4:10-6:40pm
Location: 404 International Affairs Building.
Office Hours
: after each class, and online (see below).
Q&A and announcements: we use a Piazza site (sign up). We are not using Courseworks.

Online office hours

[Alekh] Fridays    10:30-11:30am by appointment.
[Alex]   Mondays 11am-noon by appointment (over Skype).

Please send us an email [at least] the night before.

Please consider asking via Piazza first, for everyone to see a reply.

Course description

This course covers several foundational topics in sequential decision making. We start with multi-armed bandits, and proceed to contextual bandits, a generalization motivated by several real-world applications. We cover the main algorithms, performance guarantees and lower bounds, along with applications in the industry. We also discuss “offline off-policy evaluation”: performance assessment techniques for real-world online systems. The remainder of the course concerns the more general problem of reinforcement learning, from small-state MDPs to recent extensions that handle large state spaces.

Throughout, the course will primarily focus on the theoretical foundations, with some discussion of the practical aspects.


Exposure to algorithms and proofs at the level of an undergraduate algorithms course (CSOR 4231). Graduate ML course (COMS 4771) or current enrollment therein. If you do not meet these, please email the instructors.

Coursework and assessment

A final project, letter grade. The project will involve a mix of reading, coding, and/or original research on a course-related topic of your choice.

Tentative topics

  • Fundamental algorithms for multi-armed bandits:
    • IID rewards: UCB1, Successive Elimination
    • adversarial rewards: multiplicative weights, follow-the-leader, EXP3/EXP4
    • Bayesian/IID rewards: Thompson Sampling
  • Advanced topics in multi-armed bandits (*)
    • Lower bounds
    • structured action spaces (e.g., Lipschitz, linear, combinatorial)
    • global constraints ("bandits with knapsacks")
    • connections to game theory and mechanism design
    (*) some but not all of these topics will be covered
  • Contextual Bandits (CB)
    • CB with classification oracle: Epsilon-greedy, "Taming the Monster"
    • CB with linear rewards: LinUCB
    • Off-Policy Evaluation
  • Reinforcement Learning
    • MDPs, Value functions, Value Iteration, Q-Learning
    • R-MAX, lower bounds
    • Bellman Rank and resulting algorithm


Some of the lectures will be based on chapters from this book draft. For the rest, reading and/or lecture notes will be provided.

Some resources on course-related topics are listed below:

Concurrent and upcoming courses at Columbia

Our course focuses more heavily on contextual bandits and off-policy evaluation than either of these, and is complimentary to these other offerings.

Recent courses with closely related content


Lecture #1 (Sep 6 | Alex):  Class organization and intro to the problem space (slides).

[Brief and very basic] recap of probability and concentration inequalities (slides).
Understanding these slides is necessary for the rest of the course!

Lecture #2 (Sep 13 | Alex): Multi-armed bandits with IID rewards. 
Main reference: chapter 2 of Alex's book draft (revised Sep 15).

Additional references:

Lecture #3 (Sep 20 | Alex): Lower bounds on regret
Main reference: chapter 3 of Alex's book draft (revised Sep 20).

Lecture #4 (Sep 27 | Alex) Online learning with experts (i.e., bandits with full feedback & adversarial rewards)
Main reference: chapter 6 of Alex's book draft (revised Sep 28).

Additional reading: for online learning with a very large number of experts, see Chapter 8.4. For bandit problems with structure, see Interlude A.

Lecture #5 (Oct 4 | Alex) Adversarial bandits
Main reference: chapter 7 of Alex's book draft (revised Oct 5).

Lecture #6 (Oct 11 | Alekh) Contextual bandits basics Lecture Notes. Slides for first part.

Lecture #7 ( Oct 18 | Alekh) Off-policy evaluation and learning Lecture Notes.

Lecture #8 ( Oct 25 | Alekh) TBD

Lecture #9 ( Nov 1 | Alekh) TBD

Lecture #10 (Nov 8 | Alekh) TBD

Lecture #11 (Nov 15 | Alekh) TBD

Lecture #12 (Nov 29 | Alex) Bandits & incentives (tent.)

Lecture #13 (Dec 6 | Alex) project presentations

Sign-up sheet: Please use this sign-up sheet ("projects" tab) to help us keep track of the projects. Please update it as appropriate, including your names/emails, brief project description (or the current/tentative version thereof), and project status. Also, use this sheet to to find partners and to be aware of each other's topics and progress.

Deliverables & due dates
  1. Initial discussion regarding the possible project topics (probably via email and/or Skype). 

    When: the week of Mon, Oct 23 (more details coming up soon).

    See below for advice on choosing a topic and topic suggestions.

  2. Project proposal: a short document explaining the topic and the proposed approach (due by the end of Oct 31).
  3. Mid-project discussion: optional but encouraged, especially if you are doing simulations or research. (Week of Mon, Nov 13 -- sign-up sheet will be up.)
  4. Final report: a short (3-7 pp long) academic-style paper describing the background and what you've done for the project. (Tentative deadline: Dec 5).
  5. Final presentation: A presentation during the last class (Dec 6).

Please do not hesitate to send us questions at any point.

Our crew. The projects will be supervised by the two instructors, and two more people: Steven Wu and Nan Jiang. There will be one supervisor for each project, to which all project-related communication should be directed.   

Choosing a topic.  The project topic can be anything directly related to bandits, contextual bandits, or RL, subject to our approval. (If you are unsure whether a given topic is OK,  please ask.) Generally, there are three types of projects:

  1. Reading (literature review): read up on a given sub-topic.  What is are the problem formulations and how they are motivated? What are the main results and the crucual techniques to obtain these results? What are the open questions? What are the papers in this line of work, and who did what?
  2. Experiments:  pick a sub-topic and several algorithms for this sub-topic, run experiments with these algorithms to see which ones are better (for which "regimes"). You can use simulated data and/or some of the real datasets.
  3. Original research: pick an open problem and try to make a progress on it. Could be theoretical and/or experimental, up to you. Such a project is likely to involve some amount of reading, to figure out what is known and what is likely to help. It is great if by the end of the class you make a substantial progress towards a future publication, but it is totally OK to only have preliminary results, or no results at all. (After all, good research is rarely done in a few weeks.) 

These three types are not mutually exclusive: an original research may involve much reading; and reading- and experimets- type projects may be driven by a specific research question, and/or may help towards research projects in the future.

To keep #projects managable, we kindly ask you to work in  groups of three of more. We may allow, at our discretion, a few groups of two for some simulations or research projects. Working in groups tends to be more productive, more fun and also a useful collaborative experience.

If you are working on a course-related research project with some collaborators who are not attending this course, you are encouraged to submit an account of this research project as the "class project". But please ask one of us first.

Project proposal. For a reading project, mention a few papers to start with. For simulations, mention a few algorithms, a few types of problem instances you are planning to consider (if doing simulations), and the data sources you are planning to use (if working with real data). For an open research project, describe the open problem and how you plan/hope to go about solving it. While a project may evolve and morph compared to the proposal, a good proposal  usually helps to organize your thinking.

NB: It is OK if an original research project evolves into a reading and/or experiments project and vice versa. However, please keep us posted.

NB: Sometimes a bigh chunk of the overall work for the project happens before (and while) you write the project proposal. So, don't be alarmed if generating a project proposal takes much work, this probably means that you'd have less work to do later!

A page of text should usually suffice, possibly less. Please use LaTeX, and insert bibliography as appropriate. Send the PDF to, with a subject like "project proposal (name1, name2, name3)".

Topic suggestions. [We may modify these over time... ]
  • [reading: bandits with auxiliary structure] much literature on bandits assumes auxiliary structure, compared to the "standard" bandit model. Algorithms can take advantage of such structure, e.g., to cope with a huge #actions (cf.  Interlude A in Alex's book). Examples include:  Take one type of structure (e.g., linear bandits), and read up on it.
  • [reading & simulations: bandits with auxiliary data] In many scenarios, algorithm has access to extra data, compared to the "standard" bandit model. How to leverage this data? For example:
    • after each round, rewards for several arms may be revealed, not just for the chosen arm. (e.g.: Bandits with feedback graphs)
    • some exploration data may be available before the algorithm starts,  (e.g. Exploration Scavenging)
    • some exploration may be happening anyway, in parrallel with the algorithm (e.g., because different people just try out different things).
  • [reading and simulations: bandits with structured actions] in many applications, actions have a specific structure, e.g., action is a slate of search results. There are several ways to model this, depending on:
    • how users react to the slate E.g., click once and leave, or consider each item no matter what? (e.g.: Cascading models)
    • what is the reward metric? E.g., does it help to have two clicks on the slate?
    • what feedback is observed? E.g., all clicks, or only the total reward? (e.g.: Slates, Semibandits)
    • [simulations] Take a real-world ranking dataset, such as from MSLR. Simulate different user models on this data, and evaluate the effect of an algorithm tailored to that model, an algorithm just operating on global rewards of the slate, and a greedy algorithm.
    This has been studied both with and without contexts. (Current interests: Alekh.)
  • [reading: general techniques] same technique may appear in many settings, from bandits to contextual bandits to RL.  Prominent examples: 
    • multiplicative weights (e.g., see this survey).
    • follow the perturbed/reguralized leader (e.g., see ch. 8 from Alex's book).
    • UCB / optimism under uncertainty (e.g., see ch. 5 and ch. 10 in Alex's book for applications beyond the basic bandit model).
    • non-adaprive exploration (explore-first, epsilon-greedy, etc.)
    • Thompson Sampling (start with ch. 4 of Alex's book).
  • Take one such technique and read up on it.
  • [simulations: Thompson Sampling] a famous bandit algorithm called "Thompson Sampling" relies on exactly sampling from Bayesian posteriors on mean rewards. Whereas in many practical applications one can only sample approximately. What is the effect of approximate sampling on algorithm's performance?
  • [bandits & social learning] As people make decisions over time (e.g., where to eat today?) they utilize information revealed by others in the past (think Yelp) and produce information that may help others in the future (e.g., leave a review). Do they explore enough for the common good? How to incentivize them to explore more? What if a bandit algorithm issues recommendations, but people do not have to follow them?
    • [reading]  start with this informal article and this tutorial (part I and part II).
    • [simulations] see what happens under different models of human behavior, and different ways to present or censor past observations. Such simulations would plug nicely into one of our active research projects.
    (Current interests: Alex, Steven.)
  • [reading: bandits & fairness] bandit algorithms often make choices for people (e.g., web search results) or among people (whom to give a loan?). Are some individuals or groups treated unfairly? What "fairness" means, and how to ensure it? (e.g., see this paper and that paper).
    (Current interests: Alekh, Alex, Steven.)
  • [online learning & games] what if an online learning algorithm is playing a game against an adversary or another algorithm? What does this process converge to, and how fast?
    • [reading] Start with this lecture note and that paper.
    • [simulations] what if the algorithm is best-responding to the "empirical play" (time-averages) of the adversary? (Current interests: Steven)
  • [reading: online learning & economic mechanisms] what if a bandit / online learning algorithm ...
    • ... is used in an auction, either by the auction maker or by auction participants? (e.g., see this paper and that paper)
    • ... is used to determine prices & offers for sale? (e.g., see this paper and references therein).
    • ... determines who does what and at which price at a crowdsourcing market? (e.g., see this survey)
    (Current interests: Alex) 
  • [reading, simulations and research: contextual bandits theory] Possible topics include:
    • [reading] beyond IID: adversarial contexts and/or rewards (e.g.: BISTRO+ and predecessors).
    • [reading] algorithms that adapt to data, achieving better performance for "nice" problem instances (e.g.: first-order regret bounds, open problem, CORRAL).
    • [simulations] How do the adversarial algorithms behave on i.i.d. contexts and rewards, compared with algorithms for those settings? How do the i.i.d. algorithms work on non-i.i.d. and adversarial instances? Which seems more practically useful?
    • [research] Study the implicit exploration technique and try combining it with efficient contextual bandit methods as a replacement for uniform exploration? Does this lead to improvements in a structured action setting? You can also empirically evaluate the resulting approach in lieu of theory, or both.
    (Current interests: Alekh) 
  • [reading and simulations: contextual bandits applications and deployments] Possible topics include:
    • [reading] Contextual bandits in the wild: deployed systems and learnings (e.g.: Decision Service paper, Technical debt paper, ICML Tutorial)
    • [reading/simulations] Contextual bandit applications ( See for a partial ref of application papers: ICML Tutorial. Pick one or more papers around an application, and run simulations with some algorithms you have learned in this course on that application.
    (Current interests: Alekh) 
  • [reading and research: Off-policy evaluation] How to best evaluate policies given logged data?
    • [reading] Biased estimators (e.g.: doubly robust with bias, SWITCH)
    • [reading] Applications in Information Retrieval (see e.g.: Hofmann et al. monograph)
    • [research] How to improve our evaluation approach for exploration algorithms using biased algorithms like SWITCH? Can you analyze in theory if the exploration algorithm has some nice properties? (a starting point for such a property might be the structural assumption in the CORRAL paper).
    • [research] How to extend the minimax lower bounds from the SWITCH paper for structural conditions on rewards such as linear?
    (Current interests: Alekh) 
  • [reading, simulations and research: Off-policy evaluation] How to best learn good policies given logged data?
    • [reading] Reducing the variance in off-policy learning (see e.g. Self-normalized poem)
    • [reading] Can you extend the IPS-based off-policy learning ideas to further incorporate a variance term like the work above? You might get some inspiration from the EXP4.P paper.
    • [reading] Study the effect of using different off-policy learning methods inside your favorite contextual bandit algorithm (such as to train the greedy policy in epsilon-greedy).
    (Current interests: Alekh) 
  • [reading, simulations and research: Exploration in RL] Read on and experiment with different exploration algorithms for reinforcement learning
    • [reading] Exploration in small-state MDPs (e.g.: E3, R-MAX, UCRL, Delayed-Q learning)
    • [reading] Exploration with complex observations (e.g.: Eluder dimension, Bellman Rank, Curiosity driven exploration)
    • [simulations] Take some standard RL benchmark problems where exploration matters. Compare an algorithm like R-MAX or UCRL with a (contextual) bandit algorithm. Can you try the same for an Atari game?
    • [research] Come up with a theoretical model to analyze an algorithm like the curiosity-driven exploration paper. Hint: try to capture the idea that only some aspects of an observation might matter to the optimal policy/value function given a task.
    (Current interests: Alekh, Nan) 
  • [reading: Predictive State Representations] What constitutes a state in RL? Is there a procedural way to define states in RL? PSRs provide one such notion (e.g.: Original PSR paper, Spectral learning for PSRs)
  • (Current interests: Alekh, Nan) 
  • [reading: Hierarchical Reinforcement Learning] If the planning horizon is long, and the task you are trying to accomplish has re-usable subtasks, can you effectively leverage this structure? (see e.g. Options paper, General value functions and application to Atari)
  • (Current interests: Alekh, Nan) 
  • [reading/simulations: Proper evaluation of RL algorithms] The training-validation-test paradigm in supervised learning is robust to many types of experimental errors. We have repositories of standardized dataset splits, which makes it easier to compare experiments across papers. RL agents generate their own data, and are often run in simulators which makes it harder to do experiments which can be compared. How to make sure that our results are genuinely interesting? Do you want to reproduce the results of your favorite empirical RL paper according to a recommended methodology? (see e.g.: Deep RL that matters, Atari evaluation paper)
  • [reading: Policy improvement and gradient]
  • Given a starting policy supplied by an expert at training time, how does an RL agent learn a policy as good as it or better? Can the agent bootstrap from an expert and continue to get even better over time?

Final report.  The final report should clearly and concisely describe the project: what was the chosen topic, why it is interesting, what you've set out to accomplish, what you've learned/understood/accomplished, and what it means in a broader context.

The report should resemble a research paper in terms of style and presentation. Try to make it accessible to students in this course: explain things clearly, and try not assume background beyond what was covered in class.

  • "Introduction" section should state what is the topic/problem, why it is interesting, background and [some] prior work with appropriate citations, what you've set out to accomplish, and (on a high level) what you've accomplished.
  • "Preliminaries" section is a place to introduce models and/or notation. Also, if you'll be using tools that are somewhat "advanced" for this course (e.g., martingales, linear duality, KL-divergence) mention them here, and perhaps explain briefly what it is and how you'd be using it.
  • "Body" of the report presents all the details.

For new results: explain what the results mean, be as explicit as possible about why they are interesting given the prior work.  Include enough detail:

  • for theory: precise theorem statements, and full proofs, or at least proof sketches that contain the main ideas,
  • for simulations/experiments: detailed experimental setup, algorithms used, and what was observed.

If you tried some approach and it didn't work out, you can write about it, too! Try to explain why it didn't work and what are the obstacles; present lessons / conclusions, should you have any. 

For reading projects: what is the goal of your project? Clearly state and cite the papers you've looked at. Try to state the main results as precisely and explicitly as possible (but also try to be succinct, and do use informal descriptions if needed). Explain what these results mean. What are the main techniques to achieve them. Ideally, also what are the cool open problems in the area, and why prior work falls short in solving them.

Mapping out the relevant work is an imporant part of the project. In particular, when citing a line of work, cite several papers, not just one; ideally, the initial paper or two, and a few of the most important follow-ups. Cite surveys whenever applicable. DBLP is a good tool to track down correct references (e.g., do not cite an arxiv version if there is a conference publication), and also to get BibTeX entries.

Formalities. The document should be 3-5 pp long, not including fugures and references, in 11pt font with 1-inch margins and single spacing. Include full names and emails for all authors. LaTeX strongly preferred (but do talk to me if this is a problem). Then the desired formatting can be achieved by 
BibTeX is recommended for bibliography.

Submission. Submit PDF file by email to, with a subject like "final project report (supervisor: name1, name2, name3)".

Project presentation. Each project will need to make a presentation in the final class, via a (very) short talk or a poster. We will provide a more specific plan once we see how many projects we have.



Homeworks are for self-study only: they will not be not collected or graded. However, we will be happy to discuss the solutions in office hours and/or on Piazza.

Homework 1: out Sep 21.

Homework 2: out Oct 6.