Bandits and Reinforcement Learning (Fall 2017)

Roster/waitlist situation (Sept 18, 11am): enough spots on the roster for everyone who is still on the waitlist. We do not have access to the waitlist any more, so please check to make sure that you officially get on the roster.

Course number: COMS E6998.001, Columbia University
Instructors: Alekh Agarwal and Alex Slivkins (Microsoft Research NYC)
Schedule: Wednesdays 4:10-6:40pm
Location: 404 International Affairs Building.
Office Hours
: after each class, and online (see below).
Q&A and announcements: we use a Piazza site (sign up). We are not using Courseworks.

Online office hours

[Alekh] Fridays    10:30-11:30am by appointment.
[Alex]   Mondays 11am-noon by appointment (over Skype).

Please send us an email [at least] the night before.

Please consider asking via Piazza first, for everyone to see a reply.

Course description

This course covers several foundational topics in sequential decision making. We start with multi-armed bandits, and proceed to contextual bandits, a generalization motivated by several real-world applications. We cover the main algorithms, performance guarantees and lower bounds, along with applications in the industry. We also discuss “offline off-policy evaluation”: performance assessment techniques for real-world online systems. The remainder of the course concerns the more general problem of reinforcement learning, from small-state MDPs to recent extensions that handle large state spaces.

Throughout, the course will primarily focus on the theoretical foundations, with some discussion of the practical aspects.


Exposure to algorithms and proofs at the level of an undergraduate algorithms course (CSOR 4231). Graduate ML course (COMS 4771) or current enrollment therein. If you do not meet these, please email the instructors.

Coursework and assessment

A final project, letter grade. The project will involve a mix of reading, coding, and/or original research on a course-related topic of your choice.

Tentative topics

  • Fundamental algorithms for multi-armed bandits:
    • IID rewards: UCB1, Successive Elimination
    • adversarial rewards: multiplicative weights, follow-the-leader, EXP3/EXP4
    • Bayesian/IID rewards: Thompson Sampling
  • Advanced topics in multi-armed bandits (*)
    • Lower bounds
    • structured action spaces (e.g., Lipschitz, linear, combinatorial)
    • global constraints ("bandits with knapsacks")
    • connections to game theory and mechanism design
    (*) some but not all of these topics will be covered
  • Contextual Bandits (CB)
    • CB with classification oracle: Epsilon-greedy, "Taming the Monster"
    • CB with linear rewards: LinUCB
    • Off-Policy Evaluation
  • Reinforcement Learning
    • MDPs, Value functions, Value Iteration, Q-Learning
    • R-MAX, lower bounds
    • Bellman Rank and resulting algorithm


Some of the lectures will be based on chapters from this book draft. For the rest, reading and/or lecture notes will be provided.

Some resources on course-related topics are listed below:

Concurrent and upcoming courses at Columbia

Our course focuses more heavily on contextual bandits and off-policy evaluation than either of these, and is complimentary to these other offerings.

Recent courses with closely related content


Lecture #1 (Sep 6 | Alex):  Class organization and intro to the problem space (slides).

[Brief and very basic] recap of probability and concentration inequalities (slides).
Understanding these slides is necessary for the rest of the course!

Lecture #2 (Sep 13 | Alex): Multi-armed bandits with IID rewards. 
Main reference: chapter 2 of Alex's book draft (revised Sep 15).

Additional references:

Lecture #3 (Sep 20 | Alex): Lower bounds on regret
Main reference: chapter 3 of Alex's book draft (revised Sep 20).

Lecture #4 (Sep 27 | Alex) The best-expert problem (full feedback & adversarial rewards)
Main reference: chapter 6 of Alex's book draft.

Lecture #5 (Oct 4 | Alex) Adversarial bandits
Main reference: chapter 7 of Alex's book draft.

Lecture #6 (Oct 11 | Alekh) TBD

Lecture #7 ( Oct 18 | Alekh) TBD

Lecture #8 ( Oct 25 | Alekh) TBD

Lecture #9 ( Nov 1 | Alekh) TBD

Lecture #10 (Nov 8 | Alekh) TBD

Lecture #11 (Nov 15 | Alekh) TBD

Lecture #12 (Nov 29 | Alex) Bandits & incentives (tent.)

Lecture #13 (Dec 6 | Alex) project presentations

coming ...