**Course number**: COMS E6998.001, Columbia
University**Instructors**:
Alekh Agarwal and
Alex Slivkins (Microsoft Research NYC)

**Schedule**: Wednesdays
4:10-6:40pm

**Location:**
404 International Affairs Building**.Office Hours**:
after each class, and online (see below).

**Q&A and announcements:**we use a Piazza site (sign up). We are

**not**using Courseworks.

**Contact:**bandits-fa17-instr@microsoft.com.

**Online office hours**

[Alekh] Fridays
10:30-11:30am by appointment.

[Alex] Mondays
11am-noon by appointment (over Skype).

Please send us an email [at least] the night before.

Please consider asking via Piazza first, for everyone to see a reply.

**Course
description**

This course covers several foundational topics in sequential decision making. We start with multi-armed bandits, and proceed to contextual bandits, a generalization motivated by several real-world applications. We cover the main algorithms, performance guarantees and lower bounds, along with applications in the industry. We also discuss “offline off-policy evaluation”: performance assessment techniques for real-world online systems. The remainder of the course concerns the more general problem of reinforcement learning, from small-state MDPs to recent extensions that handle large state spaces.

*Throughout, the course will primarily focus on the
theoretical foundations, with some discussion of the practical aspects.*

**Prerequesites **

Exposure to algorithms and proofs at the level of an undergraduate algorithms course (CSOR 4231). Graduate ML course (COMS 4771) or current enrollment therein. If you do not meet these, please email the instructors.

**Coursework and assessment**

A final project, letter grade. The project will involve a mix of reading, coding, and/or original research on a course-related topic of your choice.

**Tentative topics**

- Fundamental algorithms for multi-armed bandits:
- IID rewards: UCB1, Successive Elimination
- adversarial rewards: multiplicative weights, follow-the-leader, EXP3/EXP4
- Bayesian/IID rewards: Thompson Sampling

- Advanced topics in multi-armed bandits (*)
- Lower bounds
- structured action spaces (e.g., Lipschitz, linear, combinatorial)
- global constraints ("bandits with knapsacks")
- connections to game theory and mechanism design

- Contextual Bandits (CB)
- CB with classification oracle: Epsilon-greedy, "Taming the Monster"
- CB with linear rewards: LinUCB
- Off-Policy Evaluation

- Reinforcement Learning
- MDPs, Value functions, Value Iteration, Q-Learning
- R-MAX, lower bounds
- Bellman Rank and resulting algorithm

**Resources**

Some of the lectures will be based on chapters from this book draft. For the rest, reading and/or lecture notes will be provided.

Some resources on course-related topics are listed below:

- A survey on (non-Bayesian) bandits by Sebastien Bubeck and Nicolo Cesa-Bianchi (2012).
- Prediction, Learning, and Games: a book by Nicolo Cesa-Bianchi and Gabor Lugosi (2006).
- (Brief) lectures on bandit theory by Sebastien Bubeck: part 1 and part 2 (Spring'16).
- A survey
on concentration inequalities by Fan Chung
and Linyuan Lu (2010)

Another survey on concentration inequalities by Colin McDiarmid (1998).

**Concurrent and upcoming courses at Columbia**

- Daniel Russo is teaching a course on dynamic optimization and reinforcement learning in Fall'17 (in the Business School).
- Shipra Agrawal will be teaching a course on reinforcement learning in Spring'18 (in the IEOR department).

Our course focuses more heavily on contextual bandits and off-policy evaluation than either of these, and is complimentary to these other offerings.

**Recent courses with closely related content**

- Shipra Agrawal, Learning and Optimization for Sequential Decision Making, Columbia University, IEOR dept, Spring 2016.
- Alex Slivkins, Bandits, Experts, and Games, University of Maryland, Computer Science Department, Fall 2016.
- Csaba Szepesvari and Tor Lattimore, Bandit Algorithms, University of Alberta (Fall 2016) and Indiana University(Spring 2017).