Course number: COMS E6998.001, Columbia
Instructors: Alekh Agarwal and Alex Slivkins (Microsoft Research NYC)
Schedule: Wednesdays 4:10-6:40pm
Office Hours: TBD.
This course covers several foundational topics in sequential decision making. We start with multi-armed bandits, and proceed to contextual bandits, a generalization motivated by several real-world applications. We cover the main algorithms, performance guarantees and lower bounds, along with applications in the industry. We also discuss “offline off-policy evaluation”: performance assessment techniques for real-world online systems. The remainder of the course concerns the more general problem of reinforcement learning, from small-state MDPs to recent extensions that handle large state spaces.
Exposure to algorithms and proofs at the level of an undergraduate algorithms course (CSOR 4231). Graduate ML course (COMS 4771) or current enrollment therein. If you do not meet these, please email the instructors.
Coursework and assessment
A final project, letter grade. The project will involve a mix of reading, coding, and/or original research on a course-related topic of your choice (more details later).
- Fundamental algorithms for multi-armed bandits:
- IID rewards: UCB1, Successive Elimination
- adversarial rewards: multiplicative weights, follow-the-leader, EXP3/EXP4
- Bayesian/IID rewards: Thompson Sampling
- Advanced topics in multi-armed bandits (*)
- Lower bounds
- structured action spaces (e.g., Lipschitz, linear, combinatorial)
- global constraints ("bandits with knapsacks")
- connections to game theory and mechanism design
- Contextual Bandits (CB)
- CB with classification oracle: Epsilon-greedy, "Taming the Monster"
- CB with linear rewards: LinUCB
- Off-Policy Evaluation
- Reinforcement Learning
- MDPs, Value functions, Value Iteration, Q-Learning
- R-MAX, lower bounds
- Bellman Rank and resulting algorithm
The first half of the course will be based on chapters from this book draft. For the rest, reading and/or lecture notes will be provided.
Some resources on course-related topics are listed below:
- A survey on (non-Bayesian) bandits by Sebastien Bubeck and Nicolo Cesa-Bianchi (2012).
- Prediction, Learning, and Games: a book by Nicolo Cesa-Bianchi and Gabor Lugosi (2006).
- (Brief) lectures on bandit theory by Sebastien Bubeck: part 1 and part 2 (Spring'16).
- A survey
on concentration inequalities by Fan Chung
and Linyuan Lu (2010)
Another survey on concentration inequalities by Colin McDiarmid (1998).
Recent cources with closely related content
- Shipra Agrawal, Learning and Optimization for Sequential Decision Making, Columbia University, IEOR dept, Spring 2016.
- Alex Slivkins, Bandits, Experts, and Games, University of Maryland, Computer Science Department, Fall 2016.
- Csaba Szepesvari and Tor Lattimore, Bandit Algorithms, University of Alberta (Fall 2016) and Indiana University(Spring 2017).