Bandits and Reinforcement Learning (Fall 2017)

Course number: COMS E6998.001, Columbia University
Instructors: Alekh Agarwal and Alex Slivkins (Microsoft Research NYC)
Schedule: Wednesdays 4:10-6:40pm
Location: TBD.
Office Hours
: TBD.
Q&A: TBD.
Contact:
bandits-fa17-instr@microsoft.com.

Course description

This course covers several foundational topics in sequential decision making. We start with multi-armed bandits, and proceed to contextual bandits, a generalization motivated by several real-world applications. We cover the main algorithms, performance guarantees and lower bounds, along with applications in the industry. We also discuss “offline off-policy evaluation”: performance assessment techniques for real-world online systems. The remainder of the course concerns the more general problem of reinforcement learning, from small-state MDPs to recent extensions that handle large state spaces.

Prerequesites

Exposure to algorithms and proofs at the level of an undergraduate algorithms course (CSOR 4231). Graduate ML course (COMS 4771) or current enrollment therein. If you do not meet these, please email the instructors.

Coursework and assessment

A final project, letter grade. The project will involve a mix of reading, coding, and/or original research on a course-related topic of your choice (more details later).

Tentative topics

  • Fundamental algorithms for multi-armed bandits:
    • IID rewards: UCB1, Successive Elimination
    • adversarial rewards: multiplicative weights, follow-the-leader, EXP3/EXP4
    • Bayesian/IID rewards: Thompson Sampling
  • Advanced topics in multi-armed bandits (*)
    • Lower bounds
    • structured action spaces (e.g., Lipschitz, linear, combinatorial)
    • global constraints ("bandits with knapsacks")
    • connections to game theory and mechanism design
    (*) some but not all of these topics will be covered
  • Contextual Bandits (CB)
    • CB with classification oracle: Epsilon-greedy, "Taming the Monster"
    • CB with linear rewards: LinUCB
    • Off-Policy Evaluation
  • Reinforcement Learning
    • MDPs, Value functions, Value Iteration, Q-Learning
    • R-MAX, lower bounds
    • Bellman Rank and resulting algorithm

Resources

The first half of the course will be based on chapters from this book draft. For the rest, reading and/or lecture notes will be provided.

Some resources on course-related topics are listed below:

Recent cources with closely related content

 

coming ...

coming ...