Sign-up sheet: Please use
this sign-up sheet
("projects" tab) to help us
keep track of the projects. Please update it as appropriate,
including your names/emails, brief project description (or the
current/tentative version thereof), and project status. Also,
use this sheet to to find partners and to be aware of each
other's topics and progress.
Deliverables & due dates
- Initial discussion regarding the possible project
topics (the week of Mon, Oct 23). See below for advice
on choosing a topic and
topic suggestions.
- Project proposal: a short document explaining the
topic and the proposed approach (due
by the end of Oct 31).
- Mid-project discussion: optional but
encouraged, especially if you are doing simulations or
research (Week of Mon, Nov 13).
- Final report: a short (3-7 pp long) academic-style
paper describing the background and what you've done for the
project. (Deadline: Dec 6).
- Final presentation:
A presentation during the last class (Dec 6).
[presentation
schedule ("presentations" tab)]
Please do not hesitate to send us questions at any point.
Project
presentation. Each project will need to make a
presentation in the final class.
- Please aim for 9 minutes or so. We will allocate 10
minutes for each presentation, including questions and
change-over.
- We recommend one speaker for each presentation; it is up
to you to decide who presents. OK if all of you come forward
for your project's presentation.
- Please prepare slides. You will present from your own
laptop. (As usual, Mac-based folks may need a VGA
connector.)
- After presentation, please submit the slides. Submit as
a PDF, by
email to
bandits-fa17-instr@microsoft.com, with a subject like "presentation
slides (supervisor: name1, name2, name3)".
Please prepare in advance -- this is your chance to tell the
class about the exciting stuff you've done and/or learned about!
Some generic advice:
- Aim to make your presentation as accessible as possible.
Explain what is the topic/problem and why it is exciting.
Transmitting the high-level bits and excitement is more
important in a short talk than going into any details.
- Try to distill crucial ideas and find simple high-level
explanations for them. Include hard details if you can state
them clearly and concisely, in a way accessible to the
class. But if you know that this or that technical statement
is not likely to be understood, then no need to say it.
- It is totally OK to “sweep things under the rug” if it
helps you transmit crucial points.
- It is totally OK to say “we’ve been trying to this, but
we don’t know how to do it”.
Re slides:
- slides are useful, among other things, to remind you
what you are supposed to say!
strive to make the slides
clear and non-crowded. Write the main points, but no need to
write full sentences, or to put on the slide everything that
you intend to say.
- If you need to discuss a formula, it is often useful to
put it on a slide and point to it.
- Use large enough font size so that it is visible from
the back row.
- No need to make slides fancy. In particular, bulleted
list with plain text is totally fine!
Final report.
The final report should
clearly and concisely describe the project: what was the chosen
topic, why it is interesting, what you've set out to accomplish,
what you've learned/understood/accomplished, and what it means
in a broader context.
The report should resemble a
research paper in terms of style and presentation. Try to make
it accessible to students in this course: explain things
clearly, and try not assume background beyond what was covered
in class.
- "Introduction" section should state what is the
topic/problem, why it is interesting, background and [some]
prior work with appropriate citations, what you've set out
to accomplish, and (on a high level) what you've
accomplished.
- "Preliminaries" section is a place to introduce
models and/or notation. Also, if you'll be using tools that
are somewhat "advanced" for this course (e.g., martingales,
linear duality, KL-divergence) mention them here, and
perhaps explain briefly what it is and how you'd be using
it.
- "Body" of the report presents all the details.
For new results: explain what the results mean, be
as explicit as possible about why they are interesting given the
prior work. Include enough detail:
- for theory: precise
theorem statements, and full proofs, or at least proof sketches
that contain the main ideas,
- for simulations/experiments:
detailed experimental setup, algorithms used, and what was
observed.
If you tried some approach and it didn't work out, you can write
about it, too! Try to explain why it didn't work and what are
the obstacles; present lessons / conclusions, should you have
any.
For reading projects: what is the goal of your project?
Clearly state and cite the papers you've looked at. Try to state
the main results as precisely and explicitly as possible (but
also try to be succinct, and do use informal descriptions if
needed). Explain what these results mean. What are the main
techniques to achieve them. Ideally, also what are the cool open
problems in the area, and why prior work falls short in solving
them.
Mapping out the relevant work is an imporant part of the
project. In particular, when citing a line of work, cite several
papers, not just one; ideally, the initial paper or two, and a
few of the most important follow-ups. Cite surveys whenever
applicable. DBLP is a good tool to track down correct references
(e.g., do not cite an arxiv version if there is a conference
publication), and also to get BibTeX entries.
Formalities. The document should be 3-5 pp long, not
including fugures and references, in 11pt font with 1-inch
margins and single spacing. Include full names and emails for
all authors. LaTeX strongly preferred (but do talk to me if this
is a problem). Then the desired formatting can be achieved by
\documentclass[11pt,letterpaper]{article}
\usepackage{fullpage}.
BibTeX is recommended for
bibliography.
Submission. Submit PDF file by
email to
bandits-fa17-instr@microsoft.com, with a subject like "final
project report (supervisor: name1, name2, name3)".
Our crew. The projects will be supervised by
the two instructors, and two more people:
Steven Wu and
Nan
Jiang. There will be one supervisor for each project, to
which all project-related communication should be directed.
Choosing a topic. The project topic can be
anything directly related to bandits, contextual bandits, or RL,
subject to our approval. (If you are
unsure whether a given topic is OK, please
ask.) Generally, there are three types of projects:
- Reading (literature review): read up on a given
sub-topic. What is are the problem formulations and
how they are motivated? What are the main results and the
crucual techniques to obtain these results? What are the
open questions? What are the papers in this line of work,
and who did what?
- Experiments: pick a sub-topic and several
algorithms for this sub-topic, run experiments with these
algorithms to see which ones are better (for which
"regimes"). You can use simulated data and/or some of the
real datasets.
- Original research: pick an open problem and try
to make a progress on it. Could be theoretical and/or
experimental, up to you. Such a project is likely
to involve some amount of reading, to figure out what is
known and what is likely to help. It is great if by the end
of the class you make a substantial progress towards a
future publication, but it is totally OK to only have
preliminary results, or no results at all. (After all, good
research is rarely done in a few weeks.)
These three types are not mutually exclusive: an original
research may involve much reading; and reading-
and experimets- type projects may be driven by a
specific research question, and/or may help towards research
projects in the future.
To keep #projects managable, we kindly ask you to work
in groups of three of more.
We may allow, at our discretion, a few groups of two for some
simulations or research projects. Working in groups tends to be
more productive, more fun and also a useful collaborative
experience.
If you are working on a course-related research project with
some collaborators who are not attending this course, you are
encouraged to submit an account of this research project as the
"class project". But please ask one of us first.
Project proposal. For a reading project,
mention a few papers to start with. For simulations, mention a
few algorithms, a few types of problem instances you are
planning to consider (if doing simulations), and the data
sources you are planning to use (if working with real data). For
an open research project, describe the open problem and how you
plan/hope to go about solving it. While a project may evolve and morph compared to the
proposal, a good proposal usually helps to organize your
thinking.
NB: It is OK if an original research
project evolves into a reading and/or experiments project and
vice versa. However, please keep us posted.
NB:
Sometimes a bigh chunk of the overall work for the project
happens
before (and while) you write the project
proposal. So, don't be alarmed if generating a project proposal
takes much work, this probably means that you'd have less work
to do later!
A page of text should usually suffice, possibly
less. Please use LaTeX, and insert bibliography as appropriate.
Send the PDF to
bandits-fa17-instr@microsoft.com, with a subject like "project proposal
(project name)". CC all collaborators, and Steve or Nan or they
are involved.
Both in the email and in the PDF, please include the following,
for each person in the group: full name, UNI,
PhD/year or Masters/year, department/school, and a few words on background and interests.
Topic suggestions. [We
may modify these over time... ]
- [reading: bandits with auxiliary
structure] much literature on bandits assumes
auxiliary structure, compared to the "standard" bandit
model. Algorithms can take advantage of such structure,
e.g., to cope with a huge #actions (cf. Interlude A in
Alex's book). Examples include:
Take one
type of structure (e.g., linear bandits), and read up on it.
- [reading & simulations:
bandits with auxiliary data] In many scenarios,
algorithm has access to extra data, compared to the
"standard" bandit model. How to leverage this data? For
example:
- after each round, rewards for several arms may be
revealed, not just for the chosen arm. (e.g.: Bandits with feedback graphs)
- some exploration data may be available before the
algorithm starts, (e.g. Exploration Scavenging)
- some exploration may be happening anyway, in
parrallel with the algorithm (e.g., because different
people just try out different things).
- [reading and simulations: bandits with
structured actions] in many applications,
actions have a specific structure, e.g., action is a slate
of search results. There are several ways to model this,
depending on:
- how users react to the slate E.g., click once and
leave, or consider each item no matter what? (e.g.: Cascading models)
- what is the reward metric? E.g., does it help to
have two clicks on the slate?
- what feedback is observed? E.g., all clicks, or only
the total reward? (e.g.: Slates, Semibandits)
- [simulations] Take a real-world ranking dataset, such as from MSLR. Simulate different user models on this data, and evaluate the effect of an algorithm tailored to that model, an algorithm just operating on global rewards of the slate, and a greedy algorithm.
This has been studied both with and without contexts. (Current interests: Alekh.)
- [reading: general techniques]
same technique may
appear in many settings, from bandits to contextual bandits
to RL. Prominent examples:
- multiplicative weights (e.g., see
this survey).
- follow the perturbed/reguralized
leader (e.g., see ch. 8 from Alex's book).
- UCB / optimism under uncertainty (e.g., see ch. 5
and ch. 10 in Alex's book for applications beyond the
basic bandit model).
- non-adaprive exploration (explore-first,
epsilon-greedy, etc.)
- Thompson Sampling (start with ch. 4 of Alex's book).
Take one such technique and read up on
it.- [simulations: Thompson
Sampling] a famous bandit algorithm called
"Thompson Sampling" relies on exactly sampling from Bayesian
posteriors on mean rewards. Whereas in many practical
applications one can only sample approximately. What is the
effect of approximate sampling on algorithm's performance?
- [bandits & social learning] As people make
decisions over time (e.g., where to eat today?) they utilize
information revealed by others in the past (think Yelp) and
produce information that may help others in the future
(e.g., leave a review). Do they explore enough for the
common good? How to incentivize them to explore more? What
if a bandit algorithm issues recommendations, but people do
not have to follow them?
- [reading]
start with this informal article and this tutorial
(part
I and part
II).
- [simulations] see what happens under
different models of human behavior, and different ways
to present or censor past observations. Such simulations
would plug nicely into one of our active research
projects.
(Current interests: Alex,
Steven.)
- [reading: bandits & fairness]
bandit algorithms
often make choices for people (e.g., web search
results) or among people (whom to give a loan?).
Are some individuals or groups treated unfairly? What
"fairness" means, and how to ensure it? (e.g., see
this paper and
that paper).
(Current interests: Alekh,
Alex, Steven.)
- [online learning &
games] what if an online learning algorithm is playing a game against an adversary or
another algorithm? What does this process converge to,
and how fast?
- [reading]
Start with
this lecture note and
that paper.
- [simulations]
what if the algorithm is best-responding to the
"empirical play" (time-averages) of the adversary?
(Current interests:
Steven)
- [reading: online learning &
economic mechanisms] what if a bandit / online
learning algorithm
...
- ... is used in an auction, either by the auction
maker or by auction participants? (e.g., see
this paper and
that paper)
- ... is used to determine prices & offers for sale?
(e.g., see this paper and references therein).
- ... determines who does what and at which price at a
crowdsourcing market? (e.g., see
this survey)
(Current interests: Alex)
- [reading, simulations and research: contextual
bandits theory] Possible topics include:
- [reading] beyond IID: adversarial contexts and/or rewards (e.g.:
BISTRO+ and predecessors).
- [reading] algorithms that adapt to data, achieving better
performance for "nice" problem instances (e.g.:
first-order regret bounds,
open problem,
CORRAL).
- [simulations] How do the adversarial algorithms behave on i.i.d. contexts and rewards, compared with algorithms for those settings? How do the i.i.d. algorithms work on non-i.i.d. and adversarial instances? Which seems more practically useful?
- [research] Study the
implicit exploration technique and try combining it with efficient contextual bandit methods as a replacement for uniform exploration? Does this lead to improvements in a structured action setting? You can also empirically evaluate the resulting approach in lieu of theory, or both.
(Current interests: Alekh)
- [reading and simulations: contextual
bandits applications and deployments] Possible topics include:
- [reading] Contextual bandits in the wild: deployed systems and learnings (e.g.:
Decision Service paper,
Technical debt paper,
ICML Tutorial)
- [reading/simulations] Contextual bandit applications ( See for a partial ref of application papers:
ICML Tutorial. Pick one or more papers around an application, and run simulations with some algorithms you have learned in this course on that application.
(Current interests: Alekh)
- [reading and research: Off-policy evaluation] How to best evaluate policies given logged data?
- [reading] Biased estimators (e.g.:
doubly robust with bias,
SWITCH)
- [reading] Applications in Information Retrieval (see e.g.:
Hofmann et al. monograph)
- [research] How to improve our evaluation approach for exploration algorithms using biased algorithms like SWITCH? Can you analyze in theory if the exploration algorithm has some nice properties? (a starting point for such a property might be the structural assumption in the
CORRAL paper).
- [research] How to extend the minimax lower bounds from the SWITCH paper for structural conditions on rewards such as linear?
(Current interests: Alekh)
- [reading, simulations and research: Off-policy evaluation] How to best learn good policies given logged data?
- [reading] Reducing the variance in off-policy learning (see e.g. Self-normalized poem)
- [reading] Can you extend the IPS-based off-policy learning ideas to further incorporate a variance term like the work above? You might get some inspiration from the
EXP4.P paper.
- [reading] Study the effect of using different off-policy learning methods inside your favorite contextual bandit algorithm (such as to train the greedy policy in epsilon-greedy).
(Current interests: Alekh)
- [reading, simulations and research: Exploration in RL] Read on and experiment with different exploration algorithms for reinforcement learning
- [reading] Exploration in small-state MDPs (e.g.:
E3,
R-MAX,
UCRL,
Delayed-Q learning)
- [reading] Exploration with complex observations (e.g.:
Eluder dimension,
Bellman Rank,
Curiosity driven exploration)
- [simulations] Take some standard RL benchmark problems where exploration matters. Compare an algorithm like R-MAX or UCRL with a (contextual) bandit algorithm. Can you try the same for an Atari game?
- [research] Come up with a theoretical model to analyze an algorithm like the curiosity-driven exploration paper. Hint: try to capture the idea that only some aspects of an observation might matter to the optimal policy/value function given a task.
(Current interests: Alekh, Nan)
- [reading: Predictive State Representations] What constitutes a state in RL? Is there a procedural way to define states in RL? PSRs provide one such notion (e.g.:
Original PSR paper,
Spectral learning for PSRs)
(Current interests: Alekh, Nan)
- [reading: Hierarchical Reinforcement Learning] If the planning horizon is long, and the task you are trying to accomplish has re-usable subtasks, can you effectively leverage this structure? (see e.g.
Options paper,
General value functions and
application to Atari)
(Current interests: Alekh, Nan)
- [reading/simulations: Proper evaluation of RL algorithms] The training-validation-test paradigm in supervised learning is robust to many types of experimental errors. We have repositories of standardized dataset splits, which makes it easier to compare experiments across papers. RL agents generate their own data, and are often run in simulators which makes it harder to do experiments which can be compared. How to make sure that our results are genuinely interesting? Do you want to reproduce the results of your favorite empirical RL paper according to a recommended methodology? (see e.g.:
Deep RL that matters,
Atari evaluation paper)
- [reading: Policy improvement and gradient]
Given a starting policy supplied by an expert at training time, how does an RL agent learn a policy as good as it or better? Can the agent bootstrap from an expert and continue to get even better over time?