Sign-up sheet: Please use
this sign-up sheet
("projects" tab) to help us
keep track of the projects. Please update it as appropriate,
including your names/emails, brief project description (or the
current/tentative version thereof), and project status. Also,
use this sheet to to find partners and to be aware of each
other's topics and progress.
Deliverables & due dates
- Initial discussion regarding the possible project
topics (probably via email and/or Skype).
When: the week of Mon, Oct 23
(more details coming up soon).
See below for advice
on choosing a topic and
- Project proposal: a short document explaining the
topic and the proposed approach (due
by the end of Oct 31).
- Mid-project discussion: optional but
encouraged, especially if you are doing simulations or
research. (Week of Mon, Nov 13
-- sign-up sheet will be up.)
- Final report: a short (3-7 pp long) academic-style
paper describing the background and what you've done for the
project. (Tentative deadline: Dec 5).
- Final presentation:
A presentation during the last class (Dec 6).
Please do not hesitate to send us questions at any point.
Our crew. The projects will be supervised by
the two instructors, and two more people:
Steven Wu and
Jiang. There will be one supervisor for each project, to
which all project-related communication should be directed.
Choosing a topic. The project topic can be
anything directly related to bandits, contextual bandits, or RL,
subject to our approval. (If you are
unsure whether a given topic is OK, please
ask.) Generally, there are three types of projects:
- Reading (literature review): read up on a given
sub-topic. What is are the problem formulations and
how they are motivated? What are the main results and the
crucual techniques to obtain these results? What are the
open questions? What are the papers in this line of work,
and who did what?
- Experiments: pick a sub-topic and several
algorithms for this sub-topic, run experiments with these
algorithms to see which ones are better (for which
"regimes"). You can use simulated data and/or some of the
- Original research: pick an open problem and try
to make a progress on it. Could be theoretical and/or
experimental, up to you. Such a project is likely
to involve some amount of reading, to figure out what is
known and what is likely to help. It is great if by the end
of the class you make a substantial progress towards a
future publication, but it is totally OK to only have
preliminary results, or no results at all. (After all, good
research is rarely done in a few weeks.)
These three types are not mutually exclusive: an original
research may involve much reading; and reading-
and experimets- type projects may be driven by a
specific research question, and/or may help towards research
projects in the future.
To keep #projects managable, we kindly ask you to work
in groups of three of more.
We may allow, at our discretion, a few groups of two for some
simulations or research projects. Working in groups tends to be
more productive, more fun and also a useful collaborative
If you are working on a course-related research project with
some collaborators who are not attending this course, you are
encouraged to submit an account of this research project as the
"class project". But please ask one of us first.
For a reading project,
mention a few papers to start with. For simulations, mention a
few algorithms, a few types of problem instances you are
planning to consider (if doing simulations), and the data
sources you are planning to use (if working with real data). For
an open research project, describe the open problem and how you
plan/hope to go about solving it. While a project may evolve and morph compared to the
proposal, a good proposal usually helps to organize your
It is OK if an original research
project evolves into a reading and/or experiments project and
vice versa. However, please keep us posted. NB:
Sometimes a bigh chunk of the overall work for the project
happens before (and while)
you write the project
proposal. So, don't be alarmed if generating a project proposal
takes much work, this probably means that you'd have less work
to do later!
A page of text should usually suffice, possibly
less. Please use LaTeX, and insert bibliography as appropriate.
Send the PDF to
, with a subject like "project proposal
(name1, name2, name3)".
Topic suggestions. [We
may modify these over time... ]
- [reading: bandits with auxiliary
structure] much literature on bandits assumes
auxiliary structure, compared to the "standard" bandit
model. Algorithms can take advantage of such structure,
e.g., to cope with a huge #actions (cf. Interlude A in
Alex's book). Examples include:
type of structure (e.g., linear bandits), and read up on it.
- [reading & simulations:
bandits with auxiliary data] In many scenarios,
algorithm has access to extra data, compared to the
"standard" bandit model. How to leverage this data? For
- after each round, rewards for several arms may be
revealed, not just for the chosen arm. (e.g.: Bandits with feedback graphs)
- some exploration data may be available before the
algorithm starts, (e.g. Exploration Scavenging)
- some exploration may be happening anyway, in
parrallel with the algorithm (e.g., because different
people just try out different things).
- [reading and simulations: bandits with
structured actions] in many applications,
actions have a specific structure, e.g., action is a slate
of search results. There are several ways to model this,
This has been studied both with and without contexts. (Current interests: Alekh.)
- how users react to the slate E.g., click once and
leave, or consider each item no matter what? (e.g.: Cascading models)
- what is the reward metric? E.g., does it help to
have two clicks on the slate?
- what feedback is observed? E.g., all clicks, or only
the total reward? (e.g.: Slates, Semibandits)
- [simulations] Take a real-world ranking dataset, such as from MSLR. Simulate different user models on this data, and evaluate the effect of an algorithm tailored to that model, an algorithm just operating on global rewards of the slate, and a greedy algorithm.
- [reading: general techniques]
same technique may
appear in many settings, from bandits to contextual bandits
to RL. Prominent examples:
Take one such technique and read up on
- multiplicative weights (e.g., see
- follow the perturbed/reguralized
leader (e.g., see ch. 8 from Alex's book).
- UCB / optimism under uncertainty (e.g., see ch. 5
and ch. 10 in Alex's book for applications beyond the
basic bandit model).
- non-adaprive exploration (explore-first,
- Thompson Sampling (start with ch. 4 of Alex's book).
- [simulations: Thompson
Sampling] a famous bandit algorithm called
"Thompson Sampling" relies on exactly sampling from Bayesian
posteriors on mean rewards. Whereas in many practical
applications one can only sample approximately. What is the
effect of approximate sampling on algorithm's performance?
- [bandits & social learning] As people make
decisions over time (e.g., where to eat today?) they utilize
information revealed by others in the past (think Yelp) and
produce information that may help others in the future
(e.g., leave a review). Do they explore enough for the
common good? How to incentivize them to explore more? What
if a bandit algorithm issues recommendations, but people do
not have to follow them?
(Current interests: Alex,
start with this informal article and this tutorial
I and part
- [simulations] see what happens under
different models of human behavior, and different ways
to present or censor past observations. Such simulations
would plug nicely into one of our active research
- [reading: bandits & fairness]
often make choices for people (e.g., web search
results) or among people (whom to give a loan?).
Are some individuals or groups treated unfairly? What
"fairness" means, and how to ensure it? (e.g., see
this paper and
(Current interests: Alekh,
- [online learning &
games] what if an online learning algorithm is playing a game against an adversary or
another algorithm? What does this process converge to,
and how fast?
this lecture note and
what if the algorithm is best-responding to the
"empirical play" (time-averages) of the adversary?
- [reading: online learning &
economic mechanisms] what if a bandit / online
(Current interests: Alex)
- ... is used in an auction, either by the auction
maker or by auction participants? (e.g., see
this paper and
- ... is used to determine prices & offers for sale?
(e.g., see this paper and references therein).
- ... determines who does what and at which price at a
crowdsourcing market? (e.g., see
- [reading, simulations and research: contextual
bandits theory] Possible topics include:
(Current interests: Alekh)
- [reading] beyond IID: adversarial contexts and/or rewards (e.g.:
BISTRO+ and predecessors).
- [reading] algorithms that adapt to data, achieving better
performance for "nice" problem instances (e.g.:
first-order regret bounds,
- [simulations] How do the adversarial algorithms behave on i.i.d. contexts and rewards, compared with algorithms for those settings? How do the i.i.d. algorithms work on non-i.i.d. and adversarial instances? Which seems more practically useful?
- [research] Study the
implicit exploration technique and try combining it with efficient contextual bandit methods as a replacement for uniform exploration? Does this lead to improvements in a structured action setting? You can also empirically evaluate the resulting approach in lieu of theory, or both.
- [reading and simulations: contextual
bandits applications and deployments] Possible topics include:
(Current interests: Alekh)
- [reading] Contextual bandits in the wild: deployed systems and learnings (e.g.:
Decision Service paper,
Technical debt paper,
- [reading/simulations] Contextual bandit applications ( See for a partial ref of application papers:
ICML Tutorial. Pick one or more papers around an application, and run simulations with some algorithms you have learned in this course on that application.
- [reading and research: Off-policy evaluation] How to best evaluate policies given logged data?
(Current interests: Alekh)
- [reading] Biased estimators (e.g.:
doubly robust with bias,
- [reading] Applications in Information Retrieval (see e.g.:
Hofmann et al. monograph)
- [research] How to improve our evaluation approach for exploration algorithms using biased algorithms like SWITCH? Can you analyze in theory if the exploration algorithm has some nice properties? (a starting point for such a property might be the structural assumption in the
- [research] How to extend the minimax lower bounds from the SWITCH paper for structural conditions on rewards such as linear?
- [reading, simulations and research: Off-policy evaluation] How to best learn good policies given logged data?
(Current interests: Alekh)
- [reading] Reducing the variance in off-policy learning (see e.g. Self-normalized poem)
- [reading] Can you extend the IPS-based off-policy learning ideas to further incorporate a variance term like the work above? You might get some inspiration from the
- [reading] Study the effect of using different off-policy learning methods inside your favorite contextual bandit algorithm (such as to train the greedy policy in epsilon-greedy).
- [reading, simulations and research: Exploration in RL] Read on and experiment with different exploration algorithms for reinforcement learning
(Current interests: Alekh, Nan)
- [reading] Exploration in small-state MDPs (e.g.:
- [reading] Exploration with complex observations (e.g.:
Curiosity driven exploration)
- [simulations] Take some standard RL benchmark problems where exploration matters. Compare an algorithm like R-MAX or UCRL with a (contextual) bandit algorithm. Can you try the same for an Atari game?
- [research] Come up with a theoretical model to analyze an algorithm like the curiosity-driven exploration paper. Hint: try to capture the idea that only some aspects of an observation might matter to the optimal policy/value function given a task.
- [reading: Predictive State Representations] What constitutes a state in RL? Is there a procedural way to define states in RL? PSRs provide one such notion (e.g.:
Original PSR paper,
Spectral learning for PSRs)
(Current interests: Alekh, Nan)
- [reading: Hierarchical Reinforcement Learning] If the planning horizon is long, and the task you are trying to accomplish has re-usable subtasks, can you effectively leverage this structure? (see e.g.
General value functions and
application to Atari)
(Current interests: Alekh, Nan)
- [reading/simulations: Proper evaluation of RL algorithms] The training-validation-test paradigm in supervised learning is robust to many types of experimental errors. We have repositories of standardized dataset splits, which makes it easier to compare experiments across papers. RL agents generate their own data, and are often run in simulators which makes it harder to do experiments which can be compared. How to make sure that our results are genuinely interesting? Do you want to reproduce the results of your favorite empirical RL paper according to a recommended methodology? (see e.g.:
Deep RL that matters,
Atari evaluation paper)
- [reading: Policy improvement and gradient] Given a starting policy supplied by an expert at training time, how does an RL agent learn a policy as good as it or better? Can the agent bootstrap from an expert and continue to get even better over time?
The final report should
clearly and concisely describe the project: what was the chosen
topic, why it is interesting, what you've set out to accomplish,
what you've learned/understood/accomplished, and what it means
in a broader context.
The report should resemble a
research paper in terms of style and presentation. Try to make
it accessible to students in this course: explain things
clearly, and try not assume background beyond what was covered
- "Introduction" section should state what is the
topic/problem, why it is interesting, background and [some]
prior work with appropriate citations, what you've set out
to accomplish, and (on a high level) what you've
- "Preliminaries" section is a place to introduce
models and/or notation. Also, if you'll be using tools that
are somewhat "advanced" for this course (e.g., martingales,
linear duality, KL-divergence) mention them here, and
perhaps explain briefly what it is and how you'd be using
- "Body" of the report presents all the details.
For new results: explain what the results mean, be
as explicit as possible about why they are interesting given the
prior work. Include enough detail:
- for theory: precise
theorem statements, and full proofs, or at least proof sketches
that contain the main ideas,
- for simulations/experiments:
detailed experimental setup, algorithms used, and what was
If you tried some approach and it didn't work out, you can write
about it, too! Try to explain why it didn't work and what are
the obstacles; present lessons / conclusions, should you have
For reading projects: what is the goal of your project?
Clearly state and cite the papers you've looked at. Try to state
the main results as precisely and explicitly as possible (but
also try to be succinct, and do use informal descriptions if
needed). Explain what these results mean. What are the main
techniques to achieve them. Ideally, also what are the cool open
problems in the area, and why prior work falls short in solving
Mapping out the relevant work is an imporant part of the
project. In particular, when citing a line of work, cite several
papers, not just one; ideally, the initial paper or two, and a
few of the most important follow-ups. Cite surveys whenever
applicable. DBLP is a good tool to track down correct references
(e.g., do not cite an arxiv version if there is a conference
publication), and also to get BibTeX entries.
The document should be 3-5 pp long, not
including fugures and references, in 11pt font with 1-inch
margins and single spacing. Include full names and emails for
all authors. LaTeX strongly preferred (but do talk to me if this
is a problem). Then the desired formatting can be achieved by
BibTeX is recommended for
Submit PDF file by
, with a subject like "final
project report (supervisor: name1, name2, name3)".
Each project will need to make a
presentation in the final class, via a (very) short talk or a
poster. We will provide a more specific plan once we see how
many projects we have.