CS 747: Foundations of Intelligent and Learning Agents
(Autumn 2015)
Instructor
Shivaram Kalyanakrishnan
Office: SIA204
Phone: 7716
Email: shivaram@cse.iitb.ac.in
Teaching Assistants
Apoorv Aggarwal
Office: SIC212 (Info Lab)
Email: apag@cse.iitb.ac.in
Anand Babu N. B.
Office: SIA311
Email: anandb@cse.iitb.ac.in
Class
Lectures will be held in SIC301 during Slot 6: 11.05 a.m. –
12.30 p.m. Wednesdays and Fridays.
The instructor will hold office hours 4.00 p.m. – 6.00 p.m. on
Thursdays in SIA204; meetings can also be arranged by
appointment.
Course Description
Today's computing systems are becoming increasingly adaptive and
autonomous: they are akin to intelligent, decisionmaking
``agents''. With its roots in artificial intelligence and machine
learning, this course covers the foundational principles of designing
such agents. Topics covered include: (1) agency, intelligence, and
learning; (2) exploration and multiarmed bandits; (3) Markov Decision
Problems and planning; (4) reinforcement learning; (5) search; (6)
multiagent systems and multiagent learning; and (7) case
studies.
The course will adopt a ``handson'' approach, with programming
assignments designed to highlight the relationship between theory and
practice. Case studies, as well as invited talks from experts, will
offer an ``endtoend'' view of deployed agents. It is hoped that
students can apply the learnings from this course to the benefit of
their respective pursuits in various areas of computer science and
related fields.
Prerequisites
The course does not formally have other courses as
prerequisites. However, class lectures and assignments will assume
that the student is comfortable with probability and algorithms. The
course has an intensive programming component: based on ideas
discussed in class, the student must be able to independently design,
implement, and evaluate programs in a language of his/her choice. The
student must be prepared to spend a significant amount of time on each
programming assignment.
Evaluation
Grades will be decided based on four programming assignments, each
worth 15 marks; a midsemester examination worth 15 marks; and an
endsemester examination worth 25 marks.
Programming assignments must be turned in through Moodle.
Students auditing the course must score 50 or more marks in the
course to be awarded an ``AU'' grade.
Academic Honesty
Students are expected to adhere to the highest standards of
integrity and academic honesty. Acts such as copying in the
examinations and sharing code for the programming assignments will be
dealt with strictly, in accordance with the
institute's procedures
and disciplinary
actions for academic malpractice.
Texts and References
Reinforcement Learning: An Introduction, Richard S. Sutton and
Andrew G. Barto, MIT Press,
1998. Online
version.
Artificial Intelligence: A Modern Approach, Stuart J. Russell and
Peter Norvig, 3rd edition, PrenticeHall, 2009.
Algorithms for Reinforcement Learning, Csaba Szepesvári,
Morgan & Claypool,
2009. Online
version.
Dynamic Programming and Optimal Control, Volume II, Dimitri
P. Bertsekas, 4th edition, Athena Scientific, 2012.
Regret Analysis of Stochastic and Nonstochastic Multiarmed Bandit
Problems, Sébastien Bubeck and Nicolò CesaBianchi, Foundations and
Trends in Machine Learning, Volume 5, Number 1,
2012. Online
version.
Selected research papers:

A Formal Basis for the Heuristic Determination of Minimum Cost Paths
Peter E. Hart, Nils J. Nilsson, and Bertram Raphael, 1968

Modified Policy Iteration Algorithms for Discounted Markov Decision Problems
Martin L. Puterman and Moon Chirl Shin, 1978

Asymptotically
Efficient Adaptive Allocation Rules
T. L. Lai and Herbert Robbins, 1985

Learning to Predict by the Methods of Temporal Differences
Richard S. Sutton, 1988

Programming Robots Using Reinforcement Learning and Teaching
Longji Lin, 1991

Qlearning
Christopher J. C. H. Watkins and Peter Dayan, 1992

Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions
Ronald J. Williams and Leemon C. Baird, III, 1993

Markov games as a framework for multiagent reinforcment learning
Michael L. Littman, 1994

An Upper Bound on the Loss from Approximate OptimalValue Functions
Satinder P. Singh and Richard C. Yee, 1994

On the Complexity of Solving Markov Decision Problems
Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling, 1995

Average
Reward Reinforcement Learning: Foundations, Algorithms, and Empirical
Results
Sridhar Mahadevan, 1996

On the Complexity of Policy Iteration
Yishay Mansour and Satinder Singh, 1999

Policy invariance under reward transformations: Theory and application to reward shaping
Andrew Y. Ng, Daishi Harada, and Stuart Russell, 1999

Convergence
Results for SingleStep OnPolicy ReinforcementLearning Algorithms
Satinder Singh, Tommi Jaakkola, Michael L. Littman, and Csaba Szepesvári, 2000

Policy Gradient Methods for Reinforcment Learning with Function Approximation
Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour, 2000.

Rational and Convergent Learning in Stochastic Games
Michael Bowling and Manuela Veloso, 2001

Finitetime
Analysis of the Multiarmed Bandit Problem
Peter Auer, Nicolò CesaBianchi, and Paul Fischer, 2002

Nash QLearning For GeneralSum Stochastic Games
Junling Hu and Michael P. Wellman, 2003.

The
Nonstochastic Multiarmed Bandit Problem
Peter Auer,
Nicolò CesaBianchi, Yoav Freund, and Robert E. Schapire,
2002a

Variable Resolution Discretization in Optimal Control
Rémi Munos and Andrew Moore, 2002

Function Approximation via Tile Coding: Automating Parameter Choice
Alexander A. Sherstov and Peter Stone, 2005

Action
Elimination and Stopping Conditions for the MultiArmed Bandit and
Reinforcement Learning Problems
Eyal EvenDar, Shie Mannor,
and Yishay Mansour, 2006

If multiagent learning is the answer, what is the question?
Yoav Shoham, Rob Powers, and Trond Grenager, 2006

Half Field Offense in RoboCup Soccer: A Multiagent Reinforcement Learning Case Study
Shivaram Kalyanakrishnan, Yaxin Liu, and Peter Stone, 2007

Stochastic
Linear Optimization under Bandit Feedback
Varsha Dani, Thomas P. Hayes, and Sham M. Kakade, 2008

Policy Search for Motor Primitives in Robotics
Jens Kober and Jan Peters, 2008

Exponential Lower Bounds For Policy Iteration
John Fearnley, 2010

Nearoptimal Regret Bounds for Reinforcment Learning
Thomas Jaksch, Ronald Ortner, and Peter Auer, 2010

Learning Complementary Multiagent Behaviors: A Case Study
Shivaram Kalyanakrishnan and Peter Stone, 2010

A
contextualbandit approach to personalized news article
recommendation
Lihong Li, Wei Chu, John Langford, and Robert
E. Schapire, 2010

Toward OffPolicy Learning Control With Function Approximation
Hamid Reza Maei, Csaba Szepesvári, Shalabh Bhatnagar, and Richard S. Sutton, 2010.

Artificial Intelligence: Foundations of Computational Agents
David Poole and Alan Mackworth, 2010

An Empirical Evaluation of Thompson Sampling
Olivier Chapelle and Lihong Li, 2011

The KLUCB Algorithm for Bounded Stochastic Bandits and Beyond
Aurélien Garivier and Olivier Cappé, 2011

Learning Methods for Sequential Decision Making with Imperfect Representations
Shivaram Kalyanakrishnan, 2011

Thompson Sampling: An Asymptotically Optimal FiniteTime Analysis
Emilie Kaufmann, Nathaniel Korda, and Rémi Munos, 2012
Communication
This page will serve as the primary source of information regarding
the course, the schedule, and related announcements. The Moodle page
for the course will be used for sharing resources for the lectures and
assignments, and also for recording grades.
Email is the best means of communicating with the instructor;
students must send email with ``[CS747]'' in the header, with a copy
marked to the TAs.
Schedule

July 22: Welcome, Introduction to the course.
Summary: Conception of course, syllabus, policies, administrative matters.

July 24: Multiarmed bandits.
Reading: Sections 2 – 2.4, Sutton and Barto (1998).
Summary: Definition of stochastic multiarmed bandit, notion of
algorithm, greedy and εgreedy algorithms, graph of expected
reward against number of pulls.

July 29: Multiarmed bandits.
Reading: Auer et al. (2002).
Summary: GLIE variants of εgreedy algorithms, Softmax
exploration, deterministic exploration algorithms, Lai and Robbins'
lower bound on regret.
References: Singh et al. (2000), Lai and Robbins (1985).

July 31: Multiarmed bandits.
Reading: Auer et al. (2002).
Summary: UCB algorithm, Hoeffding's inequality, union bound, steps towards analysing UCB.

August 5: Multiarmed bandits.
Reading: Chapelle and Li (2011).
Summary: Proof of regret bound for UCB, introduction to KLUCB and Thompson Sampling.
References: Garivier and Cappé (2011), Kaufmann et al. (2012)

August 7: Multiarmed bandits.
Summary: Preview of Programming Assignment 1, presentation of KLUCB and Thompson Sampling, discussion of other bandit settings (adversarial, pure exploration, contextual and linear bandits).
References: Auer et al. (2002a), EvenDar et al. (2006), Li et al. (2010), Dani et al. (2008).

August 12: Multi Armed Bandit Sampling in Nested Simulation
for Financial Portfolio Risk Measurement. Invited talk
by Sandeep
Juneja.

August 14: Some Cool Problems in Some Hot Areas. Invited talk by Krithi Ramamritham.

August 19: MDP Planning.
Reading: Chapter 3, Sutton and Barto (1998).
Summary: Agentenvironment interface, ``design'' of agent behaviour through specification of rewards, Markov Decision Problem.

August 21: MDP Planning.
Summary: Definitions (MDP, policy, value function), optimal policy, Bellman's equations.
Reference: Mahadevan (1996).

August 26: MDP Planning.
Reading: Chapter 4, Sutton and
Barto (1998).
Summary: Illustration of policy evaluation through
example, solution of Bellman's equations (both by solving a system of
linear equations, and through iteration), Bellman's Optimality
Equations, MDP planning through Linear Programming.
Reference: Szepesvári (2009, Chapter 2 and Appendix A).

August 28: MDP Planning.
Summary: Policy Improvement Theorem, Policy Iteration, Value Iteration.
References: Singh and Yee (1994), Littman et al. (1995), Mansour and Singh (1999), Fearnley (2010).

September 2: MDP Planning.
Summary: Proof of Policy Improvement Theorem, brief discussions on the
approximation achieved by Value Iteration, Asynchronous Policy Iteration, Generalised Policy Iteration.
References: Williams and Baird (1993), Puterman and Shin (1978).

September 4: RoboCup: A Grand Challenge for AI.
References: Kalyanakrishnan et al. (2007), Kalyanakrishnan and Stone (2009).

September 12: Midsemester examination. 8.30 a.m. – 10.30 a.m., LA 301.

September 16: Reinforcement Learning.

September 18: Reinforcement Learning.
Reading: Chapter 5, Sutton and Barto (1998).
Summary: Firstvisit and Everyvisit Monte Carlo methods for policy evaluation, estimating Qvalues.

September 23: Reinforcement Learning.
Summary: PAC analysis of Policy Iteration with Monte Carlo estimation of Qvalues.

September 30: Reinforcement Learning.
Reading: Chapter 6, Sutton and Barto (1998).
Summary: TD(0), contrast with Monte Carlo methods.
Reference: Sutton (1988).

October 7: Reinforcement Learning.
Reading: Chapter 6, Sutton and Barto (1998).
Summary: Offpolicy and onpolicy control.
Reference: Watkins and Dayan (1992), Singh et al. (2000), Jaksch et al. (2010).

October 9: Reinforcement Learning.
Reading: Chapter 9, Sutton and Barto (1998).
Summary: Modelbased RL (example: Dyna) and batch RL (example: Experience Replay).
Reference: Lin (1991).

October 14: Reinforcement Learning.
Summary: Illustration of the need for generalisation using Half Field Offense as an example.
Reference: Kalyanakrishnan et al. (2007).

October 16: Reinforcement Learning.
Reading: Chapter 8, Sutton and Barto (1998).
Summary: Neural networks, linear methods, gradient descent.

October 21: Reinforcement Learning.
Summary: Tile coding, gradient descent with linear function
approximation, convergence results, offpolicy bootstrapping.
References: Munos and Moore (2002), Sherstov and Stone (2005), Maei et al. (2010).

October 23: Reinforcement Learning.
Reading: Sutton et al. (2000).
Summary: Representations, approximate value functions, and policies, policy gradient theorem.
Reference: Kober and Peters (2008).

October 28: Learning Methods for Sequential Decision Making with Imperfect Representations.
Reference: Kalyanakrishnan (2011).

October 30: Multiagent learning.
Reading: Littman (1994).
Summary: Brief Introduction to reward shaping, Matrix games, Minimaxoptimality, Markov games, MinimaxQ algorithm.
References: Ng et al. (1999), Bowling and Veloso (2001), Hu and Wellman (2003), Shoham et al. (2006).

November 4: Overview of CS 748: Advances in Intelligent and Learning Agents.

November 6: Search.
Summary: Applications of search, relationship with MDPs, Dijkstra's algorithm for graphs, breadthfirst search and depthfirst search, A* search.
References: Chapter 3, Poole and Mackworth (2010), Hart, Nilsson, and Raphael (1968).

November 21: Endsemester examination. 9.30 a.m. – 12.30 p.m., LA 301.
Assignments

August 19: Programming assignment 1 announced on
Moodle. Submission deadline: 11.59 p.m., September 3, 2015.

September 17: Programming assignment 2 announced on
Moodle. Submission deadline: 11.59 p.m., October 3, 2015.

October 7: Programming assignment 3 announced on
Moodle. Submission deadline: 11.59 p.m., October 22, 2015.

October 25: Programming assignment 4 announced on
Moodle. Submission deadline: 11.59 p.m., November 10, 2015.