Pure Exploration Reading Group

exploration
About | Schedule | Plan | Suggested papers | About the name

Welcome to the CWI reading group on Pure Exploration. We try to meet every other week on Thursday from 11-12h, to present and discuss one paper from the pure exploration literature.

Contact: Wouter Koolen.

The first meeting is on Thursday October 11 at 11:00 in L016.

What is Pure Exploration?

Pure Exploration is the Machine Learning term for what a statistician would call active sequential multiple composite hypothesis testing. The question is how to design systems that efficiently explore their stochastic environment, by choosing experiments to perform, with the goal of learning something (specific) about it. Pure exploration is about how what you want to learn and which experiments are available affects how you should allocate experiments and what you should infer from their outcomes. Why the name?.

The schedule

When, Where Presenter Paper
11 Oct, L016 Wouter

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. PAC bounds for multi-armed bandit and markov decision processes. In Computational Learning Theory, 15th Annual Conference on Computational Learning Theory, COLT 2002, Sydney, Australia, July 8-10, 2002, Proceedings, volume 2375 of Lecture Notes in Computer Science, pages 255–270. Springer, 2002. [ DOI | http ]

and the later journal version

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079–1105, 2006. [ .html ]

25 Oct, L016 Judith? TBD
8 Nov, L016 Rémy? TBD

The plan for the reading group

We will be meeting roughly every 2 weeks for a 1h presentation of one paper, by a volunteer participant. We will start with the classics and then proceed to some recent ('17, '18) material. If you have any special tastes please let me know! If you want to volunteer, have a look at the list of suggested material.

Scope

We will be looking at Pure Exploration from a theory perspective. The objects of interest are algorithms for which we can prove statistical and computational guarantees, and fundamental limits to what you can learn. The focus is very much on understanding how things work at the lego level, and so we will study the simplest interesting problems. Things to expect are confidence intervals, martingales, (quasi-)Bayesian updating, and the supporting tools and techniques, incl. optimization.

The reading group will be open to anyone interested, including people from other groups inside/outside CWI, and people selectively attending.

Volunteers

Every meeting, a volunteer reads and then presents a paper. Presentations take place on Thursday 11-12h. We aim for a 45 minutes talk, followed by a 15 minutes discussion.

The goal of the talk is to communicate clearly the highlights of the paper, and to place it in context. Using the blackboard is recommended. Slides are overkill.


Suggested papers

These papers are candidate papers to be read. The citation graph below shows the dependencies between titles (abbreviated to a unique term).

graph

References

Key Paper
PAC

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. PAC bounds for multi-armed bandit and markov decision processes. In Computational Learning Theory, 15th Annual Conference on Computational Learning Theory, COLT 2002, Sydney, Australia, July 8-10, 2002, Proceedings, volume 2375 of Lecture Notes in Computer Science, pages 255–270. Springer, 2002. [ DOI | http ]

and journal version

Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079–1105, 2006. [ .html ]

Introduce the BAI problem. Propose the Successive Elimination algorithm.
Finitely-Armed

Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in finitely-armed and continuous-armed bandits. Theor. Comput. Sci., 412(19):1832–1852, 2011. [ DOI | http ]

Propose the Simple Regret criterion, argue incompatibility with minimising cumulative regret.
BAI in MAB

Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multi-armed bandits. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 41–53. Omnipress, 2010. [ http ]

Introduce the Successive Rejects algorithm.
Unified

Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 3221–3229, 2012. [ http ]

Compare fixed budget and fixed confidence.
Unified

Kazuki TERAOKA, Kohei HATANO, and Eiji TAKIMOTO. Efficient sampling method for Monte Carlo tree search problem. IEICE Transactions on Information and Systems, E97.D(3):392–398, 2014. [ DOI ]

FindTopWinner algorithm for pure exploration in game trees.
lil'UCB

Kevin G. Jamieson, Matthew Malloy, Robert D. Nowak, and Sébastien Bubeck. lil' UCB : An optimal exploration algorithm for multi-armed bandits. In Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, volume 35 of JMLR Workshop and Conference Proceedings, pages 423–439. JMLR.org, 2014. [ .html ]

Introduce the lil'UCB algorithm.
Correlation

Matthew D. Hoffman, Bobak Shahriari, and Nando de Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, volume 33 of JMLR Workshop and Conference Proceedings, pages 365–374. JMLR.org, 2014. [ .html ]

Linear

Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 828–836, 2014. [ http ]

Complexity

Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17:1:1–1:42, 2016. [ .html ]

Optimal

Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 998–1027. JMLR.org, 2016. [ .html ]

Matroid

Lijie Chen, Anupam Gupta, and Jian Li. Pure exploration of multi-armed bandit under matroid constraints. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 647–669. JMLR.org, 2016. [ .html ]

Fixed Budget Lower

Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 590–604. JMLR.org, 2016. [ .html ]

Hyperband

Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18:185:1–185:52, 2017. [ .html ]

Max-quantile

Yahel David and Nahum Shimkin. Pure exploration for max-quantile bandits. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I, volume 9851 of Lecture Notes in Computer Science, pages 556–571. Springer, 2016. [ DOI | http ]

FDR Control

Fanny Yang, Aaditya Ramdas, Kevin G. Jamieson, and Martin J. Wainwright. A framework for multi-a(rmed)/b(andit) testing with online FDR control. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5959–5968, 2017. [ http ]

Simple Bayesian

Daniel Russo. Simple Bayesian algorithms for best arm identification. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 1417–1418. JMLR.org, 2016. [ .html ]

Causal

Finnian Lattimore, Tor Lattimore, and Mark D. Reid. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1181–1189, 2016. [ http ]

Instance Optimal

Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identification. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, volume 65 of Proceedings of Machine Learning Research, pages 535–592. PMLR, 2017. [ .html ]

Thresholding

Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1690–1698. JMLR.org, 2016. [ .html ]

Multiple Arm

Jiecao Chen, Xi Chen, Qin Zhang, and Yuan Zhou. Adaptive multiple-arm identification. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 722–730. PMLR, 2017. [ .html ]

Combinatorial

Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, and Ruosong Wang. Nearly optimal sampling algorithms for combinatorial pure exploration. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, volume 65 of Proceedings of Machine Learning Research, pages 482–534. PMLR, 2017. [ .html ]

Skyline

Albert Cheu, Ravi Sundaram, and Jonathan Ullman. Skyline identification in multi-arm bandits. In 2018 IEEE International Symposium on Information Theory, ISIT 2018, Vail, CO, USA, June 17-22, 2018, pages 1006–1010. IEEE, 2018. [ DOI | http ]

Bonus References

S. R. Howard, A. Ramdas, J. McAuliffe, and J. Sekhon. Exponential line-crossing inequalities. ArXiv e-prints, August 2018. [ arXiv ]

S. Delattre and N. Fournier. On Monte-Carlo tree search for deterministic games with alternate moves and complete information. ArXiv e-prints, April 2017. [ arXiv ]

Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18-22, 2006, Proceedings, volume 4212 of Lecture Notes in Computer Science, pages 282–293. Springer, 2006. [ DOI | http ]



Why is it called Pure Exploration?

Rover

Research in Machine Learning, in particular on Bandits, focused strongly on the "Exploration/Exploitation" trade-off in the '00s. Here the starting point is that data points represent reward, and you care about accumulating high reward (as you do). Some experiments are very informative, while others yield high reward, so you need a mix of both and getting this trade-off right is very interesting. Pure Exploration is the rebellious counter-movement: "pure" refers to how fast it learns, with no regard for how much it earns (or in fact without any reward/cost structure). In other words, pure exploration cares only about generalisation.

A particular example where "pure" is the more useful model is in planning (say for a robot) with a generative model (i.e. in a simulator). It is perfectly fine if the planning system destroys your robot in the simulation. As long as it recommends a useful move for the physical robot.