Pure Exploration Reading Group

About | Schedule | Plan | Suggested papers | About the name

Welcome to the CWI reading group on Pure Exploration. We try to meet every other week on Thursday from 11-12h, to present and discuss one paper from the pure exploration literature.

Contact: Wouter Koolen.

The first meeting is on Thursday October 11 at 11:00 in L016.

What is Pure Exploration?

Pure Exploration is the Machine Learning term for what a statistician would call active sequential multiple composite hypothesis testing. The question is how to design systems that efficiently explore their stochastic environment, by choosing experiments to perform, with the goal of learning something (specific) about it. Pure exploration is about how what you want to learn and which experiments are available affects how you should allocate experiments and what you should infer from their outcomes. Why the name?.

The schedule

When, Where	Presenter	Paper
11 Oct, L016	Wouter Koolen	This will be an introductory session focusing on the basic definitions and problem. We will also look at three algorithms. We will closely follow PAC bounds for ... and Action elimination and ....
25 Oct, L016	Rémy Degenne	This will be a session on lower bounds for the sample complexity, and the techniques used to prove them. It will be a synthesis based on the papers On the complexity ..., Optimal Best Arm ... (without the algorithmic part), Explore first, exploit next ....
8 Nov, L016	Judith ter Schure	Spending functions from Statistical monitoring of ..., chapter 5.
22 Nov, L016	Rianne de Heide	Simple Bayesian algorithms ... on a variant of Thompson Sampling for pure exploration.
13 Dec, L016	Thijs van Ommen	Causal bandits
20 Dec, L016	Rianne de Heide	... testing with online FDR control...
11 Apr, L120	Rianne de Heide	... Hyperparameter Optimization

The plan for the reading group

We will be meeting roughly every 2 weeks for a 1h presentation of one paper, by a volunteer participant. We will start with the classics and then proceed to some recent ('17, '18) material. If you have any special tastes please let me know! If you want to volunteer, have a look at the list of suggested material.

Scope

We will be looking at Pure Exploration from a theory perspective. The objects of interest are algorithms for which we can prove statistical and computational guarantees, and fundamental limits to what you can learn. The focus is very much on understanding how things work at the lego level, and so we will study the simplest interesting problems. Things to expect are confidence intervals, martingales, (quasi-)Bayesian updating, and the supporting tools and techniques, incl. optimization.

The reading group will be open to anyone interested, including people from other groups inside/outside CWI, and people selectively attending.

Volunteers

Every meeting, a volunteer reads and then presents a paper. Presentations take place on Thursday 11-12h. We aim for a 45 minutes talk, followed by a 15 minutes discussion.

The goal of the talk is to communicate clearly the highlights of the paper, and to place it in context. Using the blackboard is recommended. Slides are overkill.

Suggested papers

These papers are candidate papers to be read. The citation graph below shows the dependencies between titles (abbreviated to a unique term).

References

Key	Paper
PAC	Eyal Even-Dar, Shie Mannor, and Yishay Mansour. PAC bounds for multi-armed bandit and Markov decision processes. In Computational Learning Theory, 15th Annual Conference on Computational Learning Theory, COLT 2002, Sydney, Australia, July 8-10, 2002, Proceedings, volume 2375 of Lecture Notes in Computer Science, pages 255–270. Springer, 2002. [ DOI \| http ] and journal version Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079–1105, 2006. [ .html ] Introduce the BAI problem. Propose the Successive Elimination algorithm.
Finitely-Armed	Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in finitely-armed and continuous-armed bandits. Theor. Comput. Sci., 412(19):1832–1852, 2011. [ DOI \| http ] Propose the Simple Regret criterion, argue incompatibility with minimising cumulative regret.
BAI in MAB	Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multi-armed bandits. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 41–53. Omnipress, 2010. [ http ] Introduce the Successive Rejects algorithm.
LUCB	Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. PAC subset selection in stochastic multi-armed bandits. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc / Omnipress, 2012. [ .pdf ] Introduce the LUCB algorithm.
Unified	Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States., pages 3221–3229, 2012. [ http ] Compare fixed budget and fixed confidence.
Monte Carlo	Kazuki TERAOKA, Kohei HATANO, and Eiji TAKIMOTO. Efficient sampling method for Monte Carlo tree search problem. IEICE Transactions on Information and Systems, E97.D(3):392–398, 2014. [ DOI ] FindTopWinner algorithm for pure exploration in game trees.
lil'UCB	Kevin G. Jamieson, Matthew Malloy, Robert D. Nowak, and Sébastien Bubeck. lil' UCB : An optimal exploration algorithm for multi-armed bandits. In Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 13-15, 2014, volume 35 of JMLR Workshop and Conference Proceedings, pages 423–439. JMLR.org, 2014. [ .html ] Introduce the lil'UCB algorithm.
Correlation	Matthew D. Hoffman, Bobak Shahriari, and Nando de Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 22-25, 2014, volume 33 of JMLR Workshop and Conference Proceedings, pages 365–374. JMLR.org, 2014. [ .html ]
Linear	Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 828–836, 2014. [ http ]
Complexity	Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. Journal of Machine Learning Research, 17:1:1–1:42, 2016. [ .html ]
Optimal	Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 998–1027. JMLR.org, 2016. [ .html ]
Matroid	Lijie Chen, Anupam Gupta, and Jian Li. Pure exploration of multi-armed bandit under matroid constraints. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 647–669. JMLR.org, 2016. [ .html ]
Fixed Budget Lower	Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 590–604. JMLR.org, 2016. [ .html ]
Hyperband	Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18:185:1–185:52, 2017. [ .html ]
Max-quantile	Yahel David and Nahum Shimkin. Pure exploration for max-quantile bandits. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I, volume 9851 of Lecture Notes in Computer Science, pages 556–571. Springer, 2016. [ DOI \| http ]
FDR Control	Fanny Yang, Aaditya Ramdas, Kevin G. Jamieson, and Martin J. Wainwright. A framework for multi-a(rmed)/b(andit) testing with online FDR control. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5959–5968, 2017. [ http ]
Simple Bayesian	Daniel Russo. Simple Bayesian algorithms for best arm identification. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 23-26, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 1417–1418. JMLR.org, 2016. [ .html ]
Causal	Finnian Lattimore, Tor Lattimore, and Mark D. Reid. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 1181–1189, 2016. [ http ]
Instance Optimal	Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identification. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, volume 65 of Proceedings of Machine Learning Research, pages 535–592. PMLR, 2017. [ .html ]
Thresholding	Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1690–1698. JMLR.org, 2016. [ .html ]
Multiple Arm	Jiecao Chen, Xi Chen, Qin Zhang, and Yuan Zhou. Adaptive multiple-arm identification. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 722–730. PMLR, 2017. [ .html ]
Combinatorial	Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, and Ruosong Wang. Nearly optimal sampling algorithms for combinatorial pure exploration. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 7-10 July 2017, volume 65 of Proceedings of Machine Learning Research, pages 482–534. PMLR, 2017. [ .html ]
Skyline	Albert Cheu, Ravi Sundaram, and Jonathan Ullman. Skyline identification in multi-arm bandits. In 2018 IEEE International Symposium on Information Theory, ISIT 2018, Vail, CO, USA, June 17-22, 2018, pages 1006–1010. IEEE, 2018. [ DOI \| http ]
InfArm	Maryam Aziz, Jesse Anderton, Emilie Kaufmann, and Javed Aslam. Pure exploration in infinitely-armed bandit models with fixed-confidence. In Proceedings of Algorithmic Learning Theory, volume 83 of Proceedings of Machine Learning Research, pages 3–24. PMLR, 07–09 Apr 2018. [ .html \| .pdf ]
Simulator	Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderate-confidence regime. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1794–1834, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR. [ .html \| .pdf ]
Lipschitz	Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 975–999, Barcelona, Spain, 13–15 Jun 2014. PMLR. [ .html \| .pdf ]
Monotonicity	Aurélien Garivier, Pierre Ménard, Laurent Rossi, and Pierre Menard. Thresholding bandit for dose-ranging: The impact of monotonicity. arXiv preprint arXiv:1711.04454, 2017. [ http ]
Multiple Answers	Rémy Degenne and Wouter M Koolen. Pure exploration with multiple correct answers. arXiv preprint arXiv:1902.03475, 2019. [ http ]

Bonus References

S. R. Howard, A. Ramdas, J. McAuliffe, and J. Sekhon. Exponential line-crossing inequalities. ArXiv e-prints, August 2018. [ arXiv ]

S. Delattre and N. Fournier. On Monte-Carlo tree search for deterministic games with alternate moves and complete information. ArXiv e-prints, April 2017. [ arXiv ]

Levente Kocsis and Csaba Szepesvári. Bandit based Monte-Carlo planning. In Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 18-22, 2006, Proceedings, volume 4212 of Lecture Notes in Computer Science, pages 282–293. Springer, 2006. [ DOI | http ]

Aurélien Garivier, Pierre Ménard, and Gilles Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. CoRR, abs/1602.07182, 2016. [ arXiv | http ]

Michael A. Proschan, K. K. Gordon Lan, and Janet Turk Wittes. Statistical Monitoring of Clinical Trials: a Unified Approach. Statistics for Biology and Health. Springer, 2006. [ http ]

Why is it called Pure Exploration?

Research in Machine Learning, in particular on Bandits, focused strongly on the "Exploration/Exploitation" trade-off in the '00s. Here the starting point is that data points represent reward, and you care about accumulating high reward (as you do). Some experiments are very informative, while others yield high reward, so you need a mix of both and getting this trade-off right is very interesting. Pure Exploration is the rebellious counter-movement: "pure" refers to how fast it learns, with no regard for how much it earns (or in fact without any reward/cost structure). In other words, pure exploration cares only about generalisation.

A particular example where "pure" is the more useful model is in planning (say for a robot) with a generative model (i.e. in a simulator). It is perfectly fine if the planning system destroys your robot in the simulation. As long as it recommends a useful move for the physical robot.