Welcome to the CWI reading group on Pure Exploration. We try to meet every other week on Thursday from 1112h, to present and discuss one paper from the pure exploration literature.
Contact: Wouter Koolen.
The first meeting is on Thursday October 11 at 11:00 in L016.
Pure Exploration is the Machine Learning term for what a statistician would call active sequential multiple composite hypothesis testing. The question is how to design systems that efficiently explore their stochastic environment, by choosing experiments to perform, with the goal of learning something (specific) about it. Pure exploration is about how what you want to learn and which experiments are available affects how you should allocate experiments and what you should infer from their outcomes. Why the name?.
When, Where  Presenter  Paper 
11 Oct, L016  Wouter Koolen  This will be an introductory session focusing on the basic definitions and problem. We will also look at three algorithms. We will closely follow PAC bounds for ... and Action elimination and .... 
25 Oct, L016  Rémy Degenne 
This will be a session on lower bounds for the sample complexity, and the techniques used to prove them. It will be a synthesis based on the papers

8 Nov, L016  Judith ter Schure  Spending functions from Statistical monitoring of ..., chapter 5. 
22 Nov, L016  Rianne de Heide  Simple Bayesian algorithms ... on a variant of Thompson Sampling for pure exploration. 
13 Dec, L016  Thijs van Ommen  Causal bandits 
20 Dec, L016  Rianne de Heide  ... testing with online FDR control... 
11 Apr, L120  Rianne de Heide  ... Hyperparameter Optimization 
We will be meeting roughly every 2 weeks for a 1h presentation of one paper, by a volunteer participant. We will start with the classics and then proceed to some recent ('17, '18) material. If you have any special tastes please let me know! If you want to volunteer, have a look at the list of suggested material.
We will be looking at Pure Exploration from a theory perspective. The objects of interest are algorithms for which we can prove statistical and computational guarantees, and fundamental limits to what you can learn. The focus is very much on understanding how things work at the lego level, and so we will study the simplest interesting problems. Things to expect are confidence intervals, martingales, (quasi)Bayesian updating, and the supporting tools and techniques, incl. optimization.
The reading group will be open to anyone interested, including people from other groups inside/outside CWI, and people selectively attending.
Every meeting, a volunteer reads and then presents a paper. Presentations take place on Thursday 1112h. We aim for a 45 minutes talk, followed by a 15 minutes discussion.
The goal of the talk is to communicate clearly the highlights of the paper, and to place it in context. Using the blackboard is recommended. Slides are overkill.
These papers are candidate papers to be read. The citation graph below shows the dependencies between titles (abbreviated to a unique term).
Key  Paper 
PAC 
Eyal EvenDar, Shie Mannor, and Yishay Mansour. PAC bounds for multiarmed bandit and Markov decision processes. In Computational Learning Theory, 15th Annual Conference on Computational Learning Theory, COLT 2002, Sydney, Australia, July 810, 2002, Proceedings, volume 2375 of Lecture Notes in Computer Science, pages 255–270. Springer, 2002. [ DOI  http ] and journal versionEyal EvenDar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079–1105, 2006. [ .html ] Introduce the BAI problem. Propose the Successive Elimination algorithm. 
FinitelyArmed 
Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in finitelyarmed and continuousarmed bandits. Theor. Comput. Sci., 412(19):1832–1852, 2011. [ DOI  http ] Propose the Simple Regret criterion, argue incompatibility with minimising cumulative regret. 
BAI in MAB 
JeanYves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multiarmed bandits. In COLT 2010  The 23rd Conference on Learning Theory, Haifa, Israel, June 2729, 2010, pages 41–53. Omnipress, 2010. [ http ] Introduce the Successive Rejects algorithm. 
LUCB 
Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, and Peter Stone. PAC subset selection in stochastic multiarmed bandits. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26  July 1, 2012. icml.cc / Omnipress, 2012. [ .pdf ] Introduce the LUCB algorithm. 
Unified 
Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 36, 2012, Lake Tahoe, Nevada, United States., pages 3221–3229, 2012. [ http ] Compare fixed budget and fixed confidence. 
Monte Carlo 
Kazuki TERAOKA, Kohei HATANO, and Eiji TAKIMOTO. Efficient sampling method for Monte Carlo tree search problem. IEICE Transactions on Information and Systems, E97.D(3):392–398, 2014. [ DOI ] FindTopWinner algorithm for pure exploration in game trees. 
lil'UCB 
Kevin G. Jamieson, Matthew Malloy, Robert D. Nowak, and Sébastien Bubeck. lil' UCB : An optimal exploration algorithm for multiarmed bandits. In Proceedings of The 27th Conference on Learning Theory, COLT 2014, Barcelona, Spain, June 1315, 2014, volume 35 of JMLR Workshop and Conference Proceedings, pages 423–439. JMLR.org, 2014. [ .html ] Introduce the lil'UCB algorithm. 
Correlation 
Matthew D. Hoffman, Bobak Shahriari, and Nando de Freitas. On correlation and budget constraints in modelbased bandit optimization with application to automatic machine learning. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, AISTATS 2014, Reykjavik, Iceland, April 2225, 2014, volume 33 of JMLR Workshop and Conference Proceedings, pages 365–374. JMLR.org, 2014. [ .html ] 
Linear 
Marta Soare, Alessandro Lazaric, and Rémi Munos. Bestarm identification in linear bandits. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada, pages 828–836, 2014. [ http ] 
Complexity 
Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of bestarm identification in multiarmed bandit models. Journal of Machine Learning Research, 17:1:1–1:42, 2016. [ .html ] 
Optimal 
Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 2326, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 998–1027. JMLR.org, 2016. [ .html ] 
Matroid 
Lijie Chen, Anupam Gupta, and Jian Li. Pure exploration of multiarmed bandit under matroid constraints. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 2326, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 647–669. JMLR.org, 2016. [ .html ] 
Fixed Budget Lower 
Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 2326, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 590–604. JMLR.org, 2016. [ .html ] 
Hyperband 
Lisha Li, Kevin G. Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel banditbased approach to hyperparameter optimization. Journal of Machine Learning Research, 18:185:1–185:52, 2017. [ .html ] 
Maxquantile 
Yahel David and Nahum Shimkin. Pure exploration for maxquantile bandits. In Machine Learning and Knowledge Discovery in Databases  European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 1923, 2016, Proceedings, Part I, volume 9851 of Lecture Notes in Computer Science, pages 556–571. Springer, 2016. [ DOI  http ] 
FDR Control 
Fanny Yang, Aaditya Ramdas, Kevin G. Jamieson, and Martin J. Wainwright. A framework for multia(rmed)/b(andit) testing with online FDR control. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 49 December 2017, Long Beach, CA, USA, pages 5959–5968, 2017. [ http ] 
Simple Bayesian 
Daniel Russo. Simple Bayesian algorithms for best arm identification. In Proceedings of the 29th Conference on Learning Theory, COLT 2016, New York, USA, June 2326, 2016, volume 49 of JMLR Workshop and Conference Proceedings, pages 1417–1418. JMLR.org, 2016. [ .html ] 
Causal 
Finnian Lattimore, Tor Lattimore, and Mark D. Reid. Causal bandits: Learning good interventions via causal inference. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pages 1181–1189, 2016. [ http ] 
Instance Optimal 
Lijie Chen, Jian Li, and Mingda Qiao. Towards instance optimal bounds for best arm identification. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 710 July 2017, volume 65 of Proceedings of Machine Learning Research, pages 535–592. PMLR, 2017. [ .html ] 
Thresholding 
Andrea Locatelli, Maurilio Gutzeit, and Alexandra Carpentier. An optimal algorithm for the thresholding bandit problem. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1690–1698. JMLR.org, 2016. [ .html ] 
Multiple Arm 
Jiecao Chen, Xi Chen, Qin Zhang, and Yuan Zhou. Adaptive multiplearm identification. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 722–730. PMLR, 2017. [ .html ] 
Combinatorial 
Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, and Ruosong Wang. Nearly optimal sampling algorithms for combinatorial pure exploration. In Proceedings of the 30th Conference on Learning Theory, COLT 2017, Amsterdam, The Netherlands, 710 July 2017, volume 65 of Proceedings of Machine Learning Research, pages 482–534. PMLR, 2017. [ .html ] 
Skyline 
Albert Cheu, Ravi Sundaram, and Jonathan Ullman. Skyline identification in multiarm bandits. In 2018 IEEE International Symposium on Information Theory, ISIT 2018, Vail, CO, USA, June 1722, 2018, pages 1006–1010. IEEE, 2018. [ DOI  http ] 
InfArm 
Maryam Aziz, Jesse Anderton, Emilie Kaufmann, and Javed Aslam. Pure exploration in infinitelyarmed bandit models with fixedconfidence. In Proceedings of Algorithmic Learning Theory, volume 83 of Proceedings of Machine Learning Research, pages 3–24. PMLR, 07–09 Apr 2018. [ .html  .pdf ] 
Simulator 
Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderateconfidence regime. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1794–1834, Amsterdam, Netherlands, 07–10 Jul 2017. PMLR. [ .html  .pdf ] 
Lipschitz 
Stefan Magureanu, Richard Combes, and Alexandre Proutiere. Lipschitz bandits: Regret lower bound and optimal algorithms. In Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 975–999, Barcelona, Spain, 13–15 Jun 2014. PMLR. [ .html  .pdf ] 
Monotonicity 
Aurélien Garivier, Pierre Ménard, Laurent Rossi, and Pierre Menard. Thresholding bandit for doseranging: The impact of monotonicity. arXiv preprint arXiv:1711.04454, 2017. [ http ] 
Multiple Answers 
Rémy Degenne and Wouter M Koolen. Pure exploration with multiple correct answers. arXiv preprint arXiv:1902.03475, 2019. [ http ] 
S. R. Howard, A. Ramdas, J. McAuliffe, and J. Sekhon. Exponential linecrossing inequalities. ArXiv eprints, August 2018. [ arXiv ]
S. Delattre and N. Fournier. On MonteCarlo tree search for deterministic games with alternate moves and complete information. ArXiv eprints, April 2017. [ arXiv ]
Levente Kocsis and Csaba Szepesvári. Bandit based MonteCarlo planning. In Machine Learning: ECML 2006, 17th European Conference on Machine Learning, Berlin, Germany, September 1822, 2006, Proceedings, volume 4212 of Lecture Notes in Computer Science, pages 282–293. Springer, 2006. [ DOI  http ]
Aurélien Garivier, Pierre Ménard, and Gilles Stoltz. Explore first, exploit next: The true shape of regret in bandit problems. CoRR, abs/1602.07182, 2016. [ arXiv  http ]
Michael A. Proschan, K. K. Gordon Lan, and Janet Turk Wittes. Statistical Monitoring of Clinical Trials: a Unified Approach. Statistics for Biology and Health. Springer, 2006. [ http ]
Research in Machine Learning, in particular on Bandits, focused strongly on the "Exploration/Exploitation" tradeoff in the '00s. Here the starting point is that data points represent reward, and you care about accumulating high reward (as you do). Some experiments are very informative, while others yield high reward, so you need a mix of both and getting this tradeoff right is very interesting. Pure Exploration is the rebellious countermovement: "pure" refers to how fast it learns, with no regard for how much it earns (or in fact without any reward/cost structure). In other words, pure exploration cares only about generalisation.
A particular example where "pure" is the more useful model is in planning (say for a robot) with a generative model (i.e. in a simulator). It is perfectly fine if the planning system destroys your robot in the simulation. As long as it recommends a useful move for the physical robot.