Reinforcement learning; direct policy search; look-ahead tree search
Abstract :
[en] Direct policy search (DPS) and look-ahead tree (LT) policies are two popular techniques for solving difficult sequential decision-making problems. They both are simple to implement, widely applicable without making strong assumptions on the structure of the problem, and capable of producing high performance control policies. However, computationally both of them are, each in their own way, very expensive. DPS can require huge offline resources (effort required to obtain the policy) to first select an appropriate space of parameterized policies that works well for the targeted problem, and then to determine the best values of the parameters via global optimization. LT policies do not require any offline resources; however, they typically require huge online resources (effort required to calculate the best decision at each step) in order to grow trees of sufficient depth. In this paper, we propose optimized look-ahead trees (OLT), a model-based policy learning scheme that lies at the intersection of DPS and LT. In OLT, the control policy is represented indirectly through an algorithm that at each decision step develops, as in LT using a model of the dynamics, a small look-ahead tree until a prespecified online budget is exhausted. Unlike LT, the development of the tree is not driven by a generic heuristic; rather, the heuristic is optimized for the target problem and implemented as a parameterized node scoring function learned offline via DPS. We experimentally compare OLT with pure DPS and pure LT variants on optimal control benchmark domains. The results show that the LT-based representation is a versatile way of compactly representing policies in a DPS scheme (which results in OLT being easier to tune and having lower offline complexity than pure DPS); while at the same time, DPS helps to significantly reduce the size of the look-ahead trees that are required to take high-quality decisions (which results in OLT having lower online complexity than pure LT). Moreover, OLT produces overall better performing policies than pure DPS and pure LT and also results in policies that are robust with respect to perturbations of the initial conditions.
Disciplines :
Computer science
Author, co-author :
Jung, Tobias ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Réseaux informatiques
Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids
Maes, Francis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation
Language :
English
Title :
Optimized look-ahead tree policies: a bridge between look-ahead tree policies and direct policy search
Publication date :
March 2014
Journal title :
International Journal of Adaptive Control and Signal Processing
Busoniu L, Babuska R, De Schutter B, Ernst D. Reinforcement Learning and Dynamic Programming Using Function Approximators. Taylor & Francis CRC Press, 2010. ISBN: 978-1439821084.
Hren J-F, Munos R. Optimistic planning of deterministic systems. Proceedings of European Workshop on Reinforcement Learning (EWRL 2008), Villeneuve d'Ascq, France, June 30-July 3, 2008; 151-164.
Maes F, Wehenkel L, Ernst D. Optimized look-ahead tree policies. Proceedings of the 9th European Workshop on Reinforcement Learning (EWRL 2011), Athens, Greece, September 9-11, 2011; 5-17.
Frean M, Boyle P. Using Gaussian processes to optimize expensive functions. In AI 2008: Advances in Artificial Intelligence, vol 5360 LNCS, Wobcke W, Zhang M (eds). Springer: Berlin, 2008; 258-267.
Lizotte D, Wang T, Bowling M, Schuurmans D. Automatic gait optimization with Gaussian process regression. Proceedings of IJCAI 2007, Hyderabad, India, January 6-12, 2007; 944-949.
Sutton R, Barto A. Reinforcement Learning: An Introduction. MIT Press: Cambridge, MA, 1998.
Rubinstein RY, Kroese DP. The Cross Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Springer: Berlin, 2004.
Hansen N. The CMA evolution strategy: a comparing review. In Towards a New Evolutionary Computation. Advances on Estimation of Distribution Algorithms, Lozano JA, Larranaga P, Inza I, Bengoetxea E (eds). Springer: Berlin, 2006; 75-102.
Perttunen CD, Jones DR, Stuckman BE. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Application 1993; 79 (1):157-181.
Brochu E, Cora VM, de Freitas N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. CoRR 2010; abs/1012.2599.
Rasmussen CE, Williams CKI. Gaussian Processes for Machine Learning. MIT Press: Cambridge, MA, 2006.
Jones DR. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization 2001; 21:345-383.
Mockus J. Bayesian Approach to Global Optimization. Kluwer Academic Publishers: Dordrecht, Holland, 1989.
Spong M. The swing up control problem for the acrobot. IEEE Control Systems Magazine 1995; 15:49-55.
Adams B, Banks H, Kwon H-D, Tran H. Dynamic multidrug therapies for HIV: optimal and STI control approaches. Mathematical Biosciences and Engineering 2004; 1:223-241.
Ernst D, Stan G, Goncalves J, Wehenkel L. Clinical data based optimal STI strategies for HIV: a reinforcement learning approach. Proceedings of the 45th IEEE Conference on Decision and Control CDC 2006, San Diego, USA, December 13-15, 2006; 667-672.
Busoniu L, Ernst D, Babuska R, De Schutter B. Cross-entropy optimization of control policies with adaptive basis functions. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 2011; 41 (1):196-209.
Osborne M, Garnett R, Roberts SJ. Gaussian processes for global optimization. Proceedings of 3rd International Conference on Learning and Intelligent Optimization (LION 3), Trento, Italy, January 14-18, 2009; 1-15.
Jung T, Stone P. Gaussian processes for sample efficient reinforcement learning with RMAX-like exploration. Proceedings of ECML 2010, Barcelona, Spain, September 20-24, 2010; 601-616.
Busoniu L, Munos R, De Schutter B, Babuska R. Optimistic planning for sparsely stochastic systems. Proceedings of IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning ADPRL 2011, Paris, France, April 11-15, 2011; 48-55.
Bertsekas D. Dynamic Programming and Optimal Control, Vol. II. Athena Scientific, 2007.
Powell W. Approximate Dynamic Programming. Wiley: New York, NY, 2007.
Peters J, Schaal S. Natural actor-critic. Neurocomputing 2008; 71 (7-9):1180-1190.
Deisenroth MP, Rasmussen CE. Pilco: a model-based and data-efficient approach to policy search. Proceedings of the International Conference on Machine Learning ICML 2011, Montreal, Canada, 2011; 465-472.
Szita I, Lorincz A. Learning Tetris using the noisy cross-entropy method. Neural Computation 2006; 18 (12):2936-2941.
Moriarty DE, Schultz AC, Grefenstette JJ. Evolutionary algorithms for reinforcement learning. Journal of Artificial Intelligence Research 1999; 11:241-276.
Gomez F, Mikkulainen R. Active guidance for a finless rocket using neuroevolution. Proceedings of the Genetic and Evolutionary Computation Conference GECCO 2003, Chicago, July 9-11, 2003; 2084-2095.
Stanley KO, Mikkulainen R. Evolving neural networks through augmenting topologies. Evolutionary Computation 2002; 10:99-127.
Gomez F, Schmidhuber J, Mikkulainen R. Accelerated neuroevolution through cooperatively coevolved synapses. Journal of Machine Learning Research 2008; 9:937-965.
Whiteson S, Taylor ME, Stone P. Critical factors in the empirical performance of temporal difference and evolutionary methods for reinforcement learning. Autonomous Agents and Multi-Agent Systems 2010; 21 (1):1-35.
Kalyanakrishnan S, Stone P. Characterizing reinforcement learning methods through parameterized learning problems. Machine Learning 2011; 82 (3):205-247.
Heidrich-Meisner V, Igel C. Variable metric reinforcement learning methods applied to the noisy mountain car problem. Proceedings of European Workshop on Reinforcement Learning EWRL 2008, Villeneuve d'Ascq, France, June 30-July 3, 2008; 136-150.
Kohl N, Stone P. Machine learning for fast quadrupedal locomotion. Proceedings of the 19th National Conference on Artificial Intelligence AAAI-04, San Jose, California, July 25-29, 2004; 611-616.
Szita I, Lorincz A. Learning to play using low-complexity rule-based policies: illustrations through Ms. Pac-Man. Journal of Artificial Intelligence Research 2007; 30:659-684.
Hart P, Nilsson N, Raphael B. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics 1968; 4 (2):100-107.
Maes F. Learning in Markov decision processes for structured prediction. Ph.D. Thesis, Pierre and Marie Curie University, Computer Science Laboratory of Paris 6 (LIP6), October 2009.
Minton S. Machine Learning Methods for Planning. Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1994.
Yoon SW, Fern A, Givan R. Learning heuristic functions from relaxed plans. International Conference on Automated Planning and Scheduling ICAPS 06, The English Lake District, Cumbria, UK, June 6-10, 2006; 162-171.
Arfaee SJ, Zilles S, Holte RC. Learning heuristic functions for large state spaces. Artificial Intelligence 2011; 175:2075-2098.
Morari M, Lee JH. Model predictive control: past, present and future. Computers & Chemical Engineering 1999; 23 (4):667-682.
Coen T, Anthonis J, De Baerdemaeker J. Cruise control using model predictive control with constraints. Computers and Electronics in Agriculture 2008; 63 (2):227-236.
Ernst D, Glavic M, Capitanescu F, Wehenkel L. Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics 2009; 39 (2):517-529.
Busoniu L, Ernst D, Babuska R, De Schutter B. Fuzzy partition optimization for approximate fuzzy q-iteration. Proceedings of the 17th IFAC World Congress IFAC-08, Seoul, Korea, 2008.
Defourny B, Ernst D, Wehenkel L. Multistage stochastic programming: a scenario tree based approach to planning under uncertainty. In Decision Theory Models for Applications in Artificial Intelligence: Concepts and Solutions, Morales EF, Sucar LE, Hoes J (eds), 2012. ISBN: 978-1609601652.
Couetoux A, Doghmen H, Teytaud O. Improving the exploration in upper confidence trees. Proceedings of the 6th International Conference on Learning and Intelligent Optimization LION 6, Paris, France, January 16; 366-371.
Maes F, Wehenkel L, Ernst D. Learning to play K-armed bandit problems. Proceedings of International Conference on Agents and Artificial Intelligence, Vilamoura, Algarve, Portugal, February 2012.
Maes F, Wehenkel L, Ernst D. Automatic discovery of ranking formulas for playing with multi-armed bandits. Proceedings of the 9th European Workshop on Reinforcement Learning (EWRL 2011), 2011; 5-17.
Castronovo M, Maes F, Fonteneau R, Ernst D. Learning exploration/ exploitation strategies for single trajectory reinforcement learning. Journal of Machine Learning Research Workshop & Conference Proceedings 24 (JMLR W&CP) (Proceedings of the EWRL 2012), Edinburgh, Scotland, June 30; 1-9.
Kocsis L, Szepesvári C. Bandit based Monte Carlo planning. Proceedings of the 17th European Conference on Machine Learning ECML 2006, Berlin, Germany, September 18-22, 2006; 282-293.
Coulom R. Efficient selectivity and backup operators inMonte-Carlo tree search. Proceedings of the 5th International Conference on Computers and Games, Turin, Italy, 2006; 72-83.
Sokolovska N, Teytaud O, Milone M. Q-learning with double progressive widening: application to robotics. Proceedings of ICONIP, Shanghai, China, 2011; 103-112.
Chaslot G, Winands M, Uiterwijk J, van den Herik H, Bouzy B. Progressive strategies for Monte-Carlo tree search. New Mathematics and Natural Computation 2008; 4 (3):343-357.
Rolet P, Sebag M, Teytaud O. Boosting active learning to optimality: a tractable Monte-Carlo, billiard-based algorithm. Proceedings of ECML, Bled, Slovenia, 2009; 302-317.
Xu X, Liu C, Hu D. Continuous-action reinforcement learning with fast policy search and adaptive basis function selection. Soft Computing 2011; 15 (6):1055-1070.
Lazaric A, Restelli M, Bonarini A. Reinforcement learning in continuous action spaces through sequential Monte Carlo methods. Advances in Neural Information Processing Systems NIPS 2007, Vancouver, Canada, 2007.
van Hasselt H, Wiering MA. Reinforcement learning in continuous action spaces. Proceedings of IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning ADPRL 2007, Honolulu, HI, USA, 2008; 272-279.