Tree-based batch mode reinforcement learning

Ernst, Damien; Geurts, Pierre; Wehenkel, Louis

Download

Article (Scientific journals)

Tree-based batch mode reinforcement learning

Ernst, Damien; Geurts, Pierre; Wehenkel, Louis

2005 • In Journal of Machine Learning Research, 6, p. 503-556

Peer Reviewed verified by ORBi

Permalink
https://hdl.handle.net/2268/9360

Files (3)Send to Details Statistics Bibliography Similar publications

Files

Full Text

ernst05a.pdf

Publisher postprint (1.32 MB)

The fitted Q iteration (FQI) algorithm was first described in the paper "Iteratively extending time horizon reinforcement learning" (see below) but this paper is the first one to name it fitted Q iteration (or FQI in short).

Download

Annexes

ernst-fittedQIteration.pdf

Publisher postprint (335.04 kB)

Presentation that gives a brief overview of our work on fitted Q iteration.

Download

ernst-icopi2005-slides.pdf

Publisher postprint (769.1 kB)

Presentation that discusses several strategies for using supervised learning in the context of batch-mode reinforcement learning.

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

fitted Q iteration; batch mode reinforcement learning; ensemble of regression trees; supervised learning; fitted value iteration; optimal control

Abstract :

[en] Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (x(t), u(t), r(t), x(t+1)) where x(t) denotes the system state at time t, u(t) the control action taken, r(t) the instantaneous reward obtained and x(t+1) the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees.

Disciplines :

Computer science

Author, co-author :

Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Geurts, Pierre ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Wehenkel, Louis ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation

Language :

English

Title :

Tree-based batch mode reinforcement learning

Publication date :

April 2005

Journal title :

Journal of Machine Learning Research

ISSN :

1532-4435

eISSN :

1533-7928

Publisher :

Microtome Publishing, Brookline, United States - Massachusetts

Volume :

Pages :

503-556

Peer reviewed :

Peer Reviewed verified by ORBi

Funders :

F.R.S.-FNRS - Fonds de la Recherche Scientifique [BE]

Available on ORBi :

since 22 March 2009

Statistics

Number of views

1194 (80 by ULiège)

Number of downloads

1261 (34 by ULiège)

More statistics

Scopus citations^®

733

Scopus citations^®
without self-citations

693

Bibliography

D. Bagnell, S. Kakade, A. Y. Ng, and J. Schneider. Policy search by dynamic programming. In Proceedings of Neural Information Processing Systems, 2003.
L. C. Baird. Residual algorithms: reinforcement learning with function approximation. In Armand Prieditis and Stuart Russell, editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 9-12, San Francisco, CA, July 1995. Morgan Kaufman.
R. Bellman. Dynamic Programming. Princeton University Press, 1957.
R. Bellman, R. Kalaba, and B. Kotkin. Polynomial approximation - a new computational technique in dynamic programming: allocation processes. Mathematical Computation, 17:155-161, 1973.
J. A. Boyan. Technical update: least-squares temporal difference learning. Machine Learning, 49 (2-3):233-246, 2002.
J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: safely approximating the value function. Advances in Neural Information Processing Systems, 7:369-376,1995.
L. Breiman. Bagging predictors. Machine Learning, 24(2): 123-140, 1996.
L. Breiman. Some infinity theory for predictor ensembles. Technical Report 577, University of California, Department of Statistics, 2000.
L. Breiman. Random forests. Machine Learning, 45(1):5-32, 2001.
L. Breiman, J. H. Friedman, R. A. Olsen, and C. J. Stone. Classification and Regression Trees. Wadsworth International (California), 1984.
D. Ernst. Near Optimal Closed-Loop Control. Application to Electric Power Systems. PhD thesis, University of Liège, Belgium, March 2003.
D. Ernst, P. Geurts, and L. Wehenkel. Iteratively extending time horizon reinforcement learning. In N. Lavra, L. Gamberger, and L. Todorovski, editors, Proceedings of the 14th European Conference on Machine Learning, pages 96-107, Dubrovnik, Croatia, September 2003. Springer-Verlag Heidelberg.
D. Ernst, M. Glavic, P. Geurts, and L. Wehenkel. Approximate value iteration in the reinforcement learning context. Application to electrical power system control. To appear in Intelligent Automation and Soft Computing, 2005.
P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Submitted, 2004.
G. J. Gordon. Online fitted reinforcement learning. In VFA workshop at ML-95, 1995a.
G. J. Gordon. Stable function approximation in dynamic programming. In Proceedings of the Twelfth International Conference on Machine Learning, pages 261-268, San Francisco, CA, 1995b. Morgan Kaufmann.
G. J. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis, Carnegie Mellon University, June 1999.
O. Hernández-Lerma and B. Lasserre. Discrete-Time Markov Control Processes. Springer, New-York, 1996.
L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: a survey. Journal of Artificial Intelligence Research, 4:237-285, 1996.
S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 267-274, 2002.
M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 4:1107-1149, 2003a.
M. G. Lagoudakis and R. Parr. Reinforcement learning as classification: leveraging modern classifiers. In Proceedings of ICML 2003, pages 424-431, 2003b.
J. Langford and B. Zadrozny. Reducing T-step reinforcement learning to classification. Submitted, 2004.
L. J. Lin. Reinforcement Learning for Robots Using Neural Networks. PhD thesis, Carnegie Mellon University, Pittsburgh, USA, 1993.
Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Technical Report 1005, Department of Statistics, University of Wisconsin, 2002.
D. G. Luenberger. Optimization by Vector Space Methods. Wiley, N.Y., 1969.
A. K. McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, University of Rochester, Rochester, New-York, 1996.
A. W. Moore and C. G. Atkeson. Prioritized sweeping: reinforcement learning with less data and less real time. Machine Learning, 13:103-130, 1993.
A. W. Moore and C. G. Atkeson. The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21(3):199-233, 1995.
A. Y. Ng and M. Jordan. PEGASUS: a policy search method for large MDPs and POMDPs. In Proceedings of the Sixteenth Conference on Uncertainly in Artificial Intelligence, pages 406-415, 1999.
D. Ormoneit and P. Glynn. Kernel-based reinforcement learning in average-cost problems. IEEE Transactions on Automatic Control, 47(10): 1624-163 6, 2002.
D. Ormoneit and S. Sen. Kernel-based reinforcement learning. Machine Learning, 49(2-3): 161-178, 2002.
J. Randløv and P. Alstrøm. Learning to drive a bicycle using reinforcement learning and shaping. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 463-471, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc.
J. Rust. Using randomization to break the curse of dimensionality. Econometrica, 65(3):487-516, 1997.
S. P. Singh, T. Jaakkola, and M. I. Jordan. Reinforcement learning with soft state aggregation. In G. Tesauro, D. S. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems: Proceedings of the 1994 Conference, pages 359-368, Cambridge, MA, 1995. MIT press.
W. D. Smart and L. P. Kaelbling. Practical reinforcement learning in continuous spaces. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 903-910, 2000.
M. W. Spong. Swing up control of the Acrobot. In 1994 IEEE International Conference on Robotics and Automation, pages 2356-2361, San Diego, CA, May 1994.
R. S. Sutton. Learning to predict by the method of temporal differences. Machine Learning, 3(1): 9-44, 1988.
R. S. Sutton. Generalization in reinforcement learning: successful examples using sparse coarse coding. Advances in Neural Information Processing Systems, 8:1038-1044, 1996.
R. S. Sutton and A. G. Barto. Reinforcement Learning, an Introduction. MIT Press, 1998.
J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16(3): 185-202, 1994.
J. N. Tsitsiklis and B. Van Roy. Feature-based methods for large-scale dynamic programming. Machine Learning, 22:59-94, 1996.
W. T. B. Uther and M. M. Veloso. Tree based discretization for continuous state space reinforcement learning. In Proceedings of AAAI-98, pages 769-774, 1998.
X. Wang and T. G. Diettrich. Efficient value function approximation using regression trees. In Proceedings of IJCAI-99 Workshop on Statistical Machine Learning for Large-Scale Optimization, Stockholm, Sweden, 1999.
C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England, 1989.
J. Yoshimoto, S. Ishii, and M. Sato. Application of reinforcement learning to balancing Acrobot. In Proceedings of the 1999 IEEE International Conference on Systems, Man and Cybernetics, pages 516-521, 1999.