On overfitting and asymptotic bias in batch reinforcement learning with partial observability

François-Lavet, Vincent; Rabusseau, Guillaume; Pineau, Joëlle; Ernst, Damien; Fonteneau, Raphaël

doi:10.1613/jair.1.11478

Download

Article (Scientific journals)

On overfitting and asymptotic bias in batch reinforcement learning with partial observability

François-Lavet, Vincent; Rabusseau, Guillaume; Pineau, Joëlle et al.

2019 • In Journal of Artificial Intelligence Research, 65, p. 1-30

Peer Reviewed verified by ORBi

Permalink
https://hdl.handle.net/2268/214551

DOI
10.1613/jair.1.11478

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

11478-Article (PDF)-21409-1-10-20190505.pdf

Publisher postprint (627.34 kB)

Download

All documents in ORBi are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Artificial intelligence; reinforcement learning; Partially Observable Markov Decision Process

Abstract :

[en] This paper provides an analysis of the tradeoff between asymptotic bias (suboptimality with unlimited data) and overfitting (additional suboptimality due to limited data) in the context of reinforcement learning with partial observability. Our theoretical analysis formally characterizes that while potentially increasing the asymptotic bias, a smaller state representation decreases the risk of overfitting. This analysis relies on expressing the quality of a state representation by bounding L1 error terms of the associated belief states. Theoretical results are empirically illustrated when the state representation is a truncated history of observations, both on synthetic POMDPs and on a large-scale POMDP in the context of smartgrids, with real-world data. Finally, similarly to known results in the fully observable setting, we also briefly discuss and empirically illustrate how using function approximators and adapting the discount factor may enhance the tradeoff between asymptotic bias and overfitting in the partially observable context.

Disciplines :

Computer science

Author, co-author :

François-Lavet, Vincent

Rabusseau, Guillaume

Pineau, Joëlle

Ernst, Damien ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids

Fonteneau, Raphaël ; Université de Liège - ULiège > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Dép. d'électric., électron. et informat. (Inst.Montefiore)

Language :

English

Title :

On overfitting and asymptotic bias in batch reinforcement learning with partial observability

Publication date :

May 2019

Journal title :

Journal of Artificial Intelligence Research

ISSN :

1076-9757

eISSN :

1943-5037

Publisher :

Morgan Kaufmann Publishers, United States - California

Volume :

Pages :

1-30

Peer reviewed :

Peer Reviewed verified by ORBi

Additional URL :

https://arxiv.org/pdf/1709.07796

Available on ORBi :

since 25 September 2017

Statistics

Number of views

262 (35 by ULiège)

Number of downloads

116 (12 by ULiège)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

Bibliography

Abel, D., Hershkowitz, D., & Littman, M. (2016). Near optimal behavior via approximate state abstraction. In Proceedings of The 33rd International Conference on Machine Learning, pp. 2915–2923.
Aberdeen, D. (2003). A (revised) survey of approximate methods for solving partially observable Markov decision processes. National ICT Australia, Canberra, Australia, 1.
Aberdeen, D., Buffet, O., & Thomas, O. (2007). Policy-gradients for PSRs and POMDPs. In Artificial Intelligence and Statistics, pp. 3–10.
Arun-Kumar, S. (2006). On bisimilarities induced by relations on actions. In Software Engineering and Formal Methods, 2006. SEFM 2006. Fourth IEEE International Conference on, pp. 41–49. IEEE.
Cassandra, A. R., Kaelbling, L. P., & Littman, M. L. (1994). Acting optimally in partially observable stochastic domains. In Proceedings of the Twelfth AAAI National Conference on Artificial Intelligence, Vol. 94, pp. 1023–1028.
Castro, P. S., Panangaden, P., & Precup, D. (2009). Equivalence relations in fully and partially observable markov decision processes.. In Twenty-First International Joint Conference on Artificial Intelligence, Vol. 9, pp. 1653–1658.
Chen, T., Goodfellow, I., & Shlens, J. (2015). Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641.
Farahmand, A.-m. (2011). Regularization in reinforcement learning. Ph.D. thesis, University of Alberta.
Ferns, N., Panangaden, P., & Precup, D. (2004). Metrics for finite Markov decision processes. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 162–169. AUAI Press.
François-Lavet, V., Fonteneau, R., & Ernst, D. (2015). How to discount deep reinforcement learning: Towards new dynamic strategies. arXiv preprint arXiv:1512.02011.
François-Lavet, V., Taralla, D., Ernst, D., & Fonteneau, R. (2016). Deep reinforcement learning solutions for energy microgrids management. In European Workshop on Reinforcement Learning.
Ghavamzadeh, M., Mannor, S., Pineau, J., & Tamar, A. (2015). Bayesian reinforcement learning: a survey. Foundations and Trends® in Machine Learning, 8(5-6), 359–483.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks.. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, Vol. 9, pp. 249–256.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301), 13–30.
Hutter, M. (2014). Extreme state aggregation beyond mdps. In International Conference on Algorithmic Learning Theory, pp. 185–199. Springer.
Jiang, N., Kulesza, A., & Singh, S. (2015a). Abstraction selection in model-based reinforcement learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 179–188.
Jiang, N., Kulesza, A., Singh, S., & Lewis, R. (2015b). The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pp. 1181–1189. International Foundation for Autonomous Agents and Multiagent Systems.
Jiang, N., & Li, L. (2016). Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, pp. 652–661.
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1), 99–134.
Lange, S., Gabel, T., & Riedmiller, M. (2012). Batch reinforcement learning. In Reinforcement learning, pp. 45–73. Springer.
Littman, M. L. (1994). Memoryless policies: Theoretical limitations and practical results. In From Animals to Animats 3: Proceedings of the third international conference on simulation of adaptive behavior, Vol. 3, p. 238. MIT Press.
Littman, M. L., & Sutton, R. S. (2002). Predictive representations of state. In Advances in neural information processing systems, pp. 1555–1561.
Maillard, O.-A., Ryabko, D., & Munos, R. (2011). Selecting the state-representation in reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2627–2635.
McCallum, A. K. (1996). Reinforcement learning with selective perception and hidden state. Ph.D. thesis, University of Rochester.
Mnih, V., Kavukcuoglu, K., Silver, D., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
Ortner, R., Maillard, O.-A., & Ryabko, D. (2014). Selecting near-optimal approximate state representations in reinforcement learning. In International Conference on Algorithmic Learning Theory, pp. 140–154. Springer.
Petrik, M., & Scherrer, B. (2009). Biasing approximate dynamic programming with a lower discount factor. In Advances in neural information processing systems, pp. 1265–1272.
Pineau, J., Gordon, G., & Thrun, S. (2003). Point-based value iteration: An anytime algorithm for POMDPs. In Proceedings of the 18th international joint conference on Artificial intelligence, Vol. 3, pp. 1025–1032.
Precup, D. (2000). Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, 80.
Ravindran, B., & Barto, A. G. (2004). Approximate homomorphisms: A framework for non-exact minimization in markov decision processes..
Ross, S., Pineau, J., Chaib-draa, B., & Kreitmann, P. (2011). A Bayesian approach for learning and planning in partially observable Markov decision processes. The Journal of Machine Learning Research, 12, 1729–1770.
Singh, S., James, M. R., & Rudary, M. R. (2004). Predictive state representations: A new theory for modeling dynamical systems. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 512–519. AUAI Press.
Singh, S. P., Jaakkola, T. S., & Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision processes.. In Proceedings of the Eleventh International Conference on Machine Learning, pp. 284–292.
Sondik, E. J. (1978). The optimal control of partially observable Markov processes over the infinite horizon: Discounted costs. Operations research, 26(2), 282–304.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction, Vol. 1. MIT press Cambridge.
Taylor, J., Precup, D., & Panagaden, P. (2009). Bounding performance loss in approximate mdp homomorphisms. In Advances in Neural Information Processing Systems, pp. 1649–1656.
Thomas, P. S., & Brunskill, E. (2016). Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning.
Tieleman, H. (2012). Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning.
Whitehead, S. D., & Ballard, D. H. (1990). Active perception and reinforcement learning. In Machine Learning Proceedings 1990, pp. 179–188. Elsevier.
Wolfe, A. P. (2006). POMDP homomorphisms. In NIPS RL Workshop.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.