Wehenkel, Louis[Université de Liège - ULg > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation >]
Ernst, Damien[Université de Liège - ULg > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation >]
2009
Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09)
117-123
Yes
No
International
978-1-4244-2761-1
IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09)
March 30 - April 2, 2009
Nashville
USA
[en] reinforcement learning ; model-free ; lower bound on a policy
[fr] performance guarantee
[en] We propose an approach for inferring bounds on the finite-horizon return of a control policy from an off-policy sample of trajectories collecting state transitions, rewards, and control actions. In this paper, the dynamics, control policy, and reward function are supposed to be deterministic and Lipschitz continuous. Under these assumptions, a polynomial algorithm, in terms of the sample size and length of the optimization horizon, is derived to compute these bounds, and their tightness is characterized in terms of the sample density.
Fonds de la Recherche Scientifique (Communauté française de Belgique) - F.R.S.-FNRS