Reference : Learning exploration/exploitation strategies for single trajectory reinforcement learning
Scientific congresses and symposiums : Paper published in a book
Engineering, computing & technology : Computer science
http://hdl.handle.net/2268/127985
Learning exploration/exploitation strategies for single trajectory reinforcement learning
English
Castronovo, Michaël mailto [Université de Liège - ULg > > > 2e an. master sc. infor., fin. appr.]
Maes, Francis mailto [Université de Liège - ULg > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation >]
Fonteneau, Raphaël mailto [Université de Liège - ULg > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation >]
Ernst, Damien mailto [Université de Liège - ULg > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids >]
2012
Proceedings of the 10th European Workshop on Reinforcement Learning (EWRL 2012)
JMLR Workshop and Conference Proceedings 24
1-9
Yes
No
International
10th European Workshop on Reinforcement Learning (EWRL 2012)
June 30-July 1, 2012
Edinburgh
UK
[en] reinforcement learning ; Exploration/Exploitation dilemma ; formula discovery
[en] We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution pM( ). The performance criterion is the sum of discounted rewards collected by the E/E strategy over an in finite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to pM( ). As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions V and Q of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with e-Greedy strategies and the results are promising.
Fonds de la Recherche Scientifique (Communauté française de Belgique) - F.R.S.-FNRS
Researchers ; Professionals ; Students
http://hdl.handle.net/2268/127985
http://jmlr.csail.mit.edu/proceedings/papers/v24/

File(s) associated to this reference

Fulltext file(s):

FileCommentaryVersionSizeAccess
Open access
castronovo12a.pdfPublisher postprint264.51 kBView/Open

Bookmark and Share SFX Query

All documents in ORBi are protected by a user license.