Reference : Learning to play K-armed bandit problems
Scientific congresses and symposiums : Paper published in a book
Engineering, computing & technology : Computer science
http://hdl.handle.net/2268/101066
Learning to play K-armed bandit problems
English
Maes, Francis mailto [Université de Liège - ULg > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation >]
Wehenkel, Louis mailto [Université de Liège - ULg > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Systèmes et modélisation >]
Ernst, Damien mailto [Université de Liège - ULg > Dép. d'électric., électron. et informat. (Inst.Montefiore) > Smart grids >]
Feb-2012
Proceedings of the 4th International Conference on Agents and Artificial Intelligence (ICAART 2012)
Yes
No
International
4th International Conference on Agents and Artificial Intelligence (ICAART 2012)
6-8 February 2012
Vilamoura, Algarve
Portugal
[en] Multi-armed bandit problems ; reinforcement learning ; exploration-exploitation dilemma
[en] We propose a learning approach to pre-compute K-armed bandit playing policies by exploiting prior information describing the class of problems targeted by the player. Our algorithm first samples a set of K-armed bandit problems from the given prior, and then chooses in a space of candidate policies one that gives the best average performances over these problems. The candidate policies use an index for ranking the arms and pick at each play the arm with the highest index; the index for each arm is computed in the form of a linear combination of features describing the history of plays (e.g., number of draws, average reward, variance of rewards and higher order moments), and an estimation of distribution algorithm is used to determine its optimal parameters in the form of feature weights. We carry out simulations in the case where the prior assumes a fixed number of Bernoulli arms, a fixed horizon, and uniformly distributed parameters of the Bernoulli arms. These simulations show that learned strategies perform very well with respect to several other strategies previously proposed in the literature (UCB1, UCB2, UCB-V, KL-UCB and $\epsilon_n$-GREEDY); they also highlight the robustness of these strategies with respect to wrong prior information.
http://hdl.handle.net/2268/101066

File(s) associated to this reference

Fulltext file(s):

FileCommentaryVersionSizeAccess
Open access
icaart-2012.pdfPublisher postprint135.2 kBView/Open

Additional material(s):

File Commentary Size Access
Open access
Ernst-INRIA-2011-talk.pdfThis paper together with the two papers "Optimized look-ahead tree search policies" and "Automatic discovery of ranking formulas for playing with multi-armed bandits" are part of a body of work that focuses on the automatic learning of good strategies for exploration-(exploitation) problems in RL. This file is a presentation of this body of work.389.2 kBView/Open

Bookmark and Share SFX Query

All documents in ORBi are protected by a user license.