References of "Castronovo, Michaël"
     in
Bookmark and Share    
Full Text
See detailLearning for exploration/exploitation in reinforcement learning
Castronovo, Michaël ULg

Master's dissertation (2012)

We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a ... [more ▼]

We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution pM(·). The performance criterion is the sum of discounted rewards collected by the E/E strategy over an infinite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to pM(·). As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions ˆ V and ˆQ of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with -Greedy strategies and the results are promising. [less ▲]

Detailed reference viewed: 43 (5 ULg)
Full Text
Peer Reviewed
See detailLearning exploration/exploitation strategies for single trajectory reinforcement learning
Castronovo, Michaël ULg; Maes, Francis ULg; Fonteneau, Raphaël ULg et al

in Proceedings of the 10th European Workshop on Reinforcement Learning (EWRL 2012) (2012)

We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a ... [more ▼]

We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution pM( ). The performance criterion is the sum of discounted rewards collected by the E/E strategy over an in finite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to pM( ). As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions V and Q of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with e-Greedy strategies and the results are promising. [less ▲]

Detailed reference viewed: 107 (12 ULg)