References of "Castronovo, Michaël"
     in
Bookmark and Share    
Full Text
See detailOffline Policy-search in Bayesian Reinforcement Learning
Castronovo, Michaël ULiege

Doctoral thesis (2017)

This thesis presents research contributions in the study field of Bayesian Reinforcement Learning — a subfield of Reinforcement Learning where, even though the dynamics of the system are un- known, the ... [more ▼]

This thesis presents research contributions in the study field of Bayesian Reinforcement Learning — a subfield of Reinforcement Learning where, even though the dynamics of the system are un- known, the existence of some prior knowledge is assumed in the form of a distribution over Markov decision processes. In this thesis, two algorithms are presented: OPPS (Offline Prior- based Policy Search) and ANN-BRL (Artificial Neural Networks for Bayesian Reinforcement Learning), whose philosophy consists to analyse and exploit the knowledge available beforehand prior to interacting with the system(s), and which differ by the nature of the model they make use of. The former makes use of formula-based agents introduced by Maes et al. in (Maes, Wehenkel, and Ernst, 2012), while the latter relies on Artificial Neural Networks built via SAMME (Stagewise Additive Modelling using a Multi-class Exponential loss function) — an AdaBoost algorithm developed by Zhu et al. in (Zhu et al., 2009). Moreover, we also describe a comprehensive benchmark which has been created to compare Bayesian Reinforcement Learning algo- rithms. In real life applications, the choice of the best agent to fulfil a given task depends not only on their performances, but also on the computation times required to deploy them. This benchmark has been designed to identify the best algorithms by taking both criteria into account, and resulted in the development of an open-source library: BBRL (Benchmarking tools for Bayesian Reinforcement Learning) (https://github.com/mcastron/BBRL/wiki). [less ▲]

Detailed reference viewed: 126 (18 ULiège)
Full Text
Peer Reviewed
See detailApproximate Bayes Optimal Policy Search using Neural Networks
Castronovo, Michaël ULiege; François-Lavet, Vincent ULiege; Fonteneau, Raphaël ULiege et al

in Proceedings of the 9th International Conference on Agents and Artificial Intelligence (ICAART 2017) (2017, February)

Bayesian Reinforcement Learning (BRL) agents aim to maximise the expected collected rewards obtained when interacting with an unknown Markov Decision Process (MDP) while using some prior knowledge. State ... [more ▼]

Bayesian Reinforcement Learning (BRL) agents aim to maximise the expected collected rewards obtained when interacting with an unknown Markov Decision Process (MDP) while using some prior knowledge. State-of-the-art BRL agents rely on frequent updates of the belief on the MDP, as new observations of the environment are made. This offers theoretical guarantees to converge to an optimum, but is computationally intractable, even on small-scale problems. In this paper, we present a method that circumvents this issue by training a parametric policy able to recommend an action directly from raw observations. Artificial Neural Networks (ANNs) are used to represent this policy, and are trained on the trajectories sampled from the prior. The trained model is then used online, and is able to act on the real MDP at a very low computational cost. Our new algorithm shows strong empirical performance, on a wide range of test problems, and is robust to inaccuracies of the prior distribution. [less ▲]

Detailed reference viewed: 588 (35 ULiège)
Full Text
Peer Reviewed
See detailBenchmarking for Bayesian Reinforcement Learning
Castronovo, Michaël ULiege; Ernst, Damien ULiege; Couëtoux, Adrien ULiege et al

in PLoS ONE (2016)

In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the col- lected rewards while interacting with their environment while using some prior knowledge that is accessed beforehand ... [more ▼]

In the Bayesian Reinforcement Learning (BRL) setting, agents try to maximise the col- lected rewards while interacting with their environment while using some prior knowledge that is accessed beforehand. Many BRL algorithms have already been proposed, but even though a few toy examples exist in the literature, there are still no extensive or rigorous benchmarks to compare them. The paper addresses this problem, and provides a new BRL comparison methodology along with the corresponding open source library. In this methodology, a comparison criterion that measures the performance of algorithms on large sets of Markov Decision Processes (MDPs) drawn from some probability distributions is defined. In order to enable the comparison of non-anytime algorithms, our methodology also includes a detailed analysis of the computation time requirement of each algorithm. Our library is released with all source code and documentation: it includes three test prob- lems, each of which has two different prior distributions, and seven state-of-the-art RL algorithms. Finally, our library is illustrated by comparing all the available algorithms and the results are discussed. [less ▲]

Detailed reference viewed: 607 (52 ULiège)
Full Text
Peer Reviewed
See detailBayes Adaptive Reinforcement Learning versus Off-line Prior-based Policy Search: an Empirical Comparison
Castronovo, Michaël ULiege; Ernst, Damien ULiege; Fonteneau, Raphaël ULiege

in Proceedings of the 23rd annual machine learning conference of Belgium and the Netherlands (BENELEARN 2014) (2014, June)

This paper addresses the problem of decision making in unknown finite Markov decision processes (MDPs). The uncertainty about the MDPs is modeled using a prior distribution over a set of candidate MDPs ... [more ▼]

This paper addresses the problem of decision making in unknown finite Markov decision processes (MDPs). The uncertainty about the MDPs is modeled using a prior distribution over a set of candidate MDPs. The performance criterion is the expected sum of discounted rewards collected over an infinite length trajectory. Time constraints are defined as follows: (i) an off-line phase with a given time budget can be used to exploit the prior distribution and (ii) at every time step of the on-line phase, decisions have to be computed within a given time budget. In this setting, we compare two decision-making strategies: OPPS, a recently proposed meta-learning scheme which mainly exploits the off-line phase to perform policy search and BAMCP, a state-of-the-art model-based Bayesian reinforcement learning algorithm, which mainly exploits the on-line time budget. We empirically compare these approaches in a real Bayesian setting by computing their performances over a large set of problems. To the best of our knowledge, it is the first time that this is done in the reinforcement learning literature. Several settings are considered by varying the prior distribution and the distribution from which test problems are drawn. The main finding of these experiments is that there may be a significant benefit of having an off-line prior-based optimization phase in the case of informative and accurate priors, especially when on-line time constraints are tight. [less ▲]

Detailed reference viewed: 321 (86 ULiège)
Full Text
Peer Reviewed
See detailApprentissage par renforcement bayésien versus recherche directe de politique hors-ligne en utilisant une distribution a priori: comparaison empirique
Castronovo, Michaël ULiege; Ernst, Damien ULiege; Fonteneau, Raphaël ULiege

in Proceedings des 9èmes Journée Francophones de Planification, Décision et Apprentissage (2014, May)

Cet article aborde le problème de prise de décision séquentielle dans des processus de déci- sion de Markov (MDPs) finis et inconnus. L’absence de connaissance sur le MDP est modélisée sous la forme d’une ... [more ▼]

Cet article aborde le problème de prise de décision séquentielle dans des processus de déci- sion de Markov (MDPs) finis et inconnus. L’absence de connaissance sur le MDP est modélisée sous la forme d’une distribution de probabilité sur un ensemble de MDPs candidats connue a priori. Le cri- tère de performance utilisé est l’espérance de la somme des récompenses actualisées sur une trajectoire infinie. En parallèle du critère d’optimalité, les contraintes liées au temps de calcul sont formalisées rigoureusement. Tout d’abord, une phase « hors-ligne » précédant l’interaction avec le MDP inconnu offre à l’agent la possibilité d’exploiter la distribution a priori pendant un temps limité. Ensuite, durant la phase d’interaction avec le MDP, à chaque pas de temps, l’agent doit prendre une décision dans un laps de temps contraint déterminé. Dans ce contexte, nous comparons deux stratégies de prise de déci- sion : OPPS, une approche récente exploitant essentiellement la phase hors-ligne pour sélectionner une politique dans un ensemble de politiques candidates et BAMCP, une approche récente de planification en-ligne bayésienne. Nous comparons empiriquement ces approches dans un contexte bayésien, en ce sens que nous évaluons leurs performances sur un large ensemble de problèmes tirés selon une distribution de test. A notre connaissance, il s’agit des premiers tests expérimentaux de ce type en apprentissage par renforcement. Nous étudions plusieurs cas de figure en considérant diverses distributions pouvant être utilisées aussi bien en tant que distribution a priori qu’en tant que distribution de test. Les résultats obtenus suggèrent qu’exploiter une distribution a priori durant une phase d’optimisation hors-ligne est un avantage non- négligeable pour des distributions a priori précises et/ou contraintes à de petits budgets temps en-ligne. [less ▲]

Detailed reference viewed: 137 (31 ULiège)
Full Text
See detailLearning for exploration/exploitation in reinforcement learning
Castronovo, Michaël ULiege

Master's dissertation (2012)

We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a ... [more ▼]

We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution pM(·). The performance criterion is the sum of discounted rewards collected by the E/E strategy over an infinite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to pM(·). As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions ˆ V and ˆQ of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with -Greedy strategies and the results are promising. [less ▲]

Detailed reference viewed: 105 (17 ULiège)
Full Text
Peer Reviewed
See detailLearning exploration/exploitation strategies for single trajectory reinforcement learning
Castronovo, Michaël ULiege; Maes, Francis ULiege; Fonteneau, Raphaël ULiege et al

in Proceedings of the 10th European Workshop on Reinforcement Learning (EWRL 2012) (2012)

We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a ... [more ▼]

We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution pM( ). The performance criterion is the sum of discounted rewards collected by the E/E strategy over an in finite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to pM( ). As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions V and Q of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with e-Greedy strategies and the results are promising. [less ▲]

Detailed reference viewed: 322 (36 ULiège)