Browse ORBi by ORBi project

- Background
- Content
- Benefits and challenges
- Legal aspects
- Functions and services
- Team
- Help and tutorials

Apprentissage actif par modification de la politique de décision courante Fonteneau, Raphaël ; ; Wehenkel, Louis et al in Sixièmes Journées Francophones de Planification, Décision et Apprentissage pour la conduite de systèmes (JFPDA 2011) (2011, June) Detailed reference viewed: 15 (5 ULg)Active exploration by searching for experiments that falsify the computed control policy Fonteneau, Raphaël ; ; Wehenkel, Louis et al in Proceedings of the 2011 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-11) (2011, April) We propose a strategy for experiment selection - in the context of reinforcement learning - based on the idea that the most interesting experiments to carry out at some stage are those that are the most ... [more ▼] We propose a strategy for experiment selection - in the context of reinforcement learning - based on the idea that the most interesting experiments to carry out at some stage are those that are the most liable to falsify the current hypothesis about the optimal control policy. We cast this idea in a context where a policy learning algorithm and a model identiﬁcation method are given a priori. Experiments are selected if, using the learnt environment model, they are predicted to yield a revision of the learnt control policy. Algorithms and simulation results are provided for a deterministic system with discrete action space. They show that the proposed approach is promising. [less ▲] Detailed reference viewed: 32 (8 ULg)Approximate reinforcement learning: an overview ; ; et al in Proceedings of the 2011 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-11) (2011, April) Reinforcement learning (RL) allows agents to learn how to optimally interact with complex environments. Fueled by recent advances in approximation-based algorithms, RL has obtained impressive successes in ... [more ▼] Reinforcement learning (RL) allows agents to learn how to optimally interact with complex environments. Fueled by recent advances in approximation-based algorithms, RL has obtained impressive successes in robotics, artiﬁcial intelligence, control, operations research, etc. However, the scarcity of survey papers about approximate RL makes it difﬁcult for newcomers to grasp this intricate ﬁeld. With the present overview, we take a step toward alleviating this situation. We review methods for approximate RL, starting from their dynamic programming roots and organizing them into three major classes: approximate value iteration, policy iteration, and policy search. Each class is subdivided into representative categories, highlighting among others ofﬂine and online algorithms, policy gradient methods, and simulation-based techniques. We also compare the different categories of methods, and outline possible ways to enhance the reviewed algorithms. [less ▲] Detailed reference viewed: 103 (3 ULg)Cross-entropy optimization of control policies with adaptive basis functions ; Ernst, Damien ; et al in IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics (2011), 41(1), 196-209 This paper introduces an algorithm for direct search of control policies in continuous-state, discrete-action Markov decision processes. The algorithm looks for the best closed-loop policy that can be ... [more ▼] This paper introduces an algorithm for direct search of control policies in continuous-state, discrete-action Markov decision processes. The algorithm looks for the best closed-loop policy that can be represented using a given number of basis functions (BFs), where a discrete action is assigned to each BF. The type of the BFs and their number are speciﬁed in advance and determine the complexity of the representation. Considerable ﬂexibility is achieved by optimizing the locations and shapes of the BFs, together with the action assignments. The optimization is carried out with the cross-entropy method and evaluates the policies by their empirical return from a representative set of initial states. The return for each representative state is estimated using Monte Carlo simulations. The resulting algorithm for crossentropy policy search with adaptive BFs is extensively evaluated in problems with two to six state variables, for which it reliably obtains good policies with only a small number of BFs. In these experiments, cross-entropy policy search requires vastly fewer BFs than value-function techniques with equidistant BFs, and outperforms policy search with a competing optimization algorithm called DIRECT. [less ▲] Detailed reference viewed: 32 (2 ULg)Towards min max generalization in reinforcement learning Fonteneau, Raphaël ; ; Wehenkel, Louis et al in Filipe, Joaquim; Fred, Ana; Sharp, Bernadette (Eds.) Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, January 2010, Revised Selected Papers (2011) In this paper, we introduce a min max approach for addressing the generalization problem in Reinforcement Learning. The min max approach works by determining a sequence of actions that maximizes the worst ... [more ▼] In this paper, we introduce a min max approach for addressing the generalization problem in Reinforcement Learning. The min max approach works by determining a sequence of actions that maximizes the worst return that could possibly be obtained considering any dynamics and reward function compatible with the sample of trajectories and some prior knowledge on the environment. We consider the particular case of deterministic Lipschitz continuous environments over continuous state spaces, nite action spaces, and a nite optimization horizon. We discuss the non-triviality of computing an exact solution of the min max problem even after reformulating it so as to avoid search in function spaces. For addressing this problem, we propose to replace, inside this min max problem, the search for the worst environment given a sequence of actions by an expression that lower bounds the worst return that can be obtained for a given sequence of actions. This lower bound has a tightness that depends on the sample sparsity. From there, we propose an algorithm of polynomial complexity that returns a sequence of actions leading to the maximization of this lower bound. We give a condition on the sample sparsity ensuring that, for a given initial state, the proposed algorithm produces an optimal sequence of actions in open-loop. Our experiments show that this algorithm can lead to more cautious policies than algorithms combining dynamic programming with function approximators. [less ▲] Detailed reference viewed: 37 (4 ULg)Automatic discovery of ranking formulas for playing with multi-armed bandits Maes, Francis ; Wehenkel, Louis ; Ernst, Damien in Proceedings of the 9th European Workshop on Reinforcement Learning (EWRL 2011) (2011) We propose an approach for discovering in an automatic way formulas for ranking arms while playing with multi-armed bandits. The approach works by de ning a grammar made of basic elements such as for ... [more ▼] We propose an approach for discovering in an automatic way formulas for ranking arms while playing with multi-armed bandits. The approach works by de ning a grammar made of basic elements such as for example addition, subtraction, the max operator, the average values of rewards collected by an arm, their standard deviation etc., and by exploiting this grammar to generate and test a large number of formulas. The systematic search for good candidate formulas is carried out by a built-on-purpose optimization algorithm used to navigate inside this large set of candidate formulas towards those that give high performances when using them on some multi-armed bandit problems. We have applied this approach on a set of bandit problems made of Bernoulli, Gaussian and truncated Gaussian distributions and have identi ed a few simple ranking formulas that provide interesting results on every problem of this set. In particular, they clearly outperform several reference policies previously introduced in the literature. We argue that these newly found formulas as well as the procedure for generating them may suggest new directions for studying bandit problems. [less ▲] Detailed reference viewed: 55 (19 ULg)Optimized look-ahead tree policies Maes, Francis ; Wehenkel, Louis ; Ernst, Damien in Proceedings of the 9th European Workshop on Reinforcement Learning (EWRL 2011) (2011) We consider in this paper look-ahead tree techniques for the discrete-time control of a deterministic dynamical system so as to maximize a sum of discounted rewards over an in finite time horizon. Given ... [more ▼] We consider in this paper look-ahead tree techniques for the discrete-time control of a deterministic dynamical system so as to maximize a sum of discounted rewards over an in finite time horizon. Given the current system state xt at time t, these techniques explore the look-ahead tree representing possible evolutions of the system states and rewards conditioned on subsequent actions ut, ut+1, ... . When the computing budget is exhausted, they output the action ut that led to the best found sequence of discounted rewards. In this context, we are interested in computing good strategies for exploring the look-ahead tree. We propose a generic approach that looks for such strategies by solving an optimization problem whose objective is to compute a (budget compliant) tree-exploration strategy yielding a control policy maximizing the average return over a postulated set of initial states. This generic approach is fully speci ed to the case where the space of candidate tree-exploration strategies are "best-first" strategies parameterized by a linear combination of look-ahead path features - some of them having been advocated in the literature before - and where the optimization problem is solved by using an EDA-algorithm based on Gaussian distributions. Numerical experiments carried out on a model of the treatment of the HIV infection show that the optimized tree-exploration strategy is orders of magnitudes better than the previously advocated ones. [less ▲] Detailed reference viewed: 85 (12 ULg)Optimal sample selection for batch-mode reinforcement learning Rachelson, Emmanuel ; Schnitzler, François ; Wehenkel, Louis et al in Proceedings of the 3rd International Conference on Agents and Artificial Intelligence (ICAART 2011) (2011) We introduce the Optimal Sample Selection (OSS) meta-algorithm for solving discrete-time Optimal Control problems. This meta-algorithm maps the problem of ﬁnding a near-optimal closed-loop policy to the ... [more ▼] We introduce the Optimal Sample Selection (OSS) meta-algorithm for solving discrete-time Optimal Control problems. This meta-algorithm maps the problem of ﬁnding a near-optimal closed-loop policy to the identiﬁcation of a small set of one-step system transitions, leading to high-quality policies when used as input of a batch-mode Reinforcement Learning (RL) algorithm. We detail a particular instance of this OSS metaalgorithm that uses tree-based Fitted Q-Iteration as a batch-mode RL algorithm and Cross Entropy search as a method for navigating efﬁciently in the space of sample sets. The results show that this particular instance of OSS algorithms is able to identify rapidly small sample sets leading to high-quality policies [less ▲] Detailed reference viewed: 104 (14 ULg)Voltage control in an HVDC system to share primary frequency reserves between non-synchronous areas ; ; Sarlette, Alain et al in Proceedings of the 17th Power Systems Computation Conference (PSCC-11) (2011) This paper addresses the problem of frequency control for non-synchronous AC areas connected by a multi-terminal HVDC grid. It proposes a decentralized control scheme for the DC voltages of the HVDC ... [more ▼] This paper addresses the problem of frequency control for non-synchronous AC areas connected by a multi-terminal HVDC grid. It proposes a decentralized control scheme for the DC voltages of the HVDC converters aimed to make the AC areas collectively react to power imbalances. A theoretical study shows that, by using local information only, the control scheme allows to signiﬁcantly reduce the impact of a power imbalance by distributing the associated frequency deviation over all areas. A secondary frequency control strategy that can be combined with this control scheme is also proposed so as to restore the frequencies and the power exchanges to their nominal values in the aftermath of a power imbalance. Simulation results on a benchmark system with ﬁve AC areas illustrate the good performance of the control scheme. [less ▲] Detailed reference viewed: 227 (7 ULg)Multistage stochastic programming: A scenario tree based approach to planning under uncertainty Defourny, Boris ; Ernst, Damien ; Wehenkel, Louis in Sucar, L. Enrique; Morales, Eduardo F.; Hoey, Jesse (Eds.) Decision Theory Models for Applications in Artificial Intelligence: Concepts and Solutions (2011) In this chapter, we present the multistage stochastic programming framework for sequential decision making under uncertainty. We discuss its differences with Markov Decision Processes, from the point of ... [more ▼] In this chapter, we present the multistage stochastic programming framework for sequential decision making under uncertainty. We discuss its differences with Markov Decision Processes, from the point of view of decision models and solution algorithms. We describe the standard technique for solving approximately multistage stochastic problems, which is based on a discretization of the disturbance space called scenario tree. We insist on a critical issue of the approach: the decisions can be very sensitive to the parameters of the scenario tree, whereas no efficient tool for checking the quality of approximate solutions exists. In this chapter, we show how supervised learning techniques can be used to evaluate reliably the quality of an approximation, and thus facilitate the selection of a good scenario tree. The framework and solution techniques presented in the chapter are explained and detailed on several examples. Along the way, we define notions from decision theory that can be used to quantify, for a particular problem, the advantage of switching to a more sophisticated decision model. [less ▲] Detailed reference viewed: 319 (45 ULg)Beyond function approximators for batch mode reinforcement learning: rebuilding trajectories Ernst, Damien Speech/Talk (2010) Detailed reference viewed: 16 (2 ULg)Model-free Monte Carlo-like policy evaluation Ernst, Damien Speech/Talk (2010) Detailed reference viewed: 6 (1 ULg)Consequence driven decomposition of large-scale power systems security analysis Ernst, Damien Speech/Talk (2010) Detailed reference viewed: 5 (2 ULg)Consequence driven decomposition of large-scale power system security analysis Fonteneau, Florence ; Ernst, Damien ; et al in Proceedings of the 2010 IREP Symposium - Bulk Power Systems Dynamics and Control - VIII (2010, August) This paper presents an approach for assessing, in operation planning studies, the security of a large-scale power system by decomposing it into elementary subproblems, each one corresponding to a ... [more ▼] This paper presents an approach for assessing, in operation planning studies, the security of a large-scale power system by decomposing it into elementary subproblems, each one corresponding to a structural weak-point of the system. We suppose that the structural weak-points are known a priori by the system operators, and are each one described by a set of constraints that are localized in some relatively small area of the system. The security analysis with respect to a given weak-point thus reduces to the identification of the combinations of power system configurations and disturbances that could lead to the violation of some of its constraints. We propose an iterative rare-event simulation approach for identifying such combinations among the very large set of possible ones. The procedure is illustrated on a simplified version of this problem applied to the Belgian transmission system. [less ▲] Detailed reference viewed: 64 (13 ULg)Impact of delays on a consensus-based primary frequency control scheme for AC systems connected by a multi-terminal HVDC grid ; ; Sarlette, Alain et al in Proceedings of the 2010 IREP Symposium - Bulk Power Systems Dynamics and Control - VIII (2010, August) This paper addresses the problem of sharing primary frequency control reserves among nonsynchronous AC systems connected by a multi-terminal HVDC grid. We focus on a control scheme that modifies the power ... [more ▼] This paper addresses the problem of sharing primary frequency control reserves among nonsynchronous AC systems connected by a multi-terminal HVDC grid. We focus on a control scheme that modifies the power injections from the different areas into the DC grid based on remote measurements of the other areas’ frequencies. This scheme is proposed and applied to a simplified system in a previous work by the authors. The current paper investigates the effects of delays on the control scheme’s effectiveness. The study shows that there generally exists a maximum acceptable delay, beyond which the areas’ frequency deviations fail to converge to an equilibrium point. This constraint should be taken into account when commissioning such a control scheme. [less ▲] Detailed reference viewed: 67 (9 ULg)Coordination of voltage control in a power system operated by multiple transmission utilities ; ; Ernst, Damien in Proceedings of the 2010 IREP Symposium - Bulk Power Systems Dynamics and Control - VIII (2010, August) This paper addresses the problem of coordinating voltage control in a large-scale power system partitioned into control areas operated by independent utilities. Two types of coordination modes are ... [more ▼] This paper addresses the problem of coordinating voltage control in a large-scale power system partitioned into control areas operated by independent utilities. Two types of coordination modes are considered to obtain settings for tap changers, generator voltages, and reactive power injections from compensation devices. First, it is supposed that a supervisor entity, with full knowledge and control of the system, makes decisions with respect to long-term settings of individual utilities. Second, the system is operated according to a decentralized coordination scheme that involves no information exchange between utilities. Those methods are compared with current practices on a 4141 bus system with 7 transmission system operators, where the generation dispatch and load demand models vary in discrete steps. Such a discrete-time model is sufficient to model any event of relevance with respect to long-term system dynamics. Simulations show that centrally coordinated voltage control yields a significant improvement in terms of both operation costs and reserves for emergency control actions. This paper also emphasizes that, although it involves few changes with respect to current practices, the decentralized coordination scheme improves the operation of multi-utility power systems. [less ▲] Detailed reference viewed: 62 (6 ULg)Upper confidence bound based decision making strategies and dynamic spectrum access ; Ernst, Damien ; et al in Proceedings of the 2010 IEEE International Conference on Communications (2010, May) In this paper, we consider the problem of exploiting spectrum resources for a secondary user (SU) of a wireless communication network. We suggest that Upper Confidence Bound (UCB) algorithms could be ... [more ▼] In this paper, we consider the problem of exploiting spectrum resources for a secondary user (SU) of a wireless communication network. We suggest that Upper Confidence Bound (UCB) algorithms could be useful to design decision making strategies for SUs to exploit intelligently the spectrum resources based on their past observations. The algorithms use an index that provides an optimistic estimation of the availability of the resources to the SU. The suggestion is supported by some experimental results carried out on a specific dynamic spectrum access (DSA) framework. [less ▲] Detailed reference viewed: 35 (2 ULg)Model-free Monte Carlo–like policy evaluation Fonteneau, Raphaël ; ; Wehenkel, Louis et al in Proceedings of Conférence Francophone sur l'Apprentissage Automatique (CAp) 2010 (2010, May) We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards ... [more ▼] We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of one-step transitions. [less ▲] Detailed reference viewed: 25 (10 ULg)Model-free Monte Carlo-like policy evaluation Fonteneau, Raphaël ; ; Wehenkel, Louis et al in Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010) (2010, May) We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards ... [more ▼] We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of one-step transitions. [less ▲] Detailed reference viewed: 82 (17 ULg)Generating informative trajectories by using bounds on the return of control policies Fonteneau, Raphaël ; ; Wehenkel, Louis et al in Proceedings of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010) (2010, May) We propose new methods for guiding the generation of informative trajectories when solving discrete-time optimal control problems. These methods exploit recently published results that provide ways for ... [more ▼] We propose new methods for guiding the generation of informative trajectories when solving discrete-time optimal control problems. These methods exploit recently published results that provide ways for computing bounds on the return of control policies from a set of trajectories. [less ▲] Detailed reference viewed: 37 (12 ULg) |
||