Coordination of voltage control in a power system operated by multiple transmission utilities; ; Ernst, Damien ![]() in Proceedings of the 2010 IREP Symposium - Bulk Power Systems Dynamics and Control - VIII (2010, August) This paper addresses the problem of coordinating voltage control in a large-scale power system partitioned into control areas operated by independent utilities. Two types of coordination modes are ... [more ▼] This paper addresses the problem of coordinating voltage control in a large-scale power system partitioned into control areas operated by independent utilities. Two types of coordination modes are considered to obtain settings for tap changers, generator voltages, and reactive power injections from compensation devices. First, it is supposed that a supervisor entity, with full knowledge and control of the system, makes decisions with respect to long-term settings of individual utilities. Second, the system is operated according to a decentralized coordination scheme that involves no information exchange between utilities. Those methods are compared with current practices on a 4141 bus system with 7 transmission system operators, where the generation dispatch and load demand models vary in discrete steps. Such a discrete-time model is sufficient to model any event of relevance with respect to long-term system dynamics. Simulations show that centrally coordinated voltage control yields a significant improvement in terms of both operation costs and reserves for emergency control actions. This paper also emphasizes that, although it involves few changes with respect to current practices, the decentralized coordination scheme improves the operation of multi-utility power systems. [less ▲] Detailed reference viewed: 28 (6 ULg) Upper confidence bound based decision making strategies and dynamic spectrum access; Ernst, Damien ; et alin Proceedings of the 2010 IEEE International Conference on Communications (2010, May) In this paper, we consider the problem of exploiting spectrum resources for a secondary user (SU) of a wireless communication network. We suggest that Upper Confidence Bound (UCB) algorithms could be ... [more ▼] In this paper, we consider the problem of exploiting spectrum resources for a secondary user (SU) of a wireless communication network. We suggest that Upper Confidence Bound (UCB) algorithms could be useful to design decision making strategies for SUs to exploit intelligently the spectrum resources based on their past observations. The algorithms use an index that provides an optimistic estimation of the availability of the resources to the SU. The suggestion is supported by some experimental results carried out on a specific dynamic spectrum access (DSA) framework. [less ▲] Detailed reference viewed: 25 (2 ULg) Model-free Monte Carlo–like policy evaluationFonteneau, Raphaël ; ; Wehenkel, Louis et alin Proceedings of Conférence Francophone sur l'Apprentissage Automatique (CAp) 2010 (2010, May) We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards ... [more ▼] We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of one-step transitions. [less ▲] Detailed reference viewed: 21 (9 ULg) Model-free Monte Carlo-like policy evaluationFonteneau, Raphaël ; ; Wehenkel, Louis et alin Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2010) (2010, May) We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards ... [more ▼] We propose an algorithm for estimating the finite-horizon expected return of a closed loop control policy from an a priori given (off-policy) sample of one-step transitions. It averages cumulated rewards along a set of “broken trajectories” made of one-step transitions selected from the sample on the basis of the control policy. Under some Lipschitz continuity assumptions on the system dynamics, reward function and control policy, we provide bounds on the bias and variance of the estimator that depend only on the Lipschitz constants, on the number of broken trajectories used in the estimator, and on the sparsity of the sample of one-step transitions. [less ▲] Detailed reference viewed: 55 (16 ULg) Generating informative trajectories by using bounds on the return of control policiesFonteneau, Raphaël ; ; Wehenkel, Louis et alin Proceedings of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010) (2010, May) We propose new methods for guiding the generation of informative trajectories when solving discrete-time optimal control problems. These methods exploit recently published results that provide ways for ... [more ▼] We propose new methods for guiding the generation of informative trajectories when solving discrete-time optimal control problems. These methods exploit recently published results that provide ways for computing bounds on the return of control policies from a set of trajectories. [less ▲] Detailed reference viewed: 27 (11 ULg) Using prior knowledge to accelerate online least-squares policy iteration; ; et al in Proceedings of the 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (2010, May) Reinforcement learning (RL) is a promising paradigm for learning optimal control. Although RL is generally envisioned as working without any prior knowledge about the system, such knowledge is often ... [more ▼] Reinforcement learning (RL) is a promising paradigm for learning optimal control. Although RL is generally envisioned as working without any prior knowledge about the system, such knowledge is often available and can be exploited to great advantage. In this paper, we consider prior knowledge about the monotonicity of the control policy with respect to the system states, and we introduce an approach that exploits this type of prior knowledge to accelerate a state-of-the-art RL algorithm called online least-squares policy iteration (LSPI). Monotonic policies are appropriate for important classes of systems appearing in control applications. LSPI is a data-efficient RL algorithm that we previously extended to online learning, but that did not provide until now a way to use prior knowledge about the policy. In an empirical evaluation, online LSPI with prior knowledge learns much faster and more reliably than the original online LSPI. [less ▲] Detailed reference viewed: 17 (3 ULg) Approximate dynamic programming with a fuzzy parameterization; Ernst, Damien ; et alin Automatica (2010), 46(5), 804-814 Dynamic programming (DP) is a powerful paradigm for general, nonlinear optimal control. Computing exact DP solutions is in general only possible when the process states and the control actions take values ... [more ▼] Dynamic programming (DP) is a powerful paradigm for general, nonlinear optimal control. Computing exact DP solutions is in general only possible when the process states and the control actions take values in a small discrete set. In practice, it is necessary to approximate the solutions. Therefore, we propose an algorithm for approximate DP that relies on a fuzzy partition of the state space, and on a discretization of the action space. This fuzzy Q-iteration algorithm works for deterministic processes, under the discounted return criterion. We prove that fuzzy Q-iteration asymptotically converges to a solution that lies within a bound of the optimal solution. A bound on the suboptimality of the solution obtained in a finite number of iterations is also derived. Under continuity assumptions on the dynamics and on the reward function, we show that fuzzy Q-iteration is consistent, i.e., that it asymptotically obtains the optimal solution as the approximation accuracy increases. These properties hold both when the parameters of the approximator are updated in a synchronous fashion, and when they are updated asynchronously. The asynchronous algorithm is proven to converge at least as fast as the synchronous one. The performance of fuzzy Q-iteration is illustrated in a two-link manipulator control problem. [less ▲] Detailed reference viewed: 36 (12 ULg) Reinforcement Learning and Dynamic Programming using Function Approximators; ; et al Book published by CRC Press (2010) Detailed reference viewed: 178 (21 ULg) A cautious approach to generalization in reinforcement learningFonteneau, Raphaël ; ; Wehenkel, Louis et alin Proceedings of the 2nd International Conference on Agents and Artificial Intelligence (2010, January) In the context of a deterministic Lipschitz continuous environment over continuous state spaces, finite action spaces, and a finite optimization horizon, we propose an algorithm of polynomial complexity ... [more ▼] In the context of a deterministic Lipschitz continuous environment over continuous state spaces, finite action spaces, and a finite optimization horizon, we propose an algorithm of polynomial complexity which exploits weak prior knowledge about its environment for computing from a given sample of trajectories and for a given initial state a sequence of actions. The proposed Viterbi-like algorithm maximizes a recently proposed lower bound on the return depending on the initial state, and uses to this end prior knowledge about the environment provided in the form of upper bounds on its Lipschitz constants. It thereby avoids, in way depending on the initial state and on the prior knowledge, those regions of the state space where the sample is too sparse to make safe generalizations. Our experiments show that it can lead to more cautious policies than algorithms combining dynamic programming with function approximators. We give also a condition on the sample sparsity ensuring that, for a given initial state, the proposed algorithm produces an optimal sequence of actions in open-loop. [less ▲] Detailed reference viewed: 60 (21 ULg) Exploiting policy knowledge in online least-squares policy iteration: An empirical study; Ernst, Damien ; et alin Automation, Computers, Applied Mathematics (2010), 19(4), 521-529 Reinforcement learning (RL) is a promising paradigm for learning optimal control. Traditional RL works for discrete variables only, so to deal with the continuous variables appearing in control problems ... [more ▼] Reinforcement learning (RL) is a promising paradigm for learning optimal control. Traditional RL works for discrete variables only, so to deal with the continuous variables appearing in control problems, approximate representations of the solution are necessary. The field of approximate RL has tremendously expanded over the last decade, and a wide array of effective algorithms is now available. However, RL is generally envisioned as working without any prior knowledge about the system or the solution, whereas such knowledge is often available and can be exploited to great advantage. Therefore, in this paper we describe a method that exploits prior knowledge to accelerate online least-squares policy iteration (LSPI), a state-of-the-art algorithm for approximate RL. We focus on prior knowledge about the monotonicity of the control policy with respect to the system states. Such monotonic policies are appropriate for important classes of systems appearing in control applications, including for instance nearly linear systems and linear systems with monotonic input nonlinearities. In an empirical evaluation, online LSPI with prior knowledge is shown to learn much faster and more reliably than the original online LSPI. [less ▲] Detailed reference viewed: 33 (3 ULg) Computing bounds for kernel-based policy evaluation in reinforcement learningFonteneau, Raphaël ; ; Wehenkel, Louis et alReport (2010) This technical report proposes an approach for computing bounds on the finite-time return of a policy using kernel-based approximators from a sample of trajectories in a continuous state space and ... [more ▼] This technical report proposes an approach for computing bounds on the finite-time return of a policy using kernel-based approximators from a sample of trajectories in a continuous state space and deterministic framework. [less ▲] Detailed reference viewed: 9 (3 ULg) Voronoi model learning for batch mode reinforcement learningFonteneau, Raphaël ; Ernst, Damien ![]() Report (2010) We consider deterministic optimal control problems with continuous state spaces where the information on the system dynamics and the reward function is constrained to a set of system transitions. Each ... [more ▼] We consider deterministic optimal control problems with continuous state spaces where the information on the system dynamics and the reward function is constrained to a set of system transitions. Each system transition gathers a state, the action taken while being in this state, the immediate reward observed and the next state reached. In such a context, we propose a new model learning--type reinforcement learning (RL) algorithm in batch mode, finite-time and deterministic setting. The algorithm, named Voronoi reinforcement learning (VRL), approximates from a sample of system transitions the system dynamics and the reward function of the optimal control problem using piecewise constant functions on a Voronoi--like partition of the state-action space. [less ▲] Detailed reference viewed: 19 (3 ULg) Online least-squares policy iteration for reinforcement learning control; Ernst, Damien ; et alin Proceedings of the 2010 American Control Conference (2010) Reinforcement learning is a promising paradigm for learning optimal control. We consider policy iteration (PI) algorithms for reinforcement learning, which iteratively evaluate and improve control ... [more ▼] Reinforcement learning is a promising paradigm for learning optimal control. We consider policy iteration (PI) algorithms for reinforcement learning, which iteratively evaluate and improve control policies. State-of-the-art, least-squares techniques for policy evaluation are sample-efficient and have relaxed convergence requirements. However, they are typically used in offline PI, whereas a central goal of reinforcement learning is to develop online algorithms. Therefore, we propose an online PI algorithm that evaluates policies with the so-called least-squares temporal difference for Q-functions (LSTD-Q). The crucial difference between this online least-squares policy iteration (LSPI) algorithm and its offline counterpart is that, in the online case, policy improvements must be performed once every few state transitions, using only an incomplete evaluation of the current policy. In an extensive experimental evaluation, online LSPI is found to work well for a wide range of its parameters, and to learn successfully in a real-time example. Online LSPI also compares favorably with offline LSPI and with a different flavor of online PI, which instead of LSTD-Q employs another least-squares method for policy evaluation. [less ▲] Detailed reference viewed: 23 (2 ULg) Multi-armed bandit based policies for cognitive radio's decision making issues; Ernst, Damien ; et alin Proceedings of the 3rd International Conference on Signals, Circuits and Systems (SCS) (2009, November) We suggest in this paper that many problems related to Cognitive Radio’s (CR) decision making inside CR equipments can be formalized as Multi-Armed Bandit problems and that solving such problems by using ... [more ▼] We suggest in this paper that many problems related to Cognitive Radio’s (CR) decision making inside CR equipments can be formalized as Multi-Armed Bandit problems and that solving such problems by using Upper Confidence Bound (UCB) algorithms can lead to high-performance CR devices. An application of these algorithms to an academic Cognitive Radio problem is reported. [less ▲] Detailed reference viewed: 39 (15 ULg) Apoptosis characterizes immunological failure of HIV infected patients; ; Fonteneau, Raphaël et alin Control Engineering Practice (2009), 17(7), 798-804 This paper studies the influence of apoptosis in the dynamics of the HIV infection. A new modeling of the healthy CD4+ T-cells activation-induced apoptosis is used. The parameters of this model are ... [more ▼] This paper studies the influence of apoptosis in the dynamics of the HIV infection. A new modeling of the healthy CD4+ T-cells activation-induced apoptosis is used. The parameters of this model are identified by using clinical data generated by monitoring patients starting Highly Active Anti-Retroviral Therapy (HAART). The sampling of blood tests is performed to satisfy the constraints of dynamical system parameter identification. The apoptosis parameter, which is inferred from clinical data, is then shown to play a key role in the early diagnosis of immunological failure. [less ▲] Detailed reference viewed: 69 (17 ULg) What is the likely future of real-time transient stability ?Ernst, Damien ; Wehenkel, Louis ; Pavella, Mania ![]() in Proceedings of the 2009 IEEE/PES Power Systems Conference & Exposition (PSCE 2009) (2009) Despite very intensive research efforts in the field of transient stability during the last five decades, the large majority of the derived techniques have hardly moved from the research laboratories to ... [more ▼] Despite very intensive research efforts in the field of transient stability during the last five decades, the large majority of the derived techniques have hardly moved from the research laboratories to the industrial world and, as a matter of fact, the very large majority of today's control centers do not make use of any real-time transient stability software. On the other hand, along all these years the techniques developed for real-time transient stability have mainly focused on the definition of stability margins and speeding-up techniques rather than on preventive or emergency control strategies. In the light of the above observations, this paper attempts to explain the reasons for lack of industrial interest in real-time transient stability, and also to examine an even more fundamental question, namely: is transient stability, as has been stated many decades ago, still the relevant issue in the context of the new power systems morphology towards more dispersed generation, higher penetration of power electronics, larger and more complex structures, and, in addition, of economic and environmental constraints? Or, maybe, there is a need for techniques different from those developed so far? [less ▲] Detailed reference viewed: 97 (14 ULg) Inferring bounds on the performance of a control policy from a sample of trajectoriesFonteneau, Raphaël ; ; Wehenkel, Louis et alin Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09) (2009) We propose an approach for inferring bounds on the finite-horizon return of a control policy from an off-policy sample of trajectories collecting state transitions, rewards, and control actions. In this ... [more ▼] We propose an approach for inferring bounds on the finite-horizon return of a control policy from an off-policy sample of trajectories collecting state transitions, rewards, and control actions. In this paper, the dynamics, control policy, and reward function are supposed to be deterministic and Lipschitz continuous. Under these assumptions, a polynomial algorithm, in terms of the sample size and length of the optimization horizon, is derived to compute these bounds, and their tightness is characterized in terms of the sample density. [less ▲] Detailed reference viewed: 35 (10 ULg) Planning under uncertainty, ensembles of disturbance trees and kernelized discrete action spacesDefourny, Boris ; Ernst, Damien ; Wehenkel, Louis ![]() in Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09) (2009) Optimizing decisions on an ensemble of incomplete disturbance trees and aggregating their first stage decisions has been shown as a promising approach to (model-based) planning under uncertainty in large ... [more ▼] Optimizing decisions on an ensemble of incomplete disturbance trees and aggregating their first stage decisions has been shown as a promising approach to (model-based) planning under uncertainty in large continuous action spaces and in small discrete ones. The present paper extends this approach and deals with large but highly structured action spaces, through a kernel-based aggregation scheme. The technique is applied to a test problem with a discrete action space of 6561 elements adapted from the NIPS 2005 SensorNetwork benchmark. [less ▲] Detailed reference viewed: 21 (8 ULg) Policy search with cross-entropy optimization of basis functions; Ernst, Damien ; et alin Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09) (2009) This paper introduces a novel algorithm for approximate policy search in continuous-state, discrete-action Markov decision processes (MDPs). Previous policy search approaches have typically used ad-hoc ... [more ▼] This paper introduces a novel algorithm for approximate policy search in continuous-state, discrete-action Markov decision processes (MDPs). Previous policy search approaches have typically used ad-hoc parameterizations developed for specific MDPs. In contrast, the novel algorithm employs a flexible policy parameterization, suitable for solving general discrete-action MDPs. The algorithm looks for the best closed-loop policy that can be represented using a given number of basis functions, where a discrete action is assigned to each basis function. The locations and shapes of the basis functions are optimized, together with the action assignments. This allows a large class of policies to be represented. The optimization is carried out with the cross-entropy method and evaluates the policies by their empirical return from a representative set of initial states. We report simulation experiments in which the algorithm reliably obtains good policies with only a small number of basis functions, albeit at sizable computational costs. [less ▲] Detailed reference viewed: 23 (9 ULg) A rare-event approach to build security analysis tools when N-k (k > 1) analyses are needed (as they are in large-scale power systems)Belmudes, Florence ; Ernst, Damien ; Wehenkel, Louis ![]() in Proceedings of the 2009 IEEE Bucharest PowerTech (2009) We consider the problem of performing N − k security analyses in large scale power systems. In such a context, the number of potentially dangerous N − k contingencies may become rapidly very large when k ... [more ▼] We consider the problem of performing N − k security analyses in large scale power systems. In such a context, the number of potentially dangerous N − k contingencies may become rapidly very large when k grows, and so running a security analysis for each one of them is often intractable. We assume in this paper that the number of dangerous N − k contingencies is very small with respect to the number of non-dangerous ones. Under this assumption, we suggest to use importance sampling techniques for identifying rare events in combinatorial search spaces. With such techniques, it is possible to identify dangerous contingencies by running security analyses for only a small number of events. A procedure relying on these techniques is proposed in this work for steady-state security analyses. This procedure has been evaluated on the IEEE 118 bus test system. The results show that it is indeed able to efficiently identify among a large set of contingencies some of the rare ones which are dangerous. [less ▲] Detailed reference viewed: 34 (5 ULg) |
||