Browse ORBi by ORBi project

- Background
- Content
- Benefits and challenges
- Legal aspects
- Functions and services
- Team
- Help and tutorials

Using prior knowledge to accelerate online least-squares policy iteration ; ; et al in Proceedings of the 2010 IEEE International Conference on Automation, Quality and Testing, Robotics (2010, May) Reinforcement learning (RL) is a promising paradigm for learning optimal control. Although RL is generally envisioned as working without any prior knowledge about the system, such knowledge is often ... [more ▼] Reinforcement learning (RL) is a promising paradigm for learning optimal control. Although RL is generally envisioned as working without any prior knowledge about the system, such knowledge is often available and can be exploited to great advantage. In this paper, we consider prior knowledge about the monotonicity of the control policy with respect to the system states, and we introduce an approach that exploits this type of prior knowledge to accelerate a state-of-the-art RL algorithm called online least-squares policy iteration (LSPI). Monotonic policies are appropriate for important classes of systems appearing in control applications. LSPI is a data-efficient RL algorithm that we previously extended to online learning, but that did not provide until now a way to use prior knowledge about the policy. In an empirical evaluation, online LSPI with prior knowledge learns much faster and more reliably than the original online LSPI. [less ▲] Detailed reference viewed: 24 (3 ULg)Approximate dynamic programming with a fuzzy parameterization ; Ernst, Damien ; et al in Automatica (2010), 46(5), 804-814 Dynamic programming (DP) is a powerful paradigm for general, nonlinear optimal control. Computing exact DP solutions is in general only possible when the process states and the control actions take values ... [more ▼] Dynamic programming (DP) is a powerful paradigm for general, nonlinear optimal control. Computing exact DP solutions is in general only possible when the process states and the control actions take values in a small discrete set. In practice, it is necessary to approximate the solutions. Therefore, we propose an algorithm for approximate DP that relies on a fuzzy partition of the state space, and on a discretization of the action space. This fuzzy Q-iteration algorithm works for deterministic processes, under the discounted return criterion. We prove that fuzzy Q-iteration asymptotically converges to a solution that lies within a bound of the optimal solution. A bound on the suboptimality of the solution obtained in a finite number of iterations is also derived. Under continuity assumptions on the dynamics and on the reward function, we show that fuzzy Q-iteration is consistent, i.e., that it asymptotically obtains the optimal solution as the approximation accuracy increases. These properties hold both when the parameters of the approximator are updated in a synchronous fashion, and when they are updated asynchronously. The asynchronous algorithm is proven to converge at least as fast as the synchronous one. The performance of fuzzy Q-iteration is illustrated in a two-link manipulator control problem. [less ▲] Detailed reference viewed: 44 (12 ULg)Reinforcement Learning and Dynamic Programming using Function Approximators ; ; et al Book published by CRC Press (2010) Detailed reference viewed: 315 (36 ULg)A cautious approach to generalization in reinforcement learning Fonteneau, Raphaël ; ; Wehenkel, Louis et al in Proceedings of the 2nd International Conference on Agents and Artificial Intelligence (2010, January) In the context of a deterministic Lipschitz continuous environment over continuous state spaces, finite action spaces, and a finite optimization horizon, we propose an algorithm of polynomial complexity ... [more ▼] In the context of a deterministic Lipschitz continuous environment over continuous state spaces, finite action spaces, and a finite optimization horizon, we propose an algorithm of polynomial complexity which exploits weak prior knowledge about its environment for computing from a given sample of trajectories and for a given initial state a sequence of actions. The proposed Viterbi-like algorithm maximizes a recently proposed lower bound on the return depending on the initial state, and uses to this end prior knowledge about the environment provided in the form of upper bounds on its Lipschitz constants. It thereby avoids, in way depending on the initial state and on the prior knowledge, those regions of the state space where the sample is too sparse to make safe generalizations. Our experiments show that it can lead to more cautious policies than algorithms combining dynamic programming with function approximators. We give also a condition on the sample sparsity ensuring that, for a given initial state, the proposed algorithm produces an optimal sequence of actions in open-loop. [less ▲] Detailed reference viewed: 130 (25 ULg)Model-free Monte Carlo-like policy evaluation Fonteneau, Raphaël ; ; Wehenkel, Louis et al in 29th Benelux Meeting on Systems and Control (2010) Detailed reference viewed: 13 (1 ULg)Exploiting policy knowledge in online least-squares policy iteration: An empirical study ; Ernst, Damien ; et al in Automation, Computers, Applied Mathematics (2010), 19(4), 521-529 Reinforcement learning (RL) is a promising paradigm for learning optimal control. Traditional RL works for discrete variables only, so to deal with the continuous variables appearing in control problems ... [more ▼] Reinforcement learning (RL) is a promising paradigm for learning optimal control. Traditional RL works for discrete variables only, so to deal with the continuous variables appearing in control problems, approximate representations of the solution are necessary. The ﬁeld of approximate RL has tremendously expanded over the last decade, and a wide array of effective algorithms is now available. However, RL is generally envisioned as working without any prior knowledge about the system or the solution, whereas such knowledge is often available and can be exploited to great advantage. Therefore, in this paper we describe a method that exploits prior knowledge to accelerate online least-squares policy iteration (LSPI), a state-of-the-art algorithm for approximate RL. We focus on prior knowledge about the monotonicity of the control policy with respect to the system states. Such monotonic policies are appropriate for important classes of systems appearing in control applications, including for instance nearly linear systems and linear systems with monotonic input nonlinearities. In an empirical evaluation, online LSPI with prior knowledge is shown to learn much faster and more reliably than the original online LSPI. [less ▲] Detailed reference viewed: 81 (4 ULg)Computing bounds for kernel-based policy evaluation in reinforcement learning Fonteneau, Raphaël ; ; Wehenkel, Louis et al Report (2010) This technical report proposes an approach for computing bounds on the finite-time return of a policy using kernel-based approximators from a sample of trajectories in a continuous state space and ... [more ▼] This technical report proposes an approach for computing bounds on the finite-time return of a policy using kernel-based approximators from a sample of trajectories in a continuous state space and deterministic framework. [less ▲] Detailed reference viewed: 18 (3 ULg)Voronoi model learning for batch mode reinforcement learning Fonteneau, Raphaël ; Ernst, Damien Report (2010) We consider deterministic optimal control problems with continuous state spaces where the information on the system dynamics and the reward function is constrained to a set of system transitions. Each ... [more ▼] We consider deterministic optimal control problems with continuous state spaces where the information on the system dynamics and the reward function is constrained to a set of system transitions. Each system transition gathers a state, the action taken while being in this state, the immediate reward observed and the next state reached. In such a context, we propose a new model learning--type reinforcement learning (RL) algorithm in batch mode, finite-time and deterministic setting. The algorithm, named Voronoi reinforcement learning (VRL), approximates from a sample of system transitions the system dynamics and the reward function of the optimal control problem using piecewise constant functions on a Voronoi--like partition of the state-action space. [less ▲] Detailed reference viewed: 34 (4 ULg)Online least-squares policy iteration for reinforcement learning control ; Ernst, Damien ; et al in Proceedings of the 2010 American Control Conference (2010) Reinforcement learning is a promising paradigm for learning optimal control. We consider policy iteration (PI) algorithms for reinforcement learning, which iteratively evaluate and improve control ... [more ▼] Reinforcement learning is a promising paradigm for learning optimal control. We consider policy iteration (PI) algorithms for reinforcement learning, which iteratively evaluate and improve control policies. State-of-the-art, least-squares techniques for policy evaluation are sample-efficient and have relaxed convergence requirements. However, they are typically used in offline PI, whereas a central goal of reinforcement learning is to develop online algorithms. Therefore, we propose an online PI algorithm that evaluates policies with the so-called least-squares temporal difference for Q-functions (LSTD-Q). The crucial difference between this online least-squares policy iteration (LSPI) algorithm and its offline counterpart is that, in the online case, policy improvements must be performed once every few state transitions, using only an incomplete evaluation of the current policy. In an extensive experimental evaluation, online LSPI is found to work well for a wide range of its parameters, and to learn successfully in a real-time example. Online LSPI also compares favorably with offline LSPI and with a different flavor of online PI, which instead of LSTD-Q employs another least-squares method for policy evaluation. [less ▲] Detailed reference viewed: 35 (2 ULg)Multi-armed bandit based policies for cognitive radio's decision making issues ; Ernst, Damien ; et al in Proceedings of the 3rd International Conference on Signals, Circuits and Systems (SCS) (2009, November) We suggest in this paper that many problems related to Cognitive Radio’s (CR) decision making inside CR equipments can be formalized as Multi-Armed Bandit problems and that solving such problems by using ... [more ▼] We suggest in this paper that many problems related to Cognitive Radio’s (CR) decision making inside CR equipments can be formalized as Multi-Armed Bandit problems and that solving such problems by using Upper Confidence Bound (UCB) algorithms can lead to high-performance CR devices. An application of these algorithms to an academic Cognitive Radio problem is reported. [less ▲] Detailed reference viewed: 102 (15 ULg)Apoptosis characterizes immunological failure of HIV infected patients ; ; Fonteneau, Raphaël et al in Control Engineering Practice (2009), 17(7), 798-804 This paper studies the influence of apoptosis in the dynamics of the HIV infection. A new modeling of the healthy CD4+ T-cells activation-induced apoptosis is used. The parameters of this model are ... [more ▼] This paper studies the influence of apoptosis in the dynamics of the HIV infection. A new modeling of the healthy CD4+ T-cells activation-induced apoptosis is used. The parameters of this model are identified by using clinical data generated by monitoring patients starting Highly Active Anti-Retroviral Therapy (HAART). The sampling of blood tests is performed to satisfy the constraints of dynamical system parameter identification. The apoptosis parameter, which is inferred from clinical data, is then shown to play a key role in the early diagnosis of immunological failure. [less ▲] Detailed reference viewed: 146 (19 ULg)Lower bounds in reinforcement learning: the intelligent agent dream is getting closer Ernst, Damien Speech/Talk (2009) Detailed reference viewed: 8 (1 ULg)What is the likely future of real-time transient stability ? Ernst, Damien ; Wehenkel, Louis ; Pavella, Mania in Proceedings of the 2009 IEEE/PES Power Systems Conference & Exposition (PSCE 2009) (2009) Despite very intensive research efforts in the field of transient stability during the last five decades, the large majority of the derived techniques have hardly moved from the research laboratories to ... [more ▼] Despite very intensive research efforts in the field of transient stability during the last five decades, the large majority of the derived techniques have hardly moved from the research laboratories to the industrial world and, as a matter of fact, the very large majority of today's control centers do not make use of any real-time transient stability software. On the other hand, along all these years the techniques developed for real-time transient stability have mainly focused on the definition of stability margins and speeding-up techniques rather than on preventive or emergency control strategies. In the light of the above observations, this paper attempts to explain the reasons for lack of industrial interest in real-time transient stability, and also to examine an even more fundamental question, namely: is transient stability, as has been stated many decades ago, still the relevant issue in the context of the new power systems morphology towards more dispersed generation, higher penetration of power electronics, larger and more complex structures, and, in addition, of economic and environmental constraints? Or, maybe, there is a need for techniques different from those developed so far? [less ▲] Detailed reference viewed: 133 (19 ULg)Inferring bounds on the performance of a control policy from a sample of one-step system transitions Fonteneau, Raphaël ; ; Wehenkel, Louis et al in 28th Benelux Meeting on Systems and Control (2009) Detailed reference viewed: 12 (4 ULg)Dynamic treatment regimes using reinforcement learning: a cautious generalization approach Fonteneau, Raphaël ; ; Wehenkel, Louis et al Poster (2009) Detailed reference viewed: 9 (1 ULg)Evaluation of network equivalents for voltage optimization in multi-area power systems ; ; et al in IEEE Transactions on Power Systems (2009), 24(2), 729-743 The paper addresses the problem of decentralized optimization for a power system partitioned into several areas controlled by different transmission system operators (TSOs). The optimization variables are ... [more ▼] The paper addresses the problem of decentralized optimization for a power system partitioned into several areas controlled by different transmission system operators (TSOs). The optimization variables are the settings for taps, generators’ voltages and compensators’, and the objective function is either based on the minimization of reactive power support, the minimization of active power losses, or a combination of both criteria. We suppose that each TSO assumes an external network equivalent for its neighboring areas and optimizes without concern for the neighboring systems’ objectives its own optimization function. We study, in the context where every TSO adopts the same type of objective function, the performance of an iterative scheme, where every TSO refreshes at each iteration the parameters of its external network equivalents depending on its past internal observations, solves its local optimization problem, and then, applies its “optimal actions” to the power system. In the context of voltage optimization, we find out that this decentralized control scheme can converge to nearly optimal global performance for relatively simple equivalents and simple procedures for fitting their parameters. [less ▲] Detailed reference viewed: 73 (16 ULg)Inferring bounds on the performance of a control policy from a sample of trajectories Fonteneau, Raphaël ; ; Wehenkel, Louis et al in Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09) (2009) We propose an approach for inferring bounds on the finite-horizon return of a control policy from an off-policy sample of trajectories collecting state transitions, rewards, and control actions. In this ... [more ▼] We propose an approach for inferring bounds on the finite-horizon return of a control policy from an off-policy sample of trajectories collecting state transitions, rewards, and control actions. In this paper, the dynamics, control policy, and reward function are supposed to be deterministic and Lipschitz continuous. Under these assumptions, a polynomial algorithm, in terms of the sample size and length of the optimization horizon, is derived to compute these bounds, and their tightness is characterized in terms of the sample density. [less ▲] Detailed reference viewed: 37 (10 ULg)Planning under uncertainty, ensembles of disturbance trees and kernelized discrete action spaces Defourny, Boris ; Ernst, Damien ; Wehenkel, Louis in Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09) (2009) Optimizing decisions on an ensemble of incomplete disturbance trees and aggregating their first stage decisions has been shown as a promising approach to (model-based) planning under uncertainty in large ... [more ▼] Optimizing decisions on an ensemble of incomplete disturbance trees and aggregating their first stage decisions has been shown as a promising approach to (model-based) planning under uncertainty in large continuous action spaces and in small discrete ones. The present paper extends this approach and deals with large but highly structured action spaces, through a kernel-based aggregation scheme. The technique is applied to a test problem with a discrete action space of 6561 elements adapted from the NIPS 2005 SensorNetwork benchmark. [less ▲] Detailed reference viewed: 30 (10 ULg)Policy search with cross-entropy optimization of basis functions ; Ernst, Damien ; et al in Proceedings of the IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL-09) (2009) This paper introduces a novel algorithm for approximate policy search in continuous-state, discrete-action Markov decision processes (MDPs). Previous policy search approaches have typically used ad-hoc ... [more ▼] This paper introduces a novel algorithm for approximate policy search in continuous-state, discrete-action Markov decision processes (MDPs). Previous policy search approaches have typically used ad-hoc parameterizations developed for specific MDPs. In contrast, the novel algorithm employs a flexible policy parameterization, suitable for solving general discrete-action MDPs. The algorithm looks for the best closed-loop policy that can be represented using a given number of basis functions, where a discrete action is assigned to each basis function. The locations and shapes of the basis functions are optimized, together with the action assignments. This allows a large class of policies to be represented. The optimization is carried out with the cross-entropy method and evaluates the policies by their empirical return from a representative set of initial states. We report simulation experiments in which the algorithm reliably obtains good policies with only a small number of basis functions, albeit at sizable computational costs. [less ▲] Detailed reference viewed: 28 (9 ULg)A rare-event approach to build security analysis tools when N-k (k > 1) analyses are needed (as they are in large-scale power systems) Belmudes, Florence ; Ernst, Damien ; Wehenkel, Louis in Proceedings of the 2009 IEEE Bucharest PowerTech (2009) We consider the problem of performing N − k security analyses in large scale power systems. In such a context, the number of potentially dangerous N − k contingencies may become rapidly very large when k ... [more ▼] We consider the problem of performing N − k security analyses in large scale power systems. In such a context, the number of potentially dangerous N − k contingencies may become rapidly very large when k grows, and so running a security analysis for each one of them is often intractable. We assume in this paper that the number of dangerous N − k contingencies is very small with respect to the number of non-dangerous ones. Under this assumption, we suggest to use importance sampling techniques for identifying rare events in combinatorial search spaces. With such techniques, it is possible to identify dangerous contingencies by running security analyses for only a small number of events. A procedure relying on these techniques is proposed in this work for steady-state security analyses. This procedure has been evaluated on the IEEE 118 bus test system. The results show that it is indeed able to efficiently identify among a large set of contingencies some of the rare ones which are dangerous. [less ▲] Detailed reference viewed: 49 (7 ULg) |
||