Scenario Trees and Policy Selection for Multistage Stochastic Programming Using Machine Learning; Ernst, Damien ; Wehenkel, Louis ![]() in INFORMS Journal on Computing (in press) In the context of multistage stochastic optimization problems, we propose a hybrid strategy for generalizing to nonlinear decision rules, using machine learning, a finite data set of constrained vector ... [more ▼] In the context of multistage stochastic optimization problems, we propose a hybrid strategy for generalizing to nonlinear decision rules, using machine learning, a finite data set of constrained vector-valued recourse decisions optimized using scenario-tree techniques from multistage stochastic programming. The decision rules are based on a statistical model inferred from a given scenario-tree solution and are selected by out-of-sample simulation given the true problem. Because the learned rules depend on the given scenario tree, we repeat the procedure for a large number of randomly generated scenario trees and then select the best solution (policy) found for the true problem. The scheme leads to an ex post selection of the scenario tree itself. Numerical tests evaluate the dependence of the approach on the machine learning aspects and show cases where one can obtain near-optimal solutions, starting with a “weak” scenario-tree generator that randomizes the branching structure of the trees. [less ▲] Detailed reference viewed: 54 (6 ULg) The global grid; Ernst, Damien ; in Renewable Energy : An International Journal (2013), 57 This paper puts forward the vision that a natural future stage of the electricity network could be a grid spanning the whole planet and connecting most of the large power plants in the world: this is the ... [more ▼] This paper puts forward the vision that a natural future stage of the electricity network could be a grid spanning the whole planet and connecting most of the large power plants in the world: this is the “Global Grid”. The main driving force behind the Global Grid will be the harvesting of remote renewable sources, and its key infrastructure element will be the high capacity long transmission lines. Wind farms and solar power plants will supply load centers with green power over long distances. This paper focuses on the introduction of the concept, showing that a globally interconnected network can be technologically feasible and economically competitive. We further highlight the multiple opportunities emerging from a global electricity network such as smoothing the renewable energy supply and electricity demand, reducing the need for bulk storage, and reducing the volatility of the energy prices. We also discuss possible investment mechanisms and operating schemes. Among others, we envision in such a system a global power market and the establishment of two new coordinating bodies, the “Global Regulator” and the “Global System Operator”. [less ▲] Detailed reference viewed: 27 (6 ULg) Optimal discovery with probabilistic expert advice: finite time analysis and macroscopic optimality; Ernst, Damien ; in Journal of Machine Learning Research (2013), 14 We consider an original problem that arises from the issue of security analysis of a power system and that we name optimal discovery with probabilistic expert advice. We address it with an algorithm based ... [more ▼] We consider an original problem that arises from the issue of security analysis of a power system and that we name optimal discovery with probabilistic expert advice. We address it with an algorithm based on the optimistic paradigm and on the Good-Turing missing mass estimator. We prove two different regret bounds on the performance of this algorithm under weak assumptions on the probabilistic experts. Under more restrictive hypotheses, we also prove a macroscopic optimality result, comparing the algorithm both with an oracle strategy and with uniform sampling. Finally, we provide numerical experiments illustrating these theoretical findings. [less ▲] Detailed reference viewed: 4 (2 ULg) Stratégies d'échantillonnage pour l'apprentissage par renforcement batchFonteneau, Raphaël ; ; Wehenkel, Louis et alin Revue d'Intelligence Artificielle [=RIA] (2013) Outbound SPIT Filter with Optimal Performance GuaranteesJung, Tobias ; Martin, Sylvain ; et alin Computer Networks (2013) This paper presents a formal framework for identifying and filtering SPIT calls (SPam in Internet Telephony) in an outbound scenario with provable optimal performance. In so doing, our work is largely ... [more ▼] This paper presents a formal framework for identifying and filtering SPIT calls (SPam in Internet Telephony) in an outbound scenario with provable optimal performance. In so doing, our work is largely different from related previous work: our goal is to rigorously formalize the problem in terms of mathematical decision theory, find the optimal solution to the problem, and derive concrete bounds for its expected loss (number of mistakes the SPIT filter will make in the worst case). This goal is achieved by considering an abstracted scenario amenable to theoretical analysis, namely SPIT detection in an outbound scenario with pure sources. Our methodology is to first define the cost of making an error (false positive and false negative), apply Wald’s sequential probability ratio test to the individual sources, and then determine analytically error probabilities such that the resulting expected loss is minimized. The benefits of our approach are: (1) the method is optimal (in a sense defined in the paper); (2) the method does not rely on manual tuning and tweaking of parameters but is completely self-contained and mathematically justified; (3) the method is computationally simple and scalable. These are desirable features that would make our method a component of choice in larger, autonomic frameworks. [less ▲] Detailed reference viewed: 40 (22 ULg) Meta-learning of Exploration/Exploitation Strategies: The Multi-Armed Bandit Case; Wehenkel, Louis ; Ernst, Damien ![]() in Filipe, Joaquim; Fred, Ana (Eds.) Agents and Artificial Intelligence: 4th International Conference, ICAART 2012, Vilamoura, Portugal, February 6-8, 2012. Revised Selected Papers (2013) The exploration/exploitation (E/E) dilemma arises naturally in many subfields of Science. Multi-armed bandit problems formalize this dilemma in its canonical form. Most current research in this field ... [more ▼] The exploration/exploitation (E/E) dilemma arises naturally in many subfields of Science. Multi-armed bandit problems formalize this dilemma in its canonical form. Most current research in this field focuses on generic solutions that can be applied to a wide range of problems. However, in practice, it is often the case that a form of prior information is available about the specific class of target problems. Prior knowledge is rarely used in current solutions due to the lack of a systematic approach to incorporate it into the E/E strategy. To address a specific class of E/E problems, we propose to proceed in three steps: (i) model prior knowledge in the form of a probability distribution over the target class of E/E problems; (ii) choose a large hypothesis space of candidate E/E strategies; and (iii), solve an optimization problem to find a candidate E/E strategy of maximal average performance over a sample of problems drawn from the prior distribution. We illustrate this meta-learning approach with two different hypothesis spaces: one where E/E strategies are numerically parameterized and another where E/E strategies are represented as small symbolic formulas. We propose appropriate optimization algorithms for both cases. Our experiments, with two-armed “Bernoulli” bandit problems and various playing budgets, show that the metalearnt E/E strategies outperform generic strategies of the literature (UCB1, UCB1-T UNED, UCB-V, KL-UCB and epsilon-GREEDY); they also evaluate the robustness of the learnt E/E strategies, by tests carried out on arms whose rewards follow a truncated Gaussian distribution. [less ▲] Detailed reference viewed: 11 (2 ULg) Optimized Look-Ahead Trees: Extensions to Large and Continuous Action SpacesJung, Tobias ; Ernst, Damien ; in Proc. of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL'13) (2013) This paper studies look-ahead tree based control policies from the viewpoint of online decision making with constraints on the computational budget allowed per decision (expressed as number of calls to ... [more ▼] This paper studies look-ahead tree based control policies from the viewpoint of online decision making with constraints on the computational budget allowed per decision (expressed as number of calls to the generative model). We consider optimized look-ahead tree (OLT) policies, a recently introduced family of hybrid techniques, which combine the advantages of look-ahead trees (high precision) with the advantages of direct policy search (low online cost) and which are specifically designed for limited online budgets. We present two extensions of the basic OLT algorithm that on the one side allow tackling deterministic optimal control problems with large and continuous action spaces and that on the other side can also help to further reduce the online complexity. [less ▲] Detailed reference viewed: 8 (5 ULg) Optimized Look-Ahead Tree Policies: A Bridge Between Look-Ahead Tree Policies and Direct Policy SearchJung, Tobias ; Wehenkel, Louis ; Ernst, Damien et alin International Journal of Adaptive Control and Signal Processing (2013) Direct policy search (DPS) and look-ahead tree (LT) policies are two popular techniques for solving difficult sequential decision-making problems. They both are simple to implement, widely applicable ... [more ▼] Direct policy search (DPS) and look-ahead tree (LT) policies are two popular techniques for solving difficult sequential decision-making problems. They both are simple to implement, widely applicable without making strong assumptions on the structure of the problem, and capable of producing high performance control policies. However, computationally both of them are, each in their own way, very expensive. DPS can require huge offline resources (effort required to obtain the policy) to first select an appropriate space of parameterized policies that works well for the targeted problem, and then to determine the best values of the parameters via global optimization. LT policies do not require any offline resources; however, they typically require huge online resources (effort required to calculate the best decision at each step) in order to grow trees of sufficient depth. In this paper, we propose optimized look-ahead trees (OLT), a model-based policy learning scheme that lies at the intersection of DPS and LT. In OLT, the control policy is represented indirectly through an algorithm that at each decision step develops, as in LT using a model of the dynamics, a small look-ahead tree until a prespecified online budget is exhausted. Unlike LT, the development of the tree is not driven by a generic heuristic; rather, the heuristic is optimized for the target problem and implemented as a parameterized node scoring function learned offline via DPS. We experimentally compare OLT with pure DPS and pure LT variants on optimal control benchmark domains. The results show that the LT-based representation is a versatile way of compactly representing policies in a DPS scheme (which results in OLT being easier to tune and having lower offline complexity than pure DPS); while at the same time, DPS helps to significantly reduce the size of the look-ahead trees that are required to take high-quality decisions (which results in OLT having lower online complexity than pure LT). Moreover, OLT produces overall better performing policies than pure DPS and pure LT and also results in policies that are robust with respect to perturbations of the initial conditions. [less ▲] Detailed reference viewed: 66 (26 ULg) A Learning Procedure for Sampling Semantically Different Valid ExpressionsLupien St-Pierre, David ; ; Ernst, Damien et alE-print/Working paper (2013) Detailed reference viewed: 21 (3 ULg) Biorthogonalization Techniques for Least Squares Temporal Difference LearningJung, Tobias ; Ernst, Damien ![]() Poster (2012, December 07) We consider Markov reward processes and study OLS-LSTD, a framework for selecting basis functions from a set of candidates to obtain a sparse representation of the value function in the context of least ... [more ▼] We consider Markov reward processes and study OLS-LSTD, a framework for selecting basis functions from a set of candidates to obtain a sparse representation of the value function in the context of least squares temporal difference learning. To support efficient both updating and downdating operations, OLS-LSTD uses a biorthogonal representation for the selected basis vectors. Empirical comparisons with the recently proposed MP and LARS frameworks for LSTD are made. [less ▲] Detailed reference viewed: 33 (11 ULg) Optimal discovery with probabilistic expert advice; Ernst, Damien ; in Proceedings of the 51st IEEE Conference on Decision and Control (CDC 2012) (2012, December) Motivated by issues of security analysis for power systems, we analyze a new problem, called optimal discovery with probabilistic expert advice. We address it with an algorithm based on the optimistic ... [more ▼] Motivated by issues of security analysis for power systems, we analyze a new problem, called optimal discovery with probabilistic expert advice. We address it with an algorithm based on the optimistic paradigm and the Good-Turingmissing mass estimator. We show that this strategy attains the optimal discovery rate in a macroscopic limit sense, under some assumptions on the probabilistic experts. We also provide numerical experiments suggesting that this optimal behavior may still hold under weaker assumptions. [less ▲] Detailed reference viewed: 7 (2 ULg) Cooperative frequency control with a multi-terminal high-voltage DC network; ; et al in Automatica (2012), 48(12), 31283134 We consider frequency control in power systems made of several non-synchronous AC areas connected by a multi-terminal high-voltage direct current (HVDC) grid. We propose two HVDC control schemes to make ... [more ▼] We consider frequency control in power systems made of several non-synchronous AC areas connected by a multi-terminal high-voltage direct current (HVDC) grid. We propose two HVDC control schemes to make the areas collectively react to power imbalances, so that individual areas can schedule smaller power reserves. The first scheme modifies the power injected by each area into the DC grid as a function of frequency deviations of neighboring AC areas. The second scheme changes the DC voltage of each converter as a function of its own area's frequency only, relying on the physical network to obtain a collective reaction. For both schemes, we prove convergence of the closed-loop system with heterogeneous AC areas. [less ▲] Detailed reference viewed: 45 (3 ULg) A computationally efficient algorithm for the provision of a day-ahead modulation service by a load aggregatorMathieu, Sébastien ; Karangelos, Efthymios ; Louveaux, Quentin et alPoster (2012, October 08) We study a decision making problem faced by an aggregator willing to offer a load modulation service to a Transmission System Operator (TSO). In particular, we concentrate on a day-ahead service ... [more ▼] We study a decision making problem faced by an aggregator willing to offer a load modulation service to a Transmission System Operator (TSO). In particular, we concentrate on a day-ahead service consisting of a load modulation option, which can be called by the TSO once per day. The option specifies the maximum amplitude of a potential modification on the demand of the loads within a certain time interval. We consider the specific case where the loads can be modeled by a generic tank model whose inflow depends on the power consumed by the load and outflow is assumed to be known the day before for every market period. The level of the reservoir at the beginning of the market day is also assumed to be known. We show that, under these assumptions, the problem of maximizing the amplitude of the load modulation service can be formulated as a mixed integer linear programming problem (MILP). In order to solve this problem in a computationally efficient manner we introduce a novel heuristic-method. We test this method on a set of problems and demonstrate that our approach is orders of magnitude faster than CPLEX - a state-of-the-art software for solving MILP problems - without considerably compromising the solution accuracy. [less ▲] Detailed reference viewed: 35 (7 ULg) Policy search in a space of simple closed-form formulas: towards interpretability of reinforcement learningMaes, Francis ; Fonteneau, Raphaël ; Wehenkel, Louis et alin Lecture Notes in Computer Science (2012), 7569 In this paper, we address the problem of computing interpretable solutions to reinforcement learning (RL) problems. To this end, we propose a search algorithm over a space of simple losed-form formulas ... [more ▼] In this paper, we address the problem of computing interpretable solutions to reinforcement learning (RL) problems. To this end, we propose a search algorithm over a space of simple losed-form formulas that are used to rank actions. We formalize the search for a high-performance policy as a multi-armed bandit problem where each arm corresponds to a candidate policy canonically represented by its shortest formula-based representation. Experiments, conducted on standard benchmarks, show that this approach manages to determine both efficient and interpretable solutions. [less ▲] Detailed reference viewed: 20 (6 ULg) Généralisation min max pour l'apprentissage par renforcement batch et déterministe : schémas de relaxationFonteneau, Raphaël ; Ernst, Damien ; Boigelot, Bernard et alin Septièmes Journées Francophones de Planification, Décision et Apprentissage pour la conduite de systèmes (JFPDA 2012) (2012, May) On s’intéresse au problème de généralisation min max dans le cadre de l’apprentissage par renforcement batch et déterministe. Le problème a été originellement introduit par Fonteneau et al. (2011). Dans ... [more ▼] On s’intéresse au problème de généralisation min max dans le cadre de l’apprentissage par renforcement batch et déterministe. Le problème a été originellement introduit par Fonteneau et al. (2011). Dans un premier temps, on montre que le problème est NP-dur. Dans le cas où l’horizon d’optimisation vaut 2, on développe deux schémas de relaxation. Le premier schéma fonctionne en éliminant des contraintes de telle sorte qu’on obtienne un problème soluble en temps polynomial. Le deuxième schéma est une relaxation Lagrangienne conduisant à un problème conique-quadratique. On montre théoriquement et empiriquement que ces deux schémas permettent d’obtenir de meilleurs résultats que ceux proposés par Fonteneau et al. (2011). [less ▲] Detailed reference viewed: 45 (7 ULg) Coordinated primary frequency control among nonsynchronous systems connected by a multi-terminal HVDC grid; ; et al in IET Generation, Transmission & Distribution (2012), 6(2), 99-108 The authors consider a power system composed of several non-synchronous AC areas connected by a multiterminal high-voltage direct current (HVDC) grid. In this context, the authors propose a distributed ... [more ▼] The authors consider a power system composed of several non-synchronous AC areas connected by a multiterminal high-voltage direct current (HVDC) grid. In this context, the authors propose a distributed control scheme that modifies the power injections from the different AC areas into the DC grid so as to make the system collectively react to load imbalances. This collective reaction allows each individual AC area to downscale its primary reserves. The scheme is inspired by algorithms for the consensus problem extensively studied by the control theory community. It modifies the power injections based on frequency deviations of the AC areas so as to make them stay close to each other. A stability analysis of the closed-loop system is reported as well as simulation results on a benchmark power system with five AC areas. These results show that with proper tuning, the control scheme makes the frequency deviations converge rapidly to a common value following a load imbalance in an area. [less ▲] Detailed reference viewed: 41 (3 ULg) Learning to play K-armed bandit problemsMaes, Francis ; Wehenkel, Louis ; Ernst, Damien ![]() in Proceedings of the 4th International Conference on Agents and Artificial Intelligence (ICAART 2012) (2012, February) We propose a learning approach to pre-compute K-armed bandit playing policies by exploiting prior information describing the class of problems targeted by the player. Our algorithm first samples a set of K ... [more ▼] We propose a learning approach to pre-compute K-armed bandit playing policies by exploiting prior information describing the class of problems targeted by the player. Our algorithm first samples a set of K-armed bandit problems from the given prior, and then chooses in a space of candidate policies one that gives the best average performances over these problems. The candidate policies use an index for ranking the arms and pick at each play the arm with the highest index; the index for each arm is computed in the form of a linear combination of features describing the history of plays (e.g., number of draws, average reward, variance of rewards and higher order moments), and an estimation of distribution algorithm is used to determine its optimal parameters in the form of feature weights. We carry out simulations in the case where the prior assumes a fixed number of Bernoulli arms, a fixed horizon, and uniformly distributed parameters of the Bernoulli arms. These simulations show that learned strategies perform very well with respect to several other strategies previously proposed in the literature (UCB1, UCB2, UCB-V, KL-UCB and $\epsilon_n$-GREEDY); they also highlight the robustness of these strategies with respect to wrong prior information. [less ▲] Detailed reference viewed: 74 (16 ULg) Learning exploration/exploitation strategies for single trajectory reinforcement learningCastronovo, Michaël ; Maes, Francis ; Fonteneau, Raphaël et alin Proceedings of the 10th European Workshop on Reinforcement Learning (EWRL 2012) (2012) We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a ... [more ▼] We consider the problem of learning high-performance Exploration/Exploitation (E/E) strategies for finite Markov Decision Processes (MDPs) when the MDP to be controlled is supposed to be drawn from a known probability distribution pM( ). The performance criterion is the sum of discounted rewards collected by the E/E strategy over an in finite length trajectory. We propose an approach for solving this problem that works by considering a rich set of candidate E/E strategies and by looking for the one that gives the best average performances on MDPs drawn according to pM( ). As candidate E/E strategies, we consider index-based strategies parametrized by small formulas combining variables that include the estimated reward function, the number of times each transition has occurred and the optimal value functions V and Q of the estimated MDP (obtained through value iteration). The search for the best formula is formalized as a multi-armed bandit problem, each arm being associated with a formula. We experimentally compare the performances of the approach with R-max as well as with e-Greedy strategies and the results are promising. [less ▲] Detailed reference viewed: 55 (10 ULg) Min max generalization for two-stage deterministic batch mode reinforcement learning: relaxation schemesFonteneau, Raphaël ; Ernst, Damien ; Boigelot, Bernard et alReport (2012) Detailed reference viewed: 20 (1 ULg) Comparison of Different Selection Strategies in Monte-Carlo Tree Search for the Game of Tron; Lupien St-Pierre, David ; Maes, Francis et alin IEEE Conference on Computational and Intelligence in Games 2012 (2012) Monte-Carlo Tree Search (MCTS) techniques are essentially known for their performance on turn-based games, such as Go, for which players have considerable time for choosing their moves. In this paper, we ... [more ▼] Monte-Carlo Tree Search (MCTS) techniques are essentially known for their performance on turn-based games, such as Go, for which players have considerable time for choosing their moves. In this paper, we apply MCTS to the game of Tron, a simultaneous real-time two-player game. The fact that players have to react fast and that moves occur simultaneously creates an unusual setting for MCTS, in which classical selection policies such as UCB1 may be suboptimal. In this paper, we perform an empirical comparison of a wide range of selection policies for MCTS applied to Tron, with both deterministic policies (UCB1, UCB1-Tuned, UCB-V, UCBMinimal, OMC-Deterministic, MOSS) and stochastic policies (Epsilon-greedy, EXP3, Thompson Sampling, OMC-Stochastic, PBBM). From the experiments, we observe that UCB1-Tuned has the best behavior shortly followed by UCB1. Even if UCB-Minimal is ranked fourth, this is a remarkable result for this recently introduced selection policy found through automatic discovery of good policies on generic multi-armed bandit problems. We also show that deterministic policies perform better than stochastic ones for this problem. [less ▲] Detailed reference viewed: 20 (5 ULg) |
||