References of "Louppe, Gilles"
     in
Bookmark and Share    
Full Text
See detailUnderstanding Random Forests: From Theory to Practice
Louppe, Gilles ULg

Doctoral thesis (2014)

Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations ... [more ▼]

Data analysis and machine learning have become an integrative part of the modern scientific methodology, offering automated procedures for the prediction of a phenomenon based on past observations, unraveling underlying patterns in data and providing insights about the problem. Yet, caution should avoid using machine learning as a black-box tool, but rather consider it as a methodology, with a rational thought process that is entirely dependent on the problem under study. In particular, the use of algorithms should ideally require a reasonable understanding of their mechanisms, properties and limitations, in order to better apprehend and interpret their results. Accordingly, the goal of this thesis is to provide an in-depth analysis of random forests, consistently calling into question each and every part of the algorithm, in order to shed new light on its learning capabilities, inner workings and interpretability. The first part of this work studies the induction of decision trees and the construction of ensembles of randomized trees, motivating their design and purpose whenever possible. Our contributions follow with an original complexity analysis of random forests, showing their good computational performance and scalability, along with an in-depth discussion of their implementation details, as contributed within Scikit-Learn. In the second part of this work, we analyze and discuss the interpretability of random forests in the eyes of variable importance measures. The core of our contributions rests in the theoretical characterization of the Mean Decrease of Impurity variable importance measure, from which we prove and derive some of its properties in the case of multiway totally randomized trees and in asymptotic conditions. In consequence of this work, our analysis demonstrates that variable importances as computed from non-totally randomized trees (e.g., standard Random Forest) suffer from a combination of defects, due to masking effects, misestimations of node impurity or due to the binary structure of decision trees. Finally, the last part of this dissertation addresses limitations of random forests in the context of large datasets. Through extensive experiments, we show that subsampling both samples and features simultaneously provides on par performance while lowering at the same time the memory requirements. Overall this paradigm highlights an intriguing practical fact: there is often no need to build single models over immensely large datasets. Good performance can often be achieved by building models on (very) small random parts of the data and then combining them all in an ensemble, thereby avoiding all practical burdens of making large data fit into memory. [less ▲]

Detailed reference viewed: 4873 (41 ULg)
Full Text
Peer Reviewed
See detailSimple connectome inference from partial correlation statistics in calcium imaging
Sutera, Antonio ULg; Joly, Arnaud ULg; François-Lavet, Vincent ULg et al

E-print/Working paper (2014)

In this work, we propose a simple yet effective solution to the problem of connectome inference in calcium imaging data. The proposed algorithm consists of two steps. First, processing the raw signals to ... [more ▼]

In this work, we propose a simple yet effective solution to the problem of connectome inference in calcium imaging data. The proposed algorithm consists of two steps. First, processing the raw signals to detect neural peak activities. Second, inferring the degree of association between neurons from partial correlation statistics. This paper summarises the methodology that led us to win the Connectomics Challenge, proposes a simplified version of our method, and finally compares our results with respect to other inference methods. [less ▲]

Detailed reference viewed: 229 (56 ULg)
Full Text
Peer Reviewed
See detailA hybrid human-computer approach for large-scale image-based measurements using web services and machine learning
Marée, Raphaël ULg; Rollus, Loïc ULg; Stevens, Benjamin ULg et al

in Proceedings IEEE International Symposium on Biomedical Imaging (2014, May)

We present a novel methodology combining web-based software development practices, machine learning, and spatial databases for computer-aided quantification of regions of interest (ROIs) in large-scale ... [more ▼]

We present a novel methodology combining web-based software development practices, machine learning, and spatial databases for computer-aided quantification of regions of interest (ROIs) in large-scale imaging data. We describe our main methodological choices, and then illustrate the benefits of the approach (workload reduction, improved precision, scalability, and traceability) on hundreds of whole-slide images of biological tissue slices in cancer research. [less ▲]

Detailed reference viewed: 142 (37 ULg)
Full Text
Peer Reviewed
See detailExploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
Botta, Vincent ULg; Louppe, Gilles ULg; Geurts, Pierre ULg et al

in PLoS ONE (2014)

The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however ... [more ▼]

The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification. [less ▲]

Detailed reference viewed: 33 (7 ULg)
Full Text
See detailGradient Boosted Regression Trees in Scikit-Learn
Prettenhofer, Peter; Louppe, Gilles ULg

Conference (2014, February 23)

This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche ... [more ▼]

This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price. We give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk is dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle. [less ▲]

Detailed reference viewed: 1041 (16 ULg)
Full Text
See detailForecasting Daily Solar Energy Production Using Robust Regression Techniques
Louppe, Gilles ULg; Prettenhofer, Peter

Conference (2014, February 05)

We describe a novel approach to forecast daily solar energy production based on the output of a numerical weather prediction (NWP) model using non-parametric robust regression techniques. Our approach ... [more ▼]

We describe a novel approach to forecast daily solar energy production based on the output of a numerical weather prediction (NWP) model using non-parametric robust regression techniques. Our approach comprises two steps: First, we use a non-linear interpolation technique, Gaussian Process regression (also known as Kriging in Geostatistics), to interpolate the coarse NWP grid to the location of the solar energy production facilities. Second, we use Gradient Boosted Regression Trees, a non-parametric regression technique, to predict the daily solar energy output based on the interpolated NWP model and additional spatio-temporal features. Experimental evidence suggests that two aspects of our approach are crucial for its effectiveness: a) the ability of Gaussian Process regression to incorporate both input and output uncertainty which we leverage by deriving input uncertainty from an ensemble of 11 NWP models and including convidence intervals alongside the interpolated point estimates and b) the ability of Gradient Boosted Regression Trees to handle outliers in the outputs by using robust loss functions - a property that is very important due to the volatile nature of solar energy output. We evaluated the approach on a dataset of daily solar energy measurements from 98 stations in Oklahoma. The results show a relative improvement of 17.17% and 46.19% over the baselines, Spline Interpolation and Gaussian Mixture Models, resp. [less ▲]

Detailed reference viewed: 73 (12 ULg)
Full Text
See detailScikit-Learn: Machine Learning in the Python ecosystem
Joly, Arnaud ULg; Louppe, Gilles ULg

Poster (2014, January 27)

The scikit-learn project is an increasingly popular machine learning library written in Python. It is designed to be simple and efficient, useful to both experts and non-experts, and reusable in a variety ... [more ▼]

The scikit-learn project is an increasingly popular machine learning library written in Python. It is designed to be simple and efficient, useful to both experts and non-experts, and reusable in a variety of contexts. The primary aim of the project is to provide a compendium of efficient implementations of classic, well-established machine learning algorithms. Among other things, it includes classical supervised and unsupervised learning algorithms, tools for model evaluation and selection, as well as tools for data preprocessing and feature engineering. This presentation will illustrate the use of scikit-learn as a component of the larger scientific Python environment to solve complex data analysis tasks. Examples will include end-to-end workflows based on powerful and popular algorithms in the library. Among others, we will show how to use out-of-core learning with on-the-fly feature extraction to tackle very large natural language processing tasks, how to exploit an IPython cluster for distributed cross-validation, or how to build and use random forests to explore biological data. [less ▲]

Detailed reference viewed: 60 (11 ULg)
Full Text
Peer Reviewed
See detailScikit-Learn: Machine Learning in the Python ecosystem
Louppe, Gilles ULg; Varoquaux, Gaël

Conference (2013, December 10)

The scikit-learn project is an increasingly popular machine learning library written in Python. It is designed to be simple and efficient, useful to both experts and non-experts, and reusable in a variety ... [more ▼]

The scikit-learn project is an increasingly popular machine learning library written in Python. It is designed to be simple and efficient, useful to both experts and non-experts, and reusable in a variety of contexts. The primary aim of the project is to provide a compendium of efficient implementations of classic, well-established machine learning algorithms. Among other things, it includes classical supervised and unsupervised learning algorithms, tools for model evaluation and selection, as well as tools for data preprocessing and feature engineering. This presentation will illustrate the use of scikit-learn as a component of the larger scientific Python environment to solve complex data analysis tasks. Examples will include end-to-end workflows based on powerful and popular algorithms in the library. Among others, we will show how to use out-of-core learning with on-the-fly feature extraction to tackle very large natural language processing tasks, how to exploit an IPython cluster for distributed cross-validation, or how to build and use random forests to explore biological data. [less ▲]

Detailed reference viewed: 117 (16 ULg)
Full Text
Peer Reviewed
See detailUnderstanding variable importances in forests of randomized trees
Louppe, Gilles ULg; Wehenkel, Louis ULg; Sutera, Antonio ULg et al

in Advances in Neural Information Processing Systems 26 (2013, December)

Despite growing interest and practical use in various scientific areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work ... [more ▼]

Despite growing interest and practical use in various scientific areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. We derive a three-level decomposition of the information jointly provided by all input variables about the output in terms of i) the MDI importance of each input variable, ii) the degree of interaction of a given input variable with the other input variables, iii) the different interaction terms of a given degree. We then show that this MDI importance of a variable is equal to zero if and only if the variable is irrelevant and that the MDI importance of a relevant variable is invariant with respect to the removal or the addition of irrelevant variables. We illustrate these properties on a simple example and discuss how they may change in the case of non-totally randomized trees such as Random Forests and Extra-Trees. [less ▲]

Detailed reference viewed: 1028 (126 ULg)
Full Text
Peer Reviewed
See detailAPI design for machine learning software: experiences from the scikit-learn project
Buitinck, Lars; Louppe, Gilles ULg; Blondel, Mathieu et al

Conference (2013, September 23)

scikit-learn is an increasingly popular machine learning library. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper ... [more ▼]

scikit-learn is an increasingly popular machine learning library. Written in Python, it is designed to be simple and efficient, accessible to non-experts, and reusable in various contexts. In this paper, we present and discuss our design choices for the application programming interface (API) of the project. In particular, we describe the simple and elegant interface shared by all learning and processing units in the library and then discuss its advantages in terms of composition and reusability. The paper also comments on implementation details specific to the Python ecosystem and analyzes obstacles faced by users and developers of the library. [less ▲]

Detailed reference viewed: 655 (66 ULg)
Full Text
Peer Reviewed
See detailEnsembles on Random Patches
Louppe, Gilles ULg; Geurts, Pierre ULg

in Machine Learning and Knowledge Discovery in Databases (2012)

In this paper, we consider supervised learning under the assumption that the available memory is small compared to the dataset size. This general framework is relevant in the context of big data ... [more ▼]

In this paper, we consider supervised learning under the assumption that the available memory is small compared to the dataset size. This general framework is relevant in the context of big data, distributed databases and embedded systems. We investigate a very simple, yet effective, ensemble framework that builds each individual model of the ensemble from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset. We carry out an extensive and systematic evaluation of this method on 29 datasets, using decision tree-based estimators. With respect to popular ensemble methods, these experiments show that the proposed method provides on par performance in terms of accuracy while simultaneously lowering the memory needs, and attains significantly better performance when memory is severely constrained. [less ▲]

Detailed reference viewed: 282 (57 ULg)
Full Text
Peer Reviewed
See detailLearning to rank with extremely randomized trees
Geurts, Pierre ULg; Louppe, Gilles ULg

in JMLR: Workshop and Conference Proceedings (2011, January), 14

In this paper, we report on our experiments on the Yahoo! Labs Learning to Rank challenge organized in the context of the 23rd International Conference of Machine Learning (ICML 2010). We competed in both ... [more ▼]

In this paper, we report on our experiments on the Yahoo! Labs Learning to Rank challenge organized in the context of the 23rd International Conference of Machine Learning (ICML 2010). We competed in both the learning to rank and the transfer learning tracks of the challenge with several tree-based ensemble methods, including Tree Bagging, Random Forests, and Extremely Randomized Trees. Our methods ranked 10th in the first track and 4th in the second track. Although not at the very top of the ranking, our results show that ensembles of randomized trees are quite competitive for the “learning to rank” problem. The paper also analyzes computing times of our algorithms and presents some post-challenge experiments with transfer learning methods. [less ▲]

Detailed reference viewed: 355 (73 ULg)
Full Text
Peer Reviewed
See detailA zealous parallel gradient descent algorithm
Louppe, Gilles ULg; Geurts, Pierre ULg

Poster (2010, December 11)

Parallel and distributed algorithms have become a necessity in modern machine learning tasks. In this work, we focus on parallel asynchronous gradient descent and propose a zealous variant that minimizes ... [more ▼]

Parallel and distributed algorithms have become a necessity in modern machine learning tasks. In this work, we focus on parallel asynchronous gradient descent and propose a zealous variant that minimizes the idle time of processors to achieve a substantial speedup. We then experimentally study this algorithm in the context of training a restricted Boltzmann machine on a large collaborative filtering task. [less ▲]

Detailed reference viewed: 237 (47 ULg)
Full Text
See detailCollaborative filtering: Scalable approaches using restricted Boltzmann machines
Louppe, Gilles ULg

Master's dissertation (2010)

Parallel to the growth of electronic commerce, recommender systems have become a very active area of research, both in the industry and in the academic world. The goal of these systems is to make ... [more ▼]

Parallel to the growth of electronic commerce, recommender systems have become a very active area of research, both in the industry and in the academic world. The goal of these systems is to make automatic but personal recommendations when customers are overwhelmed with thousands of possibilities and do not know what to look for. In that context, the object of this work is threefold. The first part consists in a survey of recommendation algorithms and emphasizes on a class of algorithms known as collaborative filtering algorithms. The second part consists in studying in more depth a specific model of neural networks known as restricted Boltzmann machines. That model is then experimentaly and extensively examined on a recommendation problem. The third part of this work focuses on how restricted Boltzmann machines can be made more scalable. Three different and original approaches are proposed and studied. In the first approach, we revisit the learning and test algorithms of restricted Boltzmann machines in the context of shared-memory architectures. In the second approach, we propose to reformulate these algorithms as MapReduce tasks. Finally, in the third method, ensemble of RBMs are investigated. The best and the more promising results are obtained with the MapReduce approach. [less ▲]

Detailed reference viewed: 851 (80 ULg)