References of "Geurts, Pierre"
     in
Bookmark and Share    
Full Text
Peer Reviewed
See detailExploiting SNP Correlations within Random Forest for Genome-Wide Association Studies
Botta, Vincent ULg; Louppe, Gilles ULg; Geurts, Pierre ULg et al

in PLoS ONE (2014)

The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however ... [more ▼]

The primary goal of genome-wide association studies (GWAS) is to discover variants that could lead, in isolation or in combination, to a particular trait or disease. Standard approaches to GWAS, however, are usually based on univariate hypothesis tests and therefore can account neither for correlations due to linkage disequilibrium nor for combinations of several markers. To discover and leverage such potential multivariate interactions, we propose in this work an extension of the Random Forest algorithm tailored for structured GWAS data. In terms of risk prediction, we show empirically on several GWAS datasets that the proposed T-Trees method significantly outperforms both the original Random Forest algorithm and standard linear models, thereby suggesting the actual existence of multivariate non-linear effects due to the combinations of several SNPs. We also demonstrate that variable importances as derived from our method can help identify relevant loci. Finally, we highlight the strong impact that quality control procedures may have, both in terms of predictive power and loci identification. [less ▲]

Detailed reference viewed: 6 (2 ULg)
Full Text
Peer Reviewed
See detailData normalization and supervised learning to assess the condition of patients with multiple sclerosis based on gait analysis
Azrour, Samir ULg; Pierard, Sébastien ULg; Geurts, Pierre ULg et al

Poster (2014, April)

Gait impairment is considered as an important feature of disability in multiple sclerosis but its evaluation in the clinical routine remains limited. In this paper, we assess, by means of supervised ... [more ▼]

Gait impairment is considered as an important feature of disability in multiple sclerosis but its evaluation in the clinical routine remains limited. In this paper, we assess, by means of supervised learning, the condition of patients with multiple sclerosis based on their gait descriptors obtained with a gait analysis system. As the morphological characteristics of individuals influence their gait while being in first approximation independent of the disease level, an original strategy of data normalization with respect to these characteristics is described and applied beforehand in order to obtain more reliable predictions. In addition, we explain how we address the problem of missing data which is a common issue in the field of clinical evaluation. Results show that, based on machine learning combined to the proposed data handling techniques, we can predict a score highly correlated with the condition of patients. [less ▲]

Detailed reference viewed: 56 (24 ULg)
Full Text
Peer Reviewed
See detailIdentification of a microRNA landscape targeting the PI3K/Akt signaling pathway in inflammation-induced colorectal carcinogenesis
JOSSE, Claire ULg; Bouznad, Nassim ULg; Geurts, Pierre ULg et al

in American Journal of Physiology - Gastrointestinal and Liver Physiology (2014), 306

Inflammation can contribute to tumor formation; however, markers that predict progression are still lacking. In the present study, the well-established azoxymethane (AOM)/dextran sulfate sodium (DSS ... [more ▼]

Inflammation can contribute to tumor formation; however, markers that predict progression are still lacking. In the present study, the well-established azoxymethane (AOM)/dextran sulfate sodium (DSS)-induced mouse model of colitis-associated cancer was used to analyze microRNA (miRNA) modulation accompanying inflammation-induced tumor development and to determine whether inflammation-triggered miRNA alterations affect the expression of genes or pathways involved in cancer. A miRNA microarray experiment was performed to establish miRNA expression profiles in mouse colon at early and late time points during inflammation and/or tumor growth. Chronic inflammation and carcinogenesis were associated with distinct changes in miRNA expression. Nevertheless, prediction algorithms of miRNA-mRNA interactions and computational analyses based on ranked miRNA lists consistently identified putative target genes that play essential roles in tumor growth or that belong to key carcinogenesis-related signaling pathways. We identified PI3K/Akt and the insulin growth factor-1 (IGF-1) as major pathways being affected in the AOM/DSS model. DSS-induced chronic inflammation downregulates miR-133a and miR-143/145, which is reportedly associated with human colorectal cancer and PI3K/Akt activation. Accordingly, conditioned medium from inflammatory cells decreases the expression of these miRNA in colorectal adenocarcinoma Caco-2 cells. Overexpression of miR-223, one of the main miRNA showing strong upregulation during AOM/DSS tumor growth, inhibited Akt phosphorylation and IGF-1R expression in these cells. Cell sorting from mouse colons delineated distinct miRNA expression patterns in epithelial and myeloid cells during the periods preceding and spanning tumor growth. Hence, cell-type-specific miRNA dysregulation and subsequent PI3K/Akt activation may be involved in the transition from intestinal inflammation to cancer. [less ▲]

Detailed reference viewed: 13 (1 ULg)
Full Text
Peer Reviewed
See detailOn protocols and measures for the validation of supervised methods for the inference of biological networks
Schrynemackers, Marie ULg; Kuffner, Robert; Geurts, Pierre ULg

in Frontiers in genetics (2013), 4(262),

Networks provide a natural representation of molecular biology knowledge, in particular to model relationships between biological entities such as genes, proteins, drugs, or diseases. Because of the ... [more ▼]

Networks provide a natural representation of molecular biology knowledge, in particular to model relationships between biological entities such as genes, proteins, drugs, or diseases. Because of the effort, the cost, or the lack of the experiments necessary for the elucidation of these networks, computational approaches for network inference have been frequently investigated in the literature. In this paper, we examine the assessment of supervised network inference. Supervised inference is based on machine learning techniques that infer the network from a training sample of known interacting and possibly non-interacting entities and additional measurement data. While these methods are very effective, their reliable validation in silico poses a challenge, since both prediction and validation need to be performed on the basis of the same partially known network. Cross-validation techniques need to be specifically adapted to classification problems on pairs of objects. We perform a critical review and assessment of protocols and measures proposed in the literature and derive specific guidelines how to best exploit and evaluate machine learning techniques for network inference. Through theoretical considerations and in silico experiments, we analyze in depth how important factors influence the outcome of performance estimation. These factors include the amount of information available for the interacting entities, the sparsity and topology of biological networks, and the lack of experimentally verified non-interacting pairs. [less ▲]

Detailed reference viewed: 30 (11 ULg)
Full Text
Peer Reviewed
See detailUnderstanding variable importances in forests of randomized trees
Louppe, Gilles ULg; Wehenkel, Louis ULg; Sutera, Antonio ULg et al

in Advances in Neural Information Processing Systems 26 (2013, December)

Despite growing interest and practical use in various scientific areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work ... [more ▼]

Despite growing interest and practical use in various scientific areas, variable importances derived from tree-based ensemble methods are not well understood from a theoretical point of view. In this work we characterize the Mean Decrease Impurity (MDI) variable importances as measured by an ensemble of totally randomized trees in asymptotic sample and ensemble size conditions. We derive a three-level decomposition of the information jointly provided by all input variables about the output in terms of i) the MDI importance of each input variable, ii) the degree of interaction of a given input variable with the other input variables, iii) the different interaction terms of a given degree. We then show that this MDI importance of a variable is equal to zero if and only if the variable is irrelevant and that the MDI importance of a relevant variable is invariant with respect to the removal or the addition of irrelevant variables. We illustrate these properties on a simple example and discuss how they may change in the case of non-totally randomized trees such as Random Forests and Extra-Trees. [less ▲]

Detailed reference viewed: 947 (116 ULg)
Full Text
Peer Reviewed
See detailDMFSGD: A Decentralized Matrix Factorization Algorithm for Network Distance Prediction
Liao, Yongjun ULg; Du, Wei; Geurts, Pierre ULg et al

in IEEE/ACM Transactions on Networking (2013), 21(5), 1511-1524

The knowledge of end-to-end network distances is essential to many Internet applications. As active probing of all pairwise distances is infeasible in large-scale networks, a natural idea is to measure a ... [more ▼]

The knowledge of end-to-end network distances is essential to many Internet applications. As active probing of all pairwise distances is infeasible in large-scale networks, a natural idea is to measure a few pairs and to predict the other ones without actually measuring them. This paper formulates the prediction problem as matrix completion where the unknown entries in a pairwise distance matrix constructed from a network are to be predicted. By assuming that the distance matrix has a low-rank characteristics, the problem is solvable by lowrank approximation based on matrix factorization. The new formulation circumvents the well-known drawbacks of existing approaches based on Euclidean embedding. A new algorithm, so-called Decentralized Matrix Factorization by Stochastic Gradient Descent (DMFSGD), is proposed. By letting network nodes exchange messages with each other, the algorithm is fully decentralized and only requires each node to collect and to process local measurements, with neither explicit matrix constructions nor special nodes such as landmarks and central servers. In addition, we compared comprehensively matrix factorization and Euclidean embedding to demonstrate the suitability of the former on network distance prediction. We further studied the incorporation of a robust loss function and of non-negativity constraints. Extensive experiments on various publicly-available datasets of network delays show not only the scalability and the accuracy of our approach, but also its usability in real Internet applications. [less ▲]

Detailed reference viewed: 124 (22 ULg)
Full Text
Peer Reviewed
See detailAutomated Processing of Zebrafish Imaging Data: A Survey
Mikut, Ralf; Dickmeis, Thomas; Driever, Wolfgang et al

in Zebrafish (2013), 10(3), 401-421

Due to the relative transparency of its embryos and larvae, the zebrafish is an ideal model organism for bioimaging approaches in vertebrates. Novel microscope technologies allow the imaging of ... [more ▼]

Due to the relative transparency of its embryos and larvae, the zebrafish is an ideal model organism for bioimaging approaches in vertebrates. Novel microscope technologies allow the imaging of developmental processes in unprecedented detail, and they enable the use of complex image-based read-outs for high-throughput/high-content screening. Such applications can easily generate Terabytes of image data, the handling and analysis of which becomes a major bottleneck in extracting the targeted information. Here, we describe the current state of the art in computational image analysis in the zebrafish system. We discuss the challenges encountered when handling high-content image data, especially with regard to data quality, annotation, and storage. We survey methods for preprocessing image data for further analysis, and describe selected examples of automated image analysis, including the tracking of cells during embryogenesis, heartbeat detection, identification of dead embryos, recognition of tissues and anatomical landmarks, and quantification of behavioral patterns of adult fish. We review recent examples for applications using such methods, such as the comprehensive analysis of cell lineages during early development, the generation of a three-dimensional brain atlas of zebrafish larvae, and high-throughput drug screens based on movement patterns. Finally, we identify future challenges for the zebrafish image analysis community, notably those concerning the compatibility of algorithms and data formats for the assembly of modular analysis pipelines. [less ▲]

Detailed reference viewed: 48 (8 ULg)
Full Text
Peer Reviewed
See detailExtremely Randomized Trees and Random Subwindows for Image Classification, Annotation, and Retrieval
Marée, Raphaël ULg; Wehenkel, Louis ULg; Geurts, Pierre ULg

in Criminisi, A; Shotton, J (Eds.) Decision Forests in Computer Vision and Medical Image Analysis, Advances in Computer Vision and Pattern Recognition (2013)

We present a unified framework involving the extraction of random subwindows within images and the induction of ensembles of extremely randomized trees. We discuss the specialization of this framework for ... [more ▼]

We present a unified framework involving the extraction of random subwindows within images and the induction of ensembles of extremely randomized trees. We discuss the specialization of this framework for solving several general problems in computer vision, ranging from image classification and segmentation to content-based image retrieval and interest point detection. The methods are illustrated on various applications and datasets from the biomedical domain [less ▲]

Detailed reference viewed: 181 (39 ULg)
Full Text
Peer Reviewed
See detailGene regulatory network inference from systems genetics data using tree-based methods
Huynh-Thu, Vân Anh ULg; Wehenkel, Louis ULg; Geurts, Pierre ULg

in de la Fuente, Alberto (Ed.) Gene Network Inference - Verification of Methods for Systems Genetics Data (2013)

One of the pressing open problems of computational systems biology is the elucidation of the topology of gene regulatory networks (GRNs). In an attempt to solve this problem, the idea of systems genetics ... [more ▼]

One of the pressing open problems of computational systems biology is the elucidation of the topology of gene regulatory networks (GRNs). In an attempt to solve this problem, the idea of systems genetics is to exploit the natural variations that exist between the DNA sequences of related individuals and that can represent the randomized and multifactorial perturbations necessary to recover GRNs. In this chapter, we present new methods, called GENIE3-SG-joint and GENIE3- SG-sep, for the inference of GRNs from systems genetics data. Experiments on the artificial data of the StatSeq benchmark and of the DREAM5 Systems Genetics challenge show that exploiting jointly expression and genetic data is very helpful for recovering GRNs, and one of our methods outperforms by a large extent the official best performing method of the DREAM5 challenge. [less ▲]

Detailed reference viewed: 66 (16 ULg)
Full Text
See detailOrdinal Rating of Network Performance and Inference by Matrix Completion
Du, Wei; Liao, Yongjun ULg; Geurts, Pierre ULg et al

Report (2012)

This paper addresses the large-scale acquisition of end-to-end network performance. We made two distinct contributions: ordinal rating of network performance and inference by matrix completion. The former ... [more ▼]

This paper addresses the large-scale acquisition of end-to-end network performance. We made two distinct contributions: ordinal rating of network performance and inference by matrix completion. The former reduces measurement costs and unifies various metrics which eases their processing in applications. The latter enables scalable and accurate inference with no requirement of structural information of the network nor geometric constraints. By combining both, the acquisition problem bears strong similarities to recommender systems. This paper investigates the applicability of various matrix factorization models used in recommender systems. We found that the simple regularized matrix factorization is not only practical but also produces accurate results that are beneficial for peer selection. [less ▲]

Detailed reference viewed: 19 (2 ULg)
Full Text
Peer Reviewed
See detailEmbedding Monte Carlo search of features in tree-based ensemble methods
Maes, Francis ULg; Geurts, Pierre ULg; Wehenkel, Louis ULg

in Flach, Peter; De Bie, Tijl; Cristianini, Nello (Eds.) Machine Learning and Knowledge Discovery in Data Bases (2012, September)

Feature generation is the problem of automatically constructing good features for a given target learning problem. While most feature generation algorithms belong either to the filter or to the wrapper ... [more ▼]

Feature generation is the problem of automatically constructing good features for a given target learning problem. While most feature generation algorithms belong either to the filter or to the wrapper approach, this paper focuses on embedded feature generation. We propose a general scheme to embed feature generation in a wide range of tree-based learning algorithms, including single decision trees, random forests and tree boosting. It is based on the formalization of feature construction as a sequential decision making problem addressed by a tractable Monte Carlo search algorithm coupled with node splitting. This leads to fast algorithms that are applicable to large-scale problems. We empirically analyze the performances of these tree-based learners combined or not with the feature generation capability on several standard datasets. [less ▲]

Detailed reference viewed: 44 (7 ULg)
Full Text
Peer Reviewed
See detailMixtures of Bagged Markov Tree Ensembles
Schnitzler, François ULg; Geurts, Pierre ULg; Wehenkel, Louis ULg

in Cano Utrera, Andrès; Gómez-Olmedo, Manuel; Nielsen, Thomas (Eds.) Proceedings of the 6th European Workshop on Probabilistic Graphical Models (2012, September)

Markov trees, a probabilistic graphical model for density estimation, can be expanded in the form of a weighted average of Markov Trees. Learning these mixtures or ensembles from observations can be ... [more ▼]

Markov trees, a probabilistic graphical model for density estimation, can be expanded in the form of a weighted average of Markov Trees. Learning these mixtures or ensembles from observations can be performed to reduce the bias or the variance of the estimated model. We propose a new combination of both, where the upper level seeks to reduce bias while the lower level seeks to reduce variance. This algorithm is evaluated empirically on datasets generated from a mixture of Markov trees and from other synthetic densities. [less ▲]

Detailed reference viewed: 58 (5 ULg)
Full Text
Peer Reviewed
See detailComparator selection for RPC with many labels
Hiard, Samuel ULg; Geurts, Pierre ULg; Wehenkel, Louis ULg

in ECAI 2012 : 20th European Conference on Artificial Intelligence : 27-31 August 2012, Montpellier, France (2012, August)

The Ranking by Pairwise Comparison algorithm (RPC) is a well established label ranking method. However, its complexity is of O(N²) in the number N of labels. We present algorithms for selection, before ... [more ▼]

The Ranking by Pairwise Comparison algorithm (RPC) is a well established label ranking method. However, its complexity is of O(N²) in the number N of labels. We present algorithms for selection, before model construction, a subset of comparators of size O(N), to reduce the computational complexity without loss in accuracy. [less ▲]

Detailed reference viewed: 65 (19 ULg)
Full Text
Peer Reviewed
See detailWisdom of crowds for robust gene network inference
Marbach, Daniel; Costello, James C.; Küffner, Robert et al

in Nature Methods (2012), 9

Reconstructing gene regulatory networks from high-throughput data is a long-standing challenge. Through the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, we performed a ... [more ▼]

Reconstructing gene regulatory networks from high-throughput data is a long-standing challenge. Through the Dialogue on Reverse Engineering Assessment and Methods (DREAM) project, we performed a comprehensive blind assessment of over 30 network inference methods on Escherichia coli, Staphylococcus aureus, Saccharomyces cerevisiae and in silico microarray data. We characterize the performance, data requirements and inherent biases of different inference approaches, and we provide guidelines for algorithm application and development. We observed that no single inference method performs optimally across all data sets. In contrast, integration of predictions from multiple inference methods shows robust and high performance across diverse data sets. We thereby constructed high-confidence networks for E. coli and S. aureus, each comprising ~ 1,700 transcriptional interactions at a precision of ~50%. We experimentally tested 53 previously unobserved regulatory interactions in E. coli, of which 23 (43%) were supported. Our results establish community-based methods as a powerful and robust tool for the inference of transcriptional gene regulatory networks. [less ▲]

Detailed reference viewed: 156 (27 ULg)
Full Text
See detailL1-based compression of random forest models
Joly, Arnaud ULg; Schnitzler, François ULg; Geurts, Pierre ULg et al

in Proceeding of the 21st Belgian-Dutch Conference on Machine Learning (2012, May 24)

Random forests are effective supervised learning methods applicable to large-scale datasets. However, the space complexity of tree ensembles, in terms of their total number of nodes, is often prohibitive ... [more ▼]

Random forests are effective supervised learning methods applicable to large-scale datasets. However, the space complexity of tree ensembles, in terms of their total number of nodes, is often prohibitive, specially in the context of problems with very high-dimensional input spaces. We propose to study their compressibility by applying a L1-based regularization to the set of indicator functions defined by all their nodes. We show experimentally that preserving or even improving the model accuracy while significantly reducing its space complexity is indeed possible. [less ▲]

Detailed reference viewed: 104 (41 ULg)
Full Text
Peer Reviewed
See detailApproximation efficace de mélanges bootstrap d’arbres de Markov pour l’estimation de densité
Schnitzler, François ULg; Ammar, Sourour; Leray, Philippe et al

in Bougrain, Laurent (Ed.) Actes de la 14e Conférence Francophone sur l'Apprentissage Automatique (CAp 2012) (2012, May 23)

Nous considérons des algorithmes pour apprendre des Mélanges bootstrap d'Arbres de Markov pour l'estimation de densité. Pour les problèmes comportant un grand nombre de variables et peu d'observations ... [more ▼]

Nous considérons des algorithmes pour apprendre des Mélanges bootstrap d'Arbres de Markov pour l'estimation de densité. Pour les problèmes comportant un grand nombre de variables et peu d'observations, ces mélanges estiment généralement mieux la densité qu'un seul arbre appris au maximum de vraisemblance, mais sont plus coûteux à apprendre. C'est pourquoi nous étudions ici un algorithme pour apprendre ces modèles de manière approchée, afin d'accélérer l'apprentissage sans sacrifier la précision. Plus spécifiquement, nous récupérons lors du calcul d'un premier arbre de Markov les arcs qui constituent de bons candidats pour la structure, et ne considérons que ceux-ci lors de l'apprentissage des arbres suivants. Nous comparons cet algorithme à l'algorithme original de mélange, à un arbre appris au maximum de vraisemblance, à un arbre régularisé et à une autre méthode approchée. [less ▲]

Detailed reference viewed: 32 (4 ULg)
Full Text
Peer Reviewed
See detailStatistical interpretation of machine learning-based feature importance scores for biomarker discovery
Huynh-Thu, Vân Anh ULg; Saeys, Yvan; Wehenkel, Louis ULg et al

in Bioinformatics (2012), 28(13), 1766-1774

Motivation: Univariate statistical tests are widely used for biomarker discovery in bioinformatics. These procedures are simple, fast and their output is easily interpretable by biologists but they can ... [more ▼]

Motivation: Univariate statistical tests are widely used for biomarker discovery in bioinformatics. These procedures are simple, fast and their output is easily interpretable by biologists but they can only identify variables that provide a significant amount of information in isolation from the other variables. As biological processes are expected to involve complex interactions between variables, univariate methods thus potentially miss some informative biomarkers. Variable relevance scores provided by machine learning techniques, however, are potentially able to highlight multivariate interacting effects, but unlike the p-values returned by univariate tests, these relevance scores are usually not statistically interpretable. This lack of interpretability hampers the determination of a relevance threshold for extracting a feature subset from the rankings and also prevents the wide adoption of these methods by practicians. Results: We evaluated several, existing and novel, procedures that extract relevant features from rankings derived from machine learning approaches. These procedures replace the relevance scores with measures that can be interpreted in a statistical way, such as p-values, false discovery rates, or family wise error rates, for which it is easier to determine a significance level. Experiments were performed on several artificial problems as well as on real microarray datasets. Although the methods differ in terms of computing times and the tradeoff, they achieve in terms of false positives and false negatives, some of them greatly help in the extraction of truly relevant biomarkers and should thus be of great practical interest for biologists and physicians. As a side conclusion, our experiments also clearly highlight that using model performance as a criterion for feature selection is often counter-productive. [less ▲]

Detailed reference viewed: 145 (32 ULg)
Full Text
Peer Reviewed
See detailL1-based compression of random forest models
Joly, Arnaud ULg; Schnitzler, François ULg; Geurts, Pierre ULg et al

in 20th European Symposium on Artificial Neural Networks (2012, April)

Random forests are effective supervised learning methods applicable to large-scale datasets. However, the space complexity of tree ensembles, in terms of their total number of nodes, is often prohibitive ... [more ▼]

Random forests are effective supervised learning methods applicable to large-scale datasets. However, the space complexity of tree ensembles, in terms of their total number of nodes, is often prohibitive, specially in the context of problems with very high-dimensional input spaces. We propose to study their compressibility by applying a L1-based regularization to the set of indicator functions defined by all their nodes. We show experimentally that preserving or even improving the model accuracy while significantly reducing its space complexity is indeed possible. [less ▲]

Detailed reference viewed: 189 (51 ULg)
Full Text
See detailDMFSGD: A Decentralized Matrix Factorization Algorithm for Network Distance Prediction
Liao, Yongjun ULg; Du, Wei; Geurts, Pierre ULg et al

Report (2012)

The knowledge of end-to-end network distances is essential to many Internet applications. As active probing of all pairwise distances is infeasible in large-scale networks, a natural idea is to measure a ... [more ▼]

The knowledge of end-to-end network distances is essential to many Internet applications. As active probing of all pairwise distances is infeasible in large-scale networks, a natural idea is to measure a few pairs and to predict the other ones without actually measuring them. This paper formulates the distance prediction problem as matrix completion where unknown entries of an incomplete matrix of pairwise distances are to be predicted. The problem is solvable because strong correlations among network distances exist and cause the constructed distance matrix to be low rank. The new formulation circumvents the well-known drawbacks of existing approaches based on Euclidean embedding. A new algorithm, so-called Decentralized Matrix Factorization by Stochastic Gradient Descent (DMFSGD), is proposed to solve the network distance prediction problem. By letting network nodes exchange messages with each other, the algorithm is fully decentralized and only requires each node to collect and to process local measurements, with neither explicit matrix constructions nor special nodes such as landmarks and central servers. In addition, we compared comprehensively matrix factorization and Euclidean embedding to demonstrate the suitability of the former on network distance prediction. We further studied the incorporation of a robust loss function and of non-negativity constraints. Extensive experiments on various publicly-available datasets of network delays show not only the scalability and the accuracy of our approach but also its usability in real Internet applications. [less ▲]

Detailed reference viewed: 20 (3 ULg)
Full Text
Peer Reviewed
See detailEnsembles on Random Patches
Louppe, Gilles ULg; Geurts, Pierre ULg

in Machine Learning and Knowledge Discovery in Databases (2012)

In this paper, we consider supervised learning under the assumption that the available memory is small compared to the dataset size. This general framework is relevant in the context of big data ... [more ▼]

In this paper, we consider supervised learning under the assumption that the available memory is small compared to the dataset size. This general framework is relevant in the context of big data, distributed databases and embedded systems. We investigate a very simple, yet effective, ensemble framework that builds each individual model of the ensemble from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset. We carry out an extensive and systematic evaluation of this method on 29 datasets, using decision tree-based estimators. With respect to popular ensemble methods, these experiments show that the proposed method provides on par performance in terms of accuracy while simultaneously lowering the memory needs, and attains significantly better performance when memory is severely constrained. [less ▲]

Detailed reference viewed: 256 (53 ULg)