Browse ORBi by ORBi project

- Background
- Content
- Benefits and challenges
- Legal aspects
- Functions and services
- Team
- Help and tutorials

Approximation efficace de mélanges bootstrap d’arbres de Markov pour l’estimation de densité Schnitzler, François ; ; et al in Bougrain, Laurent (Ed.) Actes de la 14e Conférence Francophone sur l'Apprentissage Automatique (CAp 2012) (2012, May 23) Nous considérons des algorithmes pour apprendre des Mélanges bootstrap d'Arbres de Markov pour l'estimation de densité. Pour les problèmes comportant un grand nombre de variables et peu d'observations ... [more ▼] Nous considérons des algorithmes pour apprendre des Mélanges bootstrap d'Arbres de Markov pour l'estimation de densité. Pour les problèmes comportant un grand nombre de variables et peu d'observations, ces mélanges estiment généralement mieux la densité qu'un seul arbre appris au maximum de vraisemblance, mais sont plus coûteux à apprendre. C'est pourquoi nous étudions ici un algorithme pour apprendre ces modèles de manière approchée, afin d'accélérer l'apprentissage sans sacrifier la précision. Plus spécifiquement, nous récupérons lors du calcul d'un premier arbre de Markov les arcs qui constituent de bons candidats pour la structure, et ne considérons que ceux-ci lors de l'apprentissage des arbres suivants. Nous comparons cet algorithme à l'algorithme original de mélange, à un arbre appris au maximum de vraisemblance, à un arbre régularisé et à une autre méthode approchée. [less ▲] Detailed reference viewed: 39 (4 ULg)Statistical interpretation of machine learning-based feature importance scores for biomarker discovery Huynh-Thu, Vân Anh ; ; Wehenkel, Louis et al in Bioinformatics (2012), 28(13), 1766-1774 Motivation: Univariate statistical tests are widely used for biomarker discovery in bioinformatics. These procedures are simple, fast and their output is easily interpretable by biologists but they can ... [more ▼] Motivation: Univariate statistical tests are widely used for biomarker discovery in bioinformatics. These procedures are simple, fast and their output is easily interpretable by biologists but they can only identify variables that provide a significant amount of information in isolation from the other variables. As biological processes are expected to involve complex interactions between variables, univariate methods thus potentially miss some informative biomarkers. Variable relevance scores provided by machine learning techniques, however, are potentially able to highlight multivariate interacting effects, but unlike the p-values returned by univariate tests, these relevance scores are usually not statistically interpretable. This lack of interpretability hampers the determination of a relevance threshold for extracting a feature subset from the rankings and also prevents the wide adoption of these methods by practicians. Results: We evaluated several, existing and novel, procedures that extract relevant features from rankings derived from machine learning approaches. These procedures replace the relevance scores with measures that can be interpreted in a statistical way, such as p-values, false discovery rates, or family wise error rates, for which it is easier to determine a significance level. Experiments were performed on several artificial problems as well as on real microarray datasets. Although the methods differ in terms of computing times and the tradeoff, they achieve in terms of false positives and false negatives, some of them greatly help in the extraction of truly relevant biomarkers and should thus be of great practical interest for biologists and physicians. As a side conclusion, our experiments also clearly highlight that using model performance as a criterion for feature selection is often counter-productive. [less ▲] Detailed reference viewed: 201 (42 ULg)L1-based compression of random forest models Joly, Arnaud ; Schnitzler, François ; Geurts, Pierre et al in 20th European Symposium on Artificial Neural Networks (2012, April) Random forests are effective supervised learning methods applicable to large-scale datasets. However, the space complexity of tree ensembles, in terms of their total number of nodes, is often prohibitive ... [more ▼] Random forests are effective supervised learning methods applicable to large-scale datasets. However, the space complexity of tree ensembles, in terms of their total number of nodes, is often prohibitive, specially in the context of problems with very high-dimensional input spaces. We propose to study their compressibility by applying a L1-based regularization to the set of indicator functions defined by all their nodes. We show experimentally that preserving or even improving the model accuracy while significantly reducing its space complexity is indeed possible. [less ▲] Detailed reference viewed: 380 (72 ULg)DMFSGD: A Decentralized Matrix Factorization Algorithm for Network Distance Prediction Liao, Yongjun ; ; Geurts, Pierre et al Report (2012) The knowledge of end-to-end network distances is essential to many Internet applications. As active probing of all pairwise distances is infeasible in large-scale networks, a natural idea is to measure a ... [more ▼] The knowledge of end-to-end network distances is essential to many Internet applications. As active probing of all pairwise distances is infeasible in large-scale networks, a natural idea is to measure a few pairs and to predict the other ones without actually measuring them. This paper formulates the distance prediction problem as matrix completion where unknown entries of an incomplete matrix of pairwise distances are to be predicted. The problem is solvable because strong correlations among network distances exist and cause the constructed distance matrix to be low rank. The new formulation circumvents the well-known drawbacks of existing approaches based on Euclidean embedding. A new algorithm, so-called Decentralized Matrix Factorization by Stochastic Gradient Descent (DMFSGD), is proposed to solve the network distance prediction problem. By letting network nodes exchange messages with each other, the algorithm is fully decentralized and only requires each node to collect and to process local measurements, with neither explicit matrix constructions nor special nodes such as landmarks and central servers. In addition, we compared comprehensively matrix factorization and Euclidean embedding to demonstrate the suitability of the former on network distance prediction. We further studied the incorporation of a robust loss function and of non-negativity constraints. Extensive experiments on various publicly-available datasets of network delays show not only the scalability and the accuracy of our approach but also its usability in real Internet applications. [less ▲] Detailed reference viewed: 47 (3 ULg)Ensembles on Random Patches Louppe, Gilles ; Geurts, Pierre in Machine Learning and Knowledge Discovery in Databases (2012) In this paper, we consider supervised learning under the assumption that the available memory is small compared to the dataset size. This general framework is relevant in the context of big data ... [more ▼] In this paper, we consider supervised learning under the assumption that the available memory is small compared to the dataset size. This general framework is relevant in the context of big data, distributed databases and embedded systems. We investigate a very simple, yet effective, ensemble framework that builds each individual model of the ensemble from a random patch of data obtained by drawing random subsets of both instances and features from the whole dataset. We carry out an extensive and systematic evaluation of this method on 29 datasets, using decision tree-based estimators. With respect to popular ensemble methods, these experiments show that the proposed method provides on par performance in terms of accuracy while simultaneously lowering the memory needs, and attains significantly better performance when memory is severely constrained. [less ▲] Detailed reference viewed: 460 (75 ULg)Phenotype Classification of Zebrafish Embryos by Supervised Learning Jeanray, Nathalie ; Marée, Raphaël ; Pruvot, Benoist et al Poster (2011, December 08) Detailed reference viewed: 56 (22 ULg)Decentralized Prediction of End-to-End Network Performance Classes Liao, Yongjun ; ; Geurts, Pierre et al in Proc. of the 7th International Conference on emerging Networking EXperiments and Technologies (CoNEXT) (2011, December 08) In large-scale networks, full-mesh active probing of end-to-end performance metrics is infeasible. Measuring a small set of pairs and predicting the others is more scalable. Under this framework, we ... [more ▼] In large-scale networks, full-mesh active probing of end-to-end performance metrics is infeasible. Measuring a small set of pairs and predicting the others is more scalable. Under this framework, we formulate the prediction problem as matrix completion, whereby unknown entries of an incomplete matrix of pairwise measurements are to be predicted. This problem can be solved by matrix factorization because performance matrices have a low rank, thanks to the correlations among measurements. Moreover, its resolution can be fully decentralized without actually building matrices nor relying on special landmarks or central servers. In this paper we demonstrate that this approach is also applicable when the performance values are not measured exactly, but are only known to belong to one among some predefined performance classes, such as "good" and "bad". Such classification-based formulation not only fulfills the requirements of many Internet applications but also reduces the measurement cost and enables a unified treatment of various performance metrics. We propose a decentralized approach based on Stochastic Gradient Descent to solve this class-based matrix completion problem. Experiments on various datasets, relative to two kinds of metrics, show the accuracy of the approach, its robustness against erroneous measurements and its usability on peer selection. [less ▲] Detailed reference viewed: 190 (20 ULg)Pruning randomized trees with L1-norm regularization Joly, Arnaud ; Schnitzler, François ; Geurts, Pierre et al Poster (2011, November 29) Growing amount of high dimensional data requires robust analysis techniques. Tree-based ensemble methods provide such accurate supervised learning models. However, the model complexity can become utterly ... [more ▼] Growing amount of high dimensional data requires robust analysis techniques. Tree-based ensemble methods provide such accurate supervised learning models. However, the model complexity can become utterly huge depending on the dimension of the dataset. Here we propose a method to compress such ensemble using random tree induced space and L1-norm regularisation. This leads to a drastic pruning, preserving or improving the model accuracy. Moreover, our approach increases robustness with respect to the selection of complexity parameters. [less ▲] Detailed reference viewed: 80 (27 ULg)Phenotype Classification of Zebrafish Embryos by Supervised Learning Jeanray, Nathalie ; Marée, Raphaël ; Pruvot, Benoist et al Conference (2011, September 02) Detailed reference viewed: 40 (13 ULg)Efficiently approximating Markov tree bagging for high-dimensional density estimation Schnitzler, François ; ; et al in Gunopulos, Dimitrios; Hofmann, Thomas; Malerba, Donato (Eds.) et al Machine Learning and Knowledge Discovery in Databases, Part III (2011, September) We consider algorithms for generating Mixtures of Bagged Markov Trees, for density estimation. In problems deﬁned over many variables and when few observations are available, those mixtures generally ... [more ▼] We consider algorithms for generating Mixtures of Bagged Markov Trees, for density estimation. In problems deﬁned over many variables and when few observations are available, those mixtures generally outperform a single Markov tree maximizing the data likelihood, but are far more expensive to compute. In this paper, we describe new algorithms for approximating such models, with the aim of speeding up learning without sacriﬁcing accuracy. More speciﬁcally, we propose to use a ﬁltering step obtained as a by-product from computing a ﬁrst Markov tree, so as to avoid considering poor candidate edges in the subsequently generated trees. We compare these algorithms (on synthetic data sets) to Mixtures of Bagged Markov Trees, as well as to a single Markov tree derived by the classical Chow-Liu algorithm and to a recently proposed randomized scheme used for building tree mixtures. [less ▲] Detailed reference viewed: 80 (23 ULg)High-density lipoprotein proteome dynamics in human endotoxemia. ; Geurts, Pierre ; et al in Proteome science (2011), 9(1), 34 BACKGROUND: A large variety of proteins involved in inflammation, coagulation, lipid-oxidation and lipid metabolism have been associated with high-density lipoprotein (HDL) and it is anticipated that ... [more ▼] BACKGROUND: A large variety of proteins involved in inflammation, coagulation, lipid-oxidation and lipid metabolism have been associated with high-density lipoprotein (HDL) and it is anticipated that changes in the HDL proteome have implications for the multiple functions of HDL. Here, SELDI-TOF mass spectrometry (MS) was used to study the dynamic changes of HDL protein composition in a human experimental low-dose endotoxemia model. Ten healthy men with low HDL cholesterol (0.7+/-0.1 mmol/L) and 10 men with high HDL cholesterol levels (1.9+/-0.4 mmol/L) were challenged with endotoxin (LPS) intravenously (1 ng/kg bodyweight). We previously showed that subjects with low HDL cholesterol are more susceptible to an inflammatory challenge. The current study tested the hypothesis that this discrepancy may be related to differences in the HDL proteome. RESULTS: Plasma drawn at 7 time-points over a 24 hour time period after LPS challenge was used for direct capture of HDL using antibodies against apolipoprotein A-I followed by subsequent SELDI-TOF MS profiling. Upon LPS administration, profound changes in 21 markers (adjusted p-value < 0.05) were observed in the proteome in both study groups. These changes were observed 1 hour after LPS infusion and sustained up to 24 hours, but unexpectedly were not different between the 2 study groups. Hierarchical clustering of the protein spectra at all time points of all individuals revealed 3 distinct clusters, which were largely independent of baseline HDL cholesterol levels but correlated with paraoxonase 1 activity. The acute phase protein serum amyloid A-1/2 (SAA-1/2) was clearly upregulated after LPS infusion in both groups and comprised both native and N-terminal truncated variants that were identified by two-dimensional gel electrophoresis and mass spectrometry. Individuals of one of the clusters were distinguished by a lower SAA-1/2 response after LPS challenge and a delayed time-response of the truncated variants. CONCLUSIONS: This study shows that the semi-quantitative differences in the HDL proteome as assessed by SELDI-TOF MS cannot explain why subjects with low HDL cholesterol are more susceptible to a challenge with LPS than those with high HDL cholesterol. Instead the results indicate that hierarchical clustering could be useful to predict HDL functionality in acute phase responses towards LPS. [less ▲] Detailed reference viewed: 47 (8 ULg)Phenotype Classification of Zebrafish Embryos by Supervised Learning Jeanray, Nathalie ; Marée, Raphaël ; Pruvot, Benoist et al Poster (2011, May 20) Detailed reference viewed: 28 (10 ULg)Zebrafish Skeleton Measurements using Image Analysis and Machine Learning Methods Stern, Olivier ; Marée, Raphaël ; Aceto, Jessica et al Poster (2011, May 20) The zebrafish is a model organism for biological studies on development and gene function. Our work aims at automating the detection of the cartilage skeleton and measuring several distances and angles to ... [more ▼] The zebrafish is a model organism for biological studies on development and gene function. Our work aims at automating the detection of the cartilage skeleton and measuring several distances and angles to quantify its development following different experimental conditions. [less ▲] Detailed reference viewed: 65 (19 ULg)Learning from positive and unlabeled examples by enforcing statistical significance Geurts, Pierre in JMLR: Workshop and Conference Proceedings (2011, April), 15 Given a finite but large set of objects de- scribed by a vector of features, only a small subset of which have been labeled as ‘positive’ with respect to a class of interest, we consider the problem of ... [more ▼] Given a finite but large set of objects de- scribed by a vector of features, only a small subset of which have been labeled as ‘positive’ with respect to a class of interest, we consider the problem of characterizing the positive class. We formalize this as the problem of learning a feature based score function that minimizes the p-value of a non parametric statistical hypothesis test. For lin- ear score functions over the original feature space or over one of its kernelized versions, we provide a solution of this problem computed by a one-class SVM applied on a surrogate dataset obtained by sampling subsets of the overall set of objects and representing them by their average feature-vector shifted by the average feature-vector of the original sample of positive examples. We carry out experiments with this method on the prediction of targets of transcription factors in two different organisms, E. Coli and S. Cererevisiae. Our method extends enrichment analysis commonly carried out in Bioinformatics and its results outperform common solutions to this problem. [less ▲] Detailed reference viewed: 158 (28 ULg)Looking for applications of mixtures of Markov trees in bioinformatics Schnitzler, François ; Geurts, Pierre ; Wehenkel, Louis Scientific conference (2011, March 21) Probabilistic graphical models (PGM) eﬃciently encode a probability distribution on a large set of variables. While they have already had several successful applications in biology, their poor scaling in ... [more ▼] Probabilistic graphical models (PGM) eﬃciently encode a probability distribution on a large set of variables. While they have already had several successful applications in biology, their poor scaling in terms of the number of variables may make them unﬁt to tackle problems of increasing size. Mixtures of trees however scale well by design. Experiments on synthetic data have shown the interest of our new learning methods for this model, and we now wish to apply them to relevant problems in bioinformatics. [less ▲] Detailed reference viewed: 36 (12 ULg)Learning to rank with extremely randomized trees Geurts, Pierre ; Louppe, Gilles in JMLR: Workshop and Conference Proceedings (2011, January), 14 In this paper, we report on our experiments on the Yahoo! Labs Learning to Rank challenge organized in the context of the 23rd International Conference of Machine Learning (ICML 2010). We competed in both ... [more ▼] In this paper, we report on our experiments on the Yahoo! Labs Learning to Rank challenge organized in the context of the 23rd International Conference of Machine Learning (ICML 2010). We competed in both the learning to rank and the transfer learning tracks of the challenge with several tree-based ensemble methods, including Tree Bagging, Random Forests, and Extremely Randomized Trees. Our methods ranked 10th in the ﬁrst track and 4th in the second track. Although not at the very top of the ranking, our results show that ensembles of randomized trees are quite competitive for the “learning to rank” problem. The paper also analyzes computing times of our algorithms and presents some post-challenge experiments with transfer learning methods. [less ▲] Detailed reference viewed: 563 (75 ULg)Automatic localization of interest points in zebrafish images with tree-based methods Stern, Olivier ; Marée, Raphaël ; Aceto, Jessica et al in Proceedings of the 6th IAPR International Conference on Pattern Recognition in Bioinformatics (2011) In many biological studies, scientists assess effects of experimental conditions by visual inspection of microscopy images. They are able to observe whether a protein is expressed or not, if cells are ... [more ▼] In many biological studies, scientists assess effects of experimental conditions by visual inspection of microscopy images. They are able to observe whether a protein is expressed or not, if cells are going through normal cell cycles, how organisms evolve in different experimental conditions, etc. But, with the large number of images acquired in high-throughput experiments, this manual inspection becomes lengthy, tedious and error-prone. In this paper, we propose to automatically detect specific interest points in microscopy images using machine learning methods with the aim of performing automatic morphometric measurements in the context of Zebrafish studies. We systematically evaluate variants of ensembles of classification and regression trees on four datasets corresponding to different imaging modalities and experimental conditions. Our results show that all variants are effective, with a slight advantage for multiple output methods, which are more robust to parameter choices. [less ▲] Detailed reference viewed: 96 (22 ULg)Statistical interpretation of machine learning-based feature rankings for biomarker discovery Huynh-Thu, Vân Anh ; ; Wehenkel, Louis et al Conference (2011) Detailed reference viewed: 31 (7 ULg)Inferring gene regulatory networks from expression data using tree-based methods Huynh-Thu, Vân Anh ; Irrthum, Alexandre ; Wehenkel, Louis et al Conference (2011) Detailed reference viewed: 37 (4 ULg)MicroRNAs Profiling in Murine Models of Acute and Chronic Asthma: A Relationship with mRNAs Targets Garbacki, Nancy ; Di Valentin, Emmanuel ; Huynh-Thu, Vân Anh et al in PLoS ONE (2011) Detailed reference viewed: 156 (70 ULg) |
||