Browse ORBi by ORBi project

- Background
- Content
- Benefits and challenges
- Legal aspects
- Functions and services
- Team
- Help and tutorials

Robustness properties of the ordered logistic discrimination Ruwet, Christel ; Haesbroeck, Gentiane ; Scientific conference (2010, May 20) Logistic regression is a widely used tool designed to model the success probability of a Bernoulli random variable depending on some explanatory variables. A generalization of this bimodal model is the ... [more ▼] Logistic regression is a widely used tool designed to model the success probability of a Bernoulli random variable depending on some explanatory variables. A generalization of this bimodal model is the multinomial case where the dependent variable has more than two categories. When these categories are naturally ordered (e.g. in questionnaires where individuals are asked whether they strongly disagree, disagree, are indifferent, agree or strongly agree with a given statement), one speaks about ordered or ordinal logistic regression. The classical technique for estimating the unknown parameters is based on Maximum Likelihood estimation. Maximum Likelihood procedures are however known to be vulnerable to contamination in the data. The lack of robustness of this technique in the simple logistic regression setting has already been investigated in the literature, either by computing breakdown points or influence functions. Robust alternatives have also been constructed for that model. In this talk, the breakdown behavior of the ML-estimation procedure will be considered in the context of ordinal logistic regression. Influence functions will be computed and shown to be unbounded. A robust alternative based on a weighting idea will then be suggested and illustrated on some examples. These influence functions may be used to derive diagnostic measures, as will be illustrated on some examples. Furthermore, breakdown points will also be computed. [less ▲] Detailed reference viewed: 38 (6 ULg)RelaxMCD: smooth optimisation for the Minimum Covariance Determinant estimator Schyns, Michael ; Haesbroeck, Gentiane ; in Computational Statistics & Data Analysis (2010), 54(4), 843-857 The Minimum Covariance Determinant (MCD) estimator is a highly robust procedure for estimating the center and shape of a high dimensional data set. It consists of determining a subsample of h points out ... [more ▼] The Minimum Covariance Determinant (MCD) estimator is a highly robust procedure for estimating the center and shape of a high dimensional data set. It consists of determining a subsample of h points out of n which minimizes the generalized variance. By definition, the computation of this estimator gives rise to a combinatorial optimization problem, for which several approximative algorithms have been developed. Some of these approximations are quite powerful, but they do not take advantage of any smoothness in the objective function. In this paper, focus is on the approach outlined in a general framework in Critchley et al. (2009) and which transforms any discrete and high dimensional combinatorial problem of this type into a continuous and low-dimensional one. The idea is to build on the general algorithm proposed by Critchley et al. (2009) in order to take into account the particular features of the MCD methodology. More specifically, both the adaptation of the algorithm to the specific MCD target function as well as the comparison of this “specialized” algorithm with the usual competitors for computing MCD are the main goals of this paper. The adaptation focuses on the design of “clever” starting points in order to systematically investigate the search domain. Accordingly, a new and surprisingly efficient procedure based on the well known k-means algorithm is constructed. The adapted algorithm, called RelaxMCD, is then compared by means of simulations and examples with FASTMCD and the Feasible Subset Algorithm, both benchmark algorithms for computing MCD. As a by-product, it is shown that RelaxMCD is a general technique encompassing the two others, yielding insight about their overall good performance. [less ▲] Detailed reference viewed: 186 (41 ULg)A relaxed approach to combinatorial problems in robustness and diagnostics ; Schyns, Michael ; Haesbroeck, Gentiane et al in Statistics and Computing (2010), 20(1), 99-115 A range of procedures in both robustness and diagnostics require optimisation of a target functional over all subsamples of given size. Whereas such combinatorial problems are extremely difficult to solve ... [more ▼] A range of procedures in both robustness and diagnostics require optimisation of a target functional over all subsamples of given size. Whereas such combinatorial problems are extremely difficult to solve exactly, something less than the global optimum can be ‘good enough’ for many practical purposes, as shown by example. Again, a relaxation strategy embeds these discrete, high-dimensional problems in continuous, low-dimensional ones. Overall, nonlinear optimisation methods can be exploited to provide a single, reasonably fast algorithm to handle a wide variety of problems of this kind, thereby providing a certain unity. Four running examples illustrate the approach. On the robustness side, algorithmic approximations to minimum covariance determinant (MCD) and least trimmed squares (LTS) estimation. And, on the diagnostic side, detection of multiple multivariate outliers and global diagnostic use of the likelihood displacement function. This last is developed here as a global complement to Cook’s (in J. R. Stat. Soc. 48:133–169, 1986) local analysis. Appropriate convergence of each branch of the algorithm is guaranteed for any target functional whose relaxed form is—in a natural generalisation of concavity, introduced here—‘gravitational’. Again, its descent strategy can downweight to zero contaminating cases in the starting position. A simulation study shows that, although not optimised for the LTS problem, our general algorithm holds its own with algorithms that are so optimised. An adapted algorithm relaxes the gravitational condition itself. [less ▲] Detailed reference viewed: 169 (55 ULg)Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ruwet, Christel ; Haesbroeck, Gentiane Conference (2009, October 14) Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will ... [more ▼] Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will be on the generalized k-means algorithm, while the data of interest are assumed to come from an underlying population consisting of a mixture of two groups. Among the outputs of this clustering technique, a classi cation rule is provided in order to classify the objects into one of the clusters. When classi cation is the main objective of the statistical analysis, performance is often measured by means of an error rate ER(F; Fm) where F is the distribution of the training sample used to set up the classi cation rule and Fm (model distribution) is the distribution under which the quality of the rule is assessed (via a test sample). Under contamination, one has to replace the distribution F of the training sample by a contaminated one, F(eps) say (where eps corresponds to the fraction of contamination). In that case, the error rate will be corrupted since it relies on a contaminated rule, while the test sample may still be considered as being distributed according to the model distribution. To measure the robustness of classification based on this clustering proce- dure, influence functions of the error rate may be computed. The idea has already been exploited by Croux et al. (2008) and Croux et al. (2008) in the context of linear and logistic discrimination. In this setup, the contaminated distribution takes the form F(eps)= (1-eps)*Fm + eps*Dx, where Dx is the Dirac distribution putting all its mass at x: After studying the influence function of the error rate of the generalized k- means procedure, which depends on the influence functions of the generalized k-means centers derived by Garcia-Escudero and Gordaliza (1999), a diagnostic tool based on its value will be presented. The aim is to detect observations in the training sample which can be influential for the error rate. [less ▲] Detailed reference viewed: 77 (34 ULg)Outlier detection with the minimum covariance determinant estimator in practice Fauconnier, Cécile ; Haesbroeck, Gentiane in Statistical Methodology (2009), 6(4), 363-379 Robust statistics has slowly become familiar to all practitioners. Books entirely devoted to the subject are without any doubts responsible for the increased practice of robust statistics in all fields of ... [more ▼] Robust statistics has slowly become familiar to all practitioners. Books entirely devoted to the subject are without any doubts responsible for the increased practice of robust statistics in all fields of applications. Even classical books often have at least one chapter (or parts of chapters) which develops robust methodology. The improvement of computing power has also contributed to the development of a wider and wider range of available robust procedures. However, this success story is now menacing to get backwards: non specialists interested in the application of robust methodology are faced with a large set of (assumed equivalent) methods and with over-sophistication of some of them. Which method should one use? How the (numerous) parameters should be optimaly tuned? These questions are not so easy to answer for non specialists! One could then argue that default procedures are available in most statistical softwares (Splus, R, SAS, Matlab,...). However, using as illustration the detection of outliers in multivariate data, it is shown that, on one hand, it is not obvious that one would feel confident with the output of default procedures, and that, on the other hand, trying to understand thoroughly the tuning parameters involved in the procedures might require some extensive research. This is not conceivable when trying to compete with the classical methodology which (while clearly unreliable) is so straightfoward. The aim of the paper is to help the practitioners willing to detect in a reliable way outliers in a multivariate data set. The chosen methodology is the Minimum Covariance Determinant estimator being widely available and intuitively appealing. [less ▲] Detailed reference viewed: 331 (14 ULg)Impact of contamination on empirical and theoretical error Ruwet, Christel ; Haesbroeck, Gentiane Conference (2009, June 18) Classification analysis allows to group similar objects into a given number of groups by means of a classification rule. Many classification procedures are available : linear discrimination, logistic ... [more ▼] Classification analysis allows to group similar objects into a given number of groups by means of a classification rule. Many classification procedures are available : linear discrimination, logistic discrimination, etc. Focus in this poster will be on classification resulting from a clustering analysis. Indeed, among the outputs of classical clustering techniques, a classification rule is provided in order to classify the objects into one of the clusters. More precisely, let F denote the underlying distribution and assume that the generalized kmeans algorithm with penalty function is used to construct the k clusters C1(F), . . . ,Ck(F) with centers T1(F), . . . , Tk(F). When one feels that k true groups are existing among the data, classification might be the main objective of the statistical analysis. Performance of a particular classification technique can be measured by means of an error rate. Depending on the availability of data, two types of error rates may be computed: a theoretical one and a more empirical one. In the first case, the rule is estimated on a training sample with distribution F while the evaluation of the classification performance may be done through a test sample distributed according to a model distribution of interest, Fm say. In the second case, the same data are used to set up the rule and to evaluate the performance. Under contamination, one has to replace the distribution F of the training sample by a contaminated one, F(eps) say (where eps corresponds to the fraction of contamination). In that case, thetheoretical error rate will be corrupted since it relies on a contaminated rule but it may still consider a test sample distributed according to the model distribution. The empirical error rate will be affected twice: via the rule and also via the sample used for the evaluation of the classification performance. To measure the robustness of classification based on clustering, influence functions of the error rate may be computed. The idea has already been exploited by Croux et al (2008) and Croux et al (2008) in the context of linear and logistic discrimination. In the computation of influence functions, the contaminated distribution takes the form F(eps) = (1 − eps)*Fm + eps* Dx, where Dx is the Dirac distribution putting all its mass at x. It is interesting to note that the impact of the point mass x may be positive, i.e. may decrease the error rate, when the data at hand is used to evaluate the error. [less ▲] Detailed reference viewed: 65 (25 ULg)Influence function of the error rate of classification based on clustering Ruwet, Christel ; Haesbroeck, Gentiane Conference (2009, May 19) Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will ... [more ▼] Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will be on two particular cases of the generalized k-means algorithm : the classical k-means procedure as well as the k-medoids algorithm, while the data of interest are assumed to come from an underlying population consisting of a mixture of two groups. Among the outputs of these clustering techniques, a classification rule is provided in order to classify the objects into one of the clusters. When classification is the main objective of the statistical analysis, performance is often measured by means of an error rate. Two types of error rates can be computed: a theoretical one and a more empirical one. The first one can be written as ER(F, Fm) where F is the distribution of the training sample used to set up the classification rule and Fm (model distribution) is the distribution under which the quality of the rule is assessed (via a test sample). The empirical error rate corresponds to ER(F, F), meaning that the classification rule is tested on the same sample as the one used to set up the rule. This talk will present the results concerning the theoretical error rate. In case there are some outliers in the data, the classification rule may be corrupted. Even if it is evaluated at the model distribution, the theoretical error rate may then be contaminated. To measure the robustness of classification based on clustering, influence functions have been computed. Similar results as those derived by Croux et al (2008) and Croux et al (2008) in discriminant analysis were observed. More specifically, under optimality (which happens when the model distribution is FN = 0.5 N(μ1, σ) + 0.5 N(μ2, σ), Qiu and Tamhane 2007), the contaminated error rate can never be smaller than the optimal value, resulting in a first order influence function identically equal to 0. Second order influence functions need then to be computed. When the optimality does not hold, the first order influence function of the theoretical error rate does not vanish anymore and shows that contamination may improve the error rate achieved under the non-optimal model. The first and, when required, second order influence functions of the theoretical error rate are useful in their own right to compare the robustness of the 2-means and 2-medoids classification procedures. They have also other applications. For example, they may be used to derive diagnostic tools in order to detect observations having an unduly large influence on the error rate. Also, under optimality, the second order influence function of the theoretical error rate can yield asymptotic relative classification efficiencies. [less ▲] Detailed reference viewed: 56 (25 ULg)Influence functions of the error rates of classification based on clustering Ruwet, Christel ; Haesbroeck, Gentiane Poster (2009, May) Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this poster, focus ... [more ▼] Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this poster, focus will be on two particular cases of the generalized k-means algorithm : the classical k-means procedure as well as the k-medoids algorithm, while the data of interest are assumed to come from an underlying population consisting of a mixture of two groups. Among the outputs of these clustering techniques, a classification rule is provided in order to classify the objects into one of the clusters. When classification is the main objective of the statistical analysis, performance is often measured by means of an error rate. Two types of error rates can be computed : a theoretical one and a more empirical one. The first one can be written as ER(F, Fm) where F is the distribution of the training sample used to set up the classification rule and Fm (model distribution) is the distribution under which the quality of the rule is assessed (via a test sample). The empirical error rate corresponds to ER(F, F), meaning that the classification rule is tested on the same sample as the one used to set up the rule. In case there are some outliers in the data, the classification rule may be corrupted. Even if it is evaluated at the model distribution, the theoretical error rate may then be contaminated, while the effect of contamination on the empirical error rate is two-fold : the rule but also the test sample are contaminated. To measure the robustness of classification based on clustering, influence functions have been computed, both for the theoretical and the empirical error rates. When using the theoretical error rate, similar results as those derived by Croux et al (2008) and Croux et al (2008) in discriminant analysis were observed. More specifically, under optimality (which happens when the model distribution is FN = 0.5N(μ1, ) + 0.5N(μ2, ), Qiu and Tamhane 2007), the contaminated error rate can never be smaller than the optimal value, resulting in a first order influence function identically equal to 0. Second order influence functions would then need to be computed, as this will be done in future research. When the optimality does not hold, the first order influence function of the theoretical error rate does not vanish anymore and shows that contamination may improve the error rate achieved under the non-optimal model. Similar computations have been performed for the empirical error rate, as the poster will show. The first and, when required, second order influence functions of the theoretical and empirical error rates are useful in their own right to compare the robustness of the 2-means and 2-medoids classification procedures. They have also other applications. For example, they may be used to derive diagnostic tools in order to detect observations having an unduly large influence on the error rate. Also, under optimality, the second order influence function of the theoretical error rate can yield asymptotic relative classification efficiencies. [less ▲] Detailed reference viewed: 55 (29 ULg)Influence function of the error rate of the generalized k-means Ruwet, Christel ; Haesbroeck, Gentiane Scientific conference (2009, March 30) Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will ... [more ▼] Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will be on two particular cases of the generalized k-means algorithm: the classical k-means procedure as well as the k-medoids algorithm. Among the outputs of these clustering techniques, a classification rule is provided in order to classify the objects into one of the clusters. When classification is the main objective of the statistical analysis, performance is often measured by means of an error rate. In the clustering setting, the error rate has to be measured on the training sample while test samples are usually used in other settings like linear discrimination or logistic discrimination. This characteristic of classification resulting from a clustering implies that contamination in the training sample may not only affect the classification rule but also other parameters involved in the error rate. In the talk, influence functions will be used to measure the impact of contamination on the error rate and will show that contamination may decrease the error rate that one would expect under a given model. Moreover, a kind of second-order influence functions will also be derived to measure the bias in error rate the k-means and k-medoids procedures suffer from in finite-samples. Simulations will confirm the results obtained via the first and second-order influence functions. Future research perspectives will conclude the talk. [less ▲] Detailed reference viewed: 15 (4 ULg)Logistic discrimination using robust estimators: an influence function approach ; Haesbroeck, Gentiane ; in Canadian Journal of Statistics = Revue Canadienne de Statistique (2008), 36(1), 157-174 Logistic regression is frequently used for classifying observations into two groups. Unfortunately there are often outlying observations in a data set and these might affect the estimated model and the ... [more ▼] Logistic regression is frequently used for classifying observations into two groups. Unfortunately there are often outlying observations in a data set and these might affect the estimated model and the associated classification error rate. In this paper, the authors study the effect of observations in the training sample on the error rate by deriving influence functions. They obtain a general expression for the influence function of the error rate, and they compute it for the maximum likelihood estimator as well as for several robust logistic discrimination procedures. Besides being of interest in their own right, the influence functions are also used to derive asymptotic, classification efficiencies of different logistic discrimination rules. The authors also show how influential points can be detected by means of a diagnostic plot based on the values of the influence function. [less ▲] Detailed reference viewed: 34 (1 ULg)Pratique de la statistique descriptive Henry, Valérie ; Haesbroeck, Gentiane Book published by Coédition Ferrer et Céfal (2004) Il s'agit d'un livre d'exercices résolus qui invite le lecteur à réfléchir sur l'usage judicieux de l'outil adéquat, initie à l'analyse des données statistiques et à l'interprétation des résultats obtenus ... [more ▼] Il s'agit d'un livre d'exercices résolus qui invite le lecteur à réfléchir sur l'usage judicieux de l'outil adéquat, initie à l'analyse des données statistiques et à l'interprétation des résultats obtenus, aborde les techniques statistiques les plus récentes, telles que l'analyse exploratoire des données ou la statistique robuste. [less ▲] Detailed reference viewed: 238 (35 ULg)The case sensitivity function approach to diagnostic and robust computation: a relaxation strategy ; Schyns, Michael ; Haesbroeck, Gentiane et al in Antoch, Jaromir (Ed.) COMPSTAT 2004: Proceedings in Computational Statistics (2004) Detailed reference viewed: 28 (1 ULg)Implementing the Bianco and Yohai estimator for logistic regression ; Haesbroeck, Gentiane in Computational Statistics & Data Analysis (2003), 44(1-2), 273-295 A fast and stable algorithm to compute a highly robust estimator for the logistic regression model is proposed. A criterium. for the existence of this estimator at finite samples is derived and the ... [more ▼] A fast and stable algorithm to compute a highly robust estimator for the logistic regression model is proposed. A criterium. for the existence of this estimator at finite samples is derived and the problem of the selection of an appropriate loss function is discussed. It is shown that the loss function can be chosen such that the robust estimator exists if and only if the maximum likelihood estimator exists. The advantages of using a weighted version of this estimator are also considered. Simulations and an example give further support for the good performance of the implemented estimators. (C) 2003 Elsevier B.V. All rights reserved. [less ▲] Detailed reference viewed: 52 (5 ULg)The breakdown behavior of the maximum likelihood estimator in the logistic regression model ; ; Haesbroeck, Gentiane in Statistics & Probability Letters (2002), 60(4), 377-386 In this note we discuss the breakdown behavior of the maximum likelihood (ML) estimator in the logistic regression model. We formally prove that the ML-estimator never explodes to infinity, but rather ... [more ▼] In this note we discuss the breakdown behavior of the maximum likelihood (ML) estimator in the logistic regression model. We formally prove that the ML-estimator never explodes to infinity, but rather breaks down to zero when adding severe outliers to a data set. An example confirms this behavior. (C) 2002 Published by Elsevier Science B.V. [less ▲] Detailed reference viewed: 24 (3 ULg)Location adjustment for the minimum volume ellipsoid estimator ; Haesbroeck, Gentiane ; in Statistics and Computing (2002), 12(3), 191-200 Estimating multivariate location and scatter with both affine equivariance and positive breakdown has always been difficult. A well-known estimator which satisfies both properties is the Minimum Volume ... [more ▼] Estimating multivariate location and scatter with both affine equivariance and positive breakdown has always been difficult. A well-known estimator which satisfies both properties is the Minimum Volume Ellipsoid Estimator (MVE). Computing the exact MVE is often not feasible, so one usually resorts to an approximate algorithm. In the regression setup, algorithms for positive-breakdown estimators like Least Median of Squares typically recompute the intercept at each step, to improve the result. This approach is called intercept adjustment. In this paper we show that a similar technique, called location adjustment, can be applied to the MVE. For this purpose we use the Minimum Volume Ball (MVB), in order to lower the MVE objective function. An exact algorithm for calculating the MVB is presented. As an alternative to MVB location adjustment we propose L-1 location adjustment, which does not necessarily lower the MVE objective function but yields more efficient estimates for the location part. Simulations compare the two types of location adjustment. We also obtain the maxbias curves of both L-1 and the MVB in the multivariate setting, revealing the superiority of L-1. [less ▲] Detailed reference viewed: 35 (1 ULg)Maxbias curves of robust location estimators based on subranges ; Haesbroeck, Gentiane in Journal of Nonparametric Statistics (2002), 14(3), 295-306 A maxbias curve is a powerful tool to describe the robustness of an estimator. It tells us how much an estimator can change due to a given fraction of contamination. In this paper, maxbias curves are ... [more ▼] A maxbias curve is a powerful tool to describe the robustness of an estimator. It tells us how much an estimator can change due to a given fraction of contamination. In this paper, maxbias curves are computed for some univariate location estimators based on subranges: midranges, trimmed means and the univariate Minimum Volume Ellipsoid (MVE) location estimators. These estimators are intuitively appealing and easy to calculate. [less ▲] Detailed reference viewed: 17 (3 ULg)A note on finite-sample efficiencies of estimators for the minimum volume ellipsoid ; Haesbroeck, Gentiane in Journal of Statistical Computation & Simulation (2002), 72(7), 585-596 Among the most well known estimators of multivariate location and scatter is the Minimum Volume Ellipsoid (MVE). Many algorithms have been proposed to compute it. Most of these attempt merely to ... [more ▼] Among the most well known estimators of multivariate location and scatter is the Minimum Volume Ellipsoid (MVE). Many algorithms have been proposed to compute it. Most of these attempt merely to approximate as close as possible the exact MVE, but some of them led to the definition of new estimators which maintain the properties of robustness and affine equivariance that make the MVE so attractive. Rousseeuw and van Zomeren (1990) used the (p+1)- subset estimator which was modified by Croux and Haesbroeck (1997) to give rise to the averaged (p+1)- subset estimator . This note shows by means of simulations that the averaged (p+1)-subset estimator outperforms the exact estimator as far as finite-sample efficiency is concerned. We also present a new robust estimator for the MVE, closely related to the averaged (p+1)-subset estimator, but yielding a natural ranking of the data. [less ▲] Detailed reference viewed: 23 (4 ULg)Sur l'enseignement de la statistique en Communauté française de Belgique Bair, Jacques ; Haesbroeck, Gentiane in Repères-IREM (2002), 48 Detailed reference viewed: 14 (2 ULg)Maxbias curves of robust scale estimators based on subranges ; Haesbroeck, Gentiane in Metrika (2001), 53(2), 101-122 A maxbias curve is a powerful tool to describe the robustness of an estimator. It is an asymptotic concept which tells how much an estimator can change due to a given fraction of contamination. In this ... [more ▼] A maxbias curve is a powerful tool to describe the robustness of an estimator. It is an asymptotic concept which tells how much an estimator can change due to a given fraction of contamination. In this paper, maxbias curves are computed for some univariate scale estimators based on subranges: trimmed standard deviations, interquantile ranges and the univariate Minimum Volume Ellipsoid (MVE) and Minimum Covariance Determinant (MCD) scale estimators. These estimators are intuitively appealing and easy to calculate. Since the bias behavior of scale estimators may differ depending on the type of contamination (outliers or inliers), expressions for both explosion and implosion maxbias curves are given. On the basis of robustness and efficiency arguments, the MCD scale estimator with 25% breakdown point can be recommended for practical use. [less ▲] Detailed reference viewed: 10 (0 ULg)Régression logistique robuste Croux, Christophe ; Haesbroeck, Gentiane in Droesbeke, J. J.; Lejeune, M.; Saporta, G. (Eds.) Modèles statistiques pour données qualitatives (2001) Detailed reference viewed: 31 (7 ULg) |
||