Browse ORBi by ORBi project

- Background
- Content
- Benefits and challenges
- Legal aspects
- Functions and services
- Team
- Help and tutorials

Multivariate coefficients of variation: Comparison and influence functions Aerts, Stéphanie ; Haesbroeck, Gentiane ; Ruwet, Christel in Journal of Multivariate Analysis (2015), 142 In the univariate setting, coeﬃcients of variation are well-known and used to compare the variability of populations characterized by variables expressed in diﬀerent units or having really diﬀerent means ... [more ▼] In the univariate setting, coeﬃcients of variation are well-known and used to compare the variability of populations characterized by variables expressed in diﬀerent units or having really diﬀerent means. When dealing with more than one variable, the use of such a relative dispersion measure is much less common even though several generalizations of the coeﬃcient of variation to the multivariate setting have been introduced in the literature. In this paper, the lack of robustness of the sample versions of the multivariate coeﬃcients of variation (MCV) is illustrated by means of inﬂuence functions and some robust counterparts based either on the Minimum Covariance Determinant (MCD) estimator or on the S estimator are advocated. Then, focusing on two of the considered MCV’s, a diagnostic tool is derived and its eﬃciency in detecting observations having an unduly large eﬀect on variability is illustrated on a real-life data set. The inﬂuence functions are also used to compute asymptotic variances under elliptical distributions, yielding approximate conﬁdence intervals. Finally, simulations are conducted in order to compare, in a ﬁnite sample setting, the performance of the classical and robust MCV’s in terms of variability and in terms of coverage probability of the corresponding asymptotic conﬁdence intervals. [less ▲] Detailed reference viewed: 97 (28 ULg)Robustness and efficiency of multivariate coefficients of variation Aerts, Stéphanie ; Haesbroeck, Gentiane ; Ruwet, Christel Conference (2014, August 12) The coefficient of variation is a well-known measure used in many fields to compare the variability of a variable in several populations. However, when the dimension is greater than one, comparing the ... [more ▼] The coefficient of variation is a well-known measure used in many fields to compare the variability of a variable in several populations. However, when the dimension is greater than one, comparing the variability only marginally may lead to controversial results. Several multivariate extensions of the univariate coefficient of variation have been introduced in the literature. In practice, these coefficients can be estimated by using any pair of location and covariance estimators. However, as soon as the classical mean and covariance matrix are under consideration, the influence functions are unbounded, while the use of any robust estimators yields bounded influence functions. While useful in their own right, the influence functions of the multivariate coefficients of variation are further exploited in this talk to derive a general expression for the corresponding asymptotic variances under elliptical symmetry. Then, focusing on two of the considered multivariate coefficients, a diagnostic tool based on their influence functions is derived and compared, on a real-life dataset, with the usual distance-plot. [less ▲] Detailed reference viewed: 53 (16 ULg)Distribution under elliptical symmetry of a distance-based multivariate coefficient of variation Aerts, Stéphanie ; Haesbroeck, Gentiane ; Ruwet, Christel E-print/Working paper (2014) Detailed reference viewed: 60 (26 ULg)On the breakdown behavior of TCLUST clustering procedure Ruwet, Christel ; ; et al in Test (2013), 22(3), 466-487 Clustering procedures allowing for general covariance structures of the obtained clusters need some constraints on the solutions. With this in mind, several proposals have been introduced in the ... [more ▼] Clustering procedures allowing for general covariance structures of the obtained clusters need some constraints on the solutions. With this in mind, several proposals have been introduced in the literature. The TCLUST procedure works with a restriction on the "eigenvalues-ratio" of the clusters scatter matrices. In order to try to achieve robustness with respect to outliers, the procedure allows to trim off a proportion of the most outlying observations. The resistance to infinitesimal contamination of the TCLUST has already been studied. This paper aims to look at its resistance to a higher amount of contamination by means of the study of its breakdown behavior. The rather new concept of restricted breakdown point will demonstrate that the TCLUST procedure resists to a proportion of contamination equal to the trimming rate as soon as the data set is sufficiently "well clustered". [less ▲] Detailed reference viewed: 84 (14 ULg)Classification performance resulting from of 2-means Ruwet, Christel ; Haesbroeck, Gentiane in Journal of Statistical Planning & Inference (2013), 143(2), 408-418 The k-means procedure is probably one of the most common nonhierachical clustering techniques. From a theoretical point of view, it is related to the search for the k principal points of the underlying ... [more ▼] The k-means procedure is probably one of the most common nonhierachical clustering techniques. From a theoretical point of view, it is related to the search for the k principal points of the underlying distribution. In this paper, the classification resulting from that procedure for k=2 is shown to be optimal under a balanced mixture of two spherically symmetric and homoscedastic distributions. Then, the classification efficiency of the 2-means rule is assessed using the second order influence function and compared to the classification efficiencies of the Fisher and logistic discriminations. Influence functions are also considered here to compare the robustness to infinitesimal contamination of the 2-means method w.r.t. the generalized 2-means technique. [less ▲] Detailed reference viewed: 78 (18 ULg)Robust estimation for ordinal regression Croux, Christophe ; Haesbroeck, Gentiane ; Ruwet, Christel in Journal of Statistical Planning & Inference (2013), 143(9), 14861499 Ordinal regression is used for modelling an ordinal response variable as a function of some explanatory variables. The classical technique for estimating the unknown parameters of this model is Maximum ... [more ▼] Ordinal regression is used for modelling an ordinal response variable as a function of some explanatory variables. The classical technique for estimating the unknown parameters of this model is Maximum Likelihood (ML). The lack of robustness of this estimator is formally shown by deriving its breakdown point and its influence function. To robustify the procedure, a weighting step is added to the Maximum Likelihood estimator, yielding an estimator with bounded influence function. We also show that the loss in efficiency due to the weighting step remains limited. A diagnostic plot based on the Weighted Maximum Likelihood estimator allows to detect outliers of different types in a single plot. [less ▲] Detailed reference viewed: 29 (7 ULg)Robustness analysis of clustering and classification techniques Ruwet, Christel Doctoral thesis (2012) As mentioned in the title, the framework of this doctoral dissertation encompasses two different subjects: robust statistics on the one hand and classification and clustering techniques on the other hand ... [more ▼] As mentioned in the title, the framework of this doctoral dissertation encompasses two different subjects: robust statistics on the one hand and classification and clustering techniques on the other hand. Robust procedures try at the same time to emulate classical procedures and to produce results that are not unduly affected by contaminated observations or deviations from model assumptions. Classification and clustering techniques try to find groups among observations. Grouping is one of the most basic abilities of living creatures; the simple fact of naming objects is already grouping. The main interest lies in the fact that the characteristics of a group as well as its differences from other groups can be used as a summary of the dataset. [less ▲] Detailed reference viewed: 94 (41 ULg)Classification efficiency of the trimmed k-means procedure Ruwet, Christel Conference (2012, May 21) The k-means method is used in classification to group similar observations in k groups. When a second sample is available to test the obtained groupings, the rate of misclassification can be computed. If ... [more ▼] The k-means method is used in classification to group similar observations in k groups. When a second sample is available to test the obtained groupings, the rate of misclassification can be computed. If the samples are generated from a mixture of two homoscedastic and spherically symmetric distributions, the rate of misclassification equals that of the Bayes rule. Therefore, the k-means method is optimal under such a mixture model. However, it is not robust with respect to outliers in the dataset used to construct the groups. To avoid this problem, the k-means procedure has been adapted in many ways. This presentation focuses on the trimmed k-means method defined by trimming some of the observations. The advantage of this method, besides its resistance to outliers, is that optimality is preserved. However, it is well known that trimming observations leads to a loss in classification efficiency. The latter can be measured by means of the influence function of the misclassifiation rate. [less ▲] Detailed reference viewed: 19 (5 ULg)The influence function of the TCLUST robust clustering procedure Ruwet, Christel ; ; et al in Advances in Data Analysis and Classification (2012), 6(2), 107-130 The TCLUST procedure performs robust clustering with the aim of finding clusters with different scatter structures and proportions. An Eigenvalue Ratio constraint is considered by TCLUST in order to avoid ... [more ▼] The TCLUST procedure performs robust clustering with the aim of finding clusters with different scatter structures and proportions. An Eigenvalue Ratio constraint is considered by TCLUST in order to avoid finding spurious clusters. In order to guarantee the robustness of the method against the presence of outliers and background noise, the method allows for trimming of a given proportion of observations self determined by the data. This article studies robustness properties of the TCLUST procedure by means of the influence function, obtaining a robustness behavior close to that of the trimmed k-means. [less ▲] Detailed reference viewed: 45 (8 ULg)Impact of contamination on the TCLUST procedure Ruwet, Christel ; ; et al Conference (2011, December 18) The TCLUST procedure is a robust clustering procedure that performs clus- tering with the aim of tting clusters with di erent scatters and weights. As the corresponding objective function can be unbounded ... [more ▼] The TCLUST procedure is a robust clustering procedure that performs clus- tering with the aim of tting clusters with di erent scatters and weights. As the corresponding objective function can be unbounded, a restriction is added on the eigenvalues-ratio of the scatter matrices. The robustness of the method is guaranteed by allowing the trimming of a given proportion of observations. The resistance to contamination of that procedure will be studied. Results con- cerning breakdown points and some new criteria in robust cluster analysis, such as the dissolution point and the isolation robustness, will be presented. [less ▲] Detailed reference viewed: 14 (6 ULg)Breakdown points of the TCLUST procedure Ruwet, Christel Scientific conference (2011, September 22) The TCLUST procedure is a new robust clustering method introduced by García-Escudero et al. (2008). It performs clustering with the aim of finding clusters with different scatters and weights. As the ... [more ▼] The TCLUST procedure is a new robust clustering method introduced by García-Escudero et al. (2008). It performs clustering with the aim of finding clusters with different scatters and weights. As the corresponding objective function can be unbounded, a restriction is added on the eigenvalues-ratio of the scatter matrices. The robustness of the method is guaranteed by allowing the trimming of a given proportion of observations. This trimming level has to be chosen by the practitioner, as well as the number of clusters. Suitable values for these parameters can be obtained throughout the careful examination of some classification trimmed likelihood curves (García-Escudero et al., 2010). The first part of this talk will consist of a brief presentation of this clustering procedure and the related R package (tclust). In the second part of the talk, the robustness of the TCLUST procedure, and more precisely its breakdown behavior, will be studied. We will see that the estimator of the scatter matrices can resist to more outliers than the number of trimmed observations. However, the brekdown point of estimator of the centers is very poor. Two observations are sufficient to make the centers break down. This is due to the stringency of the classical breakdown point; the estimator has to have a good behavior even on samples which can hardly be clustered. For this reason, Gallegos and Ritter (2005) introduced the restricted breakdown point. The idea is to restrict the analysis to the class of “well-separated” data sets. On this class, the estimator of the centers has a breakdown point of α, the level of trimming. [less ▲] Detailed reference viewed: 25 (2 ULg)Robustesse des classifications obtenues par clustering Ruwet, Christel Scientific conference (2011, September 02) La différence entre les méthodes de classi cation et les méthodes de cluster- ing réside dans le fait que, en clustering, on ne dispose pas d'un échantillon de travail pour lequel l'appartenance à l'un ... [more ▼] La différence entre les méthodes de classi cation et les méthodes de cluster- ing réside dans le fait que, en clustering, on ne dispose pas d'un échantillon de travail pour lequel l'appartenance à l'un des groupes est connue. Néanmoins, même lorsque l'on dispose de cette information, il est toujours possible d'appliquer une méthode de clustering aux données en oubliant les appartenances. On pour- rait alors s'attendre à une perte d'efficacité. Lors de ce séminaire, nous allons voir qu'en appliquant la méthode 2-means, il est possible de gagner de l'efficacité par rapport à certaines méthodes de classiffication lorsque la répartition des ob- servations est symétrique. A coté de cela, nous étudierons l'impact que l'introduction de contamina- tion dans les observations peut avoir sur la procèdure 2-means. Pour cela, nous utiliserons deux outils bien connus en statistique robuste : la fonction d'influence qui mesure l'impact d'une contamination infinitésimale en un point et le point de rupture qui mesure la quantité de contamination nécessaire pour déstabiliser complétement un estimateur. Nous verrons également d'autres procèdure de clustering plus résistante à la contamination, comme la méthode 2-means généralisée et la procèdure TCLUST. [less ▲] Detailed reference viewed: 25 (7 ULg)Robustness properties of the TCLUST procedure Ruwet, Christel ; ; et al Conference (2011, July 27) The TCLUST procedure is a robust clustering procedure introduced by García-Escudero et al. (2008). It performs clustering with the aim of fitting clusters with different scatters and weights. As the ... [more ▼] The TCLUST procedure is a robust clustering procedure introduced by García-Escudero et al. (2008). It performs clustering with the aim of fitting clusters with different scatters and weights. As the corresponding objective function can be unbounded, a restriction is added on the eigenvalues-ratio of the scatter matrices. The robustness of the method is guaranteed by allowing the trimming of a given proportion of observations. As García-Escudero and Gordaliza (1999) have done for the k-means and trimmed k-means methodologies, the robustness properties of the TCLUST procedure are studied by means of the influence function and the breakdown point. In order to be able to compare the robustness of TCLUST with other clustering methods, dissolution point and isolation robustness (Hennig, 2008) are also considered. It turns out that the TCLUST procedure has a behavior close to that of the trimmed k-means. [less ▲] Detailed reference viewed: 28 (3 ULg)The breakdown bahavior of the TCLUST procedure Ruwet, Christel ; ; et al Conference (2011, May 18) The TCLUST procedure is a new robust clustering method introduced by García-Escudero et al. (2008). It performs clustering with the aim of finding clusters with different scatters and weights. As the ... [more ▼] The TCLUST procedure is a new robust clustering method introduced by García-Escudero et al. (2008). It performs clustering with the aim of finding clusters with different scatters and weights. As the corresponding objective function can be unbounded, a restriction is added on the eigenvalues-ratio of the scatter matrices. The robustness of the method is guaranteed by allowing the trimming of a given proportion of observations. This trimming level has to be chosen by the practitioner, as well as the number of clusters. Suitable values for these parameters can be obtained throughout the careful examination of some classification trimmed likelihood curves (García-Escudero et al., 2010). The first part of this talk will consist of a brief presentation of this clustering procedure and the related R package (tclust). In the second part of the talk, the robustness of the TCLUST procedure, and more precisely its breakdown behavior, will be studied. In the context of cluster analysis, Hennig (2004, 2008) has defined some useful concepts to characterize the breakdown of a procedure; the r-components breakdown point, the dissolution point and the isolation robustness. These tools will be applied to the TCLUST procedure and some examples will be presented. [less ▲] Detailed reference viewed: 29 (5 ULg)Impact of contamination on training and test error rates in statistical clustering Ruwet, Christel ; Haesbroeck, Gentiane in Communications in Statistics : Simulation & Computation (2011), 40(3), 394-411 The k-means algorithm is one of the most common nonhierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most ... [more ▼] The k-means algorithm is one of the most common nonhierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g. the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this paper, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates. [less ▲] Detailed reference viewed: 117 (44 ULg)Robustness in ordinal regression Ruwet, Christel ; Haesbroeck, Gentiane ; Conference (2010, October 14) Logistic regression is a widely used tool designed to model the success probability of a Bernoulli random variable depending on some explanatory variables. A generalization of this bimodal model is the ... [more ▼] Logistic regression is a widely used tool designed to model the success probability of a Bernoulli random variable depending on some explanatory variables. A generalization of this bimodal model is the multinomial case where the dependent variable has more than two categories. When these categories are naturally ordered (e.g. in questionnaires where individuals are asked whether they strongly disagree, disagree, are indifferent, agree or strongly agree with a given statement), one speaks about ordered or ordinal regression. The classical technique for estimating the unknown parameters is based on Maximum Likelihood estimation (e.g. Powers and Xie, 2008 or Agresti, 2002). However, as Albert and Anderson (1984) showed in the binary context, Maximum Likelihood estimates sometimes do not exist. Existence conditions in the ordinal setting, derived by Haberman in a discussion of McCullagh’s paper (1980), as well as a procedure to verify that they are fulfilled on a particular dataset will be presented. On the other hand, Maximum Likelihood procedures are known to be vulnerable to contamination in the data. The lack of robustness of this technique in the simple logistic regression setting has already been investigated in the literature (e.g. Croux et al., 2002 or Croux et al., 2008). The breakdown behaviour of the ML-estimation procedure will be considered in the context of ordinal logistic regression. A robust alternative based on a weighting idea will then be suggested and compared to the classical one by means of their influence functions. Influence functions can be used to construct a diagnostic plot allowing to detect influential observation for the classical ML procedure (Pison and Van Aelst, 2004). [less ▲] Detailed reference viewed: 54 (10 ULg)Robust ordinal logistic regression Ruwet, Christel ; Haesbroeck, Gentiane ; Croux, Christophe Conference (2010, June 28) Logistic regression is a widely used tool designed to model the success probability of a Bernoulli random variable depending on some explanatory variables. A generalization of this bimodal model is the ... [more ▼] Logistic regression is a widely used tool designed to model the success probability of a Bernoulli random variable depending on some explanatory variables. A generalization of this bimodal model is the multinomial case where the dependent variable has more than two categories. When these categories are naturally ordered (e.g. in questionnaires where individuals are asked whether they strongly disagree, disagree, are indifferent, agree or strongly agree with a given statement), one speaks about ordered or ordinal logistic regression. The classical technique for estimating the unknown parameters is based on Maximum Likelihood estimation. Maximum Likelihood procedures are however known to be vulnerable to contamination in the data. The lack of robustness of this technique in the simple logistic regression setting has already been investigated in the literature, either by computing breakdown points or influence functions. Robust alternatives have also been constructed for that model. In this talk, the breakdown behaviour of the ML-estimation procedure will be considered in the context of ordinal logistic regression. Influence functions will be computed and shown to be unbounded. A robust alternative based on a weighting idea will then be suggested and illustrated on some examples. The influence functions of the ordinal logistic regression estimators may be used to compute classification efficiencies or to derive diagnostic measures, as will be illustrated on some examples. [less ▲] Detailed reference viewed: 109 (14 ULg)Robustness properties of the ordered logistic discrimination Ruwet, Christel ; Haesbroeck, Gentiane ; Scientific conference (2010, May 20) Logistic regression is a widely used tool designed to model the success probability of a Bernoulli random variable depending on some explanatory variables. A generalization of this bimodal model is the ... [more ▼] Logistic regression is a widely used tool designed to model the success probability of a Bernoulli random variable depending on some explanatory variables. A generalization of this bimodal model is the multinomial case where the dependent variable has more than two categories. When these categories are naturally ordered (e.g. in questionnaires where individuals are asked whether they strongly disagree, disagree, are indifferent, agree or strongly agree with a given statement), one speaks about ordered or ordinal logistic regression. The classical technique for estimating the unknown parameters is based on Maximum Likelihood estimation. Maximum Likelihood procedures are however known to be vulnerable to contamination in the data. The lack of robustness of this technique in the simple logistic regression setting has already been investigated in the literature, either by computing breakdown points or influence functions. Robust alternatives have also been constructed for that model. In this talk, the breakdown behavior of the ML-estimation procedure will be considered in the context of ordinal logistic regression. Influence functions will be computed and shown to be unbounded. A robust alternative based on a weighting idea will then be suggested and illustrated on some examples. These influence functions may be used to derive diagnostic measures, as will be illustrated on some examples. Furthermore, breakdown points will also be computed. [less ▲] Detailed reference viewed: 38 (6 ULg)Detection of influential observations on the error rate based on the generalized k-means clustering procedure Ruwet, Christel ; Haesbroeck, Gentiane Conference (2009, October 14) Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will ... [more ▼] Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will be on the generalized k-means algorithm, while the data of interest are assumed to come from an underlying population consisting of a mixture of two groups. Among the outputs of this clustering technique, a classi cation rule is provided in order to classify the objects into one of the clusters. When classi cation is the main objective of the statistical analysis, performance is often measured by means of an error rate ER(F; Fm) where F is the distribution of the training sample used to set up the classi cation rule and Fm (model distribution) is the distribution under which the quality of the rule is assessed (via a test sample). Under contamination, one has to replace the distribution F of the training sample by a contaminated one, F(eps) say (where eps corresponds to the fraction of contamination). In that case, the error rate will be corrupted since it relies on a contaminated rule, while the test sample may still be considered as being distributed according to the model distribution. To measure the robustness of classification based on this clustering proce- dure, influence functions of the error rate may be computed. The idea has already been exploited by Croux et al. (2008) and Croux et al. (2008) in the context of linear and logistic discrimination. In this setup, the contaminated distribution takes the form F(eps)= (1-eps)*Fm + eps*Dx, where Dx is the Dirac distribution putting all its mass at x: After studying the influence function of the error rate of the generalized k- means procedure, which depends on the influence functions of the generalized k-means centers derived by Garcia-Escudero and Gordaliza (1999), a diagnostic tool based on its value will be presented. The aim is to detect observations in the training sample which can be influential for the error rate. [less ▲] Detailed reference viewed: 77 (34 ULg)Impact of contamination on empirical and theoretical error Ruwet, Christel ; Haesbroeck, Gentiane Conference (2009, June 18) Classification analysis allows to group similar objects into a given number of groups by means of a classification rule. Many classification procedures are available : linear discrimination, logistic ... [more ▼] Classification analysis allows to group similar objects into a given number of groups by means of a classification rule. Many classification procedures are available : linear discrimination, logistic discrimination, etc. Focus in this poster will be on classification resulting from a clustering analysis. Indeed, among the outputs of classical clustering techniques, a classification rule is provided in order to classify the objects into one of the clusters. More precisely, let F denote the underlying distribution and assume that the generalized kmeans algorithm with penalty function is used to construct the k clusters C1(F), . . . ,Ck(F) with centers T1(F), . . . , Tk(F). When one feels that k true groups are existing among the data, classification might be the main objective of the statistical analysis. Performance of a particular classification technique can be measured by means of an error rate. Depending on the availability of data, two types of error rates may be computed: a theoretical one and a more empirical one. In the first case, the rule is estimated on a training sample with distribution F while the evaluation of the classification performance may be done through a test sample distributed according to a model distribution of interest, Fm say. In the second case, the same data are used to set up the rule and to evaluate the performance. Under contamination, one has to replace the distribution F of the training sample by a contaminated one, F(eps) say (where eps corresponds to the fraction of contamination). In that case, thetheoretical error rate will be corrupted since it relies on a contaminated rule but it may still consider a test sample distributed according to the model distribution. The empirical error rate will be affected twice: via the rule and also via the sample used for the evaluation of the classification performance. To measure the robustness of classification based on clustering, influence functions of the error rate may be computed. The idea has already been exploited by Croux et al (2008) and Croux et al (2008) in the context of linear and logistic discrimination. In the computation of influence functions, the contaminated distribution takes the form F(eps) = (1 − eps)*Fm + eps* Dx, where Dx is the Dirac distribution putting all its mass at x. It is interesting to note that the impact of the point mass x may be positive, i.e. may decrease the error rate, when the data at hand is used to evaluate the error. [less ▲] Detailed reference viewed: 65 (25 ULg) |
||