|Reference : Detection of influential observations on the error rate based on the generalized k-means...|
|Scientific congresses and symposiums : Unpublished conference|
|Physical, chemical, mathematical & earth Sciences : Mathematics|
|Detection of influential observations on the error rate based on the generalized k-means clustering procedure|
|Ruwet, Christel [Université de Liège - ULg > Département de mathématique > Statistique (aspects théoriques) >]|
|Haesbroeck, Gentiane [Université de Liège - ULg > Département de mathématique > Statistique (aspects théoriques) >]|
|17th Annual meeting of the Belgian Statistical Society|
|du 14 octobre 2009 au 16 octobre 2009|
|[en] Cluster analysis may be performed when one wishes to group similar objects
into a given number of clusters. Several algorithms are available in order to
construct these clusters. In this talk, focus will be on the generalized k-means
algorithm, while the data of interest are assumed to come from an underlying
population consisting of a mixture of two groups. Among the outputs of this
clustering technique, a classi cation rule is provided in order to classify the
objects into one of the clusters. When classi cation is the main objective of the
statistical analysis, performance is often measured by means of an error rate
ER(F; Fm) where F is the distribution of the training sample used to set up the
classi cation rule and Fm (model distribution) is the distribution under which
the quality of the rule is assessed (via a test sample).
Under contamination, one has to replace the distribution F of the training
sample by a contaminated one, F(eps) say (where eps corresponds to the fraction of
contamination). In that case, the error rate will be corrupted since it relies
on a contaminated rule, while the test sample may still be considered as being
distributed according to the model distribution.
To measure the robustness of classification based on this clustering proce-
dure, influence functions of the error rate may be computed. The idea has
already been exploited by Croux et al. (2008) and Croux et al. (2008) in the
context of linear and logistic discrimination. In this setup, the contaminated
distribution takes the form F(eps)= (1-eps)*Fm + eps*Dx, where Dx is the Dirac
distribution putting all its mass at x:
After studying the influence function of the error rate of the generalized k-
means procedure, which depends on the influence functions of the generalized
k-means centers derived by Garcia-Escudero and Gordaliza (1999), a diagnostic
tool based on its value will be presented. The aim is to detect observations in
the training sample which can be influential for the error rate.
|File(s) associated to this reference|
There are no fulltext file associated with this reference.
All documents in ORBi are protected by a user license.