Article (Scientific journals)
Impact of contamination on training and test error rates in statistical clustering
Ruwet, Christel; Haesbroeck, Gentiane
2011In Communications in Statistics: Simulation and Computation, 40 (3), p. 394-411
Peer Reviewed verified by ORBi
 

Files


Full Text
Impact of contamination on training and test error rates in statistical clustering analysis_Revision.pdf
Author postprint (274.12 kB)
Download

© Taylor and Francis Group, 2011. This is the author's version of the work. It is posted here by permission of Taylor and Francis Group for personal use, not for redistribution. The definitive version was published in Communications in Statistics - Simulation and Computation, Volume 40 Issue 3, March 2011.


All documents in ORBi are protected by a user license.

Send to



Details



Keywords :
Clustering analysis; Error rate; Generalized k-means; Influence Function; Principal points; Robustness
Abstract :
[en] The k-means algorithm is one of the most common nonhierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g. the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this paper, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates.
Disciplines :
Mathematics
Author, co-author :
Ruwet, Christel ;  Université de Liège - ULiège > Département de mathématique > Statistique (aspects théoriques)
Haesbroeck, Gentiane ;  Université de Liège - ULiège > Département de mathématique > Statistique (aspects théoriques)
Language :
English
Title :
Impact of contamination on training and test error rates in statistical clustering
Publication date :
2011
Journal title :
Communications in Statistics: Simulation and Computation
ISSN :
0361-0918
eISSN :
1532-4141
Publisher :
Taylor & Francis
Volume :
40
Issue :
3
Pages :
394-411
Peer reviewed :
Peer Reviewed verified by ORBi
Available on ORBi :
since 04 November 2009

Statistics


Number of views
163 (46 by ULiège)
Number of downloads
178 (18 by ULiège)

Scopus citations®
 
7
Scopus citations®
without self-citations
5
OpenCitations
 
5

Bibliography


Similar publications



Contact ORBi