Reference : Impact of contamination on training and test error rates in statistical clustering
Scientific journals : Article
Physical, chemical, mathematical & earth Sciences : Mathematics
http://hdl.handle.net/2268/27527
Impact of contamination on training and test error rates in statistical clustering
English
Ruwet, Christel mailto [Université de Liège - ULg > Département de mathématique > Statistique (aspects théoriques) >]
Haesbroeck, Gentiane mailto [Université de Liège - ULg > Département de mathématique > Statistique (aspects théoriques) >]
2011
Communications in Statistics : Simulation & Computation
Taylor & Francis Ltd
40
3
394-411
Yes (verified by ORBi)
International
0361-0918
[en] Clustering analysis ; Error rate ; Generalized k-means ; Influence Function ; Principal points ; Robustness
[en] The k-means algorithm is one of the most common nonhierarchical methods of clustering. It aims to construct clusters in order to minimize the within cluster sum of squared distances. However, as most estimators defined in terms of objective functions depending on global sums of squares, the k-means procedure is not robust with respect to atypical observations in the data. Alternative techniques have thus been introduced in the literature, e.g. the k-medoids method. The k-means and k-medoids methodologies are particular cases of the generalized k-means procedure. In this paper, focus is on the error rate these clustering procedures achieve when one expects the data to be distributed according to a mixture distribution. Two different definitions of the error rate are under consideration, depending on the data at hand. It is
shown that contamination may make one of these two error rates decrease even under optimal models. The consequence of this will be emphasized with the comparison of influence functions and breakdown points of these error rates.
http://hdl.handle.net/2268/27527
10.1080/03610918.2010.542847
(c) Taylor and Francis Group, 2011.
This is the author's version of the work. It is posted here by permission of Taylor and Francis Group for personal use, not for redistribution.
The definitive version was published in Communications in Statistics - Simulation and Computation, Volume 40 Issue 3, March 2011.

File(s) associated to this reference

Fulltext file(s):

FileCommentaryVersionSizeAccess
Open access
Impact of contamination on training and test error rates in statistical clustering analysis.pdfAuthor preprint247.12 kBView/Open

Bookmark and Share SFX Query

All documents in ORBi are protected by a user license.