Influence function of the error rate of the generalized k-means

English

Ruwet, Christel[Université de Liège - ULg > Département de mathématique > Statistique mathématique >]

Haesbroeck, Gentiane[Université de Liège - ULg > Département de mathématique > Statistique mathématique >]

30-Mar-2009

Séminaire de statistique

Service de statistiques du Département de mathématiques

Liège

Belgique

[en] Clsutering ; Generalized k-means ; Influence function

[en] Cluster analysis may be performed when one wishes to group similar objects into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will be on two particular cases of the generalized k-means algorithm: the classical k-means procedure as well as the k-medoids algorithm. Among the outputs of these clustering techniques, a classification rule is provided in order to classify the objects into one of the clusters. When classification is the main objective of the statistical analysis, performance is often measured by means of an error rate. In the clustering setting, the error rate has to be measured on the training sample while test samples are usually used in other settings like linear discrimination or logistic discrimination. This characteristic of classification resulting from a clustering implies that contamination in the training sample may not only affect the classification rule but also other parameters involved in the error rate. In the talk, influence functions will be used to measure the impact of contamination on the error rate and will show that contamination may decrease the error rate that one would expect under a given model. Moreover, a kind of second-order influence functions will also be derived to measure the bias in error rate the k-means and k-medoids procedures suffer from in finite-samples. Simulations will confirm the results obtained via the first and second-order influence functions. Future research perspectives will conclude the talk.