Reference : Detection of influential observations on the error rate based on the generalized k-means... |

Scientific congresses and symposiums : Unpublished conference | |||

Physical, chemical, mathematical & earth Sciences : Mathematics | |||

http://hdl.handle.net/2268/27529 | |||

Detection of influential observations on the error rate based on the generalized k-means clustering procedure | |

English | |

Ruwet, Christel [Université de Liège - ULg > Département de mathématique > Statistique (aspects théoriques) >] | |

Haesbroeck, Gentiane [Université de Liège - ULg > Département de mathématique > Statistique (aspects théoriques) >] | |

14-Oct-2009 | |

No | |

17th Annual meeting of the Belgian Statistical Society | |

du 14 octobre 2009 au 16 octobre 2009 | |

Lommel | |

Belgium | |

[en] Cluster analysis may be performed when one wishes to group similar objects
into a given number of clusters. Several algorithms are available in order to construct these clusters. In this talk, focus will be on the generalized k-means algorithm, while the data of interest are assumed to come from an underlying population consisting of a mixture of two groups. Among the outputs of this clustering technique, a classi cation rule is provided in order to classify the objects into one of the clusters. When classi cation is the main objective of the statistical analysis, performance is often measured by means of an error rate ER(F; Fm) where F is the distribution of the training sample used to set up the classi cation rule and Fm (model distribution) is the distribution under which the quality of the rule is assessed (via a test sample). Under contamination, one has to replace the distribution F of the training sample by a contaminated one, F(eps) say (where eps corresponds to the fraction of contamination). In that case, the error rate will be corrupted since it relies on a contaminated rule, while the test sample may still be considered as being distributed according to the model distribution. To measure the robustness of classification based on this clustering proce- dure, influence functions of the error rate may be computed. The idea has already been exploited by Croux et al. (2008) and Croux et al. (2008) in the context of linear and logistic discrimination. In this setup, the contaminated distribution takes the form F(eps)= (1-eps)*Fm + eps*Dx, where Dx is the Dirac distribution putting all its mass at x: After studying the influence function of the error rate of the generalized k- means procedure, which depends on the influence functions of the generalized k-means centers derived by Garcia-Escudero and Gordaliza (1999), a diagnostic tool based on its value will be presented. The aim is to detect observations in the training sample which can be influential for the error rate. | |

http://hdl.handle.net/2268/27529 |

File(s) associated to this reference | ||||||||||||

| ||||||||||||

All documents in ORBi are protected by a user license.