Sesión Análisis Numérico y Optimización

Diciembre 14, 11:00 ~ 11:20

Clustering for classification: a mathematical programming approach

Blanco, Víctor

Clustering Analysis studies the ways of making groups of individuals/observations such that individuals belonging to the same group (cluster) are similar while different from the individuals in other groups. Several approaches are available in the literature for this task, among them hierarchical clustering, k-means, Gaussian-based models, Density-based clustering, amongst others. One of the most used methodolologies is $k$-means [4], in which a number of clusters, $k$, is provided and then $k$ centroid are computed to act as representatives of the clusters. The observations are then clustered by closeness (using some distance measure) to the centroids. This method assumes that when a new observation comes into scene and want to be assigned to a cluster, it is classified in the group whose centroid is closest. However, in machine learning, the classification task is not usually performed using the centroid, but constructing a decision rule which using the training data allows to classify outsample data. At this point, one of the most successful methodologies in the last years is the one of support vector machines (SVM) in which decision rules based on constructing separating hyperplanes are designed to classify data. Although SVMs are mostly used to classify among two classes, some approaches are available for the multiclass problem by extending the ideas under the binary case (see [3,5], among others). In this work, we consider the problem of clustering observations by applying a $k$-mean based methodology but trying to minimize the risk of misclassifying observations when a SVM is applied to outsample data. The problem is then the combination of two problems that are usually considered separately: multifacility continuous location [1] and multiclass support vector machines [2]. We then provide a general framework in which different distance measures are modeled and different multiclass SVM models (\textit{one-versus-all} and \textit{all-in-one}) are applied in the classification task. Mathematical programming models for the problem are presented and compared, and resolution strategies are developed. We test some of the benchmarking instances, and measure the quality of the clusters by means of different known indices in the field. \begin{thebibliography}{1} \bibitem{BPE16} V.~Blanco, J.~Puerto, and S.~El-Haj Ben-Ali. \newblock Continuous multifacility ordered median location problems. \newblock {\em European Journal of Operational Research}, 250(1):56--64, 2016. \bibitem{Bredensteiner-Bennett_COAP99} E.~J. Bredensteiner and K.~P. Bennett. \newblock Multicategory classification by support vector machines. \newblock {\em Computational Optimization and Applications}, 12:53--79, 1999. \bibitem{Cramer-Singer_JMLR01} K.~Crammer and Y.~Singer. \newblock On the algorithmic implementation of multiclass kernel-based vector machines. \newblock {\em Journal of Machine Learning Research}, 2:265--292, 2001. \bibitem{Johnson:1967} S.~C. Johnson. \newblock Hierarchical clustering schemes. \newblock {\em Psychometrika}, 32(3):241--254, 1967. \bibitem{Weston-Watkins_ESANN99} J.~Weston and C.~Watkins. \newblock Support vector machines for multi-class pattern recognition. \newblock In {\em European Symposium on Artificial Neural Networks}, 219--224, 1999. \end{thebibliography}

Autores: Blanco, Víctor / Puerto, Justo / Rodríguez-Chía, Antonio.