 Welcome to my lightning talk, extending FlexMix to model-based clustering with sparse data. My name is Bettina Grün. Package FlexMix implements model-based clustering using finite mixtures estimated with the EM algorithm. The idea is that package FlexMix provides an extensible framework which takes care of all data handling and provides the A step. Users only have to provide the M step for different mixture models. Clustering of sparse data becomes more and more of interest due to the availability of high-dimensional data, where only few non-zero cell entries are present for each of the observations. Possible application contexts are in text mining, if we have document term matrices which we want to cluster or in marketing, if we use market basket transaction data. Package FlexMix so far had no support for sparse format of data. We now extended the package and defined a new model class to enable model-based clustering of sparse data. The sparse data format supported is a simple triplet matrix which is defined in package SLAM. In addition to the model class, we specified a specific method which takes care of suitable data handling for sparse data. In order to now apply our new model class, we looked at mixtures of a mesophysia distributions. This is a model class which has been proposed for model-based clustering of spherical data, and has been previously applied to text mining applications where document term matrices were clustered, or also in bioinformatics which in expression data was clustered. There's already an implementation available in the R-package morphine, but this has completed separate implementation of the M algorithm. We now looked at extending package morphine by providing a FlexMix model driver. This FlexMix model driver supports dense and sparse data, but reuses all the functionality for the M-step which has already been available in package morphine. The advantage of using the FlexMix model driver is that the returned object is of class FlexMix, and we get a range of methods now being available for these fitted objects. Let's have a look at an example. We use the OS books dataset, which has been previously used for clustering using spherical k-means, and it's available as a corpus for the TM package where TM stands for text mining. The TM package provides functionality for transforming a corpus into a document term matrix. We use that and we see that the TM package stores document term matrices as simple triplet matrices, so that's exactly the sparse data format we are supporting, and we see our dataset contains 21 observations and in total more than 20,000 terms. To fit now mixtures of Mises Fisher distributions, we set random seed for reproducibility and reloaded to packages. FlexMix to get access to function FlexMix and the package morphine to get access to the model driver. We fit now a finite mixture of Mises Fisher distributions using function FlexMix. We use the formula to specify on the left-hand side the data which should be clustered, and the argument k specifies how many clusters we would like to get. Argument model we are using to specify which kind of model should be used for the components of our finite mixture model. Here we used a model driver for clustering using for Mises Fisher distributions. FlexMix returns an object of class FlexMix, which as print method shows how was it created. It shows the information, how many observations are assigned to which cluster using the a posterior probabilities, and how many iterations the DM algorithm take to converge. Now that we have a FlexMix object, we can for example use the PLOST method to inspect the a posterior probabilities. More information on the two packages is available on the web. Those are available from CRAM, and there are also GSS articles available for both. Thank you very much for your attention.