Matti Lyra - Evaluating Topic Models





The interactive transcript could not be loaded.


Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jul 26, 2017

Unsupervised models in natural language processing (NLP) have become very popular recently. Word2vec, GloVe and LDA provide powerful computational tools to deal with natural language and make exploring large document collections feasible. We would like to be able to say if a model is objectively good or bad, and compare different models to each other, this is often tricky to do in practice.

Supervised models are trained on labelled data and optimised to maximise an external metric such as log loss or accuracy. Unsupersived models on the other hand typically try to fit a predefined distribution to be consistent with the statistics of some large unlabelled data set or maximise the vector similarity of words that appear in similar contexts. Evaluating the trained model often starts by "eye-balling" the results, i.e. checking that your own expectations of similarity are fulfilled by the model.

Documents that talk about football should be in the same category and "cat" is more similar with "dog" than with "pen". Is "cat" more similar to "tiger" than to "dog"? Ideally this information should be captured in a single metric that can be maximised. Tools such as pyLDAvis and gensim provide many different ways to get an overview of the learned model or a single metric that can be maximised: topic coherence, perplexity, ontological similarity, term co-occurrence, word analogy. Using these methods without a good understanding of what the metric represents can give misleading results. The unsupervised models are also often used as part of larger processing pipelines, it is not clear if these intrinsic evaluation measures are approriate in such cases, perhaps the models should instead be evaluated against an external metric like accuracy for the entire pipeline.

In this talk I will give an intuition of what the evaluation metrics are trying to achieve, give some recommendations for when to use them, what kind of pitfalls one should be aware of when using topic models and the inherent difficulty of measuring or even defining semantic similarity concisely.

I assume that you are familiar with topic models, I will not cover how they are defined or trained. I talk specifically about the tools that are available for evaluating a topic model, irrespective of which algorithm you've used to learn one. The talk is accompanied by a notebook at github.com/mattilyra/pydataberlin-2017


PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

Comments are disabled for this video.
When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...